Genetic ancestry- how do we figure out who we are?

Personalized medicine, precision medicine, direct-to-consumer genetic tests… all of these are hot topics currently. On the heels of my post on direct-to-consumer genetic testing, I started wondering exactly how 23andMe determined genetic ancestry using our DNA.

Luckily, they have a webpage explaining this. But it’s long… so I thought I’d give a summary.

Image courtesy of

Image courtesy of

The basis of a DNA ancestry test is the use of a DNA marker that is associated with a geographic location. 23andMe uses a variety of DNA markers to figure out where someone’s ancestors were from.

After acquiring DNA data, a complicated process begins. It is called “phasing”. Essentially, it is piecing bits of your DNA back in the right order. Another way to put it is that your DNA gets separated based on which parent it came from (without actually telling you if it came from you mother or your father). The DNA data obtained initially is “unphased”- it is jumbled up. 23andMe uses a version of the program Beagle called “Finch” to do the “phasing” (the link is to a paper describing the method behind Beagle).

Next, stretches of your DNA (“windows”) are compared to a reference dataset. This tells you which population a certain stretch of DNA is most closely related to. 23andMe uses a set of more than 10,000 people whose ancestry is known to make up the reference dataset (most of them 23andMe members). They also use information from a few public datasets.

The data is further processed (likely mistakes are fixed) and calibrated (to ensure results are at confidence intervals that they report).

Now how accurate is the data you get? It varies by the population group. They provide the “precision” and “recall” of the results. Precision is “if they say you’re from population A, how often are they right?” and recall is “how often did the system correctly identify the DNA associated with population A”? [correct me please if you disagree with my interpretation]

A high-precision, low-recall system will be always be right about you being from population A if it says you’re from population A, but a lot of times if you’re from population A it won’t identify that you’re from population A (won’t catch everyone from population A). A low-precision, high-recall system will catch all people from population A but will also catch some people who are not from population A (and say they are from population A). Precision can also be called “positive predictive value” and recall can be called “sensitivity“.

The results for less-specific, broader groups like Oceanian, East Asian, European, Sub-Saharan African, and South Asian have high precision and recall. The lowest precision and recall is found in more specific groups, such as Scandinavian, Mongolian, Balkan, Italian, and French&German groups.

Additionally, results are much more accurate if the DNA of the child and both parents (or at least one parent) is provided.

Hope that was useful!

[Update: I found an amazing story about how a 23andMe ancestry test suggested that a baby swap between two unrelated families had occurred]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s