Earlier this month Ancestry finally rolled out the updated version of its Ethnicity Estimates for all its customers. Sadly the concerns I raised in July have become reality. Many people are now left confused by their revised African breakdown as reported by AncestryDNA.1 Understandably so given the often drastic and seemingly incoherent changes compared with the previous set-up. In this three-part blog post I will argue that Ancestry’s pioneering analysis of especially West African DNA has been downgraded rather than upgraded! In the first part I evaluated the accuracy of Ancestry’s new African breakdown by analyzing the before & after results of 130 African customers. I found that in most cases the informational value to be derived from their results is showing a decrease rather than any improvement. In the upcoming last part I will discuss FAQ’s about this update as well as look into promising new developments. See also:
- Did Ancestry kill their African breakdown? (part 1)
- Did Ancestry kill their African breakdown? (part 3)
Table 1 (click to enlarge)
The title of this blog series was sort of meant to be tongue-in-cheek 😉 as I do believe that Ancestry still offers opportunities for those wanting to learn more about their African lineage. Nonetheless it seems very clear to me that Ancestry’s update may indeed have “killed it“, but only with their new Asian & European breakdowns! However not so with their African breakdown which has taken a big step backwards instead of forwards. At least in most aspects.
In this part 2 I will explore how the changed composition of Ancestry’s Reference Panel as well as Ancestry’s new algorithm may have contributed to this very disappointing outcome. Main topics:
- More is not always better: over-sampling for “Cameroon, Congo & Southern Bantu”, “Benin/Togo” and “Mali” causing inflated scores?
- What are the ethnic backgrounds of Ancestry’s African samples?
- New algorithm has issues with describing mixed/complex lineage?
1) More is not always better
As I have maintained throughout my AncestryDNA survey it is always essential to be aware of any shortcomings in DNA testing. Luckily Ancestry still provides sufficient information on their website to help you understand your results better. However you do need to actively seek it out and not be inclined to skip the small print 😉 I do find that the level of transparency has decreased somewhat when compared to their previous update in 2013.2 Then again Ancestry’s new white paper is still an insightful, albeit a rather technical account. Recommended reading:
- Ethnicity Estimate 2018 White Paper (Ancestry)
- Evaluating the AncestryDNA reference panel (Ancestry)
- Overview of previous Reference Panel (2013-2018) (Ancestry)
- AncestryDNA Regions (before update) (Tracing African Roots, 2015)
“We’ve added 13,000 more samples to our reference panel, which increases our ability to identify and find the genetic signature of a region within one’s DNA. ” (Source: Ancestry)
“The rollout of our enhanced ethnicity estimates will take place on September 12, 2018 and with this update, new and existing customers can expect more precise results across Asia and Europe.” (Source: Ancestry) [take note how Africa is not mentioned!]
“We’ve used the expanded reference panel and updated algorithm to add more specific regions in Asia and Europe.“ (Source: Ancestry) [take note how Africa is again not mentioned!]
“Africa presents special challenges. The African continent is the ancient birthplace of humanity, and humans there are the most genetically diverse on earth. This makes Africa a tricky place for ethnicity estimation because you need lots of DNA samples to account for all that diversity. We’re working to increase the number of African samples in our reference panel so we can take full advantage of our new methods of analysis and provide even better estimates for Africa.” (Source: Ancestry)
“More is better” is a very current belief. Not only in DNA testing but also generally speaking.3 However this assumption does not always hold true! As illustrated by the quotations above Ancestry itself also acknowledges its lackluster African update. There has been no meaningful improvement of performance when describing African DNA for people of African descent but rather a deterioration. In spite of an impressive increase of Ancestry Reference Panel with over 13,000 samples. The total number of samples Ancestry uses to compare your own DNA with is now 16,638 versus 3,000 previously.
Granted most of these new samples are either European or Asian. But in fact also the number of African samples has been increased from 464 to 1,395. Which represents a tripling in sample size! Then again the proportion of African samples in Ancestry’s previous Reference Panel used to be about 15% (464/3000). This share has now however decreased to 8% (1395/16638). This is already an indication of how it is not only absolute numbers you should be concerned with, but also relative standing.
When we have a closer look at the newly updated Reference Panel (see table 1) and compare with the previous version (see this page) more imbalances are revealed. It can be seen for example how the “Cameroon, Congo & Southern Bantu” region now easily has the greatest sample size (n=579). Also the number of Malian samples has been expanded quite spectacularly. This region used to have the least number of samples (n=16) in the previous set-up but now there are 169 samples from Mali available! This is surely to be seen as an improvement in itself.4 However this accomplishment stands in stark contrast with how the already low number of samples for “Senegal” has practically stagnated (n=31 versus n=28). Also for “Nigeria” and “Ivory Coast/Ghana” only a modest increase in sample size has been achieved. In fact it turns out that more than 80% of the increase in African samples went to just three regions: “Cameroon, Congo & Southern Bantu” (+446), “Benin/Togo” (+164), and “Mali” (+153).
Is it any coincidence that these three regions are also the ones which seem to appear in very inflated amounts among the updated results of Afro-Diasporans? I will elaborate further below. However based on only table 1 one might already (intuitively) say that the over-sampled regions seem to suck in ethnicity estimate %’s at the expense of under-sampled regions. In a way functioning like a magnet. Given the reasonable predictive accuracy of the previous version (see this post) it makes you wonder why Ancestry bothered with adding more African samples in such an imbalanced manner when it only ended up making things worse and not better! No change is better than bad change after all!
As highlighted in one of the quotations above Ancestry has added mostly new European & Asian regions in their update. Resulting in a grand total of 43 global regions compared with 26 global regions in the previous version. To be fair Ancestry did introduce one new African region labeled “Eastern Africa”. But at the same time they also merged the former “Cameroon/Congo” region with the “Southeastern Bantu” region, undoing the former useful distinction between Central & Southern Africans.5 Leaving the total number of African regions within AncestryDNA unchanged at 9.
Going by my survey findings before the update there were however only 7 African regions which really mattered for Afro-Diasporans when describing their main ancestral connections with Africa. The socalled “Hunter-Gatherers” and “Africa North” regions usually being minimal or introduced by way of Iberian detour. Arguably after Ancestry’s update the African breakdown now only has 6 regions instead of 7 which really matter. The “Eastern Africa” region again being minimal for Afro-descendants in the Americas (and not even having a good prediction accuracy). While 3 out of 6 remaining regions (“Senegal”, “Ivory Coast/Ghana” and “Nigeria”) have been severely compromised because of Ancestry’s faulty sampling strategy. So basically it all became more generic instead of more specific. Even when the total number of African samples did increase…
Chart 1 (click to enlarge)
“We predicted nearly 100% of the genetic ethnicity from the correct region for the following groups: […],Cameroon, Congo & Southern Bantu Peoples, […], Africa South-Central Hunter-Gatherers. For some regions, such as Nigeria […], the numbers are not as high, with average assignment of 28% […] to the correct region, respectively.” Source: Ancestry’s White Paper (2018, p.20).
Based on 130 African customer results I have already established in part 1 of this blog series that AncestryDNA’s update seems to work out best for actual Central Africans, Malians and North Africans. For Beninese and Ghanaian Ewe there seems to be not much difference, on average. For Southern Africans it also seems to be mostly an intermediate outcome. But the worst hit would be Nigerians (both north & south), Ghanaians (Akan & Ga), Ivorians, Liberians, Senegambians as well as Northeast Africans. And by default also people in the Afro-Diaspora descended from these populations! Many of these patterns are also reflected in chart 1 above.
It is especially illuminating to compare with this overview of average prediction accuracy of each African region before the update. If I understood Ancestry’s White Paper correctly the coloured boxplot includes 25%, 50%, and 75% percentiles. The complete range (also including outliers) extending even further though. While the black bolded vertical line would represent the median or average. One thing that stands out is that 100% accurate estimates seem to have become less common, while the range downwards to below 50% has increased. Otherwise:
- Regions with decreased prediction accuracy: “Nigeria”, “Ivory Coast/Ghana”, “Senegal”, “Northern Africa”.
- Regions with increased or equal prediction accuracy: “Cameroon, Congo & Southern Bantu”, “Mali”, “Hunter-Gatherers”, “Benin/Togo”.
Although chart 1 is quite insightful it remains regrettable that the former genetic diversity tabs have disappeared as they contained more specific statistical details (in numbers!) for each separate region. It should also be kept in mind that these indications of prediction accuracy are based on the African samples already included in Ancestry’s Reference Panel. From my survey findings based on randomly collected African customer results (before the update) I found that Ancestry tended to overestimate the prediction accuracy of its African regions. For example for 77 Nigerian survey participants I calculated an average “Nigeria” score of around 52% (see this link) while Ancestry mentioned a median score of 69% “Nigeria” for its 67 Nigerian samples (see this screenshot).
2) What are the ethnic backgrounds of Ancestry’s African samples?
“Our samples came from these sources(approximate numbers):
- 500 from Human Genome Diversity Project (HGDP) samples
- 800 from One Thousand Genome Project samples
- 4,400 from AncestryDNA proprietary samples [SMGF]
- 10,800 from AncestryDNA customers” (Source: Ancestry)
“Today there are five main companies in the United States offering genealogical testing, including 23andMe, AncestryDNA, National Geographic, MyHeritage and Living DNA. Along with their popularity has come controversy. Some scientists note that because none of them release their reference panel data, it’s impossible to evaluate them.” (source: Ancestry.com’s ethnicity updates likely won’t be the last, USA Today, 2018)
Table 2 (click to enlarge)
Africa’s ethnic diversity is a fact. Even if often underestimated or misunderstood (see this page for maps). In many cases this means that also within any given African country a great deal of genetic diversity will exist. Invalidating regions referring to modernday countries (with colonial borders). As Ancestry insists on maintaining in this update. For correct interpretation of AncestryDNA’s African regions it is however still crucial to not only know the nationality but also the ethnic backgrounds of the African samples included in Ancestry’s Reference Panel. Imagine for example a “Nigeria” region being defined solely by Hausa-Fulani samples from the north. Surely this will lead to markedly different “Nigeria” scores if instead only southern Nigerian samples (Igbo, Yoruba etc.) had been used to compare your own DNA with!
When AncestryDNA first came out with its pioneering West African breakdown I therefore emailed them several times in 2014 for more details about the specific ethnic groups being included for each African region. They never provided this info… Earlier this month I once more asked about these ethnic details on their website. Again no reply… By necessity then this section will be mostly based on guesswork and (informed) speculation on my part. However to be fair Ancestry does mention some key aspects about their African samples on their website which I will try to incorporate as well.
I will not discuss the possible ethnic background of the samples being used for “Senegal” (Mandenka from HGDP?), “Ivory Coast/Ghana” (Akan/Brong and Ivorian Kru & southwestern Mande?), “Nigeria” (Igbo & Yoruba from HGDP?) and “Northern Africa” (Mozabite from HGDP?). The sample size of these regions has not been expanded that much after this update (see table 1). And frankly I suspect only customer samples have been used whenever there was modest addition.6 Otherwise the sample composition of these regions will have remained the same. For previous discussion see this page as well as this one.
It could very well be a different story though for the leading trio of “Mali”, “Benin/Togo” and “Cameroon, Congo & Southern Bantu”! As I highly suspect that most if not all of their notably greater increase in sampling may have been sourced by way of the former Sorenson (SMGF) database. This sample collection is referred to as proprietary by Ancestry in the first quote above. Because Ancestry acquired the Sorenson Molecular Genealogy Foundation (SMGF) in 2012. The previous version of Ancestry’s Reference Panel probably already contained samples from this collection. But it could very well be that the number of African samples from this invaluable database has increased even more so with this update.7
The website of the Sorenson database has regrettably been taken down by Ancestry in 2015. But luckily it can still be accessed by way of the internet archive 🙂 By performing a search I could verify that all expected countries from AncestryDNA’s African breakdown have indeed been sampled by SMGF (see table 2). However it is to be kept in mind that these samples were originally obtained for either Y-DNA or Mitochondrial DNA. But *possibly* Ancestry has now also managed to extract autosomal DNA from these samples. Again this is speculation on my part!
Looking into the 153 newly added Malian samples for example it is very tempting to go with this SMGF scenario though. Because Malian samples are quite rare in other publicly available databases to my knowledge. Not at all present in either the HGDP or 1000 Genomes databases (mentioned as other sources for Ancestry’s Reference Panel). The number of Malian Ancestry customers may also be assumed to be much too small to support an increase of 153 samples. So by way of elimination only the Sorenson database seems to remain as a viable option. A similar line of reasoning might also be valid for the 446 (!) newly added samples from presumably either Cameroon or Congo. As well as the 164 newly added samples from Benin and/or Togo.
Regrettably I was not able to find any specific ethnic or other relevant details being mentioned on the former website of the Sorenson database. However a possible clue might be taken from the foto credits on Ancestry’s “Mali” page which explicitly refer to SMGF! Pursuing this lead a bit further it turns out that actually at the time SMGF also organized a photo exhibition called “Faces of Mali”. And from the description it seems that at least some of the sampling may have taken place in southwestern Mali.8 In fact from another source it may already be confirmed that both Bambara & Dogon samples were among them! For more details see:
- Toward a more Uniform Sampling of Human Genetic Diversity: A Survey of Worldwide Populations by High-density Genotyping (Xing et al, 2010)
I did not find many useful clues when looking into the possible origins of the newly added samples for the “Benin/Togo” and the “Cameroon, Congo & Southern Bantu” regions. However just going by my before & after survey findings for 130 continental Africans, I suspect that also samples with a non-Gbe origin might have been added for the “Benin/Togo” region. This might explain the lack of any substantial improvement in describing the DNA of my Ewe and Benin samples. Plus it might also (partially) explain the great extension of the Benin/Togo region across West Africa, and especially in southern Nigeria. One would hope that Ancestry did not also include Beninese Yoruba samples, as this would frankly be nothing less than a blunder…
The increase in sample size has by far been the greatest for the “Cameroon, Congo & Southern Bantu” region (+446!). However Ancestry has not been very informative of this major change. Even if this newly combined region now includes more than 20 countries! Again I have to go by my unconfirmed assumptions, but I highly suspect that most newly added samples may have been obtained from Cameroon and not from the Congo, given the vast overrepresentation of the former country in the Sorenson database (2,453 versus 87, see table 2). This would actually be in line with a general trend whereby the genetic importance of Cameroon in DNA testing for Diasporans has been overstated because of a relative abundance of Cameroonian samples to be matched with (both for haplogroup and autosomal testing). While other samples from especially southeastern Nigeria but also from the Congo and Angola are relatively lacking. See also:
New samples added from Kenya as well as Tanzania?
***(click to enlarge)
There are the updated results of a Kenyan with 57% “Eastern Africa”, seen in preview code. See this screenshot for the website version. By using a trick (see this link) you were already able to see such results before Ancestry had rolled out their update to all their customers. The interesting thing is that programming codes have been used instead of the usual regional labeling. And quite tellingly the code name for “Eastern Africa” is “Luhya”!
Once more I would like to underline that I have no confirmation for what is about to follow. However starting with the newly updated map for the “Hunter-Gatherer” region one might wonder: did Ancestry perhaps replace their Central African hunter-gatherer samples (Mbuti and Baka Pygmys) with Tanzanian ones? These possibly new hunter-gatherer samples for Tanzania being either Sandawe and/or Hadza. All of these populations are heavily marginalized, living in very remote places and subsisting in small numbers. Frankly speaking I do not find them very relevant to understand the origins of especially Afro-Diasporans (one notable exception being the Khoi-San and their genetic legacy among the South African Coloureds). However because of their distinctive genetics they have been studied extensively and many academic samples are available. Which is why they are often featured in DNA testing.
In part 1 of this blog series I already mentioned the quite outlandish reporting of this “Hunter-Gatherer” region in clearly inflated amounts among Northeast Africans. According to Ancestry’s own information this region is now to be found as far north as Djibouti! Far removed from any historical Pygmy or Khoi-San population! But perhaps less absurd when also taking into account a (unconfirmed!) addition of Tanzanian samples. Despite having distinctive DNA markers it is also known that Tanzanian hunter-gatherers have intermingled with surrounding populations across time, incl. Bantu-speaking ones but also Nilotic and (South) Cushitic ones. Which might explain the genetic similarities now being detected (in absence of better fitting samples from Northeast Africa!). See also this very recent study:
- Genetic Ancestry of Hadza and Sandawe Peoples Reveals an Ancient Population Structure in Africa (Shriner et al., 218)
It should be noted also that the “Hunter-Gatherer” scores have mostly disappeared for Central Africans themselves (as well as West Africans), going by my before & after survey. Additional sampling from Tanzania however would be clearly in contradiction with the regional overview given by Ancestry which still only mentions the Khoi-San and Pygmy (see map above). But perhaps this text is still under revision. Also the number of samples for the “Hunter-Gatherer” region (previously n=35) has not increased with this update but actually has been reduced with one sample (see table 1)! Then again it might also be that the regional map was made in error or possibly it’s just some quirk of Ancestry’s new algorithm which is causing these inflated “Hunter-Gatherer” amounts to appear among Northeast Africans. Either way the current outcome is very unsatisfying and hardly in support of decent quality control by Ancestry.
Moving on to the new “Eastern African” region I have more solid ground to believe that the Kenyan Luhya people have been used as a reference population. Perhaps in addition to other ones but I would not be surprised if they are the only defining samples being used for “Eastern Africa”. As shown above in the preview mode of the updated results of a Kenyan with 57% “Eastern Africa”, it can be revealed that Ancestry uses “Luhya” as a code name for their ” Eastern Africa” region. By using a trick (see this link) you were already able to see a preview of your updated results before Ancestry had actually rolled out their update to all their customers. Which is how I obtained this insight 😉 The interesting thing is that in this preview mode programming codes have been used instead of the usual regional labeling. It may not be a water proof confirmation but certainly it is no coincidence that Ancestry’s programmers picked out the Luhya as their code name. When one reads the regional description for “Eastern Africa” provided by Ancestry (see this screenshot), again the Luhya are explicitly mentioned and seemingly singled out.
In fact there is more supporting evidence because Ancestry has itself mentioned that the One Thousand Genome Project has been one of their main sources for their newly added samples. And within this 1000 Genomes database 116 Luhya samples from Kenya can be found (see this link). Sufficiently covering the 82 samples being used for Ancestry’s “Eastern Africa” region (see table 1). These very same Luhya samples have actually also been used by 23andme but quite perversely for their West African category (!) (this was before their current update which is still to be rolled out completely). Highly illustrative of the sometimes arbitrary and ill-designed usage of African reference populations by DNA testing companies…
In their white paper Ancestry makes it a point to emphasize how they are committed to “developing the best possible set of reference samples.” They mention that “the genetic distinctness of each region” should be kept in mind. And quite rightfully they mention that it is not only about quantity but also about quality when designing an appropriate Reference Panel. A tool based on comparing relevant populations with your own DNA and which is able then to achieve reasonably accurate ethnicity estimates in line with either historical plausibility or verifiable genealogy.
It would be quite contradictory therefore if the Luhya have indeed been chosen as the sole defining reference population for “Eastern Africa”. As genetic studies have already revealed this Bantu speaking population not to be the perfect choice for strictly covering genetic similarity with Nilotic(-like) DNA among Northeast Africans. Many Kenyan populations actually being composites of Bantu-, Nilo-Saharan- and Cushitic speaking populations to varying degrees. However for the Luhya it seems their ancestral ties with Bantu populations from Central Africa are the strongest. Which probably accounts for the rather disappointing prediction accuracy of the new “Eastern Africa” region among native Northeast Africans (around 50% according to my before & after survey), as well as the subdued appearance of this region among Southeast Africans and the occasional trace reporting (1%) among Afro-Diasporans. The Maasai (successfully used by 23andme for their own “East Africa” category!) arguably make for a much better candidate. See also:
- The genetics of East African populations: a Nilo-Saharan component in the African genetic landscape (Dobon et al., 2015)
3) New algorithm has issues with describing mixed/complex lineage?
Table 3 (click to enlarge)
“We also evaluated the accuracy of ethnicity estimates for “synthetic” individuals of mixed ethnicity origins. These test cases are simulations we construct with known mixtures of ethnicities. Each synthetically admixed individual can have as few as 2 or as many as 20 ethnicity regions, with various proportions. Since the true ethnicity proportions are known, we can calculate precision and recall for each ethnicity region. Precision and recall are two important factors in evaluating our estimation process.” (source: Ancestry)
“For regions with low recall, it’s mostly because part of the ethnicity from these regions are assigned to nearby regions. Hence underestimation and the low recall. For regions with low precision, it’s mostly likely part of the nearby regions are assigned there. Hence overestimation and low precision.” (source: Ancestry)
“Our new algorithm analyzes longer segments of genetic information and is a fundamental change in how we interpret DNA.” (source: Ancestry)
The new algorithm used by Ancestry most likely also had a major impact on the updated African breakdown on AncestryDNA. It would be useful to see how the updated results would have turned out if Ancestry had maintained its former algorithm while still using their expanded Reference Panel. For proper understanding it will be mandatory to closely read Ancestry’s white paper. But in order not to digress too much I will keep this section brief and not overly technical. One important consequence of Ancestry’s new algorithm seems to be the tendency to stick everything in as few as possible big regions rather than having things divided up into a dozen small percentages. Something which some customers seemed to take in with great delight as an expression of “diversity” and “exotic” lineage 😉 . But in fact such overly detailed breakdowns often also were confusing or misleading.
The new algorithm also accounts for the disappearance of most “Low Confidence” a.k.a. “Trace regions”. These latter regional scores were often mislabeled and obviously to be taken with a grain of salt. On the one hand this may be considered an improvement as Ancestry is now focusing on larger stretches of DNA, which should be more reliable and less likely to represent statistical noise. But from my experience with correct interpretation and proper follow-up research these minimal scores could sometimes still already be indicative of distinctive ancestors.
It has been said that this update seems good for people with low genetic diversity and good representation of their nationality within Ancestry’s Reference Panel. However for people with more complex background, incl. recently mixed individuals, Ancestry’s new algorithm does not always perform as expected. This has been observed for example for people with known mixed northern & southern European background, whereby the northern European component tends to get overestimated. Over-simplification that works well to eliminate noise for someone that is predominately from one ethnic group, has the opposite effect for someone who is recently mixed or has more complex origins from several generations ago (see this link for insightful discussion). Like wise also for Africans of known mixed background Ancestry often does not get it right. As shown in table 3 this goes especially for people of mixed Nigerian, Senegalese or Ivorian/Ghanaian background. These three regions have already stood out before as having a worse prediction accuracy than in the previous version (see chart 1).
Within its white paper Ancestry specifically makes a distinction between prediction accuracy for so-called “single-origin individuals” as opposed to “synthetic individuals” with known mixed ethnicities. The implications for Afro-Diasporans could be even more far-reaching as after all almost by default Trans-Atlantic Afro-descendants will have intricately mixed origins from across West, Central and Southeast Africa in mostly unknown regional proportions. Generally speaking only historically documented slave trade patterns, African ethnonyms being recorded among enslaved people as well as cultural retention serving as ways to roughly verify any DNA results (see this link). But more so on a group level than for individuals! Therefore the previous algorithm might have been more suitable to deal with this complexity. While the current one might serve to underestimate or simplify the various regional origins of Afro-Diasporans. At least on their African side. From what I have seen their Asian & European admixture is now however much more in line with historical plausibility. Which seems to illustrate you cannot always have it both ways.
If you are discontent about this update let Ancestry know about it!
It is often advised not to take your DNA results too seriously because of all the imperfections and inherent limitations. And it is indeed always good to be well-informed and critical without being over-dismissive. However for myself and many other Afro-Diasporans the African breakdown provided by AncestryDNA represented a promising and valuable tool for learning more about our previously unknown regional lineage within Africa. As we generally do not have much to go by otherwise this is not something to take lightly! Which is why many people have been unsettled by the drastic and seemingly random changes in their AncestryDNA results. I will discuss some of their reactions in the last part of this blog series. Right now I would like to repeat that whenever you are asked for feedback by Ancestry make sure to let them know! When in agreement please also forward them this link:
Achieving improved ethnicity estimates is more difficult than it seems on first sight. I can imagine it often involves balancing opposed considerations and making tough calls. It is an ongoing challenge which Ancestry in their own words is dedicated to take on. And in fact I do appreciate the efforts which have gone into this update. I have noted any improvements whenever I came across them. But generally speaking, in regards to the African breakdown, the outcomes have been very disappointing and frankly a setback!
What I find particularly frustrating is that the current issue of highly inflated “Benin/Togo”, “Cameroon, Congo, Southern Bantu” and “Mali” scores could have been prevented if only Ancestry had carried out their update in a more thoughtful manner. I have been blogging about the misleading country name labeling of especially “Benin/Togo” for several years already (see this blog post). Also from the start I have pointed out that the “Cameroon/Congo” is poorly designed as it covers ancestral connections to both the Bight of Biafra and Central Africa. Instead of addressing these issues or at least attempting to achieve some improvement Ancestry has only made things worse with this update…
1) It might be different story for the European and Asian breakdowns. I have actually seen quite encouraging updated results in this regard. And generally speaking they could be an improvement indeed. Although there are also still some remaining issues. The non-African regional breakdowns are however not a topic of discussion in this blog post.
2) For example crucial statistical information to determine the predictive accuracy of each region is no longer provided as it used to be in the “genetic diversity” tabs (see end of this page for examples). Also I have not yet seen an equivalent of this chart below depicting the “Average ethnicity estimates for natives from each region”. It used to be available by way of this link. But this page has not been updated yet sofar…
***(click to enlarge)
3) Unlike commonly assumed you do not need to sample entire populations to obtain informational value with wider implications. Naturally greater sample size does (usually) help matters. But if you randomly test a given population, and if your sample group is fairly representative of the whole population, you can make generalizations. Naturally methodology and the assumptions being made should be made explicit, but this is common scientific practice. See also:
- Representative Samples: Does Sample Size Really Matter? (SurveyGizmo)
This is an important lesson I learnt while performing my AncestryDNA survey: robust patterns (in line with historical plausibility) might already be discernible from a sample-size of around n=30. Which is actually often considered a general rule of thumb. Adding more results will indeed lead to greater finesse and more detailed statistics but the main outline might then already be established. Even more so when you are aware of any possible sampling bias or substructure and know how to account for it in your analysis.
4) This spectacular increase in Malian samples (+153) certainly is to be commended in itself. It was in fact one of the main suggestions for improvement I blogged about in July (see this link). In order to prevent the currently inflated “Mali” scores it would have been preferable though if Ancestry also had augmented the sample size of their “Senegal” region. Which is still very low now ((n=31). Like I suggested in July many Senegambian samples are to be obtained from either the 1000 Genomes database (which was actually used by Ancestry in this update!) as well as the MalariaGEN database .
5) Given correct interpretation the distinction being made between “Cameroon/Congo” and “Southeastern Bantu” could be very useful for Afro-descendants as well as many Africans. This was demonstrated most clearly by the frequency of top-ranking scores for “Southeastern Bantu” for my Brazilian and Mexican survey participants, corroborating their strong ancestral ties with Angola (see this blog post).
6) I am making this assumption based on the observation of atypical 100% “Nigeria” scores for 4 Nigerian persons in my before & after update survey. As well as one single 100% “Northern Africa” score for a Moroccan. Such unexpected scores seem to be the result of including customer samples into Ancestry’s Reference Panel. Causing an overfitting or calculator effect. The ethnic backgrounds of the Nigerians scoring 100% “Nigeria” are quite diverse btw, but all hailing from southern Nigeria: Igbo (2x?), Yoruba, Urhobo (?).
7) You might wonder (like I did) why Ancestry did not use all of the available African samples in its Sorenson database right away in 2013. When they first provided their pioneering West African breakdown. However it seems that at the time there were still some issues to resolve about required consent for commercial purposes. Which perhaps may have caused the delay. See also these articles for more references:
- Sorenson Molecular Genealogy Foundation (ISOGG)
- Cruwys News (see comment by Debbie Kennett made on 8 January 2015 at 15:12)
8) Again I have to indulge in some speculation at this point. But it seems quite likely to me that Ancestry’s Malian samples were drawn from several ethnic groups and not just one or two. As southwestern Mali is very much a multi-ethnic region already let alone other parts of this large country (see this page). To repeat myself Ancestry has not disclosed the actual ethnic backgrounds of its 153 newly added Malian samples. Which is rather crucial because as I have argued before it is not only the number of additional samples which matters but also their relevancy and how they fit in the Reference Panel. Additional samples being a means to an end. But coherent regional scores in line with historical plausibility or even verifiable genealogy should remain the main goal!
Just to name one possibly problematic issue: the inclusion of Gur/Senoufo speaking samples from southern Mali could very well cause greater regional overlap with Mali’s neighbouring countries to the south & east. Possibly Dogon samples may already have a similar effect. While also any inclusion of Malian Fula samples from the Maasina area would be rather ill-advised as they have quite distinctive genetics, incl. a North African(-like) component. And in fact they are not unique to Mali at all as the Fula people are arguably the epitomy of Africans migrating across the continent (see this map). In my previous survey findings I found however that despite great dispersion and also some degree of local intermingling many Fula people (incl. also the more hybrid Hausa-Fulani) still clearly preserve a distinctive genetic component tied to their presumed origins along the Senegal river. Which is why I find it quite lamentable that their formerly predominant “Senegal” scores have now been replaced by “Mali” ones. See also: