Thursday, September 29, 2016

PCA, Neighbor-joining

This is from the 2016 Reich paper:

 The pink dots at the top of the PCA diagram are "West Eurasia".  The green dots down the side are "South Asia".  The blue dots further below are East Asia/C.A.S., clustered along with the dots for Amerindians.

The PCA diagram would lead you to believe that the green and pink are more closely related than the green and blue.   But the first two principal eigenvectors account for only 7.8% and 4.0% of the variance.   The remaining 88% of the variance is in dimensions not shown in the diagram.  It is a very high dimensional space, and perhaps the normal intuitions of distance do not apply, and that is why we see the counter-intuitive result that the group of green dots is closer connected to each other by neighbor-joining rather than some green dots being put close to some pink dots, and other green dots being put close to some blue dots.  The spread of green dots along PC2 does not preclude them from being closer to each other than to any other pink dot (the Tajik being the exception). Likewise the wide spread of the blue dots along PC2 does not preclude them from joining in one group; and finally, the green and blue join together before the green joins with the pink. 

The PCA in the 2009 Reich paper does not present how much of the variance is captured in their first two principal eigenvectors, as far as I can tell.


Comments (3)

Loading... Logging you in...
  • Logged in as
The PCI diagram is very interesting in that African variance is large in both components, while Eurasian variance is almost entirely concentrated in PC2. South Asians and East Asians are each very diverse in PC2, with South Asians intermediate between West Eurasians (a slightly less diverse group) and East Asians. The concentration of all groups along two lines is probably important. (Does it reflect geography?)
1 reply · active 441 weeks ago
I wish we had the PC3, PC4, and a few more principal components.

The out-of-Africa population has very little PC1 variance, they are in a narrow vertical band relative to PC1. If we eliminate the African population, I take it that we would lose approximately 8% of human variation. Renormalizing the other 92% still keeps PC2 at 4% (4.3% to be precise).

The European population has little PC2 variance as you noted. The PC2 variation may represent geography. I think to understand it better, we need a projection of the non-African population along the PC2-PC3 plane. Of course PC3 would account for less than 4% of the variance. Then there is the issue of whether the principal components computed with a purely non-African population would match up with these basis vectors.

In a way, the significance of accounting for 4% of the variance is huge. Since they look at of the order of 500K SNPs, I think of each person as a point in a 500K dimensional space, one dimension for each SNP, and the person at +1 if they have the SNP and -1 if they don't (yes, that is too simplistic). If there was no structure, each dimension should account for 100%/500K of the variance, 0.5 * 10^-3%. So 4% is not small for a principal component in a 500K dimensional space.

On the other hand, non-Africans have 1-4% Neanderthal DNA, and so just Neanderthal DNA variation by amount in non-Africans would account for a few percent of variation (and it is not clear along which principal component axis in the 2016 paper). I think the geneticists have to dive deeper into their analyses, not just revealing structure in their principal components, but by trying to identify what the principal components represent.
There are several other graphs in the paper or associated materials, including two more PCA charts featuring short tandem repeats. They tell a very similar story, except that there is a bit more dispersion in Eurasian PC1 and South Asian DNA now spans almost the entire range of West Eurasian to East Asian.

Post a new comment

Comments by