Thursday, September 29, 2016

PCA, Neighbor-joining

This is from the 2016 Reich paper:

 The pink dots at the top of the PCA diagram are "West Eurasia".  The green dots down the side are "South Asia".  The blue dots further below are East Asia/C.A.S., clustered along with the dots for Amerindians.

The PCA diagram would lead you to believe that the green and pink are more closely related than the green and blue.   But the first two principal eigenvectors account for only 7.8% and 4.0% of the variance.   The remaining 88% of the variance is in dimensions not shown in the diagram.  It is a very high dimensional space, and perhaps the normal intuitions of distance do not apply, and that is why we see the counter-intuitive result that the group of green dots is closer connected to each other by neighbor-joining rather than some green dots being put close to some pink dots, and other green dots being put close to some blue dots.  The spread of green dots along PC2 does not preclude them from being closer to each other than to any other pink dot (the Tajik being the exception). Likewise the wide spread of the blue dots along PC2 does not preclude them from joining in one group; and finally, the green and blue join together before the green joins with the pink. 

The PCA in the 2009 Reich paper does not present how much of the variance is captured in their first two principal eigenvectors, as far as I can tell.