Sunday, June 26, 2016


The BBC reports:

In other words, had you been washed ashore four millennia ago on the banks of the now lost river of Saraswati and hitched a bullock cart ride to Farmana in the Ghaggar valley near modern-day Delhi, here's what you might have eaten - a curry.

For in 2010, when advanced science met archaeology at an excavation site in Farmana - southeast of the largest Harappan city of Rakhigarhi - they made history, and it was edible.

Archaeologists Arunima Kashyap and Steve Webber of Vancouver's Washington State University used the method of starch analysis to trace the world's first-known or "oldest" proto-curry of aubergine, ginger and turmeric from the pot shard of a bulbous handi (pot). 

Wednesday, June 22, 2016

Hoisted from the comments

In a comment on a previous post, I wrote:
I would say that the modern population genetic data and the archaeological data are consistent with a Yamnaya incursion into Europe. The ancient DNA data is also consistent. There is no literary record, and neither genes nor archaeology inform us about language, but presumably the Yamnaya incursion is a candidate to which to tie IE expansion into Europe. The problem is that the earlier farmers are also a candidate for I.E. introduction into Europe. Neither genes nor archaeology can help us decide between the two. Most historical linguists like the shorter time depth and favor the Yamnaya theory. But their tools, e.g., "glottochronology" are riddled with flaws.

The archaeological record for India does not show a Yamnaya incursion or Andronovo incursion or any other significant incursion in the 1900 BC - 1200 BC timeframe. Language-wise - the Rg Veda is the oldest attested I.E. example but is known via oral tradition; and the Mitanni records with a few Vedic deities and I.E. words as its closest competitor; but the Indian archaeological record does not provide any evidence of language. (Re: Hittite, see below) India does not yet have any ancient DNA. Genetics of the modern population so far rules out any significant incursion into India 4000-2500 years ago - but India remains a grossly undersampled population.

The Saraswati River mentioned in the Rg Veda is named with other rivers; in the hymn, these other rivers are in the correct geographic sequence of rivers of the Indus and Gangetic systems. If we thus place the Saraswati, it was a mighty river then, it now corresponds to the seasonal Ghagghar-Hakra; the channel along of which the majority of Harappan civilization sites are found; it corresponds to the Saraswati, already dried up by the time of the Mahabharata. If we accept this identification it places the Rg Veda to before 2000 BC. Since Hittite is attested to 1600-1300 BC in written records (only), the Rg Veda would be the oldest attested I.E.

We need not accept this identification of the Saraswati, linguists such as Harvard's Sanskritist M. Witzel have theorized that some other river in Afghanistan was the original Saraswati, and for some reason the Rg Vedic people transferred the name to the already-dried up river bed that they found when they entered India.

The Rg Veda was composed in India, of that there can be little doubt. But the Saraswati timeline throws the historical linguists' 1900 BC - 1200 BC time line into confusion, so they hypothesize that the hymns are sometimes a memory of some other place where the "original" Saraswati was. They also dismiss the Rg Vedic mention of the sea and of hundred-oared boats, which are not there in the deep inland Afghan location that they want to place the "original" Saraswati, by saying that the composers were incorrigible boasters and exaggerators. Then, by their estimates, these memories were transformed into Rg Vedic hymns around 1400 BC. Then Hittite becomes the oldest attested IE language.

Tuesday, June 21, 2016

Some things to keep an eye on

This is from 2011, talking about the "next generation sequencing".  I'm told newer technology is on the way, but still not in wide use, so this is still relevant.
Genotype and SNP calling from next-generation sequencing data
Meaningful analysis of next-generation sequencing (NGS) data, which are produced extensively by genetics and genomics studies, relies crucially on the accurate calling of SNPs and genotypes. Recently developed statistical methods both improve and quantify the considerable uncertainty associated with genotype calling, and will especially benefit the growing number of studies using low- to medium-coverage data. We review these methods and provide a guide for their use in NGS studies.

This next one is intriguing too. Frankly, I would not have guessed that laboratory conditions, reagent lots and personnel differences might lead to large problems.
Tackling the widespread and critical impact of batch effects in high-throughput data
High-throughput technologies are widely used, for example to assay genetic variants, gene and protein expression, and epigenetic modifications. One often overlooked complication with such studies is batch effects, which occur because measurements are affected by laboratory conditions, reagent lots and personnel differences. This becomes a major problem when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. Using both published studies and our own analyses, we argue that batch effects (as well as other technical and biological artefacts) are widespread and critical to address. We review experimental and computational approaches for doing so.


The next one is a personal observation, possibly of little merit. All these studies that try to figure out the ancestry of populations essentially compile some data, apply some statistical models and computation and then try to come to some conclusions. One way of looking at it is that they trying to create classifiers, trees or directed graphs with some level of statistical reliability. But another way of looking at it is that they are doing one-half of machine learning. They have created a training set and applied it. The second part, which is to use the model to make predictions is missing. That is, having done their first thousand samples, they should now let their trained model work on the next thousand samples, and see how well it performs. IMO, this would be as good a demonstration of the statistical significance of their model as any other.

"Population Structure Analysis of Globally Diverse Bull Genomes" has this fascinating plot.
Among these genomes, there are m = 3,967,995 single nucleotide polymorphisms (SNPs) with no missing values and minor allele frequencies ¿ 0.05 (Supplementary Fig. 2). To explore structural complexity, whole genome sequences of n = 432 selected samples were hierarchically clustered using Manhattan distances (Figure 3, colored by 13 different breeds). It is evident that official breed codes (or countries of origin) do not necessarily represent the genetic diversity among bulls represented by SNPs.

The most effective way of learning about a complex topic

I'm sure there are many ways of learning about a complex topic.  The one that works least well for me is to simply be a sponge.  The most effective way I've found is to take a position - have some explicit or implicit hypothesis or major assumption - and then proceed therefrom.   It helps in many ways, too numerous to enumerate.

Clusters and Clines

Front Genet. 2016; 7: 22.
Published online 2016 Feb 17. doi: 10.3389/fgene.2016.00022
PMCID: PMC4756148
Population Genomics and the Statistical Values of Race: An Interdisciplinary Perspective on the Biological Classification of Human Populations and Implications for Clinical Genetic Epidemiological Research
Koffi N. Maglo, Tesfaye B. Mersha, and Lisa J. Martin
From the abstract: "...contrasts the scientific status of the “cluster” and “cline” constructs in human population genomics, and shows how cluster may be instrumentally produced."

To be frank, I do not yet fully understand the paper.  But this is intriguing (you'll have to read the paper to get the context):

Furthermore, it has been shown that the rate of individuals having membership in multiple clusters increases with the inclusion of admixed populations in studies. This does not however negate the computational possibility of clustering admixed individuals. But under this scenario, many individuals will typically have mixed membership in different clusters (Pritchard et al., 2007; Bryc et al., 2010; Maglo, 2011; Jin et al., 2012). As mentioned above, the correlated allele model was specifically designed to resolve “subtle admixture problems.” Curiously, some researchers perform cluster analysis on admixed populations by bypassing this model (Tang et al., 2005), raising questions about their findings (Graves, 2011). Yet the user guide of Structure states that “Admixture is a common feature of real data, and you probably won't find it if you use the no-admixture model” (Pritchard et al., 2000; Elhaik, 2012).
In a word, computational success does not by itself alone entail the natural reality of clustered entities in evolutionary classification.

Sunday, June 12, 2016

On Orlando

The commission of a crime requires the confluence of motive, means and opportunity.

A terrorist has a lot of opportunities in an open society.  There is no means to secure every venue from somebody who wants to spray a crowd with bullets or to blow them up.  Only some opportunities, such as those provided by commercial airline flights, can be diminished.

On the motive side - ISIS has shown itself capable of long-distance conversion of Americans into jihadis, whether or not these converts were previously from a Muslim culture.  Unless we shut down all social media there is no way to prevent to the local crazy from connecting up with the far-away crazies.  The Trump solution of a temporary (or permanent) ban on Muslims entering the US of A can't work, even if somehow it was made palatable to a majority of Americans.

Regarding the means - Americans are quite determined to keep the type of weapons and ammunition that are designed kill and maim people freely available.  The principle involved is called the Second Amendment.

It seems we are stymied on all fronts.  But we are not entirely devoid of hope.  Governments are actually quite pretty good at squashing ISIS-like organizations -- provided such organizations are not themselves state-sponsored (or aided by a wealthy diaspora - e.g., the Irish Republican Army or the Sri Lankan Tamil Tigers).    If ISIS still exists, it is because eradicating it is not any government's top priority, and because it likely has state-level covert sponsors.  This is what we can change.

Saturday, June 04, 2016

R1a-M780 map in India and AIT

If I had failed to notice before, I notice now that this following paper, if its conclusions stand, places the genetic record completely at odds with the linguistic Aryan Immigration/Migration Theory.
European Journal of Human Genetics (2015) 23, 124–131; doi:10.1038/ejhg.2014.50; published online 26 March 2014
The phylogenetic and geographic structure of Y-chromosome haplogroup R1a,
Peter Underhill et. al.
A brief explanation of why I say so.
Figure 1 from the above publication:

Haplogroup (hg) R1a-M420 topology, shown within the context of hg R-M207. Common names of the SNPs discussed in this study are shown along the branches, with those genotyped presented in color and those for which phylogenetic placement was previously unknown in orange. Hg labels are assigned according to YCC nomenclature principles with an asterisk (*) denoting a paragroup.63 Dashed lines indicate lineages not observed in our sample. The marker Z280 was not used as it maps to duplicated ampliconic tracts.

Notice the positions of M417 -- Z93 -- M780.  Also note:
Of the 1693 European R1a-M417/Page7 samples, more than 96% were assigned to R1a-Z282 (Figure 2), whereas 98.4% of the 490 Central and South Asian R1a lineages belonged to hg R1a-Z93 (Figure 3), consistent with the previously proposed trend.
Let us take the position that M417 (the common ancestor of Indians and Europeans with R1a) originated outside India and its descendants in India are a result of immigration.  This would be (so far) in accord with the Aryan immigration theory.

Here is Figure 3-d from the paper
Caption: Spatial frequency distributions of Z93 affiliated haplogroups. Maps were generated as described in Figure 2.

The M780 map above might make sense if M780 arose well after the Aryans supposedly arrived in India, perhaps just prior to the urbanizing period of the Gangetic plain, well into the Iron Age, i.e., ~500 BC.  But the paper places this at least two thousand years earlier!
The corresponding diversification {of R1a} in the Middle East and South Asia is more obscure. However, early urbanization within the Indus Valley also occurred at this time57 and the geographic distribution of R1a-M780 (Figure 3d) may reflect this.
(The "mature Harappan phase" is 2600-1900 BC.  Wiki says Early Harappan has two phases - 3300 BC- 2800 BC, and 2800 BC - 2600 BC.)

The paper does say:
The four subhaplogroups of Z93 (branches 9-M582, 10-M560, 12-Z2125, and 17-M780, L657) constitute a multifurcation unresolved by 10Mb of sequencing; it is likely that no further resolution of this part of the tree will be possible with current technology. Similarly, the shared European branch has just three SNPs.
If R1a-M780 was present at the early urbanization within the Indus Valley, then the "genetic Indo-Aryans" had arrived in India earlier than 2600 BC, well before the first spoke-wheeled chariots (Andronovo, ~2000 BC).  Traditional Aryan immigration theory has them arrive after 1900 BC (after the collapse of the Indus Valley cities) and before 1200 BC (start of the Iron Age in India); and typically around 1400 BC,  around or just after the Sanskritic words (supposedly pre-Sanskritic) that appear in the Mitanni written records.  

Wednesday, June 01, 2016

The Andronovo culture and AIT

Koenraad Elst goes over Elena Kuzmina's Origin of the Indo-Iranians and finds:
While this is undoubtedly an important book, and as far as I can judge, it is a classic of Andronovo archaeology, but it fails in its primary mission: to show that this culture was the staging-ground for an Aryan invasion of Iran and India. It only assumes that much, but doesn’t demonstrate it.