Thursday, January 07, 2010

Earthquake puzzler

Professor Eli Rabett has a New Year's Puzzler.

Count me as puzzled. Since all the details and links are on Rabett's page, I'll pose it briefly: The 118,425 earthquake events of magnitude 4 or greater in the period 1999-2009 (average of 32.4 events a day) disproportionately occur on Thursdays and Sundays.

I followed the links, extracted the data (looks like this:)
Link to USGS Home Page


NEIC: Earthquake Search Results


Year,Month,Day,Time(hhmmss.mm)UTC,Latitude,Longitude,Magnitude,Depth
1999,01,01,010239.86,-35.59, -71.75,4.1, 96
1999,01,01,013507.06, 21.68, 143.12,4.3,310
1999,01,01,020326.52,-23.44, 179.99,4.3,550
1999,01,01,030230.57,-10.76, 117.82,4.2, 33


and obtained this chart after suitable crunching:

chart0

Naively, the deviations from the average should be of the order of 50 or 100 and not 800. (The distribution is expected to be uniform. In polls with sample size of 2000, the sampling error is said to be 3%; this is a sample a hundred times larger, and so the sampling error is 1/(squareroot(100) or 0.3%. 0.3% of 17000 is 50).

Now, while the average number of events per day is 32, there are days with more than 100 events. For instance, Sunday, December 26, 2004 had 306 events! The distribution of events looks like this:

chart1

i.e., the plot shows the number of days in 1999-2000 that there were 1,2,3,...,306 events.

Note: Earthquakes are correlated in time - for instance, a large earthquake will typically  have a lot of aftershocks soon after.

My initial guess was that the long tail of this distribution is what was causing the discrepancy. That is, even though 10 years would seem to be a suitably long time to make an average, we have only a few days in that period with 200 or 300 events, and those few events happen to cause the discrepancies. To put it another way, if there were a huge once-in-ten-years event and it just so happened to occur on a Sunday during 1999-2009, then Sunday would have a lot more earthquakes than the other days of the week. Only over 100s of years would the average be smooth.

i.e., the existence of large rare events disrupts the naive expectation.

Another way of looking at it is that the distribution of earthquakes on Mondays, Tuesdays, etc., for 1999-2009 should look pretty much the same except at the higher number of events. Unfortunately, that is not clear on the chart drawn accordingly:

chart2

Then I thought, suppose I drop days that have more than a certain number of events - say, twice the average or around 64. Would the remaining days be closer to the uniform distribution? I tried, and it doesn't work.

Another way of looking at the data is to sort the Mondays from minimum to maximum number of earthquakes and draw a cumulative sum. E.g., the number of quakes that occurred on different Mondays might be, when sorted,
8,8,9,11,12,13,....,
then I plot the curve passing through the points
(1,8), (2, 8+8), (3, 8+8+9), (4,8+8+9+11), etc.

If I do that for each day of the week and superpose, I expect the lines to essentially lie on top of each other until the days with high number of events cause the lines to diverge.

Instead, I get this:
chart3

Upshot is that I have no real clue as to why the distribution of earthquakes over days of the week 1999-2009 is not uniform. (Perhaps the correct thing to do is to restrict to say, magnitude 4 to magnitude 6 earthquakes.

PS: accumulated in time order the curve is more like expected, though notice Wed gets its deficit early!
chart4

PPS: I have a time-series (# of earthquakes on each day from Jan 1, 1999 to Jan 1, 2009). Convert it into a zero-mean, unit variance parameter. Here is a crude computation of the autocorrelation function. As you can see, it shows no sign of going to zero, and that is why this distribution has funny properties.

chart5

PPPS: in response to the comment by Arthur Smith:

1999-2000
2001-2002
2003-2004
2005-2006
2007-2008