Since I'll likely be doing this statistics stuff the rest of my life, I started on a small personal project last week. I devised a means to provide concrete statistical evidence of global warming using data that can be obtained for free from the internet.
Last week, I discovered the vast amount of recorded weather data that can be accessed completely free from the National Weather Service online (www.weather.com). Seriously, you have free access to information on what the weather was like (temperature, pressure, precipitation, and conditions) in any major city in the world on any given day for the last 100 years. It's amazing!
One of the first archives I came to was historical data for New York City, which said told me that the record high for April 17 was 96 degrees (goddamn!), set in 2002, and the record low for April 17 was 28, set in 1875. So I came up with the idea of trying to compare the average year the record low for April 17 was set to the average year the record high for April 17 was set. My hypothesis is that on average, the record high was set more recently than the record low.
Since I expected the standard error to be quite large, I needed a pretty large sample to be able to detect anything. Using information from wikipedia, I chose a sample of 80 of the 100 U.S.* cities with largest metropolitan population, based on the additional criterion that each city in the sample had to be at least 100 miles distant from every other city in the sample.
So then I spent a couple of hours taking data, and I came up with the result that on average the record high was set about 64 years ago, while the record low was set about 37 years ago, with the difference being about 27 years and its standard error about 5.7 years. That gave me a z-statistic of about 4.7, with a p-value of 4*10^-6.
I thought about it some more, and I realized that the result may not be as significant if I accounted for the spatial correlation between the cities. I never took the course in spatial modeling, so now I'm going a different route: I'm going to collect data from the 80 cities over a full year, so instead I'll have 365 paired observations from each city. If I look at each city individually, then there will be no spatial correlation but instead temporal correlation, which I know how to deal with.
If anyone has any advice or sees any flaws in my reasoning, please let me know.
*I thought about trying cities around the world, but decided data from just the U.S. would have smaller variance, making it easier to detect a difference. I would've needed to increase my sample size to 200 or so to make up for it.
Yesterday, someone asked me where College Station, TX was, and completely without thinking I replied "somewhere down the long, dark road that leads to Hell."