Google BigQuery Public Datasets
An amazing set of public datasets, from the post:
- USA Names: A Social Security Administration dataset that contains all names from Social Security card applications for births that occurred in the United States after 1879.
- NYC TLC Trips: Data collected by the NYC Taxi and Limousine Commission (TLC) that includes trip records from all trips completed in yellow and green taxis in NYC from 2009 to 2015.
- Hacker News: A dataset that contains all stories and comments from Hacker News since its launch in 2006.
- USA Disease Surveillance: A dataset published by the US Department of Health and Human Services that includes all weekly surveillance reports of nationally notifiable diseases for all U.S. cities and states published between 1888 and 2013.
- GDELT Book Corpus: A dataset that contains 3.5 million digitized books stretching back two centuries, encompassing the complete English-language public domain collections of the Internet Archive (1.3M volumes) and HathiTrust (2.2 million volumes).
- NOAA GSOD: This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes global data obtained from the USAF Climatology Center. This dataset covers GSOD data between 1929 and 2016, collected from over 9000 stations.
I can readily see myself loosing serious time in the GDELT Book Corpus!
Enjoy!