You can now download a dataset of 1.65 billion Reddit comments: Beware the Redditor AI by Mic Wright.
From the post:
Once our species’ greatest trove of knowledge was the Library of Alexandria.
Now we have Reddit, a roiling mass of human ingenuity/douchebaggery that has recently focused on tearing itself apart like Tommy Wiseau in legendarily awful flick ‘The Room.’
But unlike the ancient library, the fruits of Reddit’s labors, good and ill, will not be destroyed in fire.
In fact, thanks to Jason Baumgartner of PushShift.io (aided by The Internet Archive), a dataset of 1.65 billion comments, stretching from October 2007 to May 2015, is now available to download.
The data – pulled using Reddit’s API – is made up of JSON objects, including the comment, score, author, subreddit, position in the comment tree and a range of other fields.
The uncompressed dataset weighs in at over 1TB, meaning it’ll be most useful for major research projects with enough resources to really wrangle it.
Technically, the archive is incomplete, but not significantly. After 14 months of work and many API calls, Baumgartner was faced with approximately 350,000 comments that were not available. In most cases that’s because the comment resides in a private subreddit or was simply removed.
…
If you don’t have a spare TB of space at the moment, you will also be interested in: http://www.reddit.com/r/bigquery/comments/3cej2b/17_billion_reddit_comments_loaded_on_bigquery/, where you will find several BigQueries already.
The full data set certainly makes an interesting alternative to the Turing test for AI. Can you AI generate without assistance or access to this data set, the responses that appear therein? Is that a fair test for “intelligence?”
If you want updated data, consult the Reddit API.