Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 24, 2019

What The Hell Happened (2016) – Data Questions

Filed under: Data,Politics,Survey — Patrick Durusau @ 9:12 pm

What The Hell Happened (WTHH)

From the homepage:

Every progressive remembers waking up on November 9th, 2016. The question on everyone’s mind was… “What the hell happened?”

Pundits were quick to blame “identity politics” for Clinton’s loss. Recent research suggests this framing may have led voters to be less supportive of women candidates and candidates of color.

That’s why we’re introducing the What The Hell Happened Project, where we will work with academics, practitioners and advocates to explain the 2018 election from beginning to end.

Let’s cut to the data:

This survey is based on 3,215 interviews of registered voters conducted by YouGov. The sample was weighted according to age, sex, race, education, urban/rural status, partisanship, marital status, and Census region to be nationally representative of 2018 voters according to Catalist, and to a post-election correction consisting of the national two-party vote share. Respondents were selected from YouGov and other opt-in panels to be representative of registered voters. The weights range from 0.28 to 4.6, with a mean of 1 and a standard deviation of 0.53.

The survey dataset includes measures of political participation such as activism, group consciousness, and vote choice. It also includes measures of interest including items from a hostile sexism battery, racial resentment, fear of demographic change, fear of cultural change, and a variety of policy positions. It includes a rich demographic battery of items like age, race, ethnicity, sex, party identification, income, education, and US state. Please see the attached codebook for a full description and coding of the variables in this survey, as well as the toplines for breakdowns of some of the key variables.

The dataset also includes recodes to scale the hostile sexism items to a 0-1 scale of hostile sexism, the racial animus items to a 0-1 scale of racial animus, and the demographic change items to a 0-1 scale of fear of demographic change. See the codebook for more details. We created a two-way vote choice variable to capture Democrat/Republican voting by imputing the vote choice of undecided respondents based on a Catalist partisanship model for those respondents, who comprised about 5% of the sample.

To explore the data we have embedded a Crunchbox, which you can use to easily make crosstabs and charts of the data. Here, you can click around many of the political and demographic items and look around for interesting trends to explore.

If you want a winning candidate in 2020, repeat every morning: Focus on 2020, Focus on 2020.

Your candidate is not running in 2016 or even 2018.

And, your candidate needs better voter data than WTHH offers here.

First, how was the data gathered?

Respondents were selected from YouGov and other opt-in panels to be representative of registered voters.

Yikes! That’s not how professional pollers do surveys. It may be ok for learning analysis tools but not for serious political forecasting.

Second, what manipulation, if any, of the data, has been performed?

The sample was weighted according to age, sex, race, education, urban/rural status, partisanship, marital status, and Census region to be nationally representative of 2018 voters according to Catalist, and to a post-election correction consisting of the national two-party vote share.

Oh. So we don’t know what biases or faults the weighting process may have introduced to the data. Great.

How were the questions constructed and tested?

Don’t know. (Without this step we don’t know what the question may or may not be measuring.)

How many questions were asked? (56)

Fifty-six questions. Really?

In the 1960 presidential campaign, John F. Kennedy’s staff has a matrix of 480 voter types and 52 issue clusters.

Do you see such a matrix coming out of 56 questions? Neither do I.

The WTHH data is interesting in an amateurish sort of way but winning in 2020 requires the latest data gathering and modeling techniques. Not to mention getting voters to the polling places (modeling a solution for registered but non-voting voters would be a real plus). Your Secretary of State should have prior voting behavior records.

Weather Data – 100K stations – Hourly

Filed under: Data,Weather Data — Patrick Durusau @ 8:09 pm

Weather Directory Contents

From the webpage:

This directory contains hourly weather dumps. The files are compressed using Zstandard compression (.zst). Each file is a collection of JSON objects (ndjson) and can easily be parsed by any utility that has a JSON decode library (including Python, Java, Perl, PHP, etc.) Please contact me if you have any questions about the file format or the fields within the JSON objects. The field “retrieved_utc” is a field that I added that gives the time of when the data was retrieved. The format of the files is WEATHER_YYYY-MM-DD-HH (UTC time format).

Please consider making a donation (https://pushshift.io/donations) if you download a lot of data. This helps offset the costs of my time collecting data and providing bandwidth to make these files available to the public. Thank you!

If you have any questions about the data formats of the files or any other questions, please feel free to contact me at jason@pushshift.io

A project of pushshift.io, the homepage of which is a collection of statistics on Reddit posts.

Looking at the compressed files for today (24 January 2019), the earliest file is dated Jan 24 2019 AM and tips the scales at 35,067,516 bytes. Hourly files, running between 72,272,568 and 65,989336 bytes. Remembering these files are compressed so you need a lot of space or work with them compressed.

The perfect data is your boss is a weather freak. Be sure to mention the donation link to them.

Enjoy!

October 8, 2018

Hurricane Florence Twitter Dataset – Better Twitter Interface?

Filed under: Data,Tweets,Twitter — Patrick Durusau @ 3:56 pm

Hurricane Florence Twitter Dataset by Mark Edward Phillips.

From the webpage:

This dataset contains Twitter JSON data for Tweets related to Hurricane Florence and the subsequent flooding along the Carolina coastal region. This dataset was created using the twarc (https://github.com/edsu/twarc) package that makes use of Twitter’s search API. A total of 4,971,575 Tweets and 347,205 media files make up the combined dataset.

No hyperlink in the post but see: twarc.

Have you considered using twarc to create a custom Twitter interface for yourself? At present just a thought but once you have the JSON, your ability to manipulate your Twitter feed is limited only by your imagination.

Once a base archive is constructed, create a cron job that updates base. Not “real time” like Twitter but then who makes decisions of any consequence in “real time?” You can but its not a good idea.

While you are learning twarc, consider what other datasets you could create.

November 3, 2017

Academic Torrents Update

Filed under: Data,Humanities,Open Access,Open Data,Social Sciences — Patrick Durusau @ 7:06 am

When I last mentioned Academic Torrents, in early 2014, it had 1.67TB of research data.

I dropped by Academic Torrents this week to find it now has 25.53TB of research data!

Some arbitrary highlights:

Richard Feynman’s Lectures on Physics (The Messenger Lectures)

A collection of sport activity datasets for data analysis and data mining 2017a

[Coursera] Machine Learning (Stanford University) (ml)

UC Berkeley Computer Science Courses (Full Collection)

[Coursera] Mining Massive Datasets (Stanford University) (mmds)

Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia (Original Dataset)

Your arbitrary highlights are probably different than mine so visit Academic Torrents to see what data captures your eye.

Enjoy!

June 17, 2017

The Quartz Directory of Essential Data (Directory of Directories Is More Accurate)

Filed under: Data,Journalism,News,Reporting — Patrick Durusau @ 1:51 pm

The Quartz Directory of Essential Data

From the webpage:

A curated list of useful datasets published by important sources. Please remember that “important” does not mean “correct.” You should vet these data as you would with any human source.

Switch to the “Data” tab at the bottom of this spreadsheet and use Find (⌘ + F) to search for datasets on a particular topic.

Note: Just because data is useful, doesn’t mean it’s easy to use. The point of this directory is to help you find data. If you need help accessing or interpreting one of these datasets, please reach out to your friendly Quartz data editor, Chris.

Slack: @chris
Email: c@qz.com

A directory of 77 data directories. The breath of organizing topics, health, trade, government, for example, creates a need for repeated data mining by every new user.

A low/no-friction method for creating more specific and re-usable directories has remained elusive.

June 13, 2017

Power Outage Data – 15 Years Worth

Filed under: Data,Security,Terrorism — Patrick Durusau @ 2:08 pm

Data: Explore 15 Years Of Power Outages by Jordan Wirfs-Brock.

From the post:

This database details 15 years of power outages across the United States, compiled and standardized from annual data available at from the Department of Energy.

For an explanation of what it means, how it came about, and how we got here, listen to this conversation between Inside Energy Reporter Dan Boyce and Data Journalist Jordan Wirfs-Brock:

You can also view the data as a Google Spreadsheet (where you can download it as a CSV). This version of the database also includes information about the amount of time it took power to be restored, the demand loss in megawatts, the NERC region, (NERC refers to the North American Electricity Reliability Corporation, formed to ensure the reliability of the grid) and a list of standardized tags.

The data set isn’t useful for tactical information, the submissions are too general to replicate the events leading up to an outage.

On the other hand, identifiable outage events, dates, locations, etc., do make recovery of tactical data from grid literature a manageable search problem.

Enjoy!

February 16, 2017

behind the scenes: cleaning dirty data

Filed under: Data — Patrick Durusau @ 5:11 pm

behind the scenes: cleaning dirty data

From the post:

Dirty Data. It’s everywhere! And that’s expected and ok and even frankly good imho — it happens when people are doing complicated things, in the real world, with lots of edge cases, and moving fast. Perfect is the enemy of good.

Alas it’s definitely behind-the-scenes work to find and fix dirty data problems, which means none of us learn from each other in the process. So — here’s a quick post about a dirty data issue we recently dealt with  Hopefully it’ll help you feel comradery, and maybe help some people using the BASE data.

We traced some oaDOI bugs to dirty records from PMC in the BASE open access aggregation database.

BASE = Bielefeld Academic Search Engine.

oaDOI = oaDOI (similar to DOI but points to open access version)

PMC = PubMed Central.

Are you cleaning data or contributing more dirty data?

December 28, 2016

Washington Taxi Data 2015 – 2016 (Caution: 2.2 GB File Size)

Filed under: Data,Government — Patrick Durusau @ 9:22 pm

I was rummaging around on the Opendata.dc.gov website today when I encountered Taxicab Trips (2.2 GB), described as:

DC Taxicab trip data from April 2015 to August 2016. Pick up and drop off locations are assigned to block locations with times rounded to the nearest hour. Detailed metadata included in download.The Department of For-Hire Vehicles (DFHV) provided OCTO with a taxicab trip text file representing trips from May 2015 to August 2016. OCTO processed the data to assign block locations to pick up and drop off locations.

For your convenience, I extracted README_DC_Taxicab_trip.txt and it gives the data structure of the files (“|” separated) as follows:

TABLE STRUCTURE:

COLUMN_NAME	DATA_TYPE	DEFINITION	   
OBJECTID	NUMBER(9)	Table Unique Identifier	   
TRIPTYPE	VARCHAR2(255)	Type of Taxi Trip	   
PROVIDER	VARCHAR2(255)	Taxi Company that Provided trip	   
METERFARE	VARCHAR2(255)	Meter Fare	   
TIP	VARCHAR2(255)	Tip amount	   
SURCHARGE	VARCHAR2(255)	Surcharge fee	   
EXTRAS	VARCHAR2(255)	Extra fees	   
TOLLS	VARCHAR2(255)	Toll amount	   
TOTALAMOUNT	VARCHAR2(255)	Total amount from Meter fare, tip, 
                                surcharge, extras, and tolls. 	   
PAYMENTTYPE	VARCHAR2(255)	Payment type	   
PAYMENTCARDPROVIDER	VARCHAR2(255)	Payment card provider	   
PICKUPCITY	VARCHAR2(255)	Pick up location city	   
PICKUPSTATE	VARCHAR2(255)	Pick up location state	   
PICKUPZIP	VARCHAR2(255)	Pick up location zip	   
DROPOFFCITY	VARCHAR2(255)	Drop off location city	   
DROPOFFSTATE	VARCHAR2(255)	Drop off location state	   
DROPOFFZIP	VARCHAR2(255)	Drop off location zip	   
TRIPMILEAGE	VARCHAR2(255)	Trip milaege	   
TRIPTIME	VARCHAR2(255)	Trip time	   
PICKUP_BLOCK_LATITUDE	NUMBER	Pick up location latitude	   
PICKUP_BLOCK_LONGITUDE	NUMBER	Pick up location longitude	   
PICKUP_BLOCKNAME	VARCHAR2(255)	Pick up location street block name	   
DROPOFF_BLOCK_LATITUDE	NUMBER	Drop off location latitude	   
DROPOFF_BLOCK_LONGITUDE	NUMBER	Drop off location longitude	   
DROPOFF_BLOCKNAME	VARCHAR2(255)	Drop off location street block name	   
AIRPORT	CHAR(1)	Pick up or drop off location is a local airport (Y/N)	   
PICKUPDATETIME_TR	DATE	Pick up location city	   
DROPOFFDATETIME_TR	DATE	Drop off location city	 

The taxi data files are zipped by the month:

Archive:  taxitrip2015_2016.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
107968907  2016-11-29 14:27   taxi_201511.zip
117252084  2016-11-29 14:20   taxi_201512.zip
 99545739  2016-11-30 11:15   taxi_201601.zip
129755310  2016-11-30 11:24   taxi_201602.zip
152793046  2016-11-30 11:31   taxi_201603.zip
148835360  2016-11-30 11:20   taxi_201604.zip
143734132  2016-11-30 11:19   taxi_201605.zip
139396173  2016-11-30 11:13   taxi_201606.zip
121112859  2016-11-30 11:08   taxi_201607.zip
104015666  2016-11-30 12:04   taxi_201608.zip
154623796  2016-11-30 11:03   taxi_201505.zip
161666797  2016-11-29 14:15   taxi_201506.zip
153483725  2016-11-29 14:32   taxi_201507.zip
121135328  2016-11-29 14:06   taxi_201508.zip
142098999  2016-11-30 10:55   taxi_201509.zip
160977058  2016-11-30 10:35   taxi_201510.zip
     3694  2016-12-09 16:43   README_DC_Taxicab_trip.txt

I extracted taxi_201601.zip, decompressed it and created a 10,000 line sample, named taxi-201601-10k.ods.

I was hopeful that taxi trip times might allow inference of traffic conditions but with rare exceptions, columns AA and AB record the same time.

Rats!

I’m sure there are other patterns you can extract from the data but inferring traffic conditions doesn’t appear to be one of those.

Or am I missing something obvious?

More posts about Opendata.dc.gov coming as I look for blockade information.

PS: I didn’t explore any month other than January of 2016, but it’s late and I will tend to that tomorrow.

November 13, 2016

Outbrain Challenges the Research Community with Massive Data Set

Filed under: Contest,Data,Data Mining,Prediction — Patrick Durusau @ 8:15 pm

Outbrain Challenges the Research Community with Massive Data Set by Roy Sasson.

From the post:

Today, we are excited to announce the release of our anonymized dataset that discloses the browsing behavior of hundreds of millions of users who engage with our content recommendations. This data, which was released on the Kaggle platform, includes two billion page views across 560 sites, document metadata (such as content categories and topics), served recommendations, and clicks.

Our “Outbrain Challenge” is a call out to the research community to analyze our data and model user reading patterns, in order to predict individuals’ future content choices. We will reward the three best models with cash prizes totaling $25,000 (see full contest details below).

The sheer size of the data we’ve released is unprecedented on Kaggle, the competition’s platform, and is considered extraordinary for such competitions in general. Crunching all of the data may be challenging to some participants—though Outbrain does it on a daily basis.

The rules caution:


The data is anonymized. Please remember that participants are prohibited from de-anonymizing or reverse engineering data or combining the data with other publicly available information.

That would be a more interesting question than the ones presented for the contest.

After the 2016 U.S. presidential election we know that racists, sexists, nationalists, etc., are driven by single factors so assuming you have good tagging, what’s the problem?

Yes?

Or is human behavior is not only complex but variable?

Good luck!

August 17, 2016

Text [R, Scraping, Text]

Filed under: Data,R,Web Scrapers — Patrick Durusau @ 8:31 pm

Text by Amelia McNamara.

Covers “scraping, text, and timelines.”

Using R, focuses on scraping, works through some of “…Scott, Karthik, and Garrett’s useR tutorial.”

In case you don’t know the useR tutorial:

Also known as (AKA) Extracting data from the web APIs and beyond:

No matter what your domain of interest or expertise, the internet is a treasure trove of useful data that comes in many shapes, forms, and sizes, from beautifully documented fast APIs to data that need to be scraped from deep inside of 1990s html pages. In this 3 hour tutorial you will learn how to programmatically read in various types of web data from experts in the field (Founders of the rOpenSci project and the training lead of RStudio). By the end of the tutorial you will have a basic idea of how to wrap an R package around a standard API, extract common non-standard data formats, and scrape data into tidy data frames from web pages.

Covers other resources and materials.

Enjoy!

May 9, 2016

Dark Matter: Driven by Data

Filed under: Dark Data,Data,LangSec — Patrick Durusau @ 8:47 pm

A delightful keynote by Dan Geer, presented at the 2015 LangSec Workshop at the IEEE Symposium on Security & Privacy Workshops, May 21, 2015, San Jose, CA.

Prepared text for the presentation.

A quote to interest you in watching the video:

Workshop organizer Meredith Patterson gave me a quotation from Taylor Hornby that I hadn’t seen. In it, Hornby succinctly states the kind of confusion we are in and which LANGSEC is all about:

The illusion that your program is manipulating its data is powerful. But it is an illusion: The data is controlling your program.

It almost appears that we are building weird machines on purpose, almost the weirder the better. Take big data and deep learning. Where data science spreads, a massive increase in tailorability to conditions follows. But even if Moore’s Law remains forever valid, there will never be enough computing hence data driven algorithms must favor efficiency above all else, yet the more efficient the algorithm, the less interrogatable it is,[MO] that is to say that the more optimized the algorithm is, the harder it is to know what the algorithm is really doing.[SFI]

And there is a feedback loop here: The more desirable some particular automation is judged to be, the more data it is given. The more data it is given, the more its data utilization efficiency matters. The more its data utilization efficiency matters, the more its algorithms will evolve to opaque operation. Above some threshold of dependence on such an algorithm in practice, there can be no going back. As such, if science wishes to be useful, preserving algorithm interrogatability despite efficiency-seeking, self-driven evolution is the research grade problem now on the table. If science does not pick this up, then Lessig’s characterization of code as law[LL] is fulfilled. But if code is law, what is a weird machine?

If you can’t interrogate an algorithm, could you interrogate a topic map that is an “inefficient” implementation of the algorithm?

Or put differently, could there be two representations of the same algorithm, one that is “efficient,” and one that can be “interrogated?”

Read the paper version but be aware the video has a very rich Q&A session that follows the presentation.

April 18, 2016

February 15, 2016

People NOT Technology Produce Data ROI

Filed under: BigData,Data,Data Science,Data Silos — Patrick Durusau @ 4:00 pm

Too many tools… not enough carpenters! by Nicholas Hartman.

From the webpage:

Don’t let your enterprise make the expensive mistake of thinking that buying tons of proprietary tools will solve your data analytics challenges.

tl;dr = The enterprise needs to invest in core data science skills, not proprietary tools.

Most of the world’s largest corporations are flush with data, but frequently still struggle to achieve the vast performance increases promised by the hype around so called “big data.” It’s not that the excitement around the potential of harvesting all that data was unwarranted, but rather these companies are finding that translating data into information and ultimately tangible value can be hard… really hard.

In your typical new tech-based startup the entire computing ecosystem was likely built from day one around the need to generate, store, analyze and create value from data. That ecosystem was also likely backed from day one with a team of qualified data scientists. Such ecosystems spawned a wave of new data science technologies that have since been productized into tools for sale. Backed by mind-blowingly large sums of VC cash many of these tools have set their eyes on the large enterprise market. A nice landscape of such tools was recently prepared by Matt Turck of FirstMark Capital (host of Data Driven NYC, one of the best data science meetups around).

Consumers stopped paying money for software a long time ago (they now mostly let the advertisers pay for the product). If you want to make serious money in pure software these days you have to sell to the enterprise. Large corporations still spend billions and billions every year on software and data science is one of the hottest areas in tech right now, so selling software for crunching data should be a no-brainer! Not so fast.

The problem is, the enterprise data environment is often nothing like that found within your typical 3-year-old startup. Data can be strewn across hundreds or thousands of systems that don’t talk to each other. Devices like mainframes are still common. Vast quantities of data are generated and stored within these companies, but until recently nobody ever really envisioned ever accessing — let alone analyzing — these archived records. Often, it’s not initially even clear how the all data generated by these systems directly relates to a large blue chip’s core business operations. It does, but a lack of in-house data scientists means that nobody is entirely even sure what data is really there or how it can be leveraged.

I would delete “proprietary” from the above because non-proprietary tools create data problems just as easily.

Thus I would re-write the second quote as:

Tools won’t replace skilled talent, and skilled talent doesn’t typically need many particular tools.

I substituted “particular” tools to avoid religious questions about particular non-proprietary tools.

Understanding data, recognizing where data integration is profitable and where it is a dead loss, creating tests to measure potential ROI, etc., are all tasks of a human data analyst and not any proprietary or non-proprietary tool.

That all enterprise data has some intrinsic value that can be extracted if it were only accessible is an article of religious faith, not business ROI.

If you want business ROI from data, start with human analysts and not the latest buzzwords in technological tools.

January 14, 2016

Yahoo News Feed dataset, version 1.0 (1.5TB) – Sorry, No Open Data At Yahoo!

Filed under: Data,Dataset,Yahoo! — Patrick Durusau @ 9:10 pm

R10 – Yahoo News Feed dataset, version 1.0 (1.5TB)

From the webpage:

The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. The dataset stands at a massive ~110B lines (1.5TB bzipped) of user-news item interaction data, collected by recording the user- news item interaction of about 20M users from February 2015 to May 2015. In addition to the interaction data, we are providing the demographic information (age segment and gender) and the city in which the user is based for a subset of the anonymized users. On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article. The interaction data is timestamped with the user’s local time and also contains partial information of the device on which the user accessed the news feeds, which allows for interesting work in contextual recommendation and temporal data mining.

The dataset may be used by researchers to validate recommender systems, collaborative filtering methods, context-aware learning, large-scale learning algorithms, transfer learning, user behavior modeling, content enrichment and unsupervised learning methods.

The readme file for this dataset is located in part 1 of the download. Please refer to the readme file for a detailed overview of the dataset.

A great data set but one you aren’t going to see unless you have a university email account.

I thought when it took my regular Yahoo! login and I accepted the license agreement I was in. Not a chance!

No open data at Yahoo!

Why Yahoo! would have such a restriction, particularly in light of the progress towards open data is a complete mystery.

To be honest, even if I heard Yahoo!’s “reasons,” I doubt I would find them convincing.

If you have a university email address, good for you, download and use the data.

If you don’t have a university email address, can you ping me with the email of a decision maker at Yahoo! who can void this no open data policy?

Thanks!

December 16, 2015

20 Big Data Repositories You Should Check Out [Data Source Checking?]

Filed under: BigData,Data,Data Science — Patrick Durusau @ 11:43 am

20 Big Data Repositories You Should Check Out by Vincent Granville.

Vincent lists some additional sources along with a link to Bernard Marr’s original selection.

One of the issues with such lists is that they are rarely maintained.

For example, Bernard listed:

Topsy http://topsy.com/

Free, comprehensive social media data is hard to come by – after all their data is what generates profits for the big players (Facebook, Twitter etc) so they don’t want to give it away. However Topsy provides a searchable database of public tweets going back to 2006 as well as several tools to analyze the conversations.

But if you follow http://topsy.com/, you will find it points to:

Use Search on your iPhone, iPad, or iPod touch

With iOS 9, Search lets you look for content from the web, your contacts, apps, nearby places, and more. Powered by Siri, Search offers suggestions and updates results as you type.

That sucks doesn’t it? Expecting to be able to search public tweets back to 2006, along with analytical tools and what you get is a kiddie guide to search on a malware honeypot.

For a fuller explanation or at least the latest news on Topsy, check out: Apple shuts down Twitter analytics service Topsy by Sam Byford, dated December 16, 2015 (that’s today as I write this post).

So, strike Topsy off your list of big data sources.

Rather than bare lists, what big data needs is a curated list of big data sources that does more than list sources. Those sources need to be broken down to data sets to enable big data searchers to find all the relevant data sets and retrieve only those that remain accessible.

Like “link checking” but for big data resources. Data Source Checking?

That would be the “go to” place for big data sets and as bad as I hate advertising, a high traffic area for advertising to make it cost effective if not profitable.

July 12, 2015

Reddit Archive! 1 TB of Comments

Filed under: Data,Reddit — Patrick Durusau @ 2:34 pm

You can now download a dataset of 1.65 billion Reddit comments: Beware the Redditor AI by Mic Wright.

From the post:

Once our species’ greatest trove of knowledge was the Library of Alexandria.

Now we have Reddit, a roiling mass of human ingenuity/douchebaggery that has recently focused on tearing itself apart like Tommy Wiseau in legendarily awful flick ‘The Room.’

But unlike the ancient library, the fruits of Reddit’s labors, good and ill, will not be destroyed in fire.

In fact, thanks to Jason Baumgartner of PushShift.io (aided by The Internet Archive), a dataset of 1.65 billion comments, stretching from October 2007 to May 2015, is now available to download.

The data – pulled using Reddit’s API – is made up of JSON objects, including the comment, score, author, subreddit, position in the comment tree and a range of other fields.

The uncompressed dataset weighs in at over 1TB, meaning it’ll be most useful for major research projects with enough resources to really wrangle it.

Technically, the archive is incomplete, but not significantly. After 14 months of work and many API calls, Baumgartner was faced with approximately 350,000 comments that were not available. In most cases that’s because the comment resides in a private subreddit or was simply removed.

If you don’t have a spare TB of space at the moment, you will also be interested in: http://www.reddit.com/r/bigquery/comments/3cej2b/17_billion_reddit_comments_loaded_on_bigquery/, where you will find several BigQueries already.

The full data set certainly makes an interesting alternative to the Turing test for AI. Can you AI generate without assistance or access to this data set, the responses that appear therein? Is that a fair test for “intelligence?”

If you want updated data, consult the Reddit API.

June 28, 2015

New York Philharmonic Performance History

Filed under: Data,Music — Patrick Durusau @ 2:31 pm

New York Philharmonic Performance History

From the post:

The New York Philharmonic played its first concert on December 7, 1842. Since then, it has merged with the New York Symphony, the New/National Symphony, and had a long-running summer season at New York’s Lewisohn Stadium. This Performance History database documents all known concerts of all of these organizations, amounting to more than 20,000 performances. The New York Philharmonic Leon Levy Digital Archives provides an additional interface for searching printed programs alongside other digitized items such as marked music scores, marked orchestral parts, business records, and photos.

In an effort to make this data available for study, analysis, and reuse, the New York Philharmonic joins organizations like The Tate and the Cooper-Hewitt Smithsonian National Design Museum in making its own contribution to the Open Data movement.

The metadata here is released under the Creative Commons Public Domain CC0 licence. Please see the enclosed LICENCE file for more detail.

The data:

Field Description
General Info: Info that applies to entire program
id GUID (To view program: archives.nyphil.org/index.php/artifact/GUID/fullview)
ProgramID Local NYP ID
Orchestra Full orchestra name Learn more…
Season Defined as Sep 1 – Aug 31, displayed “1842-43”
Concert Info: Repeated for each individual performance within a program
eventType See term definitions
Location Geographic location of concert (Countries are identified by their current name. For example, even though the orchestra played in Czechoslovakia, it is now identified in the data as the Czech Republic)
Venue Name of hall, theater, or building where the concert took place
Date Full ISO date used, but ignore TIME part (1842-12-07T05:00:00Z = Dec. 7, 1842)
Time Actual time of concert, e.g. “8:00PM”
Works Info: the fields below are repeated for each work performed on a program. By matching the index number of each field, you can tell which soloist(s) and conductor(s) performed a specific work on each of the concerts listed above.
worksConductorName Last name, first name
worksComposerTitle Composer Last name, first / TITLE (NYP short titles used)
worksSoloistName Last name, first name (if multiple soloists on a single work, delimited by semicolon)
worksSoloistInstrument Last name, first name (if multiple soloists on a single work, delimited by semicolon)
worksSoloistRole “S” means “Soloist”; “A” means “Assisting Artist” (if multiple soloists on a single work, delimited by semicolon)

A great starting place for a topic map for performances of the New York Philharmonic or for combination with topic maps for composers or soloists.

I first saw this in a tweet by Anna Kijas.

May 5, 2015

New York City Subway Anthrax/Plague

Filed under: Data,Skepticism — Patrick Durusau @ 3:27 pm

Spoiler Alert: This paper discusses a possible find of anthrax and plague DNA in the New York Subway. It concludes that either a related but harmless strain wasn’t considered and/or there was random sequencing error. In either case, it is a textbook example of the need for data skepticism.

Searching for anthrax in the New York City subway metagenome by Robert A Petit, III, Matthew Ezewudo, Sandeep J. Joseph, Timothy D. Read.

From the introduction:

In February 2015 Chris Mason and his team published an in-depth analysis of metagenomic data (environmental shotgun DNA sequence) from samples isolated from public surfaces in the New York City (NYC) subway system. Along with a ton of really interesting findings, the authors claimed to have detected DNA from the bacterial biothreat pathogens Bacillus anthracis (which causes anthrax) and Yersinia pestis (causes plague) in some of the samples. This predictably led to a huge interest from the press and scientists on social media. The authors followed up with an re-analysis of the data on microbe.net, where they showed some results that suggested the tools that they were using for species identification overcalled anthrax and plague.

The NYC subway metagenome study raised very timely questions about using unbiased DNA sequencing for pathogen detection. We were interested in this dataset as soon as the publication appeared and started looking deeper into why the analysis software gave false positive results and indeed what exactly was found in the subway samples. We decided to wrap up the results of our preliminary analysis and put it on this site. This report focuses on the results for B. anthracis but we also did some preliminary work on Y.pestis and may follow up on this later.

The article gives a detailed accounting of the tools and issues involved in the identification of DNA fragments from pathogens. It is hard core science but it also illustrates how iffy hard core science can be. Sure, you have the data, that doesn’t mean you will reach the correct conclusion from it.

The authors also mention a followup study by Chris Mason, the author of the original paper, entitled:

The long road from Data to Wisdom, and from DNA to Pathogen by Christopher Mason.

From the introduction:

There is an oft-cited DIKW). Just because you have data, it takes some processing to get quality information, and even good information is not necessarily knowledge, and knowledge often requires context or application to become wisdom.

And from his conclusion:

But, perhaps the bigger issue is social. I confess I grossly underestimated how the press would sensationalize these results, and even the Department of Health (DOH) did not believe it, claiming it simply could not be true. We sent the MTA and the DOH our first draft upon submission in October 2014, the raw and processed data, as well as both of our revised drafts in December 2014 and January 2015, and we did get some feedback, but they also had other concerns at the time (Ebola guy in the subway). This is also different from what they normally do (PCR for specific targets), so we both learned from each other. Yet, upon publication, it was clear that Twitter and blogs provided some of the same scrutiny as the three reviewers during the two rounds of peer review. But, they went even deeper and dug into the raw data, within hours of the paper coming online, and I would argue that online reviewers have become an invaluable part of scientific publishing. Thus, published work is effectively a living entity before (bioRxiv), during (online), and after publication (WSJ, Twitter, and others), and online voices constitute an critical, ensemble 4th reviewer.

Going forward, the transparency of the methods, annotations, algorithms, and techniques has never been more essential. To this end, we have detailed our work in the supplemental methods, but we have also posted complete pipelines in this blog post on how to go from raw data to annotated species on Galaxy. Even better, the precise virtual machines and instantiation of what was run on a server needs to be tracked and guaranteed to be 100% reproducible. For example, for our .vcf characterizations of the human alleles, we have set up our entire pipeline on Arvados/Curoverse, free to use, so that anyone can take a .vcf file and run the exact same ancestry analyses and get the exact same results. Eventually, tools like this can automate and normalize computational aspects of metagenomics work, which is an ever-increasingly important component of genomics.

Mason’s”

Data –>Information –>Knowledge –>Wisdom (DIKW).

sounds like:

evidence based data science.

to me.

Another quick point, note that Chris Mason and team made all their data available for others to review and Chris states that informal review was a valuable contributor to the scientific process.

That is an illustration of the value of transparency. Contrast that with the Obama Administration’s default position of opacity. Which one do you think serves a fact finding process better?

Perhaps that is the answer. The Obama administration isn’t interested in a fact finding process. It has found the “facts” that it wants and reaches its desired conclusions. What is there left to question or discuss?

April 29, 2015

800,000 NPR Audio Files!

Filed under: Audio,Data — Patrick Durusau @ 6:54 pm

There Are Now 800,000 Reasons To Share NPR Audio On Your Site by Patrick Cooper.

From the post:

From NPR stories to shows to songs, today we’re making more than 800,000 pieces of our audio available for you to share around the Web. We’re throwing open the doors to embedding, putting our audio on your site.

Complete with simple instructions for embedding!

I often think of topic maps when listening to NPR so don’t be surprised if you start seeing embedded NPR audio in the very near future!

Enjoy!

April 20, 2015

Twitter cuts off ‘firehose’ access…

Filed under: Data,Twitter — Patrick Durusau @ 3:11 pm

Twitter cuts off ‘firehose’ access, eyes Big Data bonanza by Mike Wheatley.

From the post:

Twitter upset the applecart on Friday when it announced it would no longer license its stream of half a billion daily tweets to third-party resellers.

The social media site said it had decided to terminate all current agreements with third parties to resell its ‘firehose’ data – an unfiltered, full stream of tweets and all of the metadata that comes with them. For companies that still wish to access the firehose, they’ll still be able to do so, but only by licensing the data directly from Twitter itself.

Twitter’s new plan is to use its own Big Data analytics team, which came about as a result of its acquisition of Gnip in 2014, to build direct relationships with data companies and brands that rely on Twitter data to measure market trends, consumer sentiment and other metrics that can be best understood by keeping track of what people are saying online. The company hopes to complete the transition by August this year.

Not that I had any foreknowledge of Twitter’s plans but I can’t say this latest move is all that surprising.

What I hope also emerges from the “new plan” is a fixed pricing structure for smaller users of Twitter content. I’m really not interested in an airline pricing model where the price you pay has no rational relationship to the value of the product. If it’s the day before the end of a sales quarter I get a very different price for a Twitter feed than mid-way through the quarter. That sort of thing.

Along with being able to specify users to follow/searches and tweet streams in daily increments of 250,000, 500,000, 750,000, 1,000,000, where they are spooled for daily pickup over high speed connections (to put less stress on infrastructure).

I suppose renewable contracts would be too much to ask? 😉

Unannotated Listicle of Public Data Sets

Filed under: Data,Dataset — Patrick Durusau @ 2:50 pm

Great Github list of public data sets by Mirko Krivanek.

Large list of public data sets, previously published on GitHub, which has no annotations to guide you to particular datasets.

Just in case you know of any legitimate aircraft wiring sites, i.e., ones that existed prior to the GAO report on hacking aircraft networks, ping me with the links. Thanks!

April 7, 2015

33% of Poor Business Decisions Track Back to Data Quality Issues

Filed under: BigData,Data,Data Quality — Patrick Durusau @ 3:46 pm

Stupid errors in spreadsheets could lead to Britain’s next corporate disaster by Rebecca Burn-Callander.

From the post:

Errors in company spreadsheets could be putting billions of pounds at risk, research has found. This is despite high-profile spreadsheet catastrophes, such as the collapse of US energy giant Enron, ringing alarm bells more than a decade ago.

Almost one in five large businesses have suffered financial losses as a result of errors in spreadsheets, according to F1F9, which provides financial modelling and business forecasting to blue chips firms. It warns of looming financial disasters as 71pc of large British business always use spreadsheets for key financial decisions.

The company’s new whitepaper entitiled Capitalism’s Dirty Secret showed that the abuse of humble spreadsheet could have far-reaching consequences. Spreadsheets are used in the preparation of British company accounts worth up to £1.9 trillion and the UK manufacturing sector uses spreadsheets to make pricing decisions for up to £170bn worth of business.

Felienne Hermans, of Delft University of Technology, analysed 15,770 spreadsheets obtained from over 600,000 emails from 158 former employees. He found 755 files with more than a hundred errors, with the maximum number of errors in one file being 83,273.

Dr Hermans said: “The Enron case has given us a unique opportunity to look inside the workings of a major corporate organisation and see first hand how widespread poor spreadsheet practice really is.

First, a gender correction, Dr. Hermans is not a he. The post should read: “She found 755 files with more than….

Second, how bad is poor spreadsheet quality? The download page has this summary:

  • 33% of large businesses report poor decision making due to spreadsheet problems.
  • Nearly 1 in 5 large businesses have suffered direct financial loss due to poor spreadsheets.
  • Spreadsheets are used in the preparation of British company accounts worth up to £1.9 trillion.

You read that correctly, not that 33% of spreadsheet have quality issues but that 33% of poor business decisions can be traced to spreadsheet problems.

A comment to the blog post supplied a link for the report: A Research Report into the Uses and Abuses of Spreadsheets.

Spreadsheets are small to medium sized data.

Care to comment on the odds of big data and its processes pushing the percentage of poor business decisions past 33%?

How would you discover you are being misled by big data and/or its processing?

How do you validate the results of big data? Run another big data process?

When you hear sales pitches about big data, be sure to ask about the impact of dirty data. If assured that your domain doesn’t have a dirty data issue, grab your wallet and run!

PS: A Research Report into the Uses and Abuses of Spreadsheets is a must have publication.

The report itself is useful, but Appendix A 20 Principles For Good Spreadsheet Practice is a keeper. With a little imagination all of those principles could be applied to big data and its processing.

Just picking one at random:

3. Ensure that everyone involved in the creation or use of spreadsheet has an appropriate level of knowledge and understanding.

For big data, reword that to:

Ensure that everyone involved in the creation or use of big data has an appropriate level of knowledge and understanding.

Your IT staff are trained, but do the managers who will use the results understand the limitations of the data and/or it processing? Or do they follow the results because “the data says so?”

February 24, 2015

Wiki New Zealand

Filed under: Data,Geography — Patrick Durusau @ 1:01 pm

Wiki New Zealand

From the about page:

It’s time to democratise data. Data is a language in which few are literate, and the resulting constraints at an individual and societal level are similar to those experienced when the proportion of the population able to read was small. When people require intermediaries before digesting information, the capacity for generating insights is reduced.

To democratise data we need to put users at the centre of our models, we need to design our systems and processes for users of data, and we need to realise that everyone can be a user. We will all win when everyone can make evidence-based decisions.

Wiki New Zealand is a charity devoted to getting people to use data about New Zealand.

We do this by pulling together New Zealand’s public sector, private sector and academic data in one place and making it easy for people to use in simple graphical form for free through this website.

We believe that informed decisions are better decisions. There is a lot of data about New Zealand available online today, but it is too difficult to access and too hard to use. We think that providing usable, clear, digestible and unbiased information will help you make better decisions, and will lead to better outcomes for you, for your community and for New Zealand.

We also believe that by working together we can build the most comprehensive, useful and accurate representation of New Zealand’s situation and performance: the “wiki” part of the name means “collaborative website”. Our site is open and free to use for everyone. Soon, anyone will be able to upload data and make graphs and submit them through our auditing process. We are really passionate about engaging with domain and data experts on their speciality areas.

We will not tell you what to think. We present topics from multiple angles, in wider contexts and over time. All our data is presented in charts that are designed to be compared easily with each other and constructed with as little bias as possible. Our job is to present data on a wide range of subjects relevant to you. Your job is to draw your own conclusions, develop your own opinions and make your decisions.

Whether you want to make a business decision based on how big your market is, fact-check a newspaper story, put together a school project, resolve an argument, build an app based on clean public licensed data, or just get to know this country better, we have made this for you.

Isn’t New Zealand a post-apocalypse destination? Thinking however great it may be now, the neighborhood is going down when all the post-apocalypse folks arrive. Something on the order of Mr. Rogers Neighborhood to Max Max Beyond Thunderdome. 😉

Hopefully, if there is an apocalypse, it will happen quickly enough to prevent a large influx of undesirables into New Zealand.

I first saw this in a tweet by Neil Saunders.

February 21, 2015

Yelp Dataset Challenge

Filed under: Challenges,Data — Patrick Durusau @ 4:38 pm

Yelp Dataset Challenge

From the webpage:

Yelp Dataset Challenge is doubling up: Now 10 cities across 4 countries! Two years, four highly competitive rounds, over $35,000 in cash prizes awarded and several hundred peer-reviewed papers later: the Yelp Dataset Challenge is doubling up. We are proud to announce our latest dataset that includes information about local businesses, reviews and users in 10 cities across 4 countries. The Yelp Challenge dataset is much larger and richer than the Academic Dataset. This treasure trove of local business data is waiting to be mined and we can’t wait to see you push the frontiers of data science research with our data.

The Challenge Dataset:

  • 1.6M reviews and 500K tips by 366K users for 61K businesses
  • 481K business attributes, e.g., hours, parking availability, ambience.
  • Social network of 366K users for a total of 2.9M social edges.
  • Aggregated check-ins over time for each of the 61K businesses
  • The deadline for the fifth round of the Yelp Dataset Challenge is June 30, 2015. Submit your project to Yelp by visiting yelp.com/challenge/submit. You can submit a research paper, video presentation, slide deck, website, blog, or any other medium that conveys your use of the Yelp Dataset Challenge data.

    Pitched at students but it is an interesting dataset.

    I first saw this in a tweet by Marin Dimitrov.

    February 10, 2015

    Data as “First Class Citizens”

    Filed under: Annotation,Data,Data Preservation,Data Repositories,Documentation — Patrick Durusau @ 7:34 pm

    Data as “First Class Citizens” by Łukasz Bolikowski, Nikos Houssos, Paolo Manghi, Jochen Schirrwagen.

    The guest editorial to D-Lib Magazine, January/February 2015, Volume 21, Number 1/2, introduces a collection of articles focusing on data as “first class citizens,” saying:

    Data are an essential element of the research process. Therefore, for the sake of transparency, verifiability and reproducibility of research, data need to become “first-class citizens” in scholarly communication. Researchers have to be able to publish, share, index, find, cite, and reuse research data sets.

    The 2nd International Workshop on Linking and Contextualizing Publications and Datasets (LCPD 2014), held in conjunction with the Digital Libraries 2014 conference (DL 2014), took place in London on September 12th, 2014 and gathered a group of stakeholders interested in growing a global data publishing culture. The workshop started with invited talks from Prof. Andreas Rauber (Linking to and Citing Data in Non-Trivial Settings), Stefan Kramer (Linking research data and publications: a survey of the landscape in the social sciences), and Dr. Iain Hrynaszkiewicz (Data papers and their applications: Data Descriptors in Scientific Data). The discussion was then organized into four full-paper sessions, exploring orthogonal but still interwoven facets of current and future challenges for data publishing: “contextualizing and linking” (Semantic Enrichment and Search: A Case Study on Environmental Science Literature and A-posteriori Provenance-enabled Linking of Publications and Datasets via Crowdsourcing), “new forms of publishing” (A Framework Supporting the Shift From Traditional Digital Publications to Enhanced Publications and Science 2.0 Repositories: Time for a Change in Scholarly Communication), “dataset citation” (Data Citation Practices in the CRAWDAD Wireless Network Data Archive, A Methodology for Citing Linked Open Data Subsets, and Challenges in Matching Dataset Citation Strings to Datasets in Social Science) and “dataset peer-review” (Enabling Living Systematic Reviews and Clinical Guidelines through Semantic Technologies and Data without Peer: Examples of Data Peer Review in the Earth Sciences).

    We believe these investigations provide a rich overview of current issues in the field, by proposing open problems, solutions, and future challenges. In short they confirm the urgent and fascinating demands of research data publishing.

    The only stumbling point in this collection of essays is the notion of data as “First Class Citizens.” Not that I object to a catchy title but not all data is going to be equal when it comes to first class citizenship.

    Take Semantic Enrichment and Search: A Case Study on Environmental Science Literature, for example. Great essay on using multiple sources to annotate entities once disambiguation had occurred. But some entities are going to have more annotations than others and some entities may not be recognized at all. Not to mention it is rather doubtful that the markup containing those entities was annotated at all?

    Are we sure we want to exclude from data the formats that contain the data? Isn’t a format a form of data? As well as the instructions for processing that data? Perhaps not in every case but shouldn’t data and the formats that hold date be equally treated as first class citizens? I am mindful that hundreds of thousands of people saw the pyramids being built but we have not one accurate report on the process.

    Will the lack of that one accurate report deny us access to data quite skillfully preserved in a format that is no longer accessible to us?

    While I support the cry for all data to be “first class citizens,” I also support a very broad notion of data to avoid overlooking data that may be critical in the future.

    February 1, 2015

    Data Sources on the Web

    Filed under: Data,R — Patrick Durusau @ 4:23 pm

    Data Sources on the Web

    From the post:

    The following list of data sources has been modified as of January 2015. Most of the data sets listed below are free, however, some are not.

    If an (R!) appears after source this means that the data are already in R format or there exist R commands for directly importing the data from R. (See http://www.quantmod.com/examples/intro/ for some code.) Otherwise, i have limited the list to data sources for which there is a reasonably simple process for importing csv files. What follows is a list of data sources organized into categories that are not mutually exclusive but which reflect what's out there.

    Want to add to or update this list? Send to mran@revolutionanalytics.com

    As you know, there are any number of data lists on the Net. This one is different, it is a maintained data list.

    Enjoy!

    I first saw this in a tweet by David Smith.

    January 1, 2015

    MemSQL releases a tool to easily ship big data into its database

    Filed under: BigData,Data,Data Pipelines — Patrick Durusau @ 5:43 pm

    MemSQL releases a tool to easily ship big data into its database by Jordan Novet.

    From the post:

    Like other companies pushing databases, San Francisco startup MemSQL wants to solve low-level problems, such as easily importing data from critical sources. Today MemSQL is acting on that impulse by releasing a tool to send data from the S3 storage service on the Amazon Web Services cloud and from the Hadoop open-source file system into its proprietary in-memory SQL database — or the open-source MySQL database.

    Engineers can try out the new tool, named MemSQL Loader, today, now that it’s been released under an open-source MIT license.

    The existing “LOAD DATA” command in MemSQL and MySQL can bring data in, although it has its shortcomings, as Wayne Song, a software engineer at the startup, wrote in a blog post today. Song and his colleagues ran into those snags and started coding.

    How very cool!

    Not every database project seeks to “easily import… data from critical sources.” but I am very glad to see MemSQL take up the challenge.

    Reducing the friction between data stores and tools will make data pipelines more robust, reducing the amount of time spent trouble shooting routine data traffic issues and increasing the time spend on analysis that fuels your ROI from data science.

    True enough, if you want to make ASCII importing a $custom assistance from your staff task, that is one business model. On the whole I would not say it is a very viable one. Particularly with more production minded folks like MemSQL around.

    What database are you going to extend MemSQL Loader to support?

    November 30, 2014

    Jeopardy! clues data

    Filed under: Data,Games — Patrick Durusau @ 5:10 pm

    Jeopardy! clues data Nathan Yau writes:

    Here’s some weekend project data for you. Reddit user trexmatt dumped a dataset for 216,930 Jeopardy! questions and answers in JSON and CSV formats, a scrape from the J! Archive. Each clue is represented by category, money value, the clue itself, the answer, round, show number, and air date.

    Nathan suggests hunting for Daily Doubles but then discovers someone has done that. (See Nathan’s post for the details.)

    Enjoy!

    November 21, 2014

    Land Matrix

    Filed under: Data,Government,Transparency — Patrick Durusau @ 6:34 pm

    Land Matrix: The Online Public Database on Land Deals

    From the webpage:

    The Land Matrix is a global and independent land monitoring initiative that promotes transparency and accountability in decisions over land and investment.

    This website is our Global Observatory – an open tool for collecting and visualising information about large-scale land acquisitions.

    The data represented here is constantly evolving; to make this resource more accurate and comprehensive, we encourage your participation.

    The deals collected as data must meet the following criteria:

    • Entail a transfer of rights to use, control or ownership of land through sale, lease or concession;
    • Have been initiated since the year 2000;
    • Cover an area of 200 hectares or more;
    • Imply the potential conversion of land from smallholder production, local community use or important ecosystem service provision to commercial use.

    FYI, 200 hectares = 2 square kilometers.

    Land ownership and its transfer are matters of law and law means government.

    The project describes its data this way:

    The dataset is inherently unreliable, but over time it is expected to become more accurate. Land deals are notoriously un-transparent. In many countries, established procedures for decision-making on land deals do not exist, and negotiations and decisions do not take place in the public realm. Furthermore, a range of government agencies and levels of government are usually responsible for approving different kinds of land deals. Even official data sources in the same country can therefore vary, and none may actually reflect reality on the ground. Decisions are often changed, and this may or may not be communicated publically.

    I would start earlier than the year 2000 but the same techniques could be applied along the route of the Keystone XL pipeline. I am assuming that you are aware that pipelines, roads and other public works are not located purely for physical or aesthetic reasons. Yes?

    Please take the time to view and support the Land Matrix project and consider similar projects in your community.

    If the owners can be run to ground, you may find the parties to the transactions are linked by other “associations.”

    October 2014 Crawl Archive Available

    Filed under: Common Crawl,Data — Patrick Durusau @ 5:41 pm

    October 2014 Crawl Archive Available by Stephen Merity.

    From the post:

    The crawl archive for October 2014 is now available! This crawl archive is over 254TB in size and contains 3.72 billion webpages. The files are located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-42/.

    To assist with exploring and using the dataset, we’ve provided gzipped files that list:

    By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

    Thanks again to blekko for their ongoing donation of URLs for our crawl!

    Just in time for weekend exploration! 😉

    Enjoy!

    Older Posts »

    Powered by WordPress