Archive for the ‘Data’ Category

Academic Torrents Update

Friday, November 3rd, 2017

When I last mentioned Academic Torrents, in early 2014, it had 1.67TB of research data.

I dropped by Academic Torrents this week to find it now has 25.53TB of research data!

Some arbitrary highlights:

Richard Feynman’s Lectures on Physics (The Messenger Lectures)

A collection of sport activity datasets for data analysis and data mining 2017a

[Coursera] Machine Learning (Stanford University) (ml)

UC Berkeley Computer Science Courses (Full Collection)

[Coursera] Mining Massive Datasets (Stanford University) (mmds)

Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia (Original Dataset)

Your arbitrary highlights are probably different than mine so visit Academic Torrents to see what data captures your eye.


The Quartz Directory of Essential Data (Directory of Directories Is More Accurate)

Saturday, June 17th, 2017

The Quartz Directory of Essential Data

From the webpage:

A curated list of useful datasets published by important sources. Please remember that “important” does not mean “correct.” You should vet these data as you would with any human source.

Switch to the “Data” tab at the bottom of this spreadsheet and use Find (⌘ + F) to search for datasets on a particular topic.

Note: Just because data is useful, doesn’t mean it’s easy to use. The point of this directory is to help you find data. If you need help accessing or interpreting one of these datasets, please reach out to your friendly Quartz data editor, Chris.

Slack: @chris

A directory of 77 data directories. The breath of organizing topics, health, trade, government, for example, creates a need for repeated data mining by every new user.

A low/no-friction method for creating more specific and re-usable directories has remained elusive.

Power Outage Data – 15 Years Worth

Tuesday, June 13th, 2017

Data: Explore 15 Years Of Power Outages by Jordan Wirfs-Brock.

From the post:

This database details 15 years of power outages across the United States, compiled and standardized from annual data available at from the Department of Energy.

For an explanation of what it means, how it came about, and how we got here, listen to this conversation between Inside Energy Reporter Dan Boyce and Data Journalist Jordan Wirfs-Brock:

You can also view the data as a Google Spreadsheet (where you can download it as a CSV). This version of the database also includes information about the amount of time it took power to be restored, the demand loss in megawatts, the NERC region, (NERC refers to the North American Electricity Reliability Corporation, formed to ensure the reliability of the grid) and a list of standardized tags.

The data set isn’t useful for tactical information, the submissions are too general to replicate the events leading up to an outage.

On the other hand, identifiable outage events, dates, locations, etc., do make recovery of tactical data from grid literature a manageable search problem.


behind the scenes: cleaning dirty data

Thursday, February 16th, 2017

behind the scenes: cleaning dirty data

From the post:

Dirty Data. It’s everywhere! And that’s expected and ok and even frankly good imho — it happens when people are doing complicated things, in the real world, with lots of edge cases, and moving fast. Perfect is the enemy of good.

Alas it’s definitely behind-the-scenes work to find and fix dirty data problems, which means none of us learn from each other in the process. So — here’s a quick post about a dirty data issue we recently dealt with  Hopefully it’ll help you feel comradery, and maybe help some people using the BASE data.

We traced some oaDOI bugs to dirty records from PMC in the BASE open access aggregation database.

BASE = Bielefeld Academic Search Engine.

oaDOI = oaDOI (similar to DOI but points to open access version)

PMC = PubMed Central.

Are you cleaning data or contributing more dirty data?

Washington Taxi Data 2015 – 2016 (Caution: 2.2 GB File Size)

Wednesday, December 28th, 2016

I was rummaging around on the website today when I encountered Taxicab Trips (2.2 GB), described as:

DC Taxicab trip data from April 2015 to August 2016. Pick up and drop off locations are assigned to block locations with times rounded to the nearest hour. Detailed metadata included in download.The Department of For-Hire Vehicles (DFHV) provided OCTO with a taxicab trip text file representing trips from May 2015 to August 2016. OCTO processed the data to assign block locations to pick up and drop off locations.

For your convenience, I extracted README_DC_Taxicab_trip.txt and it gives the data structure of the files (“|” separated) as follows:


OBJECTID	NUMBER(9)	Table Unique Identifier	   
TRIPTYPE	VARCHAR2(255)	Type of Taxi Trip	   
PROVIDER	VARCHAR2(255)	Taxi Company that Provided trip	   
METERFARE	VARCHAR2(255)	Meter Fare	   
TIP	VARCHAR2(255)	Tip amount	   
SURCHARGE	VARCHAR2(255)	Surcharge fee	   
EXTRAS	VARCHAR2(255)	Extra fees	   
TOLLS	VARCHAR2(255)	Toll amount	   
TOTALAMOUNT	VARCHAR2(255)	Total amount from Meter fare, tip, 
                                surcharge, extras, and tolls. 	   
PAYMENTTYPE	VARCHAR2(255)	Payment type	   
PAYMENTCARDPROVIDER	VARCHAR2(255)	Payment card provider	   
PICKUPCITY	VARCHAR2(255)	Pick up location city	   
PICKUPSTATE	VARCHAR2(255)	Pick up location state	   
PICKUPZIP	VARCHAR2(255)	Pick up location zip	   
DROPOFFCITY	VARCHAR2(255)	Drop off location city	   
DROPOFFSTATE	VARCHAR2(255)	Drop off location state	   
DROPOFFZIP	VARCHAR2(255)	Drop off location zip	   
TRIPMILEAGE	VARCHAR2(255)	Trip milaege	   
TRIPTIME	VARCHAR2(255)	Trip time	   
PICKUP_BLOCK_LATITUDE	NUMBER	Pick up location latitude	   
PICKUP_BLOCK_LONGITUDE	NUMBER	Pick up location longitude	   
PICKUP_BLOCKNAME	VARCHAR2(255)	Pick up location street block name	   
DROPOFF_BLOCK_LATITUDE	NUMBER	Drop off location latitude	   
DROPOFF_BLOCK_LONGITUDE	NUMBER	Drop off location longitude	   
DROPOFF_BLOCKNAME	VARCHAR2(255)	Drop off location street block name	   
AIRPORT	CHAR(1)	Pick up or drop off location is a local airport (Y/N)	   
PICKUPDATETIME_TR	DATE	Pick up location city	   
DROPOFFDATETIME_TR	DATE	Drop off location city	 

The taxi data files are zipped by the month:

  Length      Date    Time    Name
---------  ---------- -----   ----
107968907  2016-11-29 14:27
117252084  2016-11-29 14:20
 99545739  2016-11-30 11:15
129755310  2016-11-30 11:24
152793046  2016-11-30 11:31
148835360  2016-11-30 11:20
143734132  2016-11-30 11:19
139396173  2016-11-30 11:13
121112859  2016-11-30 11:08
104015666  2016-11-30 12:04
154623796  2016-11-30 11:03
161666797  2016-11-29 14:15
153483725  2016-11-29 14:32
121135328  2016-11-29 14:06
142098999  2016-11-30 10:55
160977058  2016-11-30 10:35
     3694  2016-12-09 16:43   README_DC_Taxicab_trip.txt

I extracted, decompressed it and created a 10,000 line sample, named taxi-201601-10k.ods.

I was hopeful that taxi trip times might allow inference of traffic conditions but with rare exceptions, columns AA and AB record the same time.


I’m sure there are other patterns you can extract from the data but inferring traffic conditions doesn’t appear to be one of those.

Or am I missing something obvious?

More posts about coming as I look for blockade information.

PS: I didn’t explore any month other than January of 2016, but it’s late and I will tend to that tomorrow.

Outbrain Challenges the Research Community with Massive Data Set

Sunday, November 13th, 2016

Outbrain Challenges the Research Community with Massive Data Set by Roy Sasson.

From the post:

Today, we are excited to announce the release of our anonymized dataset that discloses the browsing behavior of hundreds of millions of users who engage with our content recommendations. This data, which was released on the Kaggle platform, includes two billion page views across 560 sites, document metadata (such as content categories and topics), served recommendations, and clicks.

Our “Outbrain Challenge” is a call out to the research community to analyze our data and model user reading patterns, in order to predict individuals’ future content choices. We will reward the three best models with cash prizes totaling $25,000 (see full contest details below).

The sheer size of the data we’ve released is unprecedented on Kaggle, the competition’s platform, and is considered extraordinary for such competitions in general. Crunching all of the data may be challenging to some participants—though Outbrain does it on a daily basis.

The rules caution:

The data is anonymized. Please remember that participants are prohibited from de-anonymizing or reverse engineering data or combining the data with other publicly available information.

That would be a more interesting question than the ones presented for the contest.

After the 2016 U.S. presidential election we know that racists, sexists, nationalists, etc., are driven by single factors so assuming you have good tagging, what’s the problem?


Or is human behavior is not only complex but variable?

Good luck!

Text [R, Scraping, Text]

Wednesday, August 17th, 2016

Text by Amelia McNamara.

Covers “scraping, text, and timelines.”

Using R, focuses on scraping, works through some of “…Scott, Karthik, and Garrett’s useR tutorial.”

In case you don’t know the useR tutorial:

Also known as (AKA) Extracting data from the web APIs and beyond:

No matter what your domain of interest or expertise, the internet is a treasure trove of useful data that comes in many shapes, forms, and sizes, from beautifully documented fast APIs to data that need to be scraped from deep inside of 1990s html pages. In this 3 hour tutorial you will learn how to programmatically read in various types of web data from experts in the field (Founders of the rOpenSci project and the training lead of RStudio). By the end of the tutorial you will have a basic idea of how to wrap an R package around a standard API, extract common non-standard data formats, and scrape data into tidy data frames from web pages.

Covers other resources and materials.


Dark Matter: Driven by Data

Monday, May 9th, 2016

A delightful keynote by Dan Geer, presented at the 2015 LangSec Workshop at the IEEE Symposium on Security & Privacy Workshops, May 21, 2015, San Jose, CA.

Prepared text for the presentation.

A quote to interest you in watching the video:

Workshop organizer Meredith Patterson gave me a quotation from Taylor Hornby that I hadn’t seen. In it, Hornby succinctly states the kind of confusion we are in and which LANGSEC is all about:

The illusion that your program is manipulating its data is powerful. But it is an illusion: The data is controlling your program.

It almost appears that we are building weird machines on purpose, almost the weirder the better. Take big data and deep learning. Where data science spreads, a massive increase in tailorability to conditions follows. But even if Moore’s Law remains forever valid, there will never be enough computing hence data driven algorithms must favor efficiency above all else, yet the more efficient the algorithm, the less interrogatable it is,[MO] that is to say that the more optimized the algorithm is, the harder it is to know what the algorithm is really doing.[SFI]

And there is a feedback loop here: The more desirable some particular automation is judged to be, the more data it is given. The more data it is given, the more its data utilization efficiency matters. The more its data utilization efficiency matters, the more its algorithms will evolve to opaque operation. Above some threshold of dependence on such an algorithm in practice, there can be no going back. As such, if science wishes to be useful, preserving algorithm interrogatability despite efficiency-seeking, self-driven evolution is the research grade problem now on the table. If science does not pick this up, then Lessig’s characterization of code as law[LL] is fulfilled. But if code is law, what is a weird machine?

If you can’t interrogate an algorithm, could you interrogate a topic map that is an “inefficient” implementation of the algorithm?

Or put differently, could there be two representations of the same algorithm, one that is “efficient,” and one that can be “interrogated?”

Read the paper version but be aware the video has a very rich Q&A session that follows the presentation.

People NOT Technology Produce Data ROI

Monday, February 15th, 2016

Too many tools… not enough carpenters! by Nicholas Hartman.

From the webpage:

Don’t let your enterprise make the expensive mistake of thinking that buying tons of proprietary tools will solve your data analytics challenges.

tl;dr = The enterprise needs to invest in core data science skills, not proprietary tools.

Most of the world’s largest corporations are flush with data, but frequently still struggle to achieve the vast performance increases promised by the hype around so called “big data.” It’s not that the excitement around the potential of harvesting all that data was unwarranted, but rather these companies are finding that translating data into information and ultimately tangible value can be hard… really hard.

In your typical new tech-based startup the entire computing ecosystem was likely built from day one around the need to generate, store, analyze and create value from data. That ecosystem was also likely backed from day one with a team of qualified data scientists. Such ecosystems spawned a wave of new data science technologies that have since been productized into tools for sale. Backed by mind-blowingly large sums of VC cash many of these tools have set their eyes on the large enterprise market. A nice landscape of such tools was recently prepared by Matt Turck of FirstMark Capital (host of Data Driven NYC, one of the best data science meetups around).

Consumers stopped paying money for software a long time ago (they now mostly let the advertisers pay for the product). If you want to make serious money in pure software these days you have to sell to the enterprise. Large corporations still spend billions and billions every year on software and data science is one of the hottest areas in tech right now, so selling software for crunching data should be a no-brainer! Not so fast.

The problem is, the enterprise data environment is often nothing like that found within your typical 3-year-old startup. Data can be strewn across hundreds or thousands of systems that don’t talk to each other. Devices like mainframes are still common. Vast quantities of data are generated and stored within these companies, but until recently nobody ever really envisioned ever accessing — let alone analyzing — these archived records. Often, it’s not initially even clear how the all data generated by these systems directly relates to a large blue chip’s core business operations. It does, but a lack of in-house data scientists means that nobody is entirely even sure what data is really there or how it can be leveraged.

I would delete “proprietary” from the above because non-proprietary tools create data problems just as easily.

Thus I would re-write the second quote as:

Tools won’t replace skilled talent, and skilled talent doesn’t typically need many particular tools.

I substituted “particular” tools to avoid religious questions about particular non-proprietary tools.

Understanding data, recognizing where data integration is profitable and where it is a dead loss, creating tests to measure potential ROI, etc., are all tasks of a human data analyst and not any proprietary or non-proprietary tool.

That all enterprise data has some intrinsic value that can be extracted if it were only accessible is an article of religious faith, not business ROI.

If you want business ROI from data, start with human analysts and not the latest buzzwords in technological tools.

Yahoo News Feed dataset, version 1.0 (1.5TB) – Sorry, No Open Data At Yahoo!

Thursday, January 14th, 2016

R10 – Yahoo News Feed dataset, version 1.0 (1.5TB)

From the webpage:

The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. The dataset stands at a massive ~110B lines (1.5TB bzipped) of user-news item interaction data, collected by recording the user- news item interaction of about 20M users from February 2015 to May 2015. In addition to the interaction data, we are providing the demographic information (age segment and gender) and the city in which the user is based for a subset of the anonymized users. On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article. The interaction data is timestamped with the user’s local time and also contains partial information of the device on which the user accessed the news feeds, which allows for interesting work in contextual recommendation and temporal data mining.

The dataset may be used by researchers to validate recommender systems, collaborative filtering methods, context-aware learning, large-scale learning algorithms, transfer learning, user behavior modeling, content enrichment and unsupervised learning methods.

The readme file for this dataset is located in part 1 of the download. Please refer to the readme file for a detailed overview of the dataset.

A great data set but one you aren’t going to see unless you have a university email account.

I thought when it took my regular Yahoo! login and I accepted the license agreement I was in. Not a chance!

No open data at Yahoo!

Why Yahoo! would have such a restriction, particularly in light of the progress towards open data is a complete mystery.

To be honest, even if I heard Yahoo!’s “reasons,” I doubt I would find them convincing.

If you have a university email address, good for you, download and use the data.

If you don’t have a university email address, can you ping me with the email of a decision maker at Yahoo! who can void this no open data policy?


20 Big Data Repositories You Should Check Out [Data Source Checking?]

Wednesday, December 16th, 2015

20 Big Data Repositories You Should Check Out by Vincent Granville.

Vincent lists some additional sources along with a link to Bernard Marr’s original selection.

One of the issues with such lists is that they are rarely maintained.

For example, Bernard listed:


Free, comprehensive social media data is hard to come by – after all their data is what generates profits for the big players (Facebook, Twitter etc) so they don’t want to give it away. However Topsy provides a searchable database of public tweets going back to 2006 as well as several tools to analyze the conversations.

But if you follow, you will find it points to:

Use Search on your iPhone, iPad, or iPod touch

With iOS 9, Search lets you look for content from the web, your contacts, apps, nearby places, and more. Powered by Siri, Search offers suggestions and updates results as you type.

That sucks doesn’t it? Expecting to be able to search public tweets back to 2006, along with analytical tools and what you get is a kiddie guide to search on a malware honeypot.

For a fuller explanation or at least the latest news on Topsy, check out: Apple shuts down Twitter analytics service Topsy by Sam Byford, dated December 16, 2015 (that’s today as I write this post).

So, strike Topsy off your list of big data sources.

Rather than bare lists, what big data needs is a curated list of big data sources that does more than list sources. Those sources need to be broken down to data sets to enable big data searchers to find all the relevant data sets and retrieve only those that remain accessible.

Like “link checking” but for big data resources. Data Source Checking?

That would be the “go to” place for big data sets and as bad as I hate advertising, a high traffic area for advertising to make it cost effective if not profitable.

Reddit Archive! 1 TB of Comments

Sunday, July 12th, 2015

You can now download a dataset of 1.65 billion Reddit comments: Beware the Redditor AI by Mic Wright.

From the post:

Once our species’ greatest trove of knowledge was the Library of Alexandria.

Now we have Reddit, a roiling mass of human ingenuity/douchebaggery that has recently focused on tearing itself apart like Tommy Wiseau in legendarily awful flick ‘The Room.’

But unlike the ancient library, the fruits of Reddit’s labors, good and ill, will not be destroyed in fire.

In fact, thanks to Jason Baumgartner of (aided by The Internet Archive), a dataset of 1.65 billion comments, stretching from October 2007 to May 2015, is now available to download.

The data – pulled using Reddit’s API – is made up of JSON objects, including the comment, score, author, subreddit, position in the comment tree and a range of other fields.

The uncompressed dataset weighs in at over 1TB, meaning it’ll be most useful for major research projects with enough resources to really wrangle it.

Technically, the archive is incomplete, but not significantly. After 14 months of work and many API calls, Baumgartner was faced with approximately 350,000 comments that were not available. In most cases that’s because the comment resides in a private subreddit or was simply removed.

If you don’t have a spare TB of space at the moment, you will also be interested in:, where you will find several BigQueries already.

The full data set certainly makes an interesting alternative to the Turing test for AI. Can you AI generate without assistance or access to this data set, the responses that appear therein? Is that a fair test for “intelligence?”

If you want updated data, consult the Reddit API.

New York Philharmonic Performance History

Sunday, June 28th, 2015

New York Philharmonic Performance History

From the post:

The New York Philharmonic played its first concert on December 7, 1842. Since then, it has merged with the New York Symphony, the New/National Symphony, and had a long-running summer season at New York’s Lewisohn Stadium. This Performance History database documents all known concerts of all of these organizations, amounting to more than 20,000 performances. The New York Philharmonic Leon Levy Digital Archives provides an additional interface for searching printed programs alongside other digitized items such as marked music scores, marked orchestral parts, business records, and photos.

In an effort to make this data available for study, analysis, and reuse, the New York Philharmonic joins organizations like The Tate and the Cooper-Hewitt Smithsonian National Design Museum in making its own contribution to the Open Data movement.

The metadata here is released under the Creative Commons Public Domain CC0 licence. Please see the enclosed LICENCE file for more detail.

The data:

Field Description
General Info: Info that applies to entire program
id GUID (To view program:
ProgramID Local NYP ID
Orchestra Full orchestra name Learn more…
Season Defined as Sep 1 – Aug 31, displayed “1842-43”
Concert Info: Repeated for each individual performance within a program
eventType See term definitions
Location Geographic location of concert (Countries are identified by their current name. For example, even though the orchestra played in Czechoslovakia, it is now identified in the data as the Czech Republic)
Venue Name of hall, theater, or building where the concert took place
Date Full ISO date used, but ignore TIME part (1842-12-07T05:00:00Z = Dec. 7, 1842)
Time Actual time of concert, e.g. “8:00PM”
Works Info: the fields below are repeated for each work performed on a program. By matching the index number of each field, you can tell which soloist(s) and conductor(s) performed a specific work on each of the concerts listed above.
worksConductorName Last name, first name
worksComposerTitle Composer Last name, first / TITLE (NYP short titles used)
worksSoloistName Last name, first name (if multiple soloists on a single work, delimited by semicolon)
worksSoloistInstrument Last name, first name (if multiple soloists on a single work, delimited by semicolon)
worksSoloistRole “S” means “Soloist”; “A” means “Assisting Artist” (if multiple soloists on a single work, delimited by semicolon)

A great starting place for a topic map for performances of the New York Philharmonic or for combination with topic maps for composers or soloists.

I first saw this in a tweet by Anna Kijas.

New York City Subway Anthrax/Plague

Tuesday, May 5th, 2015

Spoiler Alert: This paper discusses a possible find of anthrax and plague DNA in the New York Subway. It concludes that either a related but harmless strain wasn’t considered and/or there was random sequencing error. In either case, it is a textbook example of the need for data skepticism.

Searching for anthrax in the New York City subway metagenome by Robert A Petit, III, Matthew Ezewudo, Sandeep J. Joseph, Timothy D. Read.

From the introduction:

In February 2015 Chris Mason and his team published an in-depth analysis of metagenomic data (environmental shotgun DNA sequence) from samples isolated from public surfaces in the New York City (NYC) subway system. Along with a ton of really interesting findings, the authors claimed to have detected DNA from the bacterial biothreat pathogens Bacillus anthracis (which causes anthrax) and Yersinia pestis (causes plague) in some of the samples. This predictably led to a huge interest from the press and scientists on social media. The authors followed up with an re-analysis of the data on, where they showed some results that suggested the tools that they were using for species identification overcalled anthrax and plague.

The NYC subway metagenome study raised very timely questions about using unbiased DNA sequencing for pathogen detection. We were interested in this dataset as soon as the publication appeared and started looking deeper into why the analysis software gave false positive results and indeed what exactly was found in the subway samples. We decided to wrap up the results of our preliminary analysis and put it on this site. This report focuses on the results for B. anthracis but we also did some preliminary work on Y.pestis and may follow up on this later.

The article gives a detailed accounting of the tools and issues involved in the identification of DNA fragments from pathogens. It is hard core science but it also illustrates how iffy hard core science can be. Sure, you have the data, that doesn’t mean you will reach the correct conclusion from it.

The authors also mention a followup study by Chris Mason, the author of the original paper, entitled:

The long road from Data to Wisdom, and from DNA to Pathogen by Christopher Mason.

From the introduction:

There is an oft-cited DIKW). Just because you have data, it takes some processing to get quality information, and even good information is not necessarily knowledge, and knowledge often requires context or application to become wisdom.

And from his conclusion:

But, perhaps the bigger issue is social. I confess I grossly underestimated how the press would sensationalize these results, and even the Department of Health (DOH) did not believe it, claiming it simply could not be true. We sent the MTA and the DOH our first draft upon submission in October 2014, the raw and processed data, as well as both of our revised drafts in December 2014 and January 2015, and we did get some feedback, but they also had other concerns at the time (Ebola guy in the subway). This is also different from what they normally do (PCR for specific targets), so we both learned from each other. Yet, upon publication, it was clear that Twitter and blogs provided some of the same scrutiny as the three reviewers during the two rounds of peer review. But, they went even deeper and dug into the raw data, within hours of the paper coming online, and I would argue that online reviewers have become an invaluable part of scientific publishing. Thus, published work is effectively a living entity before (bioRxiv), during (online), and after publication (WSJ, Twitter, and others), and online voices constitute an critical, ensemble 4th reviewer.

Going forward, the transparency of the methods, annotations, algorithms, and techniques has never been more essential. To this end, we have detailed our work in the supplemental methods, but we have also posted complete pipelines in this blog post on how to go from raw data to annotated species on Galaxy. Even better, the precise virtual machines and instantiation of what was run on a server needs to be tracked and guaranteed to be 100% reproducible. For example, for our .vcf characterizations of the human alleles, we have set up our entire pipeline on Arvados/Curoverse, free to use, so that anyone can take a .vcf file and run the exact same ancestry analyses and get the exact same results. Eventually, tools like this can automate and normalize computational aspects of metagenomics work, which is an ever-increasingly important component of genomics.


Data –>Information –>Knowledge –>Wisdom (DIKW).

sounds like:

evidence based data science.

to me.

Another quick point, note that Chris Mason and team made all their data available for others to review and Chris states that informal review was a valuable contributor to the scientific process.

That is an illustration of the value of transparency. Contrast that with the Obama Administration’s default position of opacity. Which one do you think serves a fact finding process better?

Perhaps that is the answer. The Obama administration isn’t interested in a fact finding process. It has found the “facts” that it wants and reaches its desired conclusions. What is there left to question or discuss?

800,000 NPR Audio Files!

Wednesday, April 29th, 2015

There Are Now 800,000 Reasons To Share NPR Audio On Your Site by Patrick Cooper.

From the post:

From NPR stories to shows to songs, today we’re making more than 800,000 pieces of our audio available for you to share around the Web. We’re throwing open the doors to embedding, putting our audio on your site.

Complete with simple instructions for embedding!

I often think of topic maps when listening to NPR so don’t be surprised if you start seeing embedded NPR audio in the very near future!


Twitter cuts off ‘firehose’ access…

Monday, April 20th, 2015

Twitter cuts off ‘firehose’ access, eyes Big Data bonanza by Mike Wheatley.

From the post:

Twitter upset the applecart on Friday when it announced it would no longer license its stream of half a billion daily tweets to third-party resellers.

The social media site said it had decided to terminate all current agreements with third parties to resell its ‘firehose’ data – an unfiltered, full stream of tweets and all of the metadata that comes with them. For companies that still wish to access the firehose, they’ll still be able to do so, but only by licensing the data directly from Twitter itself.

Twitter’s new plan is to use its own Big Data analytics team, which came about as a result of its acquisition of Gnip in 2014, to build direct relationships with data companies and brands that rely on Twitter data to measure market trends, consumer sentiment and other metrics that can be best understood by keeping track of what people are saying online. The company hopes to complete the transition by August this year.

Not that I had any foreknowledge of Twitter’s plans but I can’t say this latest move is all that surprising.

What I hope also emerges from the “new plan” is a fixed pricing structure for smaller users of Twitter content. I’m really not interested in an airline pricing model where the price you pay has no rational relationship to the value of the product. If it’s the day before the end of a sales quarter I get a very different price for a Twitter feed than mid-way through the quarter. That sort of thing.

Along with being able to specify users to follow/searches and tweet streams in daily increments of 250,000, 500,000, 750,000, 1,000,000, where they are spooled for daily pickup over high speed connections (to put less stress on infrastructure).

I suppose renewable contracts would be too much to ask? 😉

Unannotated Listicle of Public Data Sets

Monday, April 20th, 2015

Great Github list of public data sets by Mirko Krivanek.

Large list of public data sets, previously published on GitHub, which has no annotations to guide you to particular datasets.

Just in case you know of any legitimate aircraft wiring sites, i.e., ones that existed prior to the GAO report on hacking aircraft networks, ping me with the links. Thanks!

33% of Poor Business Decisions Track Back to Data Quality Issues

Tuesday, April 7th, 2015

Stupid errors in spreadsheets could lead to Britain’s next corporate disaster by Rebecca Burn-Callander.

From the post:

Errors in company spreadsheets could be putting billions of pounds at risk, research has found. This is despite high-profile spreadsheet catastrophes, such as the collapse of US energy giant Enron, ringing alarm bells more than a decade ago.

Almost one in five large businesses have suffered financial losses as a result of errors in spreadsheets, according to F1F9, which provides financial modelling and business forecasting to blue chips firms. It warns of looming financial disasters as 71pc of large British business always use spreadsheets for key financial decisions.

The company’s new whitepaper entitiled Capitalism’s Dirty Secret showed that the abuse of humble spreadsheet could have far-reaching consequences. Spreadsheets are used in the preparation of British company accounts worth up to £1.9 trillion and the UK manufacturing sector uses spreadsheets to make pricing decisions for up to £170bn worth of business.

Felienne Hermans, of Delft University of Technology, analysed 15,770 spreadsheets obtained from over 600,000 emails from 158 former employees. He found 755 files with more than a hundred errors, with the maximum number of errors in one file being 83,273.

Dr Hermans said: “The Enron case has given us a unique opportunity to look inside the workings of a major corporate organisation and see first hand how widespread poor spreadsheet practice really is.

First, a gender correction, Dr. Hermans is not a he. The post should read: “She found 755 files with more than….

Second, how bad is poor spreadsheet quality? The download page has this summary:

  • 33% of large businesses report poor decision making due to spreadsheet problems.
  • Nearly 1 in 5 large businesses have suffered direct financial loss due to poor spreadsheets.
  • Spreadsheets are used in the preparation of British company accounts worth up to £1.9 trillion.

You read that correctly, not that 33% of spreadsheet have quality issues but that 33% of poor business decisions can be traced to spreadsheet problems.

A comment to the blog post supplied a link for the report: A Research Report into the Uses and Abuses of Spreadsheets.

Spreadsheets are small to medium sized data.

Care to comment on the odds of big data and its processes pushing the percentage of poor business decisions past 33%?

How would you discover you are being misled by big data and/or its processing?

How do you validate the results of big data? Run another big data process?

When you hear sales pitches about big data, be sure to ask about the impact of dirty data. If assured that your domain doesn’t have a dirty data issue, grab your wallet and run!

PS: A Research Report into the Uses and Abuses of Spreadsheets is a must have publication.

The report itself is useful, but Appendix A 20 Principles For Good Spreadsheet Practice is a keeper. With a little imagination all of those principles could be applied to big data and its processing.

Just picking one at random:

3. Ensure that everyone involved in the creation or use of spreadsheet has an appropriate level of knowledge and understanding.

For big data, reword that to:

Ensure that everyone involved in the creation or use of big data has an appropriate level of knowledge and understanding.

Your IT staff are trained, but do the managers who will use the results understand the limitations of the data and/or it processing? Or do they follow the results because “the data says so?”

Wiki New Zealand

Tuesday, February 24th, 2015

Wiki New Zealand

From the about page:

It’s time to democratise data. Data is a language in which few are literate, and the resulting constraints at an individual and societal level are similar to those experienced when the proportion of the population able to read was small. When people require intermediaries before digesting information, the capacity for generating insights is reduced.

To democratise data we need to put users at the centre of our models, we need to design our systems and processes for users of data, and we need to realise that everyone can be a user. We will all win when everyone can make evidence-based decisions.

Wiki New Zealand is a charity devoted to getting people to use data about New Zealand.

We do this by pulling together New Zealand’s public sector, private sector and academic data in one place and making it easy for people to use in simple graphical form for free through this website.

We believe that informed decisions are better decisions. There is a lot of data about New Zealand available online today, but it is too difficult to access and too hard to use. We think that providing usable, clear, digestible and unbiased information will help you make better decisions, and will lead to better outcomes for you, for your community and for New Zealand.

We also believe that by working together we can build the most comprehensive, useful and accurate representation of New Zealand’s situation and performance: the “wiki” part of the name means “collaborative website”. Our site is open and free to use for everyone. Soon, anyone will be able to upload data and make graphs and submit them through our auditing process. We are really passionate about engaging with domain and data experts on their speciality areas.

We will not tell you what to think. We present topics from multiple angles, in wider contexts and over time. All our data is presented in charts that are designed to be compared easily with each other and constructed with as little bias as possible. Our job is to present data on a wide range of subjects relevant to you. Your job is to draw your own conclusions, develop your own opinions and make your decisions.

Whether you want to make a business decision based on how big your market is, fact-check a newspaper story, put together a school project, resolve an argument, build an app based on clean public licensed data, or just get to know this country better, we have made this for you.

Isn’t New Zealand a post-apocalypse destination? Thinking however great it may be now, the neighborhood is going down when all the post-apocalypse folks arrive. Something on the order of Mr. Rogers Neighborhood to Max Max Beyond Thunderdome. 😉

Hopefully, if there is an apocalypse, it will happen quickly enough to prevent a large influx of undesirables into New Zealand.

I first saw this in a tweet by Neil Saunders.

Yelp Dataset Challenge

Saturday, February 21st, 2015

Yelp Dataset Challenge

From the webpage:

Yelp Dataset Challenge is doubling up: Now 10 cities across 4 countries! Two years, four highly competitive rounds, over $35,000 in cash prizes awarded and several hundred peer-reviewed papers later: the Yelp Dataset Challenge is doubling up. We are proud to announce our latest dataset that includes information about local businesses, reviews and users in 10 cities across 4 countries. The Yelp Challenge dataset is much larger and richer than the Academic Dataset. This treasure trove of local business data is waiting to be mined and we can’t wait to see you push the frontiers of data science research with our data.

The Challenge Dataset:

  • 1.6M reviews and 500K tips by 366K users for 61K businesses
  • 481K business attributes, e.g., hours, parking availability, ambience.
  • Social network of 366K users for a total of 2.9M social edges.
  • Aggregated check-ins over time for each of the 61K businesses
  • The deadline for the fifth round of the Yelp Dataset Challenge is June 30, 2015. Submit your project to Yelp by visiting You can submit a research paper, video presentation, slide deck, website, blog, or any other medium that conveys your use of the Yelp Dataset Challenge data.

    Pitched at students but it is an interesting dataset.

    I first saw this in a tweet by Marin Dimitrov.

    Data as “First Class Citizens”

    Tuesday, February 10th, 2015

    Data as “First Class Citizens” by Łukasz Bolikowski, Nikos Houssos, Paolo Manghi, Jochen Schirrwagen.

    The guest editorial to D-Lib Magazine, January/February 2015, Volume 21, Number 1/2, introduces a collection of articles focusing on data as “first class citizens,” saying:

    Data are an essential element of the research process. Therefore, for the sake of transparency, verifiability and reproducibility of research, data need to become “first-class citizens” in scholarly communication. Researchers have to be able to publish, share, index, find, cite, and reuse research data sets.

    The 2nd International Workshop on Linking and Contextualizing Publications and Datasets (LCPD 2014), held in conjunction with the Digital Libraries 2014 conference (DL 2014), took place in London on September 12th, 2014 and gathered a group of stakeholders interested in growing a global data publishing culture. The workshop started with invited talks from Prof. Andreas Rauber (Linking to and Citing Data in Non-Trivial Settings), Stefan Kramer (Linking research data and publications: a survey of the landscape in the social sciences), and Dr. Iain Hrynaszkiewicz (Data papers and their applications: Data Descriptors in Scientific Data). The discussion was then organized into four full-paper sessions, exploring orthogonal but still interwoven facets of current and future challenges for data publishing: “contextualizing and linking” (Semantic Enrichment and Search: A Case Study on Environmental Science Literature and A-posteriori Provenance-enabled Linking of Publications and Datasets via Crowdsourcing), “new forms of publishing” (A Framework Supporting the Shift From Traditional Digital Publications to Enhanced Publications and Science 2.0 Repositories: Time for a Change in Scholarly Communication), “dataset citation” (Data Citation Practices in the CRAWDAD Wireless Network Data Archive, A Methodology for Citing Linked Open Data Subsets, and Challenges in Matching Dataset Citation Strings to Datasets in Social Science) and “dataset peer-review” (Enabling Living Systematic Reviews and Clinical Guidelines through Semantic Technologies and Data without Peer: Examples of Data Peer Review in the Earth Sciences).

    We believe these investigations provide a rich overview of current issues in the field, by proposing open problems, solutions, and future challenges. In short they confirm the urgent and fascinating demands of research data publishing.

    The only stumbling point in this collection of essays is the notion of data as “First Class Citizens.” Not that I object to a catchy title but not all data is going to be equal when it comes to first class citizenship.

    Take Semantic Enrichment and Search: A Case Study on Environmental Science Literature, for example. Great essay on using multiple sources to annotate entities once disambiguation had occurred. But some entities are going to have more annotations than others and some entities may not be recognized at all. Not to mention it is rather doubtful that the markup containing those entities was annotated at all?

    Are we sure we want to exclude from data the formats that contain the data? Isn’t a format a form of data? As well as the instructions for processing that data? Perhaps not in every case but shouldn’t data and the formats that hold date be equally treated as first class citizens? I am mindful that hundreds of thousands of people saw the pyramids being built but we have not one accurate report on the process.

    Will the lack of that one accurate report deny us access to data quite skillfully preserved in a format that is no longer accessible to us?

    While I support the cry for all data to be “first class citizens,” I also support a very broad notion of data to avoid overlooking data that may be critical in the future.

    Data Sources on the Web

    Sunday, February 1st, 2015

    Data Sources on the Web

    From the post:

    The following list of data sources has been modified as of January 2015. Most of the data sets listed below are free, however, some are not.

    If an (R!) appears after source this means that the data are already in R format or there exist R commands for directly importing the data from R. (See for some code.) Otherwise, i have limited the list to data sources for which there is a reasonably simple process for importing csv files. What follows is a list of data sources organized into categories that are not mutually exclusive but which reflect what's out there.

    Want to add to or update this list? Send to

    As you know, there are any number of data lists on the Net. This one is different, it is a maintained data list.


    I first saw this in a tweet by David Smith.

    MemSQL releases a tool to easily ship big data into its database

    Thursday, January 1st, 2015

    MemSQL releases a tool to easily ship big data into its database by Jordan Novet.

    From the post:

    Like other companies pushing databases, San Francisco startup MemSQL wants to solve low-level problems, such as easily importing data from critical sources. Today MemSQL is acting on that impulse by releasing a tool to send data from the S3 storage service on the Amazon Web Services cloud and from the Hadoop open-source file system into its proprietary in-memory SQL database — or the open-source MySQL database.

    Engineers can try out the new tool, named MemSQL Loader, today, now that it’s been released under an open-source MIT license.

    The existing “LOAD DATA” command in MemSQL and MySQL can bring data in, although it has its shortcomings, as Wayne Song, a software engineer at the startup, wrote in a blog post today. Song and his colleagues ran into those snags and started coding.

    How very cool!

    Not every database project seeks to “easily import… data from critical sources.” but I am very glad to see MemSQL take up the challenge.

    Reducing the friction between data stores and tools will make data pipelines more robust, reducing the amount of time spent trouble shooting routine data traffic issues and increasing the time spend on analysis that fuels your ROI from data science.

    True enough, if you want to make ASCII importing a $custom assistance from your staff task, that is one business model. On the whole I would not say it is a very viable one. Particularly with more production minded folks like MemSQL around.

    What database are you going to extend MemSQL Loader to support?

    Jeopardy! clues data

    Sunday, November 30th, 2014

    Jeopardy! clues data Nathan Yau writes:

    Here’s some weekend project data for you. Reddit user trexmatt dumped a dataset for 216,930 Jeopardy! questions and answers in JSON and CSV formats, a scrape from the J! Archive. Each clue is represented by category, money value, the clue itself, the answer, round, show number, and air date.

    Nathan suggests hunting for Daily Doubles but then discovers someone has done that. (See Nathan’s post for the details.)


    Land Matrix

    Friday, November 21st, 2014

    Land Matrix: The Online Public Database on Land Deals

    From the webpage:

    The Land Matrix is a global and independent land monitoring initiative that promotes transparency and accountability in decisions over land and investment.

    This website is our Global Observatory – an open tool for collecting and visualising information about large-scale land acquisitions.

    The data represented here is constantly evolving; to make this resource more accurate and comprehensive, we encourage your participation.

    The deals collected as data must meet the following criteria:

    • Entail a transfer of rights to use, control or ownership of land through sale, lease or concession;
    • Have been initiated since the year 2000;
    • Cover an area of 200 hectares or more;
    • Imply the potential conversion of land from smallholder production, local community use or important ecosystem service provision to commercial use.

    FYI, 200 hectares = 2 square kilometers.

    Land ownership and its transfer are matters of law and law means government.

    The project describes its data this way:

    The dataset is inherently unreliable, but over time it is expected to become more accurate. Land deals are notoriously un-transparent. In many countries, established procedures for decision-making on land deals do not exist, and negotiations and decisions do not take place in the public realm. Furthermore, a range of government agencies and levels of government are usually responsible for approving different kinds of land deals. Even official data sources in the same country can therefore vary, and none may actually reflect reality on the ground. Decisions are often changed, and this may or may not be communicated publically.

    I would start earlier than the year 2000 but the same techniques could be applied along the route of the Keystone XL pipeline. I am assuming that you are aware that pipelines, roads and other public works are not located purely for physical or aesthetic reasons. Yes?

    Please take the time to view and support the Land Matrix project and consider similar projects in your community.

    If the owners can be run to ground, you may find the parties to the transactions are linked by other “associations.”

    October 2014 Crawl Archive Available

    Friday, November 21st, 2014

    October 2014 Crawl Archive Available by Stephen Merity.

    From the post:

    The crawl archive for October 2014 is now available! This crawl archive is over 254TB in size and contains 3.72 billion webpages. The files are located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-42/.

    To assist with exploring and using the dataset, we’ve provided gzipped files that list:

    By simply adding either s3://aws-publicdatasets/ or to each line, you end up with the S3 and HTTP paths respectively.

    Thanks again to blekko for their ongoing donation of URLs for our crawl!

    Just in time for weekend exploration! 😉


    CERN frees LHC data

    Friday, November 21st, 2014

    CERN frees LHC data

    From the post:

    Today CERN launched its Open Data Portal, which makes data from real collision events produced by LHC experiments available to the public for the first time.

    “Data from the LHC program are among the most precious assets of the LHC experiments, that today we start sharing openly with the world,” says CERN Director General Rolf Heuer. “We hope these open data will support and inspire the global research community, including students and citizen scientists.”

    The LHC collaborations will continue to release collision data over the coming years.

    The first high-level and analyzable collision data openly released come from the CMS experiment and were originally collected in 2010 during the first LHC run. Open source software to read and analyze the data is also available, together with the corresponding documentation. The CMS collaboration is committed to releasing its data three years after collection, after they have been thoroughly studied by the collaboration.

    “This is all new and we are curious to see how the data will be re-used,” says CMS data preservation coordinator Kati Lassila-Perini. “We’ve prepared tools and examples of different levels of complexity from simplified analysis to ready-to-use online applications. We hope these examples will stimulate the creativity of external users.”

    In parallel, the CERN Open Data Portal gives access to additional event data sets from the ALICE, ATLAS, CMS and LHCb collaborations that have been prepared for educational purposes. These resources are accompanied by visualization tools.

    All data on are shared under a Creative Commons CC0 public domain dedication. Data and software are assigned unique DOI identifiers to make them citable in scientific articles. And software is released under open source licenses. The CERN Open Data Portal is built on the open-source Invenio Digital Library software, which powers other CERN Open Science tools and initiatives.

    Awesome is the only term for this data release!

    But, when you dig just a little bit further, you discover that embargoes still exist on three (3) out of (4) experiments. Both on data and software.

    Disappointing but hopefully a dying practice when it comes to publicly funded data.

    I first saw this in a tweet by Ben Evans.

    Over 1,000 research data repositories indexed in

    Thursday, November 20th, 2014

    Over 1,000 research data repositories indexed in

    From the post:

    In August 2012 – the Registry of Research Data Repositories went online with 23 entries. Two years later the registry provides researchers, funding organisations, libraries and publishers with over 1,000 listed research data repositories from all over the world making it the largest and most comprehensive online catalog of research data repositories on the web. provides detailed information about the research data repositories, and its distinctive icons help researchers easily identify relevant repositories for accessing and depositing data sets.

    To more than 5,000 unique visitors per month offers reliable orientation in the heterogeneous landscape of research data repositories. An average of 10 repositories are added to the registry every week. The latest indexed data infrastructure is the new CERN Open Data Portal.

    Add to your short list of major data repositories!

    Science fiction fanzines to be digitized as part of major UI initiative

    Wednesday, November 19th, 2014

    Science fiction fanzines to be digitized as part of major UI initiative by Kristi Bontrager.

    From the post:

    The University of Iowa Libraries has announced a major digitization initiative, in partnership with the UI Office of the Vice President for Research and Economic Development. 10,000 science fiction fanzines will be digitized from the James L. “Rusty” Hevelin Collection, representing the entire history of science fiction as a popular genre and providing the content for a database that documents the development of science fiction fandom.

    Hevelin was a fan and a collector for most of his life. He bought pulp magazines from newsstands as a boy in the 1930s, and by the early 1940s began attending some of the first organized science fiction conventions. He remained an active collector, fanzine creator, book dealer, and fan until his death in 2011. Hevelin’s collection came to the UI Libraries in 2012, contributing significantly to the UI Libraries’ reputation as a major international center for science fiction and fandom studies.

    Interesting content for many of us but an even more interesting work flow model for the content:

    Once digitized, the fanzines will be incorporated into the UI Libraries’ DIY History interface, where a select number of interested fans (up to 30) will be provided with secure access to transcribe, annotate, and index the contents of the fanzines. This group will be modeled on an Amateur Press Association (APA) structure, a fanzine distribution system developed in the early days of the medium that required contributions of content from members in order to qualify for, and maintain, membership in the organization. The transcription will enable the UI Libraries to construct a full-text searchable fanzine resource, with links to authors, editors, and topics, while protecting privacy and copyright by limiting access to the full set of page images.

    The similarity between the Amateur Press Association (APA) structure and modern open source projects is interesting. I checked the APA’s homepage, they are have a more traditional membership fee now.

    The Hevelin Collection homepage.