Archive for the ‘Dataset’ Category

Introducing Kaggle Datasets [No Data Feudalism Here]

Saturday, January 23rd, 2016

Introducing Kaggle Datasets

From the post:

At Kaggle, we want to help the world learn from data. This sounds bold and grandiose, but the biggest barriers to this are incredibly simple. It’s tough to access data. It’s tough to understand what’s in the data once you access it. We want to change this. That’s why we’ve created a home for high quality public datasets, Kaggle Datasets.

Kaggle Datasets has four core components:

  • Access: simple, consistent access to the data with clear licensing
  • Analysis: a way to explore the data without downloading it
  • Results: visibility to the previous work that’s been created on the data
  • Conversation: forums and comments for discussing the nuances of the data

Are you interested in publishing one of your datasets on Submit a sample here.

Unlike some medievalists who publish in the New England Journal of Medicine, Kaggle not only makes the data sets freely available, but offers tools to help you along.

Kaggle will also assist you in making your datasets available as well.

Yahoo News Feed dataset, version 1.0 (1.5TB) – Sorry, No Open Data At Yahoo!

Thursday, January 14th, 2016

R10 – Yahoo News Feed dataset, version 1.0 (1.5TB)

From the webpage:

The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. The dataset stands at a massive ~110B lines (1.5TB bzipped) of user-news item interaction data, collected by recording the user- news item interaction of about 20M users from February 2015 to May 2015. In addition to the interaction data, we are providing the demographic information (age segment and gender) and the city in which the user is based for a subset of the anonymized users. On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article. The interaction data is timestamped with the user’s local time and also contains partial information of the device on which the user accessed the news feeds, which allows for interesting work in contextual recommendation and temporal data mining.

The dataset may be used by researchers to validate recommender systems, collaborative filtering methods, context-aware learning, large-scale learning algorithms, transfer learning, user behavior modeling, content enrichment and unsupervised learning methods.

The readme file for this dataset is located in part 1 of the download. Please refer to the readme file for a detailed overview of the dataset.

A great data set but one you aren’t going to see unless you have a university email account.

I thought when it took my regular Yahoo! login and I accepted the license agreement I was in. Not a chance!

No open data at Yahoo!

Why Yahoo! would have such a restriction, particularly in light of the progress towards open data is a complete mystery.

To be honest, even if I heard Yahoo!’s “reasons,” I doubt I would find them convincing.

If you have a university email address, good for you, download and use the data.

If you don’t have a university email address, can you ping me with the email of a decision maker at Yahoo! who can void this no open data policy?


Looking after Datasets

Tuesday, September 1st, 2015

Looking after Datasets by Antony Unwin.

Some examples that Antony uses to illustrate the problems with datasets in R:

You might think that supplying a dataset in an R package would be a simple matter: You include the file, you write a short general description mentioning the background and giving the source, you define the variables. Perhaps you provide some sample analyses and discuss the results briefly. Kevin Wright's agridat package is exemplary in these respects.

As it happens, there are a couple of other issues that turn out to be important. Is the dataset or a version of it already in R and is the name you want to use for the dataset already taken? At this point the experienced R user will correctly guess that some datasets have the same name but are quite different (e.g., movies, melanoma) and that some datasets appear in many different versions under many different names. The best example I know is the Titanic dataset, which is availble in the datasets package. You will also find titanic (COUNT, prLogistic, msme), titanic.dat (exactLoglinTest), titan.Dat (elrm), titgrp (COUNT), etitanic (earth), ptitanic (rpart.plot), Lifeboats (vcd), TitanicMat (RelativeRisk), Titanicp (vcdExtra), TitanicSurvival (effects), Whitestar (alr4), and one package, plotrix, includes a manually entered version of the dataset in one of its help examples. The datasets differ on whether the crew is included or not, on the number of cases, on information provided, on formatting, and on discussion, if any, of analyses. Versions with the same names in different packages are not identical. There may be others I have missed.

The issue came up because I was looking for a dataset of the month for the website of my book "Graphical Data Analysis with R". The plan is to choose a dataset from one of the recently released or revised R packages and publish a brief graphical analysis to illustrate and reinforce the ideas presented in the book while showing some interesting information about the data. The dataset finch in dynRB looked rather nice: five species of finch with nine continuous variables and just under 150 cases. It looked promising and what’s more it is related to Darwin’s work and there was what looked like an original reference from 1904.

As if Antony’s list of issues wasn’t enough, how do you capture your understanding of a problem with a dataset?

That is you have discovered the meaning of a variable that isn’t recorded with the dataset. Where are you going to put that information?

You could modify the original dataset to capture that new information but then people will have to discover your version of the original dataset. Not to mention you need to avoid stepping on something else in the original dataset.

Antony concludes:

…returning to Moore’s definition of data, wouldn’t it be a help to distinguish proper datasets from mere sets of numbers in R?

Most people have an intersecting idea of a “proper dataset” but I would spend less time trying to define that and more on capturing the context of whatever appears to me to be a “proper dataset.”

More data is never a bad thing.

Unannotated Listicle of Public Data Sets

Monday, April 20th, 2015

Great Github list of public data sets by Mirko Krivanek.

Large list of public data sets, previously published on GitHub, which has no annotations to guide you to particular datasets.

Just in case you know of any legitimate aircraft wiring sites, i.e., ones that existed prior to the GAO report on hacking aircraft networks, ping me with the links. Thanks!

Student Data Sets

Wednesday, December 10th, 2014

Christopher Lortie tweeted today that his second year ecology students have posted 415 datasets this year!

Which is a great example for others!

However, how do other people find these and similar datasets?

Not a criticism of the students or their datasets but a reminder that findability remains an unsolved issue.

Exemplar Public Health Datasets

Friday, November 14th, 2014

Exemplar Public Health Datasets Editor: Jonathan Tedds.

From the post:

This special collection contains papers describing exemplar public heath datasets published as part of the Enhancing Discoverability of Public Health and Epidemiology Research Data project commissioned by the Wellcome Trust and the Public Health Research Data Forum.

The publication of the datasets included in this collection is intended to promote faster progress in improving health, better value for money and higher quality science, in accordance with the joint statement made by the forum members in January 2011.

Submission to this collection is by invitation only, and papers have been peer reviewed. The article template and instructions for submission are available here.

Data for analysis as well as examples of best practices for pubic health datasets.


I first saw this in a tweet by Christophe Lallanne.

Open science in machine learning

Wednesday, February 26th, 2014

Open science in machine learning by Joaquin Vanschoren, Mikio L. Braun, and Cheng Soon Ong.


We present OpenML and mldata, open science platforms that provides easy access to machine learning data, software and results to encourage further study and application. They go beyond the more traditional repositories for data sets and software packages in that they allow researchers to also easily share the results they obtained in experiments and to compare their solutions with those of others.

From 2 OpenML:

OpenML ( is a website where researchers can share their data sets, implementations and experiments in such a way that they can easily be found and reused by others. It offers a web API through which new resources and results can be submitted automatically, and is being integrated in a number of popular machine learning and data mining platforms, such as Weka, RapidMiner, KNIME, and data mining packages in R, so that new results can be submitted automatically. Vice versa, it enables researchers to easily search for certain results (e.g. evaluations of algorithms on a certain data set), to directly compare certain techniques against each other, and to combine all submitted data in advanced queries.

From 3 mldata:

mldata ( is a community-based website for the exchange of machine learning data sets. Data sets can either be raw data files or collections of files, or use one of the supported file formats like HDF5 or ARFF in which case mldata looks at meta data contained in the files to display more information. Similar to OpenML, mldata can define learning tasks based on data sets, where mldata currently focuses on supervised learning data. Learning tasks identify which features are used for input and output and also which score is used to evaluate the functions. mldata also allows to create learning challenges by grouping learning tasks together, and lets users submit results in the form of predicted labels which are then automatically evaluated.

Interesting sites.

Does raise the question of who will index the indexers of datasets?

I first saw this in a tweet by Stefano Betolo.

Social Science Dataset Prize!

Wednesday, January 22nd, 2014

Statwing is awarding $1,500 for the best insights from its massive social science dataset by Derrick Harris.

All submissions are due through the form on this page by January 30 at 11:59pm PST.

From the post:

Statistics startup Statwing has kicked off a competition to find the best insights from a 406-variable social science dataset. Entries will be voted on by the crowd, with the winner getting $1,000, second place getting $300 and third place getting $200. (Check out all the rules on the Statwing site.) Even if you don’t win, though, it’s a fun dataset to play with.

The data comes from the General Social Survey and dates back to 1972. It contains variables ranging from sex to feelings about education funding, from education level to whether respondents think homosexual men make good parents. I spent about an hour slicing and dicing variable within the Statwing service, and found some at least marginally interesting stuff. Contest entries can use whatever tools they want, and all 79 megabytes and 39,662 rows are downloadable from the contest page.

Time is short so you better start working.

The rules page, where you make your submission, emphasizes:

Note that this is a competition for the most interesting finding(s), not the best visualization.

Use any tool or method, just find the “most interesting finding(s)” as determined by crowd vote.

On the dataset:

Every other year since 1972, the General Social Survey (GSS) has asked thousands of Americans 90 minutes of questions about religion, culture, beliefs, sex, politics, family, and a lot more. The resulting dataset has been cited by more than 14,000 academic papers, books, and dissertations—more than any except the U.S. Census.

I can’t decide if Americans have more odd opinions now than before. 😉

Maybe some number crunching will help with that question.

Everpix-Intelligence [Failed Start-up Data Set]

Sunday, January 12th, 2014


From the webpage:

About Everpix

Everpix was started in 2011 with the goal of solving the Photo Mess, an increasingly real pain point in people’s life photo collections, through ambitious engineering and user experience. Our startup was angel and VC funded with $2.3M raised over its lifetime.

After 2 years of research and product development, and although having a very enthousiastic user base of early adopters combined with strong PR momentum, we didn’t succeed in raising our Series A in the highly competitive VC funding market. Unable to continue operating our business, we had to announce our upcoming shutdown on November 5th, 2013.

High-Level Metrics

At the time of its shutdown announcement, the Everpix platform had 50,000 signed up users (including 7,000 subscribers) with 400 millions photos imported, while generating subscription sales of $40,000 / month during the last 3 months (i.e. enough money to cover variable costs, but not the fixed costs of the business).

Complete Dataset

Building a startup is about taking on a challenge and working countless hours on solving it. Most startups do not make it but rarely do they reveal the story behind, leaving their users often frustrated. Because we wanted the Everpix community to understand some of the dynamics in the startup world and why we had to come to such a painful ending, we worked closely with a reporter from The Verge who chronicled our last couple weeks. The resulting article generated extensive coverage and also some healthy discussions around some of our high-level metrics and financials. There was a lot more internal data we wanted to share but it wasn’t the right time or place.

With the Everpix shutdown behind us, we had the chance to put together a significant dataset covering our business from fundraising to metrics. We hope this rare and uncensored inside look at the internals of a startup will benefit the startup community.

Here are some example of common startup questions this dataset helps answering:

  • What are investment terms for consecutive convertible notes and an equity seed round? What does the end cap table look like? (see here)
  • How does a Silicon Valley startup spend its raised money during 2 years? (see here)
  • What does a VC pitch deck look like? (see here)
  • What kinds of reasons do VCs give when they pass? (see here)
  • What are the open rate and click rate of transactional and marketing emails? (see here)
  • What web traffic do various news websites generate? (see here and here)
  • What are the conversion rate from product landing page to sign up for new visitors? (see here)
  • How fast do people purchase a subscription after signing up to a freemium service? (see here and here)
  • Which countries have higher suscription rates? (see here and here)

The dataset is organized as follow:

Every IT startup but especially data oriented startups should work with this data set before launch.

I thought the comments from VCs were particularly interesting.

I would summarize those comments as:

  1. There is a problem.
  2. You have a great idea to solve the problem.
  3. Will consumers pay you to solve the problem?

What evidence do you have on #3?

Bearing in mind that should, ought to, value is obvious, etc., are wishes, not evidence.

I first saw this in a tweet by Emil Eifrem.

Big data sets available for free

Wednesday, January 1st, 2014

Big data sets available for free by Vincent Granville.

From the post:

A few data sets are accessible from our data science apprenticeship web page.

(graphic omitted)

  • Source code and data for our Big Data keyword correlation API (see also section in separate chapter, in our book)
  • Great statistical analysis: forecasting meteorite hits (see also section in separate chapter, in our book)
  • Fast clustering algorithms for massive datasets (see also section in separate chapter, in our book)
  • 53.5 billion clicks dataset available for benchmarking and testing
  • Over 5,000,000 financial, economic and social datasets
  • New pattern to predict stock prices, multiplies return by factor 5 (stock market data, S&P 500; see also section in separate chapter, in our book)
  • 3.5 billion web pages: The graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages
  • Another large data set – 250 million data points: This is the full resolution GDELT event dataset running January 1, 1979 through March 31, 2013 and containing all data fields for each event record.
  • 125 Years of Public Health Data Available for Download

Just in case you are looking for data for a 2014 demo or data project!

Curated Dataset Lists

Tuesday, December 31st, 2013

6 dataset lists curated by data scientists by Scott Haylon.

From the post:

Since we do a lot of experimenting with data, we’re always excited to find new datasets to use with Mortar. We’re saving bookmarks and sharing datasets with our team on a nearly-daily basis.

There are tons of resources throughout the web, but given our love for the data scientist community, we thought we’d pick out a few of the best dataset lists curated by data scientists.

Below is a collection of six great dataset lists from both famous data scientists and those who aren’t well-known:

Here you will find lists of datasets by:

  • Peter Skomoroch
  • Hilary Mason
  • Kevin Chai
  • Jeff Hammerbacher
  • Jerry Smith
  • Gregory Piatetsky-Shapiro

Great lists of datasets, unfortunately, not deduped nor ranked by the # of collections in which they appear.

Requesting Datasets from the Federal Government

Friday, December 13th, 2013

Requesting Datasets from the Federal Government by Eruditio Loginquitas.

From the post:

Much has been made of “open government” of late, with the U.S.’s federal government releasing tens of thousands of data sets from pretty much all public-facing offices. Many of these sets are available off of their respective websites. Many are offered in a centralized way at I finally spent some time on this site in search of datasets with location data to continue my learning of Tableau Public (with an eventual planned move to ArcMap).

I’ve been appreciating how much data are required to govern effectively but also how much data are created in the work of governance, particularly in an open and transparent society. There are literally billions of records and metrics required to run an efficient modern government. In a democracy, the tendency is to make information available—through sunshine laws and open meetings laws and data requests. The openness is particularly pronounced in cases of citizen participation, academic research, and journalistic requests. These are all aspects of a healthy interchange between citizens and their government…and further, digital government.

Public Requests for Data

One of the more charming aspects of the site involves a public thread which enables people to make requests for the creation of certain data sets by developers. People would make the case for the need for certain information. Some would offer “trades” by making promises about how they would use the data and what they would make available to the larger public. Others would simply make a request for the data. Still others would just post “requests,” which were actually just political or personal statements. (The requests site may be viewed here:;=1 .)

What datasets would you like to see?

The rejected requests can interesting, for example:

Properties Owned by Congressional Members Rejected

Congressional voting records Rejected

I don’t think the government has detailed information sufficient to answer the one about property owned by members of Congress.

On the other hand there are only 535 members so manual data mining in each state should turn up most of the public information fairly easily. The not public information could be more difficult.

The voting records request is puzzling since that is public record. And various rant groups print up their own analysis of voting records.

I don’t know, given the number of requests “Under Review” if it would be a good use of time but requesting the data behind opaque reports might illuminate the areas being hidden from transparency.

Quandl exceeds 7.8 million datasets!

Monday, December 9th, 2013

From The R Backpages 2 by Joseph Rickert.

From the post:

Quandl contiues its mission to seek out and make available the worlds financial and econometric data. Recently added data sets include:

That’s a big jump since our last post when Quandl broke 5 million datasets! (April 30, 2013)

Any thoughts on how many of these datasets have semantic mapping data to facilitate their re-use and/or combination with other datasets?

Selling the mapping data might be a tough sell because the customer still has to make intelligent use of it.

Selling mapped data on the other hand, that is offering consolidation of specified data sets on a daily, weekly, monthly basis, that might be a different story.

Something to think about.

PS: Do remember that a documented mapping for any dataset at Quandl will work for that same dataset elsewhere. So you won’t be re-discovering the mapping every time a request comes in for that dataset.

Not a “…butts in seats…” approach but then you probably aren’t a prime contractor.

2013 Arrives! (New Crawl Data)

Thursday, November 28th, 2013

New Crawl Data Available! by Jordan Mendelson.

From the post:

We are very please to announce that new crawl data is now available! The data was collected in 2013, contains approximately 2 billion web pages and is 102TB in size (uncompressed).

We’ve made some changes to the data formats and the directory structure. Please see the details below and please share your thoughts and questions on the Common Crawl Google Group.

Format Changes

We have switched from ARC files to WARC files to better match what the industry has standardized on. WARC files allow us to include HTTP request information in the crawl data, add metadata about requests, and cross-reference the text extracts with the specific response that they were generated from. There are also many good open source tools for working with WARC files.

We have switched the metadata files from JSON to WAT files. The JSON format did not allow specifying the multiple offsets to files necessary for the WARC upgrade and WAT files provide more detail.

We have switched our text file format from Hadoop sequence files to WET files (WARC Encapsulated Text) that properly reference the original requests. This makes it far easier for your processes to disambiguate which text extracts belong to which specific page fetches.

Jordan continues to outline the directory structure of the 2013 crawl data and lists additional resources that will be of interest.

If you aren’t Google or some reasonable facsimile thereof (yet), the Common Crawl data set is your doorway into the wild wild content of the WWW.

How do your algorithms fare when matched against the full range of human expression?

Yelp Dataset Challenge

Monday, November 25th, 2013

Yelp Dataset Challenge

Deadline: Monday, February 10, 2014.

From the webpage:

Yelp is proud to introduce a deep dataset for research-minded academics from our wealth of data. If you’ve used our Academic Dataset and want something richer to train your models on and use in publications, this is it. Tired of using the same standard datasets? Want some real-world relevance in your research project? This data is for you!

Yelp is bringing you a generous sample of our data from the greater Phoenix, AZ metropolitan area including:

  • 11,537 businesses
  • 8,282 checkin sets
  • 43,873 users
  • 229,907 reviews


If you are a student and come up with an appealing project, you’ll have the opportunity to win one of ten Yelp Dataset Challenge awards for $5,000. Yes, that’s $5,000 for showing us how you use our data in insightful, unique, and compelling ways.

Additionally, if you publish a research paper about your winning research in a peer-reviewed academic journal, then you’ll be awarded an additional $1,000 as recognition of your publication. If you are published, Yelp will also contribute up to $500 to travel expenses to present your research using our data at an academic or industry conference.

If you are a student, see the Yelp webpage for more details. If you are not a student, pass this along to someone who is.

Yes, this is dataset mentioned in How-to: Index and Search Data with Hue’s Search App.

Data Repositories…

Thursday, November 14th, 2013

Data Repositories – Mother’s Milk for Data Scientists by Jerry A. Smith.

From the post:

Mothers are life givers, giving the milk of life. While there are so very few analogies so apropos, data is often considered the Mother’s Milk of Corporate Valuation. So, as a data scientist, we should treat dearly all those sources of data, understanding their place in the overall value chain of corporate existence.

A Data Repository is a logical (and sometimes physical) partitioning of data where multiple databases which apply to specific applications or sets of applications reside. For example, several databases (revenues, expenses) which support financial applications (A/R, A/P) could reside in a single financial Data Repository. Data Repositories can be found both internal (e.g., in data warehouses) and external (see below) to an organization. Here are a few repositories from KDnuggets that are worth taking a look at: (emphasis in original)

I count sixty-four (64) collections of data sets as of today.

What I haven’t seen, perhaps you have, is an index across the most popular data set collections that dedupes data sets and has thumb-nail information for each one.

Suggested indexes across data set collections?

…A new open Scientific Data journal

Friday, October 18th, 2013

Publishing one’s research data : A new open Scientific Data journal

From the post:

A new Journal called ‘Scientific Data‘ to be launched by Nature in May 2014 has made a call for submissions. What makes this publication unique is that it is open-access, online-only publication for descriptions of scientifically valuable datasets, which aims to foster data sharing and reuse, and ultimately to accelerate the pace of scientific discovery.

Sample publications, 1 and 2.

From the journal homepage:

Launching in May 2014 and open now for submissions, Scientific Data is a new open-access, online-only publication for descriptions of scientifically valuable datasets, initially focusing on the life, biomedical and environmental science communities

Scientific Data exists to help you publish, discover and reuse research data and is built around six key principles:

  • Credit: Credit, through a citable publication, for depositing and sharing your data
  • Reuse: Complete, curated and standardized descriptions enable the reuse of your data
  • Quality: Rigorous community-based peer review
  • Discovery: Find datasets relevant to your research
  • Open: Promotes and endorses open science principles for the use, reuse and distribution of your data, and is available to all through a Creative Commons license
  • Service: In-house curation, rapid peer-review and publication of your data descriptions

Possibly an important source of scientific data in the not so distant future.

International Tracing Service Archive

Wednesday, June 5th, 2013

International Tracing Service Archive (U.S. Holocaust Memorial Museum)

The posting on Crowdsourcing + Machine Learning… reminded me to check on access to the archives of the International Tracking Service.

Let’s just say the International Tracking Service has a poor track record on accessibility to its archives. An archive of documents the ITS describes as:

Placed end-to-end, the documents in the ITS archives would extent to a length of about 26,000 metres.

Fortunately digitized copies of portions of the archives are available at other locations, such as the U.S. Holocaust Memorial Museum.

The FAQ on the archives answers the question “Are the records goings to be on the Internet?” this way:

Regrettably, the collection was neither organized nor digitized to be directly searchable online. Therefore, the Museum’s top priority is to develop software and a database that will efficiently search the records so we can quickly respond to survivor requests for information.

Only a small fraction of the records are machine readable. In order to be searched by Google or Yahoo! search engines, all of the data must be machine readable.

Searching the material is an arduous task in any event. The ITS records are in some 25 different languages and contain millions of names, many with multiple spellings. Many of the records are entirely handwritten. In cases where forms were used, the forms are written in German and the entries are often handwritten in another language.

The best way to ensure that survivors receive accurate information quickly and easily will be by submitting requests to the Museum by e-mail, regular mail, or fax, and trained Museum staff will assist with the research. The Museum will provide copies of all relevant original documents to survivors who wish to receive them via e-mail or regular mail.

The priority of the Museum is in answering requests for information from survivors.

However, we do know that multiple languages and handwritten texts are not barriers to creating machine readable texts for online searching.

The searches would not be perfect but even double-key entry of all the data would not be perfect.

What better way to introduce digital literate generations to the actuality of the Holocaust than to involve them in crowd-sourcing the proofing of a machine transcription of this archive?

Then the Holocaust would not a few weeks in history class or a museum or memorial to visit but experience with documents of the fates of millions.

PS: Creating trails through the multiple languages, spellings, locations, etc., by researchers than can be enhanced by other researchers, would highlight the advantages of topic maps in historical research.

Distributing the Edit History of Wikipedia Infoboxes

Thursday, May 30th, 2013

Distributing the Edit History of Wikipedia Infoboxes by Enrique Alfonseca.

From the post:

Aside from its value as a general-purpose encyclopedia, Wikipedia is also one of the most widely used resources to acquire, either automatically or semi-automatically, knowledge bases of structured data. Much research has been devoted to automatically building disambiguation resources, parallel corpora and structured knowledge from Wikipedia. Still, most of those projects have been based on single snapshots of Wikipedia, extracting the attribute values that were valid at a particular point in time. So about a year ago we compiled and released a data set that allows researchers to see how data attributes can change over time.


For this reason, we released, in collaboration with Wikimedia Deutschland e.V., a resource containing all the edit history of infoboxes in Wikipedia pages. While this was already available indirectly in Wikimedia’s full history dumps, the smaller size of the released dataset will make it easier to download and process this data. The released dataset contains 38,979,871 infobox attribute updates for 1,845,172 different entities, and it is available for download both from Google and from Wikimedia Deutschland’s Toolserver page. A description of the dataset can be found in our paper WHAD: Wikipedia Historical Attributes Data, accepted for publication at the Language Resources and Evaluation journal.

How much data do you need beyond the infoboxes of Wikipedia?

And knowing what values were in the past … isn’t that like knowing prior identifiers for subjects?

Medicare Provider Charge Data

Thursday, May 30th, 2013

Medicare Provider Charge Data

From the webpage:

As part of the Obama administration’s work to make our health care system more affordable and accountable, data are being released that show significant variation across the country and within communities in what hospitals charge for common inpatient services.

The data provided here include hospital-specific charges for the more than 3,000 U.S. hospitals that receive Medicare Inpatient Prospective Payment System (IPPS) payments for the top 100 most frequently billed discharges, paid under Medicare based on a rate per discharge using the Medicare Severity Diagnosis Related Group (MS-DRG) for Fiscal Year (FY) 2011. These DRGs represent almost 7 million discharges or 60 percent of total Medicare IPPS discharges.

Hospitals determine what they will charge for items and services provided to patients and these charges are the amount the hospital bills for an item or service. The Total Payment amount includes the MS-DRG amount, bill total per diem, beneficiary primary payer claim payment amount, beneficiary Part A coinsurance amount, beneficiary deductible amount, beneficiary blood deducible amount and DRG outlier amount.

For these DRGs, average charges and average Medicare payments are calculated at the individual hospital level. Users will be able to make comparisons between the amount charged by individual hospitals within local markets, and nationwide, for services that might be furnished in connection with a particular inpatient stay.

Data are being made available in Microsoft Excel (.xlsx) format and comma separated values (.csv) format.

Inpatient Charge Data, FY2011, Microsoft Excel version
Inpatient Charge Data, FY2011, Comma Separated Values (CSV) version

A nice start towards a useful data set.

Next step would be tying identifiable physicians with ordered medical procedures and tests.

The only times I have arrived at a hospital by ambulance, I never thought to ask for a comparison of their prices with other local hospitals. Nor did I see any signs advertising discounts on particular procedures.

Have you?

Let’s not pretend medical care is a consumer market, where “consumers” are penalized for not being good shoppers.

I first saw this at Nathan Yau’s Medicare provider charge data released.

Thursday, May 30th, 2013

From the post:

An increasing number of universities and research organisations are starting to build research data repositories to allow permanent access in a trustworthy environment to data sets resulting from research at their institutions. Due to varying disciplinary requirements, the landscape of research data repositories is very heterogeneous. This makes it difficult for researchers, funding bodies, publishers, and scholarly institutions to select an appropriate repository for storage of research data or to search for data.

The registry allows the easy identification of appropriate research data repositories, both for data producers and users. The registry covers research data repositories from all academic disciplines. Information icons display the principal attributes of a repository, allowing users to identify the functionalities and qualities of a data repository. These attributes can be used for multi-faceted searches, for instance to find a repository for geoscience data using a Creative Commons licence.

By April 2013, 338 research data repositories were indexed in 171 of these are described by a comprehensive vocabulary, which was developed by involving the data repository community (

The search at can be found at:
The information icons are explained at:

Does this sound like any of these?:


The Dataverse Network Project

IOGDS: International Open Government Dataset Search

PivotPaths: a Fluid Exploration of Interlinked Information Collections

Quandl [> 2 million financial/economic datasets]

Just to name five (5) that came to mind right off hand?

Addressing the heterogeneous nature of data repositories by creating another, semantically different data repository, seems like a non-solution to me.

What would be useful would be to create a mapping of this “new” classification, which I assume works for some group of users, against the existing classifications.

That would allow users of the “new” classification to access data in existing repositories, without having to learn their classification systems.

The heterogeneous nature of information is never vanquished but we can incorporate it into our systems.

Quandl – Update

Tuesday, April 30th, 2013


When I last wrote about Quandl, they were at over 2,000,000 datasets.

Following a recent link to their site, I found they are now over 5,000,000 data sets.

No mean feat, but among the questions that remain:

How do I judge the interoperability of data sets?

Where do I find the information needed to make data sets interoperable?

And just as importantly,

Where do I write down information I discovered or created to make a data set interoperable? (To avoid doing the labor over again.)

Cool GSS training video! And cumulative file 1972-2012!

Sunday, March 10th, 2013

Cool GSS training video! And cumulative file 1972-2012! by Andrew Gelman.

From the post:

Felipe Osorio made the above video to help people use the General Social Survey and R to answer research questions in social science. Go for it!

From the GSS: General Social Survey website:

The General Social Survey (GSS) conducts basic scientific research on the structure and development of American society with a data-collection program designed to both monitor societal change within the United States and to compare the United States to other nations.

The GSS contains a standard ‘core’ of demographic, behavioral, and attitudinal questions, plus topics of special interest. Many of the core questions have remained unchanged since 1972 to facilitate time-trend studies as well as replication of earlier findings. The GSS takes the pulse of America, and is a unique and valuable resource. It has tracked the opinions of Americans over the last four decades.

The information “gap” is becoming more of a matter of skill than access to underlying data.

How would you match the GSS data up to other data sets?


Friday, March 8th, 2013

Crossfilter: Fast Multidimensional Filtering for Coordinated Views

From the webpage:

Crossfilter is a JavaScript library for exploring large multivariate datasets in the browser. Crossfilter supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records; we built it to power analytics for Square Register, allowing merchants to slice and dice their payment history fluidly.

Since most interactions only involve a single dimension, and then only small adjustments are made to the filter values, incremental filtering and reducing is significantly faster than starting from scratch. Crossfilter uses sorted indexes (and a few bit-twiddling hacks) to make this possible, dramatically increasing the perfor­mance of live histograms and top-K lists. For more details on how Crossfilter works, see the API reference.

See the webpage for an impressive demonstration with a 5.3 MB dataset.

Is there a trend towards “big data” manipulation on clusters and “less big data” in browsers?

Will be interesting to see how the benchmarks for “big” and “less big” move over time.

I first saw this in Nat Torkington’s Four Short links: 4 March 2013.

Data, Data, Data: Thousands of Public Data Sources

Monday, March 4th, 2013

Data, Data, Data: Thousands of Public Data Sources

From the post:

We love data, big and small and we are always on the lookout for interesting datasets. Over the last two years, the BigML team has compiled a long list of sources of data that anyone can use. It’s a great list for browsing, importing into our platform, creating new models and just exploring what can be done with different sets of data.

A rather remarkable list of data sets. You are sure to find something of interest!


Saturday, February 23rd, 2013


While looking for more information on Arango-DB, I stumbled across this collection of graph data sets:

Brief descriptions: ArangoDB-Data [San Francisco, for example]

Wednesday, February 13th, 2013

From the homepage:

a comprehensive list of open data catalogs curated by experts from around the world.

Cited in Simon Roger’s post: Competition: visualise open government data and win $2,000.

As of today, 288 registered data catalogs.

The reservation I have about “open” government data is that when it is “open,” it’s not terribly useful.

I am sure there is useful “open” government data but let me give you an example of non-useful “open” government data.

Consider San Francisco, CA and cases of police misconduct against it citizens.

A really interesting data visualization would be to plot those incidents against the neighborhoods of San Francisco. Where the neighborhoods are colored by economic status.

The maps of San Francisco are available at DataSF, specifically, Planning Neighborhoods.

What about the police data?

I found summaries like: OCC Caseload/Disposition Summary – 1993-2009

Which listed:

  • Opened
  • Closed
  • Pending
  • Sustained

Not exactly what is needed for neighborhood by neighborhood mapping.

Note: No police misconduct since 2009 according to these data sets. (I find that rather hard to credit.)

How would you vote on this data set from San Francisco?

Open, Opaque, Semi-Transparent?

Call for KDD Cup Competition Proposals

Sunday, February 10th, 2013

Call for KDD Cup Competition Proposals

From the post:

Please let us know if you are interested in being considered for the 2013 KDD Cup Competition by filling out the form below.

This is the official call for proposals for the KDD Cup 2013 competition. The KDD Cup is the well known data mining competition of the annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD-2013 conference will be held in Chicago from August 11 – 14, 2013. The competition will last between 6 and 8 weeks and the winners should be notified by end-June. The winners will be announced in the KDD-2013 conference and we are planning to run a workshop as well.

A good competition task is one that is practically useful, scientifically or technically challenging, can be done without extensive application domain knowledge, and can be evaluated objectively. Of particular interest are non-traditional tasks/data that require novel techniques and/or thoughtful feature construction.

Proposals should involve data and a problem whose successful completion will result in a contribution of some lasting value to a field or discipline. You may assume that Kaggle will provide the technical support for running the contest. The data needs to be available no later than mid-March.

If you have initial questions about the suitability of your data/problem feel free to reach out to claudia.perlich [at]

Do you have:

non-traditional tasks/data that require[s] novel techniques and/or thoughtful feature construction?

Is collocation of information on the basis of multi-dimensional subject identity a non-traditional task?

Does extraction of multiple dimensions of a subject identity from users require novel techniques?

If so, what data sets would you suggest using in this challenge?

I first saw this at: 19th ACM SIGKDD Knowledge Discovery and Data Mining Conference.

OneMusicAPI Simplifies Music Metadata Collection

Friday, February 8th, 2013

OneMusicAPI Simplifies Music Metadata Collection by Eric Carter.

From the post:

Elsten software, digital music organizer, has announced OneMusicAPI. Proclaimed to be “OneMusicAPI to rule them all,” the API acts as a music metadata aggregator that pulls from multiple sources across the web through a single interface. Elsten founder and OneMusicAPI creator, Dan Gravell, found keeping pace with constant changes from individual sources became too tedious a process to adequately organize music.

Currently covers over three million albums but only returns cover art.

Other data will be added but when and to what degree isn’t clear.

When launched, pricing plans will be available.

A lesson that will need to be reinforced from time to time.

Collation of data/information consumes time and resources.

To encourage collation, collators need to be paid.

If you need an example of what happens without paid collators, search your favorite search engine for the term “collator.”

Depending on how you count “sameness,” I get eight or nine different notions of collator from mine.

Doing More with the Hortonworks Sandbox

Tuesday, February 5th, 2013

Doing More with the Hortonworks Sandbox by Cheryle Custer.

From the post:

The Hortonworks Sandbox was recently introduced garnering incredibly positive response and feedback. We are as excited as you, and gratified that our goal providing the fastest onramp to Apache Hadoop has come to fruition. By providing a free, integrated learning environment along with a personal Hadoop environment, we are helping you gain those big data skills faster. Because of your feedback and demand for new tutorials, we are accelerating the release schedule for upcoming tutorials. We will continue to announce new tutorials via the Hortonworks blog, opt-in email and Twitter (@hortonworks).

While you wait for more tutorials, Cheryle points to some data sets to keep you busy:

For advice, see the Sandbox Forums.

BTW, while you are munging across different data sets, be sure to notice any semantic impedance if you try to merge some data sets.

If you don’t want everyone in your office doing that merging one-off, you might want to consider topic maps.

Design and document a merge between data sets once, run many times.

Even if your merging requirements change. Just change that part of the map, don’t re-create the entire map.

What if mapping companies recreated their maps for every new street?

Or would it be better to add the new street to an existing map?

If that looks obvious, try the extra-bonus question:

Which model, new map or add new street, do you use for schema migration?