Archive for the ‘Dataset’ Category

Quandl – Update

Tuesday, April 30th, 2013

Quandl

When I last wrote about Quandl, they were at over 2,000,000 datasets.

Following a recent link to their site, I found they are now over 5,000,000 data sets.

No mean feat, but among the questions that remain:

How do I judge the interoperability of data sets?

Where do I find the information needed to make data sets interoperable?

And just as importantly,

Where do I write down information I discovered or created to make a data set interoperable? (To avoid doing the labor over again.)

Cool GSS training video! And cumulative file 1972-2012!

Sunday, March 10th, 2013

Cool GSS training video! And cumulative file 1972-2012! by Andrew Gelman.

From the post:

Felipe Osorio made the above video to help people use the General Social Survey and R to answer research questions in social science. Go for it!

From the GSS: General Social Survey website:

The General Social Survey (GSS) conducts basic scientific research on the structure and development of American society with a data-collection program designed to both monitor societal change within the United States and to compare the United States to other nations.

The GSS contains a standard ‘core’ of demographic, behavioral, and attitudinal questions, plus topics of special interest. Many of the core questions have remained unchanged since 1972 to facilitate time-trend studies as well as replication of earlier findings. The GSS takes the pulse of America, and is a unique and valuable resource. It has tracked the opinions of Americans over the last four decades.

The information “gap” is becoming more of a matter of skill than access to underlying data.

How would you match the GSS data up to other data sets?

Crossfilter

Friday, March 8th, 2013

Crossfilter: Fast Multidimensional Filtering for Coordinated Views

From the webpage:

Crossfilter is a JavaScript library for exploring large multivariate datasets in the browser. Crossfilter supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records; we built it to power analytics for Square Register, allowing merchants to slice and dice their payment history fluidly.

Since most interactions only involve a single dimension, and then only small adjustments are made to the filter values, incremental filtering and reducing is significantly faster than starting from scratch. Crossfilter uses sorted indexes (and a few bit-twiddling hacks) to make this possible, dramatically increasing the perfor­mance of live histograms and top-K lists. For more details on how Crossfilter works, see the API reference.

See the webpage for an impressive demonstration with a 5.3 MB dataset.

Is there a trend towards “big data” manipulation on clusters and “less big data” in browsers?

Will be interesting to see how the benchmarks for “big” and “less big” move over time.

I first saw this in Nat Torkington’s Four Short links: 4 March 2013.

Data, Data, Data: Thousands of Public Data Sources

Monday, March 4th, 2013

Data, Data, Data: Thousands of Public Data Sources

From the post:

We love data, big and small and we are always on the lookout for interesting datasets. Over the last two years, the BigML team has compiled a long list of sources of data that anyone can use. It’s a great list for browsing, importing into our platform, creating new models and just exploring what can be done with different sets of data.

A rather remarkable list of data sets. You are sure to find something of interest!

ArangoDB-Data

Saturday, February 23rd, 2013

ArangoDB-Data

While looking for more information on Arango-DB, I stumbled across this collection of graph data sets:

Brief descriptions: ArangoDB-Data

datacatalogs.org [San Francisco, for example]

Wednesday, February 13th, 2013

datacatalogs.org

From the homepage:

a comprehensive list of open data catalogs curated by experts from around the world.

Cited in Simon Roger’s post: Competition: visualise open government data and win $2,000.

As of today, 288 registered data catalogs.

The reservation I have about “open” government data is that when it is “open,” it’s not terribly useful.

I am sure there is useful “open” government data but let me give you an example of non-useful “open” government data.

Consider San Francisco, CA and cases of police misconduct against it citizens.

A really interesting data visualization would be to plot those incidents against the neighborhoods of San Francisco. Where the neighborhoods are colored by economic status.

The maps of San Francisco are available at DataSF, specifically, Planning Neighborhoods.

What about the police data?

I found summaries like: OCC Caseload/Disposition Summary – 1993-2009

Which listed:

  • Opened
  • Closed
  • Pending
  • Sustained

Not exactly what is needed for neighborhood by neighborhood mapping.

Note: No police misconduct since 2009 according to these data sets. (I find that rather hard to credit.)

How would you vote on this data set from San Francisco?

Open, Opaque, Semi-Transparent?

Call for KDD Cup Competition Proposals

Sunday, February 10th, 2013

Call for KDD Cup Competition Proposals

From the post:

Please let us know if you are interested in being considered for the 2013 KDD Cup Competition by filling out the form below.

This is the official call for proposals for the KDD Cup 2013 competition. The KDD Cup is the well known data mining competition of the annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD-2013 conference will be held in Chicago from August 11 – 14, 2013. The competition will last between 6 and 8 weeks and the winners should be notified by end-June. The winners will be announced in the KDD-2013 conference and we are planning to run a workshop as well.

A good competition task is one that is practically useful, scientifically or technically challenging, can be done without extensive application domain knowledge, and can be evaluated objectively. Of particular interest are non-traditional tasks/data that require novel techniques and/or thoughtful feature construction.

Proposals should involve data and a problem whose successful completion will result in a contribution of some lasting value to a field or discipline. You may assume that Kaggle will provide the technical support for running the contest. The data needs to be available no later than mid-March.

If you have initial questions about the suitability of your data/problem feel free to reach out to claudia.perlich [at] gmail.com.

Do you have:

non-traditional tasks/data that require[s] novel techniques and/or thoughtful feature construction?

Is collocation of information on the basis of multi-dimensional subject identity a non-traditional task?

Does extraction of multiple dimensions of a subject identity from users require novel techniques?

If so, what data sets would you suggest using in this challenge?

I first saw this at: 19th ACM SIGKDD Knowledge Discovery and Data Mining Conference.

OneMusicAPI Simplifies Music Metadata Collection

Friday, February 8th, 2013

OneMusicAPI Simplifies Music Metadata Collection by Eric Carter.

From the post:

Elsten software, digital music organizer, has announced OneMusicAPI. Proclaimed to be “OneMusicAPI to rule them all,” the API acts as a music metadata aggregator that pulls from multiple sources across the web through a single interface. Elsten founder and OneMusicAPI creator, Dan Gravell, found keeping pace with constant changes from individual sources became too tedious a process to adequately organize music.

Currently covers over three million albums but only returns cover art.

Other data will be added but when and to what degree isn’t clear.

When launched, pricing plans will be available.

A lesson that will need to be reinforced from time to time.

Collation of data/information consumes time and resources.

To encourage collation, collators need to be paid.

If you need an example of what happens without paid collators, search your favorite search engine for the term “collator.”

Depending on how you count “sameness,” I get eight or nine different notions of collator from mine.

Doing More with the Hortonworks Sandbox

Tuesday, February 5th, 2013

Doing More with the Hortonworks Sandbox by Cheryle Custer.

From the post:

The Hortonworks Sandbox was recently introduced garnering incredibly positive response and feedback. We are as excited as you, and gratified that our goal providing the fastest onramp to Apache Hadoop has come to fruition. By providing a free, integrated learning environment along with a personal Hadoop environment, we are helping you gain those big data skills faster. Because of your feedback and demand for new tutorials, we are accelerating the release schedule for upcoming tutorials. We will continue to announce new tutorials via the Hortonworks blog, opt-in email and Twitter (@hortonworks).

While you wait for more tutorials, Cheryle points to some data sets to keep you busy:

For advice, see the Sandbox Forums.

BTW, while you are munging across different data sets, be sure to notice any semantic impedance if you try to merge some data sets.

If you don’t want everyone in your office doing that merging one-off, you might want to consider topic maps.

Design and document a merge between data sets once, run many times.

Even if your merging requirements change. Just change that part of the map, don’t re-create the entire map.

What if mapping companies recreated their maps for every new street?

Or would it be better to add the new street to an existing map?

If that looks obvious, try the extra-bonus question:

Which model, new map or add new street, do you use for schema migration?

Case study: million songs dataset

Sunday, February 3rd, 2013

Case study: million songs dataset by Danny Bickson.

From the post:

A couple of days ago I wrote about the million songs dataset. Our man in London, Clive Cox from Rummble Labs, suggested we should implement rankings based on item similarity.

Thanks to Clive suggestion, we have now an implementation of Fabio Aiolli’s cost function as explained in the paper: A Preliminary Study for a Recommender System for the Million Songs Dataset, which is the winning method in this contest.

Following are detailed instructions on how to utilize GraphChi CF toolkit on the million songs dataset data, for computing user ratings out of item similarities. 

Just in case you need some data for practice with your GraphChi installation. ;-)

Seriously, nice way to gain familiarity with the data set.

What value you extract from it is up to you.

Bad News From UK: … brows up, breasts down

Tuesday, January 29th, 2013

UK plastic surgery statistics 2012: brows up, breasts down by Ami Sedghi.

From the post:

Despite a recession and the government launching a review into cosmetic surgery following the breast implant scandal, plastic surgery procedures in the UK were up last year.

A total of 43,172 surgical procedures were carried out in 2012 according to the British Association of Aesthetic Plastic Surgeons (BAAPS), an increase of 0.2% on the previous year. Although there wasn’t a big change for overall procedures, anti-ageing treatments such as eyelid surgery and face lifts saw double digit increases.

Breast augmentation (otherwise known as ‘boob jobs’) were still the most popular procedure overall although the numbers dropped by 1.6% from 2011 to 2012. Last year’s stats took no account of the breast implant scandal so this is the first release of figures from BAAPS to suggest what impact the scandal has had on the popular procedure.

Just for comparison purposes:

Country Procedures Population Percent of Population Treated
UK 43,172 62,641,000 0.00068%
US 9,200,000 313,914,000 0.02900%

Perhaps beauty isn’t one of the claimed advantages of socialized medicine?

Click Dataset [HTTP requests]

Tuesday, January 22nd, 2013

Click Dataset

From the webpage:

To foster the study of the structure and dynamics of Web traffic networks, we make available a large dataset (‘Click Dataset’) of HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows one to examine large volumes of traffic data while minimizing biases associated with other data sources. It also provides one with valuable referrer information to reconstruct the subset of the Web graph actually traversed by users. The goal is to develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

The data was generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University. This filter matched all traffic destined for TCP port 80. A long-running collection process used the pcap library to gather these packets, then applied a small set of regular expressions to their payloads to determine whether they contained HTTP GET requests.

Data available under terms and restrictions, including transfer by physical hard drive (~ 2.5 TB of data).

Intrigued by the notion of a “subset of the Web graph actually traversed by users.”

Does that mean that semantic annotation should occur on the portion of the “…Web graph actually traversed by users” before reaching other parts?

If the language of 4,148,237 English Wikipedia pages is never in doubt for any user, do we really need triples to record that for every page?

Complete Guardian Dataset Listing!

Thursday, January 17th, 2013

All our datasets: the complete index by Chris Cross.

From the post:

Lost track of the hundreds of datasets published by the Guardian Datablog since it began in 2009? Thanks to ScraperWiki, this is the ultimate list and resource. The table below is live and updated every day – if you’re still looking for that ultimate dataset, the chance is we’ve already done it. Click below to find out

I am simply in awe of the number of datasets produced by the Guardian since 2009.

A few of the more interesting titles include:

You will find things in the hundreds of datasets you have wondered about and other things you can’t imagine wondering about. ;-)

Enjoy!

Quandl [> 2 million financial/economic datasets]

Tuesday, December 25th, 2012

Quandl (alpha)

From the homepage:

Quandl is a collaboratively curated portal to over 2 million financial and economic time-series datasets from over 250 sources. Our long-term mission is to make all numerical data on the internet easy to find and easy to use.

Interesting enough but the detail from the “about” page are even more so:

Our Vision

The internet offers a rich collection of high quality numerical data on thousands of subjects. But the potential of this data is not being reached at all because the data is very difficult to actually find. Furthermore, it is also difficult to extract, validate, format, merge, and share.

We have a solution: We’re building an intelligent search engine for numerical data. We’ve developed technology that lets people quickly and easily add data to Quandl’s index. Once this happens, the data instantly becomes easy to find and easy to use because it gains 8 essential attributes:

Findability Quandl is essentially a search engine for numerical data. Every search result on Quandl is an actual data set that you can use right now. Once data from anywhere on the internet becomes known to Quandl, it becomes findable by search and (soon) by browse.
Structure Quandl is a universal translator for data formats. It accepts numerical data no matter what format it happens to be published in and then delivers it in any format you request it. When you find a dataset on Quandl, you’ll be able to export anywhere you want, in any format you want.
Validity Every dataset on Quandl has a simple link back to the same data on the publisher’s web site which gives you 100% certainty on validity.
Fusibility Any data set on Quandl is totally compatible with any and all other data on Quandl. You can merge multiple datasets on Quandl quickly and easily (coming soon).
Permanence Once a dataset is on Quandl, it stays there forever. It is always up-to-date and available at a permanent, unchanging URL.
Connectivity Every dataset on Quandl is accessible by a simple API. Whether or not the original publisher offered an API no longer matters because Quandl always does. Quandl is the universal API for numerical data on the internet.
Recency Every single dataset on Quandl is guaranteed to be the most recent version of that data, retrieved afresh directly from the original publisher.
Utility Data on Quandl is organized and presented for maximum utility: Actual data is examinable immediately; the data is graphed (properly); description, attribution, units, and export tools are clear and concise.

I have my doubts about the “fusibility” claims. You can check the US Leading Indicators data list and note that “level” and “units” use different units of measurement. Other semantic issues lurk just beneath the surface.

Still, the name of the engine does not begin with “B” or “G” and illustrates there is enormous potential for curated data collections.

Come to think of it, topic maps are curated data collections.

Are you in need of a data curator?

I first saw this in a tweet by Gregory Piatetsky.

d8taplex [UK bicycle theft = terrorism?]

Friday, November 23rd, 2012

d8taplex

Bills itself as:

Explore over 50 thousand data sets containing over 1 million time series.

But searching at random (there was no description of which 50,000 datasets were in play):

Astronomy – 8 “hits” – All doctorates awarded by field of study.

Chemistry – 22 “hits” – Degrees, students, periodical prices.

Physics – 38 “hits” – Degrees, students, periodical prices, staff.

Automobile accidents – 492 “hits” – What you would expect about road conditions, condition of drivers, etc.

Terrorist attacks – 11 “hits” –

containing document: Crime in England and Wales 2009/10: Supplementary Tables: Nature of burglary, vehicle-related theft, bicycle theft, other household theft, personal and other theft, vandalism and violent crime | data.gov.uk
anchor text: Personal theft

I really don’t equate “bicycle theft” with an act of terrorism. Inconvenient yes, terrorism no.

Unless you are getting money from the U.S. Department of Homeland Security of course. They fund studies of how to hide power transmission stations that are too large and dependent on air cooling to be enclosed.

I guess putting blank spots on maps would only serve to highlight their presence. DHS could ban the manufacture of printed maps. Only allow electronic ones. Which can be distorted to show or conceal whatever the flavor of terrorism is for the week.

It would not take long for the only content of the map to be “You are here.” With no markers as to where “here” might be. But then you are there so look around.

Archive of datasets bundled with R

Wednesday, November 21st, 2012

Archive of datasets bundled with R by Nathan Yau.

From the post:

R comes with a lot of datasets, some with the core distribution and others with packages, but you’d never know which ones unless you went through all the examples found at the end of help documents. Luckily, Vincent Arel-Bundock cataloged 596 of them in an easy-to-read page, and you can quickly download them as CSV files.

Many of the datasets are dated, going back to the original distribution of R, but it’s a great resource for teaching or if you’re just looking for some data to play with.

A great find! Thanks Nathan and to Vincent for pulling it together!

FindTheData

Monday, November 19th, 2012

FindTheData

From the about page:

At FindTheData, we present you with the facts stripped of any marketing influence so that you can make quick and informed decisions. We present the facts in easy-to-use tables with smart filters, so that you can decide what is best.

Too often, marketers and pay-to-play sites team up to present carefully crafted advertisements as objective “best of” lists. As a result, it has become difficult and time consuming to distinguish objective information from paid placements. Our goal is to become a trusted source in assisting you in life’s important decisions.

FindTheData is organized into 9 broad categories

Each category includes dozens of Comparisons from smartphones to dog breeds. Each Comparison consists of a variety of listings and each listing can be sorted by several key filters or compared side-by-side.

Traditional search is a great hammer but sometimes you need a wrench.

Currently search can find any piece of information across hundreds of billions of Web pages, but when you need to make a decision whether it’s choosing the right college or selecting the best financial advisor, you need information structured in an easily comparable format. FindTheData does exactly that. We help you compare apples-to-apples data, side-by-side, on a wide variety of products & services.

If you think in the same categories as the authors, sorta like using LCSH, you are in like Flint. If you don’t, well, your mileage may vary.

While some people may find it convenient to have tables and sorts pre-set for them, it would be nice to be able to download the data files.

Still, you may find it useful to browse for datasets that are new to you.

IOGDS: International Open Government Dataset Search

Saturday, November 10th, 2012

IOGDS: International Open Government Dataset Search

Description:

The TWC International Open Government Dataset Search (IOGDS) is a linked data application based on metadata “scraped” from hundreds of international dataset catalog websites publishing a rich variety of government data. Metadata extracted from these catalog websites is automatically converted to RDF linked data and re-published via the TWC LOGD SPARQL endpoint and made available for download. The TWC IOGDS demo site features an efficient, reconfigurable faceted browser with search capabilities offering a compelling demonstration of the value of a common metadata model for open government dataset catalogs. We believe that the vocabulary choices demonstrated by IOGDS highlights the potential for useful linked data applications to be created from open government catalogs and will encourage the adoption of such a standard worldwide.

In addition to the datasets you will find tutorials, videos, demos, tools and technologies and other resources.

Whether you are looking for Linked Data or Linked Data to re-use in other ways.

Seen in a tweet by Tim O’Reilly.

Using (Spring Data) Neo4j for the Hubway Data Challenge [Boston Biking]

Thursday, October 11th, 2012

Using (Spring Data) Neo4j for the Hubway Data Challenge by Michael Hunger.

From the post:

Using Spring Data Neo4j it was incredibly easy to model and import the Hubway Challenge dataset into a Neo4j graph database, to make it available for advanced querying and visualization.

The Challenge and Data

Tonight @graphmaven pointed me to the boston.com article about the Hubway Data Challenge.

(graphics omitted)

Hubway is a bike sharing service which is currently expanding worldwide. In the Data challenge they offer the CSV-data of their 95 Boston stations and about half a million bike rides up until the end of September. The challenge is to provide answers to some posted questions and develop great visualizations (or UI’s) for the Hubway data set. The challenge is also supported by MAPC (Metropolitan Area Planning Council).

Useful import tips for data into Neo4j and on modeling this particular dataset.

Not to mention the resulting database as well!

PS: From the challenge site:

Submission will open here on Friday, October 12, 2012.

Deadline

MIDNIGHT (11:59 p.m.) on Halloween,
Wednesday, October 31, 2012.

Winners will be announced on Wednesday, November 7, 2012.

Prizes:

  • A one-year Hubway membership
  • Hubway T-shirt
  • Bern helmet
  • A limited edition Hubway System Map—one of only 61 installed in the original Hubway stations.

For other details, see the challenge site.

Interesting large scale dataset: D4D mobile data [Deadline: October 31, 2012]

Wednesday, October 10th, 2012

Interesting large scale dataset: D4D mobile data by Danny Bickson.

From the post:

I got the following from Prof. Scott Kirkpatrick.

Write a 250-words research project and get access within a week to the largest ever released mobile phone datasets: datasets based on 2.5 billion records, calls and text messages exchanged between 5 million anonymous users over 5 months.

Participation rules: http://www.d4d.orange.com/

Description of the datasets: http://arxiv.org/abs/1210.0137

The “Terms and Conditions” by Orange allows the publication of resultsbobtained from the datasets even if they do not directly relate to the challenge.

Cash prizes for winning participants and an invitation to present the results at the NetMob conference be held in May 2-3, 2013 at the Medialab at MIT (www.netmob.org).

Deadline: October 31, 2012

Looking to exercise your graph software? Compare to other graph software? Do interesting things with cell phone data?

This could be your chance!

Datasets! Datasets! Get Your Datasets Here!

Saturday, September 22nd, 2012

Datasets from René Pichardt’s group:

The project KONECT (Koblenz Network Collection) has extracted and made available four new network datasets based on information in the English Wikipedia, using data from the DBpedia project. The four network datasets are: The bipartite network of writers and their works (113,000 nodes and 122,000 edges) The bipartite network of producers and the works they [...]

Assume you have a knowledge base containing entities and their properties or relations with other entities. For instance, think of a knowledge base about movies, actors and directors. For the movies you have structured knowledge about the title and the year they were made in, while for the actors and directors you might have their [...]

The Institute for Web Science and Technologies (WeST) at the University of Koblenz-Landau is making available a new series of datasets: The Wikipedia hyperlink networks in the eight largest Wikipedia languages: http://konect.uni-koblenz.de/networks/wikipedia_link_en – English http://konect.uni-koblenz.de/networks/wikipedia_link_de – German http://konect.uni-koblenz.de/networks/wikipedia_link_fr – French http://konect.uni-koblenz.de/networks/wikipedia_link_ja – Japanese http://konect.uni-koblenz.de/networks/wikipedia_link_itItalian http://konect.uni-koblenz.de/networks/wikipedia_link_pt – Portugese http://konect.uni-koblenz.de/networks/wikipedia_link_ru – Russian The largest dataset, [...]

I found an article about ohloh, a directory created by Black Duck Software with over 500,000 open source projects. They offer a RESTful API and the data is available under the Creative Commons Attribution 3.0 licence. An interesting aspect are Kudos. With a Kudo, a ohlo user can thank another user for his or her contribution, so [...]

I started to mention these earlier in the week but decided they needed a separate post.

Europeana opens up data on 20 million cultural items

Thursday, September 13th, 2012

Europeana opens up data on 20 million cultural items by Jonathan Gray (Open Knowledge Foundation):

From the post:

Europe‘s digital library Europeana has been described as the ‘jewel in the crown’ of the sprawling web estate of EU institutions.

It aggregates digitised books, paintings, photographs, recordings and films from over 2,200 contributing cultural heritage organisations across Europe – including major national bodies such as the British Library, the Louvre and the Rijksmuseum.

Today [Wednesday, 12 September 2012] Europeana is opening up data about all 20 million of the items it holds under the CC0 rights waiver. This means that anyone can reuse the data for any purpose – whether using it to build applications to bring cultural content to new audiences in new ways, or analysing it to improve our understanding of Europe’s cultural and intellectual history.

This is a coup d’etat for advocates of open cultural data. The data is being released after a grueling and unenviable internal negotiation process that has lasted over a year – involving countless meetings, workshops, and white papers presenting arguments and evidence for the benefits of openness.

That is good news!

A familiar issue that it overcomes:

To complicate things even further, many public institutions actively prohibit the redistribution of information in their catalogues (as they sell it to – or are locked into restrictive agreements with – third party companies). This means it is not easy to join the dots to see which items live where across multiple online and offline collections.

Oh, yeah! That was one of Google’s reasons for pulling the plug on the Open Knowledge Graph. Google had restrictive agreements so you can only connect the dots with Google products. (I think there is a name for that, let me think about it. Maybe an EU prosecutor might know it. You could always ask.)

What are you going to be mapping from this collection?

Do You Just Talk About The Weather?

Wednesday, September 12th, 2012

After reading this post by Alex you will still just be talking about the weather, but you may have something interesting to say. ;-)

Locating Mountains and More with Mahout and Public Weather Dataset by Alex Baranau

From the post:

Recently I was playing with Mahout and public weather dataset. In this post I will describe how I used Mahout library and weather statistics to fill missing gaps in weather measurements and how I managed to locate steep mountains in US with a little Machine Learning (n.b. we are looking for people with Machine Learning or Data Mining backgrounds – see our jobs).

The idea was to just play and learn something, so the effort I did and the decisions chosen along with the approaches should not be considered as a research or serious thoughts by any means. In fact, things done during this effort may appear too simple and straightforward to some. Read on if you want to learn about the fun stuff you can do with Mahout!
Tools & Data

The data and tools used during this effort are: Apache Mahout project and public weather statistics dataset. Mahout is a machine learning library which provided a handful of machine learning tools. During this effort I used just small piece of this big pie. The public weather dataset is a collection of daily weather measurements (temperature, wind speed, humidity, pressure, &c.) from 9000+ weather stations around the world.

What other questions could you explore with the weather data set?

The real power of “big data” access and tools may be that we no longer have to rely on the summaries of others.

Summaries still have a value-add, perhaps even more so when the original data is available for verification.

Women’s representation in media:… [Counting and Evaluation]

Saturday, September 8th, 2012

Women’s representation in media: the best data on the subject to date

From the post:

In the first of a series of datablog posts looking at women in the media, we present one year of every article published by the Guardian, Telegraph and Daily Mail, with each article tagged by section, gender, and social media popularity.

(images omitted)

The Guardian datablog has joined forces with J. Nathan Matias of the MIT media lab and data scientist Lynn Cherny to collect what is to our knowledge, the most comprehensive, high resolution dataset available on news content by gender and audience interest.

The dataset covers from July 2011 to June 2012. The post describes the data collection and some rough counts by gender, etc. More analysis to follow.

The data should not be impacted by:

Opinion sections can shape a society’s opinions and therefore are an important measure of women’s voices in society.

It isn’t clear how those claims go together.

Anything being possible the statement that “…opinion sections can shape a society’s opinions…,” is trivially true.

But even if true (an unwarranted assumption), how does that lead to it being “…an important measure of women’s voices in society[?]”

Could be true and have nothing to do with measuring “…women’s voices in society.”

Could be false and have nothing to do with measuring “…women’s voices in society.”

As well as the other possibilities.

Just because we can count something, doesn’t imbue it with relevance for something else that is harder to evaluate.

Women’s voices in society are important. Let’s not demean them by grabbing the first thing we can count as their measure.

London 2012 Olympic athletes: the full list

Friday, July 27th, 2012

London 2012 Olympic athletes: the full list

Simon Rogers fo the Guardian reports scrapping together the full list of Olympic athletes into a single data set.

Simon says:

We’ve just scratched the surface of this dataset – you can download it below. What can you do with it?

I would ask the question somewhat differently: Having the data set, what can you reliably add to it?

Aggregate data analysis is interesting but then so is aggregated data on the individual athletes.

PS: If you do something interesting with the data set, be sure to let the Guardian know.

Kiss the Weatherman [Weaponizing Data]

Wednesday, June 27th, 2012

Kiss the Weatherman by James Locus.

From the post:

Weather Hurts

Catastrophic weather events like the historic 2011 floods in Pakistan or prolonged droughts in the horn of Africa make living conditions unspeakably harsh for tens of millions of families living in these affected areas. In the US, the winter storms of 2009-2010 and 2010-2011 brought record-setting snowfall, forcing mighty metropolises into an icy standstill. Extreme weather can profoundly impact the landscape of the planet.

The effects of extreme weather can send terrible ripples throughout an entire community. Unexpected cold snaps or overly hot summers can devastate crop yields and forcing producers to raise prices. When food prices rise, it becomes more difficult for some people to earn enough money to provide for their families, creating even larger problems for societies as a whole.

The central problem is the inability of current measuring technologies to more accurately predict large-scale weather patterns. Weathermen are good at predicting weather but poor at predicting climate. Weather occurs over a shorter period of time and can be reliability predicted within a 3-day timeframe. Climate stretches many months, years, or even centuries. Matching historical climate data with current weather data to make future weather and climate is a major challenge for scientists.

James has a good survey of both data sources and researchers working on using “big data” (read historical weather data) for both weather (short term) and climate (longer term) prediction.

Weather data by itself is just weather data.

What other data would you combine with it and on what basis to weaponize the data?

No one can control the weather but you can control your plans for particular weather events.

Closing In On A Million Open Government Data Sets

Sunday, June 24th, 2012

Closing In On A Million Open Government Data Sets by Jennifer Zaino.

From the post:

A million data sets. That’s the number of government data sets out there on the web that we have closed in on.

“The question is, when you have that many, how do you search for them, find them, coordinate activity between governments, bring in NGOs,” says James A. Hendler, Tetherless World Senior Constellation Professor, Department of Computer Science and Cognitive Science Department at Rensselaer Polytechnic Institute, and a principal investigator of its Linking Open Government Data project lives, as well as Internet web expert for data.gov, He also is connected with many other governments’ open data projects. “Semantic web tools organize and link the metadata about these things, making them searchable, explorable and extensible.”

To be more specific, Hendler at SemTech a couple of weeks ago said there are 851,000 open government data sets across 153 catalogues from 30-something countries, with the three biggest representatives, in terms of numbers, at the moment being the U.S., the U.K, and France. Last week, the one million threshold was crossed.

About 410,000 of these data sets are from the U.S. (federal, state, city, county, tribal included), including quite a large number of geo-data sets. The U.S. government’s goal is to put “lots and lots and lots of stuff out there” and let people figure out what they want to do with it, he notes.

My question about data that “..[is] searchable, explorable and extensible,” is whether anyone wants to search, explore or extend it?

Simply piling up data to say you have a large pile of data doesn’t sound very useful.

I would rather have a smaller pile of data that included contract/testing transparency on anti-terrorism IT projects, for example. If the systems aren’t working, then disclosing them isn’t going to make them work any less well.

Not that anyone need fear transparency or failure to perform. The TSA has failed to perform for more than a decade now, failed to catch a single terrorist and it remains funded. Even when it starts groping children, passengers are so frightened that even that outrage passes without serious opposition.

Still, it would be easier to get people excited about mining government data if the data weren’t so random or marginal.

Data.gov launches developer community

Wednesday, May 9th, 2012

Data.gov launches developer community

Federal Computer Week reports:

Data.gov has launched a new community for software developers to share ideas, collaborate or compete on projects and request new datasets.

Developer.data.gov joins a growing list of communities and portals tapping into Data.gov’s datasets, including those for health, energy, education, law, oceans and the Semantic Web.

The developer site is set up to offer access to federal agency datasets, source code, applications and ongoing developer challenges, along with blogs and forums where developers can discuss projects and share ideas.

Source: FCW (http://s.tt/1azwt)

Depending upon your developer skills, this could be a good place to hone them.

Not to mention having a wealth of free data sets at hand.

Simple tools for building a recommendation engine

Thursday, April 26th, 2012

Simple tools for building a recommendation engine by Joseph Rickert.

From the post:

Revolution’s resident economist, Saar Golde, is very fond of saying that “90% of what you might from a recommendation engine can be achieved with simple techniques”. To illustrate this point (without doing a lot of work), we downloaded the million row movie dataset from www.grouplens.org with the idea of just taking the first obvious exploratory step: finding the good movies. Three zipped up .dat files comprise this data set. The first file, ratings.dat, contains 1,000,209 records of UserID, MovieID, Rating, and Timestamp for 6,040 users rating 3,952 movies. Ratings are whole numbers on a 1 to 5 scale. The second file, users.dat, contains the UserID, Gender, Age, Occupation and Zip-code for each user. The third file, movies.dat, contains the MovieID, Title and Genre associated with each movie.

Curious, if a topic map engine performed 90% of the possible merges in a topic map, would that be enough?

Would your answer differ if the topic map had less than 10,000 topics and associations versus a topic map with 100 million topics and associations?

Would your answer differ based on a timeline of the data? Say the older the data, the less reliable the merging. Recent medical data < 1% error rate, up to ten years, ten to twenty years, <= 10% error rate, more than twenty years, best efforts.

Which of course raises the question of how you would test for conformance to such requirements?

The Shades of Time Project

Thursday, April 26th, 2012

The Shades of TIME project by Drew Conway.

Drew writes:

A couple of days ago someone posted a link to a data set of all TIME Magazine covers, from March, 1923 to March, 2012. Of course, I downloaded it and began thumbing through the images. As is often the case when presented with a new data set I was left wondering, “What can I ask of the data?”

After thinking it over, and with the help of Trey Causey, I came up with, “Have the faces of those on the cover become more diverse over time?” To address this questions I chose to answer something more specific: Has the color values of skin tones in faces on the covers changed over time?

I developed a data visualization tool, I’m calling the Shades of TIME, to explore the answer to that question.

An interesting data set and an illustration of why topic map applications are more useful if they have dynamic merging (user selected).

Presented with the same evidence, the covers of TIME magazine I most likely would have:

  • Mapped people on the covers to historical events
  • Mapped people on the covers to additional historical resources
  • Mapped covers into library collections
  • etc.

I would not have set out to explore the diversity in skin color on the covers. In part because I remember when it changed. That is part of my world knowledge. I don’t have to go looking for evidence of it.

My purpose isn’t to say authors, even topic map authors, should avoid having a point of view. Isn’t possible in any event. What I am suggesting is that to the extent possible, users be enabled to impose their views on a topic map as well.