Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 5, 2013

Doing More with the Hortonworks Sandbox

Filed under: Data,Dataset,Hadoop,Hortonworks — Patrick Durusau @ 2:01 pm

Doing More with the Hortonworks Sandbox by Cheryle Custer.

From the post:

The Hortonworks Sandbox was recently introduced garnering incredibly positive response and feedback. We are as excited as you, and gratified that our goal providing the fastest onramp to Apache Hadoop has come to fruition. By providing a free, integrated learning environment along with a personal Hadoop environment, we are helping you gain those big data skills faster. Because of your feedback and demand for new tutorials, we are accelerating the release schedule for upcoming tutorials. We will continue to announce new tutorials via the Hortonworks blog, opt-in email and Twitter (@hortonworks).

While you wait for more tutorials, Cheryle points to some data sets to keep you busy:

For advice, see the Sandbox Forums.

BTW, while you are munging across different data sets, be sure to notice any semantic impedance if you try to merge some data sets.

If you don’t want everyone in your office doing that merging one-off, you might want to consider topic maps.

Design and document a merge between data sets once, run many times.

Even if your merging requirements change. Just change that part of the map, don’t re-create the entire map.

What if mapping companies recreated their maps for every new street?

Or would it be better to add the new street to an existing map?

If that looks obvious, try the extra-bonus question:

Which model, new map or add new street, do you use for schema migration?

February 3, 2013

Case study: million songs dataset

Filed under: Data,Dataset,GraphChi,Graphs,Machine Learning — Patrick Durusau @ 6:58 pm

Case study: million songs dataset by Danny Bickson.

From the post:

A couple of days ago I wrote about the million songs dataset. Our man in London, Clive Cox from Rummble Labs, suggested we should implement rankings based on item similarity.

Thanks to Clive suggestion, we have now an implementation of Fabio Aiolli’s cost function as explained in the paper: A Preliminary Study for a Recommender System for the Million Songs Dataset, which is the winning method in this contest.

Following are detailed instructions on how to utilize GraphChi CF toolkit on the million songs dataset data, for computing user ratings out of item similarities. 

Just in case you need some data for practice with your GraphChi installation. 😉

Seriously, nice way to gain familiarity with the data set.

What value you extract from it is up to you.

January 29, 2013

Bad News From UK: … brows up, breasts down

Filed under: Data,Dataset,Humor,Medical Informatics — Patrick Durusau @ 6:51 pm

UK plastic surgery statistics 2012: brows up, breasts down by Ami Sedghi.

From the post:

Despite a recession and the government launching a review into cosmetic surgery following the breast implant scandal, plastic surgery procedures in the UK were up last year.

A total of 43,172 surgical procedures were carried out in 2012 according to the British Association of Aesthetic Plastic Surgeons (BAAPS), an increase of 0.2% on the previous year. Although there wasn’t a big change for overall procedures, anti-ageing treatments such as eyelid surgery and face lifts saw double digit increases.

Breast augmentation (otherwise known as ‘boob jobs’) were still the most popular procedure overall although the numbers dropped by 1.6% from 2011 to 2012. Last year’s stats took no account of the breast implant scandal so this is the first release of figures from BAAPS to suggest what impact the scandal has had on the popular procedure.

Just for comparison purposes:

Country Procedures Population Percent of Population Treated
UK 43,172 62,641,000 0.00068%
US 9,200,000 313,914,000 0.02900%

Perhaps beauty isn’t one of the claimed advantages of socialized medicine?

January 22, 2013

Click Dataset [HTTP requests]

Filed under: Dataset,Graphs,Networks,WWW — Patrick Durusau @ 2:41 pm

Click Dataset

From the webpage:

To foster the study of the structure and dynamics of Web traffic networks, we make available a large dataset (‘Click Dataset’) of HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows one to examine large volumes of traffic data while minimizing biases associated with other data sources. It also provides one with valuable referrer information to reconstruct the subset of the Web graph actually traversed by users. The goal is to develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

The data was generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University. This filter matched all traffic destined for TCP port 80. A long-running collection process used the pcap library to gather these packets, then applied a small set of regular expressions to their payloads to determine whether they contained HTTP GET requests.

Data available under terms and restrictions, including transfer by physical hard drive (~ 2.5 TB of data).

Intrigued by the notion of a “subset of the Web graph actually traversed by users.”

Does that mean that semantic annotation should occur on the portion of the “…Web graph actually traversed by users” before reaching other parts?

If the language of 4,148,237 English Wikipedia pages is never in doubt for any user, do we really need triples to record that for every page?

January 17, 2013

Complete Guardian Dataset Listing!

Filed under: Data,Dataset,News — Patrick Durusau @ 7:28 pm

All our datasets: the complete index by Chris Cross.

From the post:

Lost track of the hundreds of datasets published by the Guardian Datablog since it began in 2009? Thanks to ScraperWiki, this is the ultimate list and resource. The table below is live and updated every day – if you’re still looking for that ultimate dataset, the chance is we’ve already done it. Click below to find out

I am simply in awe of the number of datasets produced by the Guardian since 2009.

A few of the more interesting titles include:

You will find things in the hundreds of datasets you have wondered about and other things you can’t imagine wondering about. 😉

Enjoy!

December 25, 2012

Quandl [> 2 million financial/economic datasets]

Filed under: Data,Dataset,Time Series — Patrick Durusau @ 4:19 pm

Quandl (alpha)

From the homepage:

Quandl is a collaboratively curated portal to over 2 million financial and economic time-series datasets from over 250 sources. Our long-term mission is to make all numerical data on the internet easy to find and easy to use.

Interesting enough but the detail from the “about” page are even more so:

Our Vision

The internet offers a rich collection of high quality numerical data on thousands of subjects. But the potential of this data is not being reached at all because the data is very difficult to actually find. Furthermore, it is also difficult to extract, validate, format, merge, and share.

We have a solution: We’re building an intelligent search engine for numerical data. We’ve developed technology that lets people quickly and easily add data to Quandl’s index. Once this happens, the data instantly becomes easy to find and easy to use because it gains 8 essential attributes:

Findability Quandl is essentially a search engine for numerical data. Every search result on Quandl is an actual data set that you can use right now. Once data from anywhere on the internet becomes known to Quandl, it becomes findable by search and (soon) by browse.
Structure Quandl is a universal translator for data formats. It accepts numerical data no matter what format it happens to be published in and then delivers it in any format you request it. When you find a dataset on Quandl, you’ll be able to export anywhere you want, in any format you want.
Validity Every dataset on Quandl has a simple link back to the same data on the publisher’s web site which gives you 100% certainty on validity.
Fusibility Any data set on Quandl is totally compatible with any and all other data on Quandl. You can merge multiple datasets on Quandl quickly and easily (coming soon).
Permanence Once a dataset is on Quandl, it stays there forever. It is always up-to-date and available at a permanent, unchanging URL.
Connectivity Every dataset on Quandl is accessible by a simple API. Whether or not the original publisher offered an API no longer matters because Quandl always does. Quandl is the universal API for numerical data on the internet.
Recency Every single dataset on Quandl is guaranteed to be the most recent version of that data, retrieved afresh directly from the original publisher.
Utility Data on Quandl is organized and presented for maximum utility: Actual data is examinable immediately; the data is graphed (properly); description, attribution, units, and export tools are clear and concise.

I have my doubts about the “fusibility” claims. You can check the US Leading Indicators data list and note that “level” and “units” use different units of measurement. Other semantic issues lurk just beneath the surface.

Still, the name of the engine does not begin with “B” or “G” and illustrates there is enormous potential for curated data collections.

Come to think of it, topic maps are curated data collections.

Are you in need of a data curator?

I first saw this in a tweet by Gregory Piatetsky.

November 23, 2012

d8taplex [UK bicycle theft = terrorism?]

Filed under: Dataset — Patrick Durusau @ 11:27 am

d8taplex

Bills itself as:

Explore over 50 thousand data sets containing over 1 million time series.

But searching at random (there was no description of which 50,000 datasets were in play):

Astronomy – 8 “hits” – All doctorates awarded by field of study.

Chemistry – 22 “hits” – Degrees, students, periodical prices.

Physics – 38 “hits” – Degrees, students, periodical prices, staff.

Automobile accidents – 492 “hits” – What you would expect about road conditions, condition of drivers, etc.

Terrorist attacks – 11 “hits” –

containing document: Crime in England and Wales 2009/10: Supplementary Tables: Nature of burglary, vehicle-related theft, bicycle theft, other household theft, personal and other theft, vandalism and violent crime | data.gov.uk
anchor text: Personal theft

I really don’t equate “bicycle theft” with an act of terrorism. Inconvenient yes, terrorism no.

Unless you are getting money from the U.S. Department of Homeland Security of course. They fund studies of how to hide power transmission stations that are too large and dependent on air cooling to be enclosed.

I guess putting blank spots on maps would only serve to highlight their presence. DHS could ban the manufacture of printed maps. Only allow electronic ones. Which can be distorted to show or conceal whatever the flavor of terrorism is for the week.

It would not take long for the only content of the map to be “You are here.” With no markers as to where “here” might be. But then you are there so look around.

November 21, 2012

Archive of datasets bundled with R

Filed under: Data,Dataset,R — Patrick Durusau @ 12:19 pm

Archive of datasets bundled with R by Nathan Yau.

From the post:

R comes with a lot of datasets, some with the core distribution and others with packages, but you’d never know which ones unless you went through all the examples found at the end of help documents. Luckily, Vincent Arel-Bundock cataloged 596 of them in an easy-to-read page, and you can quickly download them as CSV files.

Many of the datasets are dated, going back to the original distribution of R, but it’s a great resource for teaching or if you’re just looking for some data to play with.

A great find! Thanks Nathan and to Vincent for pulling it together!

November 19, 2012

FindTheData

Filed under: Data,Data Source,Dataset — Patrick Durusau @ 7:03 pm

FindTheData

From the about page:

At FindTheData, we present you with the facts stripped of any marketing influence so that you can make quick and informed decisions. We present the facts in easy-to-use tables with smart filters, so that you can decide what is best.

Too often, marketers and pay-to-play sites team up to present carefully crafted advertisements as objective “best of” lists. As a result, it has become difficult and time consuming to distinguish objective information from paid placements. Our goal is to become a trusted source in assisting you in life’s important decisions.

FindTheData is organized into 9 broad categories

Each category includes dozens of Comparisons from smartphones to dog breeds. Each Comparison consists of a variety of listings and each listing can be sorted by several key filters or compared side-by-side.

Traditional search is a great hammer but sometimes you need a wrench.

Currently search can find any piece of information across hundreds of billions of Web pages, but when you need to make a decision whether it’s choosing the right college or selecting the best financial advisor, you need information structured in an easily comparable format. FindTheData does exactly that. We help you compare apples-to-apples data, side-by-side, on a wide variety of products & services.

If you think in the same categories as the authors, sorta like using LCSH, you are in like Flint. If you don’t, well, your mileage may vary.

While some people may find it convenient to have tables and sorts pre-set for them, it would be nice to be able to download the data files.

Still, you may find it useful to browse for datasets that are new to you.

November 10, 2012

IOGDS: International Open Government Dataset Search

Filed under: Dataset,Linked Data,RDF,SPARQL — Patrick Durusau @ 9:21 am

IOGDS: International Open Government Dataset Search

Description:

The TWC International Open Government Dataset Search (IOGDS) is a linked data application based on metadata “scraped” from hundreds of international dataset catalog websites publishing a rich variety of government data. Metadata extracted from these catalog websites is automatically converted to RDF linked data and re-published via the TWC LOGD SPARQL endpoint and made available for download. The TWC IOGDS demo site features an efficient, reconfigurable faceted browser with search capabilities offering a compelling demonstration of the value of a common metadata model for open government dataset catalogs. We believe that the vocabulary choices demonstrated by IOGDS highlights the potential for useful linked data applications to be created from open government catalogs and will encourage the adoption of such a standard worldwide.

In addition to the datasets you will find tutorials, videos, demos, tools and technologies and other resources.

Whether you are looking for Linked Data or Linked Data to re-use in other ways.

Seen in a tweet by Tim O’Reilly.

October 11, 2012

Using (Spring Data) Neo4j for the Hubway Data Challenge [Boston Biking]

Filed under: Challenges,Data,Dataset,Graphs,Neo4j,Networks,Spring — Patrick Durusau @ 12:33 pm

Using (Spring Data) Neo4j for the Hubway Data Challenge by Michael Hunger.

From the post:

Using Spring Data Neo4j it was incredibly easy to model and import the Hubway Challenge dataset into a Neo4j graph database, to make it available for advanced querying and visualization.

The Challenge and Data

Tonight @graphmaven pointed me to the boston.com article about the Hubway Data Challenge.

(graphics omitted)

Hubway is a bike sharing service which is currently expanding worldwide. In the Data challenge they offer the CSV-data of their 95 Boston stations and about half a million bike rides up until the end of September. The challenge is to provide answers to some posted questions and develop great visualizations (or UI’s) for the Hubway data set. The challenge is also supported by MAPC (Metropolitan Area Planning Council).

Useful import tips for data into Neo4j and on modeling this particular dataset.

Not to mention the resulting database as well!

PS: From the challenge site:

Submission will open here on Friday, October 12, 2012.

Deadline

MIDNIGHT (11:59 p.m.) on Halloween,
Wednesday, October 31, 2012.

Winners will be announced on Wednesday, November 7, 2012.

Prizes:

  • A one-year Hubway membership
  • Hubway T-shirt
  • Bern helmet
  • A limited edition Hubway System Map—one of only 61 installed in the original Hubway stations.

For other details, see the challenge site.

October 10, 2012

Interesting large scale dataset: D4D mobile data [Deadline: October 31, 2012]

Filed under: Data,Data Mining,Dataset,Graphs,Networks — Patrick Durusau @ 4:19 pm

Interesting large scale dataset: D4D mobile data by Danny Bickson.

From the post:

I got the following from Prof. Scott Kirkpatrick.

Write a 250-words research project and get access within a week to the largest ever released mobile phone datasets: datasets based on 2.5 billion records, calls and text messages exchanged between 5 million anonymous users over 5 months.

Participation rules: http://www.d4d.orange.com/

Description of the datasets: http://arxiv.org/abs/1210.0137

The “Terms and Conditions” by Orange allows the publication of resultsbobtained from the datasets even if they do not directly relate to the challenge.

Cash prizes for winning participants and an invitation to present the results at the NetMob conference be held in May 2-3, 2013 at the Medialab at MIT (www.netmob.org).

Deadline: October 31, 2012

Looking to exercise your graph software? Compare to other graph software? Do interesting things with cell phone data?

This could be your chance!

September 22, 2012

Datasets! Datasets! Get Your Datasets Here!

Filed under: Data,Dataset — Patrick Durusau @ 3:59 pm

Datasets from René Pichardt’s group:

The project KONECT (Koblenz Network Collection) has extracted and made available four new network datasets based on information in the English Wikipedia, using data from the DBpedia project. The four network datasets are: The bipartite network of writers and their works (113,000 nodes and 122,000 edges) The bipartite network of producers and the works they […]

Assume you have a knowledge base containing entities and their properties or relations with other entities. For instance, think of a knowledge base about movies, actors and directors. For the movies you have structured knowledge about the title and the year they were made in, while for the actors and directors you might have their […]

The Institute for Web Science and Technologies (WeST) at the University of Koblenz-Landau is making available a new series of datasets: The Wikipedia hyperlink networks in the eight largest Wikipedia languages: http://konect.uni-koblenz.de/networks/wikipedia_link_en – English http://konect.uni-koblenz.de/networks/wikipedia_link_de – German http://konect.uni-koblenz.de/networks/wikipedia_link_fr – French http://konect.uni-koblenz.de/networks/wikipedia_link_ja – Japanese http://konect.uni-koblenz.de/networks/wikipedia_link_itItalian http://konect.uni-koblenz.de/networks/wikipedia_link_pt – Portugese http://konect.uni-koblenz.de/networks/wikipedia_link_ru – Russian The largest dataset, […]

I found an article about ohloh, a directory created by Black Duck Software with over 500,000 open source projects. They offer a RESTful API and the data is available under the Creative Commons Attribution 3.0 licence. An interesting aspect are Kudos. With a Kudo, a ohlo user can thank another user for his or her contribution, so […]

I started to mention these earlier in the week but decided they needed a separate post.

September 13, 2012

Europeana opens up data on 20 million cultural items

Filed under: Archives,Data,Dataset,Europeana,Library,Museums — Patrick Durusau @ 3:25 pm

Europeana opens up data on 20 million cultural items by Jonathan Gray (Open Knowledge Foundation):

From the post:

Europe‘s digital library Europeana has been described as the ‘jewel in the crown’ of the sprawling web estate of EU institutions.

It aggregates digitised books, paintings, photographs, recordings and films from over 2,200 contributing cultural heritage organisations across Europe – including major national bodies such as the British Library, the Louvre and the Rijksmuseum.

Today [Wednesday, 12 September 2012] Europeana is opening up data about all 20 million of the items it holds under the CC0 rights waiver. This means that anyone can reuse the data for any purpose – whether using it to build applications to bring cultural content to new audiences in new ways, or analysing it to improve our understanding of Europe’s cultural and intellectual history.

This is a coup d’etat for advocates of open cultural data. The data is being released after a grueling and unenviable internal negotiation process that has lasted over a year – involving countless meetings, workshops, and white papers presenting arguments and evidence for the benefits of openness.

That is good news!

A familiar issue that it overcomes:

To complicate things even further, many public institutions actively prohibit the redistribution of information in their catalogues (as they sell it to – or are locked into restrictive agreements with – third party companies). This means it is not easy to join the dots to see which items live where across multiple online and offline collections.

Oh, yeah! That was one of Google’s reasons for pulling the plug on the Open Knowledge Graph. Google had restrictive agreements so you can only connect the dots with Google products. (I think there is a name for that, let me think about it. Maybe an EU prosecutor might know it. You could always ask.)

What are you going to be mapping from this collection?

September 12, 2012

Do You Just Talk About The Weather?

Filed under: Dataset,Machine Learning,Mahout,Weather Data — Patrick Durusau @ 9:24 am

After reading this post by Alex you will still just be talking about the weather, but you may have something interesting to say. 😉

Locating Mountains and More with Mahout and Public Weather Dataset by Alex Baranau

From the post:

Recently I was playing with Mahout and public weather dataset. In this post I will describe how I used Mahout library and weather statistics to fill missing gaps in weather measurements and how I managed to locate steep mountains in US with a little Machine Learning (n.b. we are looking for people with Machine Learning or Data Mining backgrounds – see our jobs).

The idea was to just play and learn something, so the effort I did and the decisions chosen along with the approaches should not be considered as a research or serious thoughts by any means. In fact, things done during this effort may appear too simple and straightforward to some. Read on if you want to learn about the fun stuff you can do with Mahout!
Tools & Data

The data and tools used during this effort are: Apache Mahout project and public weather statistics dataset. Mahout is a machine learning library which provided a handful of machine learning tools. During this effort I used just small piece of this big pie. The public weather dataset is a collection of daily weather measurements (temperature, wind speed, humidity, pressure, &c.) from 9000+ weather stations around the world.

What other questions could you explore with the weather data set?

The real power of “big data” access and tools may be that we no longer have to rely on the summaries of others.

Summaries still have a value-add, perhaps even more so when the original data is available for verification.

September 8, 2012

Women’s representation in media:… [Counting and Evaluation]

Filed under: Data,Dataset,News — Patrick Durusau @ 10:46 am

Women’s representation in media: the best data on the subject to date

From the post:

In the first of a series of datablog posts looking at women in the media, we present one year of every article published by the Guardian, Telegraph and Daily Mail, with each article tagged by section, gender, and social media popularity.

(images omitted)

The Guardian datablog has joined forces with J. Nathan Matias of the MIT media lab and data scientist Lynn Cherny to collect what is to our knowledge, the most comprehensive, high resolution dataset available on news content by gender and audience interest.

The dataset covers from July 2011 to June 2012. The post describes the data collection and some rough counts by gender, etc. More analysis to follow.

The data should not be impacted by:

Opinion sections can shape a society’s opinions and therefore are an important measure of women’s voices in society.

It isn’t clear how those claims go together.

Anything being possible the statement that “…opinion sections can shape a society’s opinions…,” is trivially true.

But even if true (an unwarranted assumption), how does that lead to it being “…an important measure of women’s voices in society[?]”

Could be true and have nothing to do with measuring “…women’s voices in society.”

Could be false and have nothing to do with measuring “…women’s voices in society.”

As well as the other possibilities.

Just because we can count something, doesn’t imbue it with relevance for something else that is harder to evaluate.

Women’s voices in society are important. Let’s not demean them by grabbing the first thing we can count as their measure.

July 27, 2012

London 2012 Olympic athletes: the full list

Filed under: Data,Dataset — Patrick Durusau @ 4:10 am

London 2012 Olympic athletes: the full list

Simon Rogers fo the Guardian reports scrapping together the full list of Olympic athletes into a single data set.

Simon says:

We’ve just scratched the surface of this dataset – you can download it below. What can you do with it?

I would ask the question somewhat differently: Having the data set, what can you reliably add to it?

Aggregate data analysis is interesting but then so is aggregated data on the individual athletes.

PS: If you do something interesting with the data set, be sure to let the Guardian know.

June 27, 2012

Kiss the Weatherman [Weaponizing Data]

Filed under: BigData,Data,Dataset,Weather Data — Patrick Durusau @ 8:05 am

Kiss the Weatherman by James Locus.

From the post:

Weather Hurts

Catastrophic weather events like the historic 2011 floods in Pakistan or prolonged droughts in the horn of Africa make living conditions unspeakably harsh for tens of millions of families living in these affected areas. In the US, the winter storms of 2009-2010 and 2010-2011 brought record-setting snowfall, forcing mighty metropolises into an icy standstill. Extreme weather can profoundly impact the landscape of the planet.

The effects of extreme weather can send terrible ripples throughout an entire community. Unexpected cold snaps or overly hot summers can devastate crop yields and forcing producers to raise prices. When food prices rise, it becomes more difficult for some people to earn enough money to provide for their families, creating even larger problems for societies as a whole.

The central problem is the inability of current measuring technologies to more accurately predict large-scale weather patterns. Weathermen are good at predicting weather but poor at predicting climate. Weather occurs over a shorter period of time and can be reliability predicted within a 3-day timeframe. Climate stretches many months, years, or even centuries. Matching historical climate data with current weather data to make future weather and climate is a major challenge for scientists.

James has a good survey of both data sources and researchers working on using “big data” (read historical weather data) for both weather (short term) and climate (longer term) prediction.

Weather data by itself is just weather data.

What other data would you combine with it and on what basis to weaponize the data?

No one can control the weather but you can control your plans for particular weather events.

June 24, 2012

Closing In On A Million Open Government Data Sets

Filed under: Dataset,Geographic Data,Government,Government Data,Open Data — Patrick Durusau @ 7:57 pm

Closing In On A Million Open Government Data Sets by Jennifer Zaino.

From the post:

A million data sets. That’s the number of government data sets out there on the web that we have closed in on.

“The question is, when you have that many, how do you search for them, find them, coordinate activity between governments, bring in NGOs,” says James A. Hendler, Tetherless World Senior Constellation Professor, Department of Computer Science and Cognitive Science Department at Rensselaer Polytechnic Institute, and a principal investigator of its Linking Open Government Data project lives, as well as Internet web expert for data.gov, He also is connected with many other governments’ open data projects. “Semantic web tools organize and link the metadata about these things, making them searchable, explorable and extensible.”

To be more specific, Hendler at SemTech a couple of weeks ago said there are 851,000 open government data sets across 153 catalogues from 30-something countries, with the three biggest representatives, in terms of numbers, at the moment being the U.S., the U.K, and France. Last week, the one million threshold was crossed.

About 410,000 of these data sets are from the U.S. (federal, state, city, county, tribal included), including quite a large number of geo-data sets. The U.S. government’s goal is to put “lots and lots and lots of stuff out there” and let people figure out what they want to do with it, he notes.

My question about data that “..[is] searchable, explorable and extensible,” is whether anyone wants to search, explore or extend it?

Simply piling up data to say you have a large pile of data doesn’t sound very useful.

I would rather have a smaller pile of data that included contract/testing transparency on anti-terrorism IT projects, for example. If the systems aren’t working, then disclosing them isn’t going to make them work any less well.

Not that anyone need fear transparency or failure to perform. The TSA has failed to perform for more than a decade now, failed to catch a single terrorist and it remains funded. Even when it starts groping children, passengers are so frightened that even that outrage passes without serious opposition.

Still, it would be easier to get people excited about mining government data if the data weren’t so random or marginal.

May 9, 2012

Data.gov launches developer community

Filed under: Dataset,Government Data — Patrick Durusau @ 2:15 pm

Data.gov launches developer community

Federal Computer Week reports:

Data.gov has launched a new community for software developers to share ideas, collaborate or compete on projects and request new datasets.

Developer.data.gov joins a growing list of communities and portals tapping into Data.gov’s datasets, including those for health, energy, education, law, oceans and the Semantic Web.

The developer site is set up to offer access to federal agency datasets, source code, applications and ongoing developer challenges, along with blogs and forums where developers can discuss projects and share ideas.

Source: FCW (http://s.tt/1azwt)

Depending upon your developer skills, this could be a good place to hone them.

Not to mention having a wealth of free data sets at hand.

April 26, 2012

Simple tools for building a recommendation engine

Filed under: Dataset,R,Recommendation — Patrick Durusau @ 6:31 pm

Simple tools for building a recommendation engine by Joseph Rickert.

From the post:

Revolution’s resident economist, Saar Golde, is very fond of saying that “90% of what you might from a recommendation engine can be achieved with simple techniques”. To illustrate this point (without doing a lot of work), we downloaded the million row movie dataset from www.grouplens.org with the idea of just taking the first obvious exploratory step: finding the good movies. Three zipped up .dat files comprise this data set. The first file, ratings.dat, contains 1,000,209 records of UserID, MovieID, Rating, and Timestamp for 6,040 users rating 3,952 movies. Ratings are whole numbers on a 1 to 5 scale. The second file, users.dat, contains the UserID, Gender, Age, Occupation and Zip-code for each user. The third file, movies.dat, contains the MovieID, Title and Genre associated with each movie.

Curious, if a topic map engine performed 90% of the possible merges in a topic map, would that be enough?

Would your answer differ if the topic map had less than 10,000 topics and associations versus a topic map with 100 million topics and associations?

Would your answer differ based on a timeline of the data? Say the older the data, the less reliable the merging. Recent medical data < 1% error rate, up to ten years, ten to twenty years, <= 10% error rate, more than twenty years, best efforts. Which of course raises the question of how you would test for conformance to such requirements?

The Shades of Time Project

Filed under: Data,Dataset,Diversity — Patrick Durusau @ 6:31 pm

The Shades of TIME project by Drew Conway.

Drew writes:

A couple of days ago someone posted a link to a data set of all TIME Magazine covers, from March, 1923 to March, 2012. Of course, I downloaded it and began thumbing through the images. As is often the case when presented with a new data set I was left wondering, “What can I ask of the data?”

After thinking it over, and with the help of Trey Causey, I came up with, “Have the faces of those on the cover become more diverse over time?” To address this questions I chose to answer something more specific: Has the color values of skin tones in faces on the covers changed over time?

I developed a data visualization tool, I’m calling the Shades of TIME, to explore the answer to that question.

An interesting data set and an illustration of why topic map applications are more useful if they have dynamic merging (user selected).

Presented with the same evidence, the covers of TIME magazine I most likely would have:

  • Mapped people on the covers to historical events
  • Mapped people on the covers to additional historical resources
  • Mapped covers into library collections
  • etc.

I would not have set out to explore the diversity in skin color on the covers. In part because I remember when it changed. That is part of my world knowledge. I don’t have to go looking for evidence of it.

My purpose isn’t to say authors, even topic map authors, should avoid having a point of view. Isn’t possible in any event. What I am suggesting is that to the extent possible, users be enabled to impose their views on a topic map as well.

April 18, 2012

Windows Azure Marketplace

Filed under: Dataset,Windows Azure Marketplace — Patrick Durusau @ 6:09 pm

Windows Azure Marketplace

The location of the weather data sets for the Download 10,000 Days of Free Weather… post.

I was somewhat disappointed by the small number of data sets and equally overwhelmed when I saw the number of applications at this site.

One that stood out was an EDI to XML translation service, featuring “manual” translations. Yikes!

But the principle was what interested me.

That is the offering of an interface that “translates” data that users can then consume via some other application.

There are any number of government data sets, in a variety of formats, with diverse semantics, that could be useful, if they were only available with a common format and reconciled semantics. (True I would prefer to capture their true diversity but also need to have a product users will buy.)

To make that repeatable for a large number of data sets, the creation the tool that offers the common format and reconciled semantics, I am thinking that a topic map would be quite appropriate.

Of course, need to find data sets that are of commercial interest (unlike campaign contribution datasets, businesses already know which members of government they own and which they don’t).

Thoughts? Suggestions?

April 17, 2012

Download 10,000 Days of Free Weather Data for Almost Any Location Worldwide

Filed under: Data,Dataset,PowerPivot — Patrick Durusau @ 7:12 pm

Download 10,000 Days of Free Weather Data for Almost Any Location Worldwide

A very cool demonstration of PowerPivot with weather data.

I don’t have PowerPivot (or Office 2010) but will be correcting that in the near future.

Pointers to importing diverse data into PowerPivot?

April 12, 2012

30 Places to Find Open Data on the Web

Filed under: Data,Dataset — Patrick Durusau @ 7:04 pm

30 Places to Find Open Data on the Web by Romy Misra.

From the post:

Finding an interesting data set and a story it tells can be the most difficult part of producing an infographic or data visualization.

Data visualization is the end artifact, but it involves multiple steps – finding reliable data, getting the data in the right format, cleaning it up (an often underestimated step in the amount of time it takes!) and then finding the story you will eventually visualize.

Following is a list useful resources for finding data. Your needs will vary from one project to another, but this list is a great place to start — and bookmark.

A very good collection of data sources.

From the comments as of April 10, 2012, you may also want to consider:

http://data.gov.uk/

http://thedatahub.org/

http://www.freebase.com/

(The photography link in the comments is spam, don’t bother.)

Other data sources that you would suggest?

April 6, 2012

Mapped: British, Spanish and Dutch Shipping 1750-1800 (Stable Identifiers)

Filed under: Dataset,Graphics,R,Visualization — Patrick Durusau @ 6:49 pm

Mapped: British, Spanish and Dutch Shipping 1750-1800 by James Cheshire.

From the post:

I recently stumbled upon a fascinating dataset which contains digitised information from the log books of ships (mostly from Britain, France, Spain and The Netherlands) sailing between 1750 and 1850. The creation of this dataset was completed as part of the Climatological Database for the World’s Oceans 1750-1850 (CLIWOC) project. The routes are plotted from the lat/long positions derived from the ships’ logs. I have played around with the original data a little to clean it up (I removed routes where there was a gap of over 1000km between known points, and only mapped to the year 1800). As you can see the British (above) and Spanish and Dutch (below) had very different trading priorities over this period. What fascinates me most about these maps is the thousands (if not millions) of man hours required to create them. Today we churn out digital spatial information all the time without thinking, but for each set of coordinates contained in these maps a ship and her crew had to sail there and someone had to work out a location without GPS or reliable charts.

Truly awesome display of data! You will have to see the maps to appreciate it.

Note the space between creation and use of the data. Over two hundred (200) years.

“Stable” URIs are supposed to be what? Twelve (12) to fifteen (15) years?

What older identifiers can you think of? (Hint: Ask a librarian.)

April 2, 2012

The 1000 Genomes Project

The 1000 Genomes Project

If Amazon is hosting a single dataset > 200 TB, is your data “big data?” 😉

This merits quoting in full:

We're very pleased to welcome the 1000 Genomes Project data to Amazon S3. 

The original human genome project was a huge undertaking. It aimed to identify every letter of our genetic code, 3 billion DNA bases in total, to help guide our understanding of human biology. The project ran for over a decade, cost billions of dollars and became the corner stone of modern genomics. The techniques and tools developed for the human genome were also put into practice in sequencing other species, from the mouse to the gorilla, from the hedgehog to the platypus. By comparing the genetic code between species, researchers can identify biologically interesting genetic regions for all species, including us.

A few years ago there was a quantum leap in the technology for sequencing DNA, which drastically reduced the time and cost of identifying genetic code. This offered the promise of being able to compare full genomes from individuals, rather than entire species, leading to a much more detailed genetic map of where we, as individuals, have genetic similarities and differences. This will ultimately give us better insight into human health and disease.

The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available, ultimately with data from the genomes of over 2,661 people from 26 populations around the world. The project began with three pilot studies that assessed strategies for producing a catalog of genetic variants that are present at one percent or greater in the populations studied. We were happy to host the initial pilot data on Amazon S3 in 2010, and today we're making the latest dataset available to all, including results from sequencing the DNA of approximately 1,700 people.

The data is vast (the current set weighs in at over 200Tb), so hosting the data on S3 which is closely located to the computational resources of EC2 means that anyone with an AWS account can start using it in their research, from anywhere with internet access, at any scale, whilst only paying for the compute power they need, as and when they use it. This enables researchers from laboratories of all sizes to start exploring and working with the data straight away. The Cloud BioLinux AMIs are ready to roll with the necessary tools and packages, and are a great place to get going.

Making the data available via a bucket in S3 also means that customers can crunch the information using Hadoop via Elastic MapReduce, and take advantage of the growing collection of tools for running bioinformatics job flows, such as CloudBurst and Crossbow

You can find more information, the location of the data and how to get started using it on our 1000 Genomes web page, or from the project pages.

If that sounds like a lot of data, just imagine all of the recorded mathematical texts and the relationships between the concepts represented in such texts?

It is in our view that data looks smooth or simple. Or complex.

UCR Time Series Classification/Clustering Page

Filed under: Classification,Clustering,Dataset,Time Series — Patrick Durusau @ 5:46 pm

UCR Time Series Classification/Clustering Page

From the webpage:

This webpage has been created as a public service to the data mining/machine learning community, to encourage reproducible research for time series classification and clustering.

While chasing the details on Eamonn Keogh and his time series presentation, I encountered this collection of data sets.

March 28, 2012

NASA-GISS Datasets and Images

Filed under: Dataset,NASA — Patrick Durusau @ 4:22 pm

NASA-GISS Datasets and Images

Data and image sets from the Goddard Institute for Space Studies.

A number of interesting data/image sets along with links to similar material.

If you are looking for data sets to integrate with other public data sets, definitely worth a look.

March 27, 2012

Publicly available large data sets for database research

Filed under: Data,Dataset — Patrick Durusau @ 7:17 pm

Publicly available large data sets for database research by Daniel Lemire.

Daniel summaries large (> 20 GB) data sets that may be useful for database research.

If you know of any data sets that have been overlooked or that become available, please post a note on this entry at Daniel’s blog.

« Newer PostsOlder Posts »

Powered by WordPress