Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 1, 2014

Big data sets available for free

Filed under: BigData,Data,Dataset — Patrick Durusau @ 7:54 pm

Big data sets available for free by Vincent Granville.

From the post:

A few data sets are accessible from our data science apprenticeship web page.

(graphic omitted)

  • Source code and data for our Big Data keyword correlation API (see also section in separate chapter, in our book)
  • Great statistical analysis: forecasting meteorite hits (see also section in separate chapter, in our book)
  • Fast clustering algorithms for massive datasets (see also section in separate chapter, in our book)
  • 53.5 billion clicks dataset available for benchmarking and testing
  • Over 5,000,000 financial, economic and social datasets
  • New pattern to predict stock prices, multiplies return by factor 5 (stock market data, S&P 500; see also section in separate chapter, in our book)
  • 3.5 billion web pages: The graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages
  • Another large data set – 250 million data points: This is the full resolution GDELT event dataset running January 1, 1979 through March 31, 2013 and containing all data fields for each event record.
  • 125 Years of Public Health Data Available for Download

Just in case you are looking for data for a 2014 demo or data project!

December 31, 2013

Curated Dataset Lists

Filed under: Data,Dataset — Patrick Durusau @ 3:01 pm

6 dataset lists curated by data scientists by Scott Haylon.

From the post:

Since we do a lot of experimenting with data, we’re always excited to find new datasets to use with Mortar. We’re saving bookmarks and sharing datasets with our team on a nearly-daily basis.

There are tons of resources throughout the web, but given our love for the data scientist community, we thought we’d pick out a few of the best dataset lists curated by data scientists.

Below is a collection of six great dataset lists from both famous data scientists and those who aren’t well-known:

Here you will find lists of datasets by:

  • Peter Skomoroch
  • Hilary Mason
  • Kevin Chai
  • Jeff Hammerbacher
  • Jerry Smith
  • Gregory Piatetsky-Shapiro

Great lists of datasets, unfortunately, not deduped nor ranked by the # of collections in which they appear.

December 29, 2013

Sanity Checks

Filed under: Data,Data Analysis — Patrick Durusau @ 3:11 pm

Being paranoid about data accuracy! by Kunal Jain.

Kunal knew a long meeting was developing after this exchange at its beginning:

Kunal: How many rows do you have in the data set?

Analyst 1: (After going through the data set) X rows

Kunal: How many rows do you expect?

Analyst 1 & 2: Blank look at their faces

Kunal: How many events / data points do you expect in the period / every month?

Analyst 1 & 2: …. (None of them had a clue)
The number of rows in the data set looked higher to me. The analysts had missed it clearly, because they did not benchmark it against business expectation (or did not have it in the first place). On digging deeper, we found that some events had multiple rows in the data sets and hence the higher number of rows.
….

You have probably seen them before but Kunal has seven (7) sanity check rules that should be applied to every data set.

Unless, of course, the inability to answer to simple questions about your data sets* is tolerated by your employer.

*Data sets become “yours” when you are asked to analyze them. Better to spot and report problems before they become evident in your results.

December 24, 2013

Design, Math, and Data

Filed under: Dashboard,Data,Design,Interface Research/Design — Patrick Durusau @ 2:58 pm

Design, Math, and Data: Lessons from the design community for developing data-driven applications by Dean Malmgren.

From the post:

When you hear someone say, “that is a nice infographic” or “check out this sweet dashboard,” many people infer that they are “well-designed.” Creating accessible (or for the cynical, “pretty”) content is only part of what makes good design powerful. The design process is geared toward solving specific problems. This process has been formalized in many ways (e.g., IDEO’s Human Centered Design, Marc Hassenzahl’s User Experience Design, or Braden Kowitz’s Story-Centered Design), but the basic idea is that you have to explore the breadth of the possible before you can isolate truly innovative ideas. We, at Datascope Analytics, argue that the same is true of designing effective data science tools, dashboards, engines, etc — in order to design effective dashboards, you must know what is possible.

As founders of Datascope Analytics, we have taken inspiration from Julio Ottino’s Whole Brain Thinking, learned from Stanford’s d.school, and even participated in an externship swap with IDEO to learn how the design process can be adapted to the particular challenges of data science (see interspersed images throughout).

If you fear “some assembly required,” imagine how users feel with new interfaces.

Good advice on how to explore potential interface options.

Do you think HTML5 will lead to faster mock-ups?

See for example:

21 Fresh Examples of Websites Using HTML5 (2013)

40+ Useful HTML5 Examples and Tutorials (2012)

HTML5 Website Showcase: 48 Potential Flash-Killing Demos (2009, est.)

December 23, 2013

Where Does the Data Go?

Filed under: Data,Semantic Inconsistency,Semantics — Patrick Durusau @ 2:20 pm

Where Does the Data Go?

A brief editorial on The Availability of Research Data Declines Rapidly with Article Age by Timothy H. Vines, et.al., which reads in part:

A group of researchers in Canada examined 516 articles published between 1991 and 2011, and “found that availability of the data was strongly affected by article age.” For instance, the team reports that the odds of finding a working email address associated with a paper decreased by 7 percent each year and that the odds of an extant dataset decreased by 17 percent each year since publication. Some data was technically available, the researchers note, but stored on floppy disk or on zip drives that many researchers no longer have the hardware to access.

The one of highlights of the article (which appears in Current Biology) reads:

Broken e-mails and obsolete storage devices were the main obstacles to data sharing

Curious because I would have ventured that semantic drift over twenty (20) years would have been a major factor as well.

Then I read the paper and discovered:

To avoid potential confounding effects of data type and different research community practices, we focused on recovering data from articles containing morphological data from plants or animals that made use of a discriminant function analysis (DFA). [Under Results, the online edition has no page numbers]

The authors appeared to have dodged the semantic bullet by the selection of data and their non-reporting of difficulties, if any, in using the data (19.5%) that was shared by the original authors.

Preservation of data is a major concern for researchers but I would urge that the semantics of data be preserved as well.

Imagine that feeling when you “ls -l” a directory and recognize only some of the file names writ large. Writ very large.

December 21, 2013

Who Owns This Data?

Filed under: Data — Patrick Durusau @ 10:43 am

Visualizing Google’s million-row copyright claim dataset by Derrick Harris.

From the post:

Google released its latest transparency report on Thursday, and while much coverage of those reports rightfully focuses on governmental actions — requests for user data and requests to remove content — Google is also providing a trove of copyright data. In fact, the copyright section of its transparency report includes a massive, nearly 1-million-row dataset regarding claims of copyright infringement on URLs. (You can download all the data here. Unfortunately, it doesn’t include YouTube data, just search.) Here are some charts highlighting which copyright owners have been the most active since 2011.

The top four (4) takedown artists were:

  • The British Recorded Music Industry
  • The Recording Industry Association of America
  • Porn copyright owner Froytal Services
  • Fox

Remember that the next time copyright discussions come up.

Copyright protects music companies (not artists), porn and Fox.

Makes you wish the copyright period was back at seven (7) years doesn’t it?

December 19, 2013

Military footprint

Filed under: Data,Maps — Patrick Durusau @ 7:50 pm

Military footprint by Nathan Yau.

Nathan has found a collection of aerial photographs of military bases around the world. Along with their locations.

Excellent information for repackaging with other information about military bases and their surroundings.

WARNING: Laws concerning the collection and/or sale of information about military bases varies from one jurisdiction to another.

Just so you know and can price your services appropriately.

UNESCO Open Access Publications [Update]

Filed under: Data,Government,Government Data,Open Data — Patrick Durusau @ 7:22 pm

UNESCO Open Access Publications

From the webpage:

Building peaceful, democratic and inclusive knowledge societies across the world is at the heart of UNESCO’s mandate. Universal access to information is one of the fundamental conditions to achieve global knowledge societies. This condition is not a reality in all regions of the world.

In order to help reduce the gap between industrialized countries and those in the emerging economy, UNESCO has decided to adopt an Open Access Policy for its publications by making use of a new dimension of knowledge sharing – Open Access.

Open Access means free access to scientific information and unrestricted use of electronic data for everyone. With Open Access, expensive prices and copyrights will no longer be obstacles to the dissemination of knowledge. Everyone is free to add information, modify contents, translate texts into other languages, and disseminate an entire electronic publication.

For UNESCO, adopting an Open Access Policy means to make thousands of its publications freely available to the public. Furthermore, Open Access is also a way to provide the public with an insight into the work of the Organization so that everyone is able to discover and share what UNESCO is doing.

You can access and use our resources for free by clicking here.

In May of 2013 UNESCO announced its Open Access policy.

Many organizations profess a belief in “Open Access.”

The real test is whether they practice “Open Access.”

December 14, 2013

Financial Data Accessible from R – Part IV

Filed under: Data,Finance Services,R — Patrick Durusau @ 7:31 pm

The R Trader blog is collecting sources of financial data accessible from R.

Financial Data Accessible from R IV

From the post:

DataMarket is the latest data source of financial data accessible from R I came across. A good tutorial can be found here. I updated the table and the descriptions below.

R Trader is a fairly new blog but I like the emphasis on data sources.

Not the largest list of data sources for financial markets I have ever seen but then it isn’t the quantity of data that makes a difference. (Ask the NSA about 9/11.)

What makes a difference is your skill at collecting the right data and at analyzing it.

December 13, 2013

Ancient texts published online…

Filed under: Bible,Data,Library — Patrick Durusau @ 5:58 pm

Ancient texts published online by the Bodleian and the Vatican Libraries

From the post:

The Bodleian Libraries of the University of Oxford and the Biblioteca Apostolica Vaticana (BAV) have digitized and made available online some of the world’s most unique and important Bibles and biblical texts from their collections, as the start of a major digitization initiative undertaken by the two institutions.

The digitized texts can be accessed on a dedicated website which has been launched today (http://bav.bodleian.ox.ac.uk). This is the first launch of digitized content in a major four-year collaborative project.
Portions of the Bodleian and Vatican Libraries’ collections of Hebrew manuscripts, Greek manuscripts, and early printed books have been selected for digitization by a team of scholars and curators from around the world. The selection process has been informed by a balance of scholarly and practical concerns; conservation staff at the Bodleian and Vatican Libraries have worked with curators to assess not only the significance of the content, but the physical condition of the items. While the Vatican and the Bodleian have each been creating digital images from their collections for a number of years, this project has provided an opportunity for both libraries to increase the scale and pace with which they can digitize their most significant collections, whilst taking great care not to expose books to any damage, as they are often fragile due to their age and condition.

The newly-launched website features zoomable images which enable detailed scholarly analysis and study. The website also includes essays and a number of video presentations made by scholars and supporters of the digitization project including the Archbishop of Canterbury and Archbishop Jean-Louis Bruguès, o.p. The website blog will also feature articles on the conservation and digitized techniques and methods used during the project. The website is available both in English and Italian.

Originally announced in April 2012, the four-year collaboration aims to open up the two libraries’ collections of ancient texts and to make a selection of remarkable treasures freely available online to researchers and the general public worldwide. Through the generous support of the Polonsky Foundation, this project will make 1.5 million digitized pages freely available over the next three years.

Only twenty-one (21) works up now but 1.5 million pages by the end of the project. This is going to be a treasure trove without end!

Associating these items with their cultural contexts of production, influence on other works, textual history, comments by subsequent works, across multiple languages, is a perfect fit for topic maps.

Kudos to both the Bodleian and the Vatican Libraries!

A million first steps [British Library Image Release]

Filed under: Data,Image Understanding,Library — Patrick Durusau @ 4:48 pm

A million first steps by Ben O’Steen.

From the post:

We have released over a million images onto Flickr Commons for anyone to use, remix and repurpose. These images were taken from the pages of 17th, 18th and 19th century books digitised by Microsoft who then generously gifted the scanned images into the Public Domain. The images themselves cover a startling mix of subjects: There are maps, geological diagrams, beautiful illustrations, comical satire, illuminated and decorative letters, colourful illustrations, landscapes, wall-paintings and so much more that even we are not aware of.

Which brings me to the point of this release. We are looking for new, inventive ways to navigate, find and display these ‘unseen illustrations’. The images were plucked from the pages as part of the ‘Mechanical Curator’, a creation of the British Library Labs project. Each image is individually addressible, online, and Flickr provies an API to access it and the image’s associated description.

We may know which book, volume and page an image was drawn from, but we know nothing about a given image. Consider the image below. The title of the work may suggest the thematic subject matter of any illustrations in the book, but it doesn’t suggest how colourful and arresting these images are.

(Aside from any educated guesses we might make based on the subject matter of the book of course.)

BL-image

See more from this book: “Historia de las Indias de Nueva-España y islas de Tierra Firme…” (1867)

Next steps

We plan to launch a crowdsourcing application at the beginning of next year, to help describe what the images portray. Our intention is to use this data to train automated classifiers that will run against the whole of the content. The data from this will be as openly licensed as is sensible (given the nature of crowdsourcing) and the code, as always, will be under an open licence.

The manifests of images, with descriptions of the works that they were taken from, are available on github and are also released under a public-domain ‘licence’. This set of metadata being on github should indicate that we fully intend people to work with it, to adapt it, and to push back improvements that should help others work with this release.

There are very few datasets of this nature free for any use and by putting it online we hope to stimulate and support research concerning printed illustrations, maps and other material not currently studied. Given that the images are derived from just 65,000 volumes and that the library holds many millions of items.

If you need help or would like to collaborate with us, please contact us on email, or twitter (or me personally, on any technical aspects)

Think about the numbers. One million images from 65,000 volumes. The British Library holds millions of items.

Encourage more releases like this one with good use of and suggestions for this release!

December 12, 2013

Harvard gives new meaning to meritocracy

Filed under: Data,Government,IT — Patrick Durusau @ 7:00 pm

Harvard gives new meaning to meritocracy by Kaiser Fung.

From the post:

Due to the fastidious efforts of Professor Harvey Mansfield, Harvard has confirmed the legend that “the hard part is to get in”. Not only does it appear impossible to flunk out but according to the new revelation (link), the median grade given is A- and “the most frequently awarded grade at Harvard College is actually a straight A”.

The last sentence can be interpreted in two ways. If “straight A” means As across the board, then he is saying a lot of graduates end up with As in all courses taken. If “straight A” is used to distinguish between A and A-, then all he is saying is that the median grade is A- and the mode is A. Since at least 50% of the grades given are A or A- and there are more As than A-s, there would be at least 25% As, possibly a lot more.

Note also that the median being A- tells us nothing about the bottom half of the grades. If no professor even gave out anything below an A-, the median would still be A-. If such were to be the case, then the 5th percentile, 10th percentile, 25th percentile, etc. would all be A-.

For full disclosure, Harvard should tell us what proportion fo grades are As and what proportion are A-s.

And to think, I complain about government contractors having a sense of entitlement, divorced from their performance.

Looks like that is also true for all those Harvard (and other) graduates that are now employed by the U.S. government.

Nothing you or I can do about it but something you need to take into account when dealing with the U.S. government.

I keep hoping that some department, agency, government or government in waiting will become interested in weapons grade IT.

Reasoning that when other departments, agencies, governments or governments in waiting start feeling the heat, it may set off an IT arms race.

Not the waste for the sake of waste sort of arms race we had in the 1960’s but one with real winners and losers.

Patent database of 15 million chemical structures goes public

Filed under: Cheminformatics,Data — Patrick Durusau @ 3:55 pm

Patent database of 15 million chemical structures goes public by Richard Van Noorden.

From the post:

The internet’s wealth of free chemistry data just got significantly larger. Today, the European Bioinformatics Institute (EBI) has launched a website — www.surechembl.org — that allows anyone to search through 15 million chemical structures, extracted automatically by data-mining software from world patents.

The initiative makes public a 4-terabyte database that until now had been sold on a commercial basis by a software firm, SureChem, which is folding. SureChem has agreed to transfer its information over to the EBI — and to allow the institute to use its software to continue extracting data from patents.

“It is the first time a world patent chemistry collection has been made publicly available, marking a significant advance in open data for use in drug discovery,” says a statement from Digital Science — the company that owned SureChem, and which itself is owned by Macmillan Publishers, the parent company of Nature Publishing Group.

This is one of those Selling Data opportunities that Vincent Granville was talking about.

You can harvest data here, combine it (hopefully using a topic map) with other data and market the results. Not everyone who has need for the data has the time or skills required to re-package the data.

What seems problematic to me is how to reach potential buyers of information?

If you produce data and license it to one of the large data vendors, what’s the likelihood your data will get noticed?

On the other hand, direct sale of data seems like a low percentage deal.

Suggestions?

December 11, 2013

Semantic Web Rides Into the Sunset

Filed under: CSV,Data,XQuery — Patrick Durusau @ 8:16 pm

W3C’s Semantic Web Activity Folds Into New Data Activity by Jennifer Zaino.

From the post:

The World Wide Web Consortium has headline news today: The Semantic Web, as well as eGovernment, Activities are being merged and superseded by the Data Activity, where Phil Archer serves as Lead. Two new workgroups also have been chartered: CSV on the Web and Data on the Web Best Practices.

The new CSV on the Web Working Group is an important step in that direction, following on the heels of efforts such as R2RML. It’s about providing metadata about CSV files, such as column headings, data types, and annotations, and, with it, making it easily possible to convert CSV into RDF (or other formats), easing data integration. “The working group will define a metadata vocabulary and then a protocol for how to link data to metadata (presumably using HTTP Link headers) or embed the metadata directly. Since the links between data and metadata can work in either direction, the data can come from an API that returns tabular data just as easily as it can a static file,” says Archer. “It doesn’t take much imagination to string together a tool chain that allows you to run SPARQL queries against ’5 Star Data’ that’s actually published as a CSV exported from a spreadsheet.”

The Data on the Web Best Practices working group, he explains, will not define any new technologies but will guide data publishers (government, research scientists, cultural heritage organizations) to better use the Web as a data platform. Additionally, the Data Activity, as well as the new Digital Publishing Activity that will be lead by former Semantic Web Activity Lead Ivan Herman, are now in a new domain called the Information and Knowledge Domain (INK), led by Ralph Swick.

I will spare you all the tedious justification by Phil Archer of the Semantic Web venture.

The W3C is also the home of XSLT, XPath, XQuery, and other standards that require no defense or justification.

Maybe we will all get lucky and the CSV on the Web and Data on the Web Best Practices activities will be successful activities at the W3C.

December 10, 2013

What’s Not There: The Odd-Lot Bias in TAQ Data

Filed under: Data,Finance Services — Patrick Durusau @ 2:07 pm

What’s Not There: The Odd-Lot Bias in TAQ Data by Maureen O’Hara, Chen Yao, and, Mao Ye.

Abstract:

We investigate the systematic bias that arises from the exclusion of trades for less than 100 shares from TAQ data. In our sample, we find that the median number of missing trades per stock is 19%, but for some stocks missing trades are as high as 66% of total transactions. Missing trades are more pervasive for stocks with higher prices, lower liquidity, higher levels of information asymmetry and when volatility is low. We show that odd lot trades contribute 30 % of price discovery and trades of 100 shares contribute another 50%, consistent with informed traders splitting orders into odd-lots and smaller trade sizes. The truncation of odd-lot trades leads to a significant bias for empirical measures such as order imbalance, challenges the literature using trade size to proxy individual trades, and biases measures of individual sentiment. Because odd-lot trades are more likely to arise from high frequency traders, we argue their exclusion from TAQ and the consolidated tape raises important regulatory issues.

TAQ = Trade and Quote Detail.

Amazing what you can find if you go looking for it. O’Hara and friends find that missing trades can be as much as 66% of the total transactions for some stocks.

The really big news is that from this academic paper, US regulators required disclosure of this hidden data starting on December 9, 2013

For access, see the Daily TAQ, where you will find the raw data for $1,500 per year for one user.

Despite its importance to the public, I don’t know of any time-delayed public archive of trade data.

Format specifications and sample data are available for:

  • Daily Trades File: Every trade reported to the consolidated tape, from all CTA participants. Each trade identifies the time, exchange, security, volume, price, sale condition, and more.
  • Daily TAQ Master File (Beta): (specification only)
  • Daily TAQ Master File: All master securities information in NYSE-listed and non-listed stocks, including Primary Exchange Indicator
  • Daily Quote and Trade Admin Message File: All Limit-up/Limit-down Price Band messages published on the CTA and UTP trade and quote feeds. The LULD trial is scheduled to go live with phase 1 on April 8, 2013.
  • Daily NBBO File: An addendum to the Daily Quotes file, containing continuous National Best Bid and Offer updates and consolidated trades and quotes for all listed and non-listed issues.
  • Daily Quotes File: Every quote reported to the consolidated tape, from all CTA participants. Each quote identifies the time, exchange, security, bid/ask volumes, bid/ask prices, NBBO indicator, and more.

Merging financial data with other data, property transactions/ownership, marriage/divorce, and other activities are a topic map activity.

December 9, 2013

ROpenSci

Filed under: Data,Dublin Core,R,Science — Patrick Durusau @ 1:00 pm

ROpenSci

From the webpage:

At rOpenSci we are creating packages that allow access to data repositories through the R statistical programming environment that is already a familiar part of the workflow of many scientists. We hope that our tools will not only facilitate drawing data into an environment where it can readily be manipulated, but also one in which those analyses and methods can be easily shared, replicated, and extended by other researchers. While all the pieces for connecting researchers with these data sources exist as disparate entities, our efforts will provide a unified framework that will be quickly connect researchers to open data.

More than twenty (20) R packages are available today!

Great for data mining your favorite science data repository, but that isn’t the only reason I mention them.

One of the issues for topic maps has always been how to produce the grist for a topic map mill. There is a lot of data and production isn’t a thrilling task. 😉

But what if we could automate that production, at least to a degree?

The search functions in Treebase offer several examples of auto-generation of semantics would benefit both the data set and potential users.

In Treebase: An R package for discovery, access and manipulation of online phylogenies Carl Boettiger and Duncan Temple Lang point out that Treebase has search functions for “author,” and “subject.”

Err, but Dublin Core 1.1 refers to authors as “creators.” And “subject,” for Treebase means: “Matches in the subject terms.”

The ACM would say “keywords,” as would many others, instead of “subject.”

Not a great semantic speed bump* but one that if left unnoticed, will result in poorer, not richer search results.

What if for an R package like Treebase, a user could request what is identified by a field?

That is in addition to the fields being returned, one or more key/value pairs are returned for each field, which define what is identified by that field.

For example, for “author” an --iden switch could return:

Author Semantics
Creator http://purl.org/dc/elements/1.1/creator
Author/Creator http://catalog.loc.gov/help/author-keyword.htm

and so on, perhaps even including identifiers in other languages.

While this step only addresses identifying what a field identifies, it would be a first step towards documenting identifiers that could be used over and over again to improve access to scientific data.

Future changes and we know there will be future changes, are accommodated by simply appending to the currently documented identifiers.

Document identifier mappings once, Reuse identifier mappings many times.

PS: The mapping I suggest above is a blind mapping, there is no information is given about “why” I thought the alternatives given were alternatives to the main entry “author.”

Blind mappings are sufficient for many cases but are terribly insufficient for others. Biological taxonomies, for example, do change and capturing what characteristics underlie a particular mapping may be important in terms of looking forwards or backwards from some point in time in the development of a taxonomy.

* I note for your amusement that Wikipedia offers “vertical deflection traffic calming devices,” as a class that includes “speed bump, speed hump, speed cushion, and speed table.”

Like many Library of Congress subject headings, “vertical deflection traffic calming devices” doesn’t really jump to mind when doing a search for “speed bump.” 😉

Quandl exceeds 7.8 million datasets!

Filed under: Data,Dataset,Marketing — Patrick Durusau @ 9:01 am

From The R Backpages 2 by Joseph Rickert.

From the post:

Quandl contiues its mission to seek out and make available the worlds financial and econometric data. Recently added data sets include:

That’s a big jump since our last post when Quandl broke 5 million datasets! (April 30, 2013)

Any thoughts on how many of these datasets have semantic mapping data to facilitate their re-use and/or combination with other datasets?

Selling the mapping data might be a tough sell because the customer still has to make intelligent use of it.

Selling mapped data on the other hand, that is offering consolidation of specified data sets on a daily, weekly, monthly basis, that might be a different story.

Something to think about.

PS: Do remember that a documented mapping for any dataset at Quandl will work for that same dataset elsewhere. So you won’t be re-discovering the mapping every time a request comes in for that dataset.

Not a “…butts in seats…” approach but then you probably aren’t a prime contractor.

December 7, 2013

Free GIS Data

Filed under: Data,GIS,Mapping — Patrick Durusau @ 2:13 pm

Free GIS Data by Robin Wilson.

Over 300 GIS data sets. As of 7 December 2013, last updated 6 December 2013.

A very wide ranging collection of “free” GIS data.

Robin recommends you check the licenses of individual data sets. The meaning of “free” varies from person to person.

If you discover “free” GIS resources not listed on Robin’s page, drop him a note.

I first saw this in Pete Warden’s Five Short Links for November 30, 2013.

Think Tank Review

Filed under: Data,EU,Government — Patrick Durusau @ 11:47 am

Think Tank Review by Central Library of the General Secretariat of the EU Council.

The title could mean a number of things so when I saw it at Full Text Reports, I followed it.

From the first page:

Welcome to issue 8 of the Think Tank Review compiled by the Council Library.* It references papers published in October 2013. As usual, we provide the link to the full text and a short abstract.

The current Review and past issues can be downloaded from the Intranet of the General Secretariat of the Council or requested to the Library.

A couple of technical points: the Think Tank Review will soon be made available – together with other bibliographic and research products from the Library – on our informal blog at http://www.councillibrary.wordpress.com. A Beta version is already online for you to comment.

More broadly, in the next months we will be looking for ways to disseminate the contents of the Review in a more sophisticated way than the current – admittedly spartan – collection of links cast in a pdf format. We will look at issues such as indexing, full text search, long-term digital preservation, ease of retrieval and readability on various devices. Ideas from our small but faithful community of readers are welcome. You can reach us at central.library@consilium.europa.eu.

I’m not a policy wonk so scanning the titles didn’t excite me but it might you or (more importantly) one of your clients.

It seemed like an odd enough resource that you may not encounter it by chance.

December 6, 2013

…: Selling Data

Filed under: Data,Marketing — Patrick Durusau @ 8:02 pm

A New Source of Revenue for Data Scientists: Selling Data by Vincent Granville.

From the post:

What kind of data is salable? How can data scientists independently make money by selling data that is automatically generated: raw data, research data (presented as customized reports), or predictions. In short, using an automated data generation / gathering or prediction system, working from home with no boss and no employee, and possibly no direct interactions with clients. An alternate career path that many of us would enjoy!

Vincent gives a number of examples of companies selling data right now, some possible data sources, startup ideas and pointers to articles on data scientists.

Vincent makes me think there are at least three ways to sell topic maps:

  1. Sell people on using topic maps so they can produce high quality data through the use of topic maps.
  2. Sell people on hiring you to construct a topic map system so they can produce high quality data.
  3. Sell people high quality data because you are using a topic map.

Not everyone who likes filet mignon (#3), wants to raise the cow (#1) and/or butcher the cow(#2).

It is more expensive to buy filet mignon, but it also lowers the odds of stepping in cow manure and/or blood.

What data would you buy?

December 4, 2013

Free Language Lessons for Computers

Filed under: Data,Language,Natural Language Processing — Patrick Durusau @ 4:58 pm

Free Language Lessons for Computers by Dave Orr.

From the post:

50,000 relations from Wikipedia. 100,000 feature vectors from YouTube videos. 1.8 million historical infoboxes. 40 million entities derived from webpages. 11 billion Freebase entities in 800 million web documents. 350 billion words’ worth from books analyzed for syntax.

These are all datasets that we’ve shared with researchers around the world over the last year from Google Research.

A great summary of the major data drops by Google Research over the past year. In many cases including pointers to additional information on the datasets.

One that I have seen before and that strikes me as particularly relevant to topic maps is:

Dictionaries for linking Text, Entities, and Ideas

What is it: We created a large database of pairs of 175 million strings associated with 7.5 million concepts, annotated with counts, which were mined from Wikipedia. The concepts in this case are Wikipedia articles, and the strings are anchor text spans that link to the concepts in question.

Where can I find it: http://nlp.stanford.edu/pubs/crosswikis-data.tar.bz2

I want to know more: A description of the data, several examples, and ideas for uses for it can be found in a blog post or in the associated paper.

For most purposes, you would need far less than the full set of 7.5 million concepts. Imagine having the relevant concepts for a domain that was being automatically “tagged” as you composed prose about it.

Certainly less error-prone than marking concepts by hand!

December 3, 2013

Five Stages of Data Grief

Filed under: Data,Data Quality — Patrick Durusau @ 2:15 pm

Five Stages of Data Grief by Jeni Tennison.

From the post:

As organisations come to recognise how important and useful data could be, they start to think about using the data that they have been collecting in new ways. Often data has been collected over many years as a matter of routine, to drive specific processes or sometimes just for the sake of it. Suddenly that data is repurposed. It is probed, analysed and visualised in ways that haven’t been tried before.

Data analysts have a maxim:

If you don’t think you have a quality problem with your data, you haven’t looked at it yet.

Every dataset has its quirks, whether it’s data that has been wrongly entered in the first place, automated processing that has introduced errors, irregularities that come from combining datasets into a consistent structure or simply missing information. Anyone who works with data knows that far more time is needed to clean data into something that can be analysed, and to understand what to leave out, than in actually performing the analysis itself. They also know that analysis and visualisation of data will often reveal bugs that you simply can’t see by staring at a spreadsheet.

But for the people who have collected and maintained such data — or more frequently their managers, who don’t work with the data directly — this realisation can be a bit of a shock. In our last ODI Board meeting, Sir Tim Berners-Lee suggested that the data curators need to go through was something like the five stages of grief described by the Kübler-Ross model.

Jeni covers the five stages of grief from a data quality standpoint and offers a sixth stage. (No spoilers follow, read her post.)

Correcting input/transformation errors is one level of data cleaning.

But the near-collapse of HealthCare.gov shows how streams of “clean” data can combine into a large pool of “dirty” data.

Every contributor supplied ‘clean’ data but when combined with other “clean” data, confusion was the result.

“Clean” data is an ongoing process at two separate levels:

Level 1: Traditional correction of input/transformation errors (as per Jeni).

Level 2: Preparation of data for transformation into “clean” data for new purposes.

The first level is familiar.

The second we all know as ad-hoc ETL.

Enough knowledge is gained to make a transformation work, but that knowledge isn’t passed on with the data or more generally.

Or as we all learned from television: “Lather, rinse, repeat.”

A good slogan if you are trying to maximize sales of shampoo, but a wasteful one when describing ETL for data.

What if data curators captured the knowledge required for ETL, making every subsequent ETL less resource intensive and less error prone?

I think that would qualify as data cleaning.

You?

November 28, 2013

2013 Arrives! (New Crawl Data)

Filed under: Common Crawl,Data,Dataset,WWW — Patrick Durusau @ 10:56 am

New Crawl Data Available! by Jordan Mendelson.

From the post:

We are very please to announce that new crawl data is now available! The data was collected in 2013, contains approximately 2 billion web pages and is 102TB in size (uncompressed).

We’ve made some changes to the data formats and the directory structure. Please see the details below and please share your thoughts and questions on the Common Crawl Google Group.

Format Changes

We have switched from ARC files to WARC files to better match what the industry has standardized on. WARC files allow us to include HTTP request information in the crawl data, add metadata about requests, and cross-reference the text extracts with the specific response that they were generated from. There are also many good open source tools for working with WARC files.

We have switched the metadata files from JSON to WAT files. The JSON format did not allow specifying the multiple offsets to files necessary for the WARC upgrade and WAT files provide more detail.


We have switched our text file format from Hadoop sequence files to WET files (WARC Encapsulated Text) that properly reference the original requests. This makes it far easier for your processes to disambiguate which text extracts belong to which specific page fetches.

Jordan continues to outline the directory structure of the 2013 crawl data and lists additional resources that will be of interest.

If you aren’t Google or some reasonable facsimile thereof (yet), the Common Crawl data set is your doorway into the wild wild content of the WWW.

How do your algorithms fare when matched against the full range of human expression?

November 27, 2013

…Features from YouTube Videos…

Filed under: Data,Machine Learning,Multiview Learning — Patrick Durusau @ 1:30 pm

Released Data Set: Features Extracted from YouTube Videos for Multiview Learning by Omid Madani.

From the post:

“If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.”

The “duck test”.

Performance of machine learning algorithms, supervised or unsupervised, is often significantly enhanced when a variety of feature families, or multiple views of the data, are available. For example, in the case of web pages, one feature family can be based on the words appearing on the page, and another can be based on the URLs and related connectivity properties. Similarly, videos contain both audio and visual signals where in turn each modality is analyzed in a variety of ways. For instance, the visual stream can be analyzed based on the color and edge distribution, texture, motion, object types, and so on. YouTube videos are also associated with textual information (title, tags, comments, etc.). Each feature family complements others in providing predictive signals to accomplish a prediction or classification task, for example, in automatically classifying videos into subject areas such as sports, music, comedy, games, and so on.

We have released a dataset of over 100k feature vectors extracted from public YouTube videos. These videos are labeled by one of 30 classes, each class corresponding to a video game (with some amount of class noise): each video shows a gameplay of a video game, for teaching purposes for example. Each instance (video) is described by three feature families (textual, visual, and auditory), and each family is broken into subfamilies yielding up to 13 feature types per instance. Neither video identities nor class identities are released.

The concept of multiview learning is clear enough but the term was unfamiliar.

In that regard, you may want to read: A Survey on Multi-view Learning by Chang Xu, Dacheng Tao, Chao Xu.

Abstract:

In recent years, a great many methods of learning from multi-view data by considering the diversity of different views have been proposed. These views may be obtained from multiple sources or different feature subsets. In trying to organize and highlight similarities and differences between the variety of multi-view learning approaches, we review a number of representative multi-view learning algorithms in different areas and classify them into three groups: 1) co-training, 2) multiple kernel learning, and 3) subspace learning. Notably, co-training style algorithms train alternately to maximize the mutual agreement on two distinct views of the data; multiple kernel learning algorithms exploit kernels that naturally correspond to different views and combine kernels either linearly or non-linearly to improve learning performance; and subspace learning algorithms aim to obtain a latent subspace shared by multiple views by assuming that the input views are generated from this latent subspace. Though there is significant variance in the approaches to integrating multiple views to improve learning performance, they mainly exploit either the consensus principle or the complementary principle to ensure the success of multi-view learning. Since accessing multiple views is the fundament of multi-view learning, with the exception of study on learning a model from multiple views, it is also valuable to study how to construct multiple views and how to evaluate these views. Overall, by exploring the consistency and complementary properties of different views, multi-view learning is rendered more effective, more promising, and has better generalization ability than single-view learning.

Be forewarned that the survey runs 59 pages and has 9 1/2 pages of references. Not something you take home for a quick read. 😉

November 14, 2013

Data Repositories…

Filed under: Data,Data Repositories,Dataset — Patrick Durusau @ 6:00 pm

Data Repositories – Mother’s Milk for Data Scientists by Jerry A. Smith.

From the post:

Mothers are life givers, giving the milk of life. While there are so very few analogies so apropos, data is often considered the Mother’s Milk of Corporate Valuation. So, as a data scientist, we should treat dearly all those sources of data, understanding their place in the overall value chain of corporate existence.

A Data Repository is a logical (and sometimes physical) partitioning of data where multiple databases which apply to specific applications or sets of applications reside. For example, several databases (revenues, expenses) which support financial applications (A/R, A/P) could reside in a single financial Data Repository. Data Repositories can be found both internal (e.g., in data warehouses) and external (see below) to an organization. Here are a few repositories from KDnuggets that are worth taking a look at: (emphasis in original)

I count sixty-four (64) collections of data sets as of today.

What I haven’t seen, perhaps you have, is an index across the most popular data set collections that dedupes data sets and has thumb-nail information for each one.

Suggested indexes across data set collections?

November 7, 2013

Enhancing Time Series Data by Applying Bitemporality

Filed under: Data,Time,Time Series,Timelines,Topic Maps — Patrick Durusau @ 5:30 pm

Enhancing Time Series Data by Applying Bitemporality (It’s not just what you know, it’s when you know it) by Jeffrey Shmain.

A “white paper” and all that implies but it raises the interesting question of setting time boundaries for the validity of data.

From the context of the paper, “bitemporality” means setting a start and end time for the validity of some unit of data.

We all know the static view of the world presented by most data systems is false. But it works well enough in some cases.

The problem is that most data systems don’t allow you to choose static versus some other view of the world.

In part because to get a non-static view, you have to modify your data system (often not a good idea) or migrate to another data system (which is expensive and not risk free) to obtain a non-static view of the world.

Jeffrey remarks in the paper that “all data is time series data” and he’s right. Data arrives at time X, was sent at time T, was logged at time Y, was seen by the CIO at Z, etc. To say nothing of tracking changes to that data.

Not all cases require that much detail but if you need it, wouldn’t it be nice to have?

Your present system may limit you to static views but topic maps can enhance your system in place. Avoiding the dangers of upgrading in place and/or migrating into unknown perils and hazards.

When did you know you needed time based validity for your data?

For a bit more technical view of bitemporality. (authored by Robbert van Dalen)

Dealing with Data in the Hadoop Ecosystem…

Filed under: Cloudera,Data,Hadoop — Patrick Durusau @ 1:15 pm

Dealing with Data in the Hadoop Ecosystem – Hadoop, Sqoop, and ZooKeeper by Rachel Roumeliotis.

From the post:

Kathleen Ting (@kate_ting), Technical Account Manager at Cloudera, and our own Andy Oram 0:22]

  • ZooKeeper, the canary in the Hadoop coal mine [Discussed at 1:10]
  • Leaky clients are often a problem ZooKeeper detects [Discussed at 2:10]
  • Sqoop is a bulk data transfer tool [Discussed at 2:47]
  • Sqoop helps to bring together structured and unstructured data [Discussed at 3:50]
  • ZooKeep is not for storage, but coordination, reliability, availability [Discussed at 4:44]
  • Conference interview so not deep but interesting.

    For example, reported that 44% of production errors could be traced to misconfiguration errors.

    November 5, 2013

    Exoplanets.org

    Filed under: Astroinformatics,Data — Patrick Durusau @ 4:45 pm

    Exoplanets.org

    From the homepage:

    The Exoplanet Data Explorer is an interactive table and plotter for exploring and displaying data from the Exoplanet Orbit Database. The Exoplanet Orbit Database is a carefully constructed compilation of quality, spectroscopic orbital parameters of exoplanets orbiting normal stars from the peer-reviewed literature, and updates the Catalog of nearby exoplanets.

    A detailed description of the Exoplanet Orbit Database and Explorers is published here and is available on astro-ph.

    In addition to the Exoplanet Data Explorer, we have also provided the entire Exoplanet Orbit Database in CSV format for a quick and convenient download here. A list of all archived CSVs is available here.

    Help and documentation for the Exoplanet Data Explorer is available here. A FAQ and overview of our methodology is here, including answers to the questions “Why isn’t my favorite planet/datum in the EOD?” and “Why does site X list more planets than this one?”.

    A small data set but an important one none the less.

    I would point out that the term “here” occurs five (5) times with completely different meanings.

    It’s a small thing but had:

    Help and documentation for the Exoplanet Data Explorer is available <a href=”http://exoplanets.org/help/common/data”>here</a>

    been:

    <a href=”http://exoplanets.org/help/common/data”>Exoplanet Data Explorer help and documentation</a>

    Even a not very bright search engine might do a better search of the page.

    Please avoid labeling links with “here.”

    November 3, 2013

    Download Cooper-Hewitt Collections Data

    Filed under: Data,Museums — Patrick Durusau @ 5:32 pm

    Download Cooper-Hewitt Collections Data

    From the post:

    Cooper-Hewitt is committed to making its collection data available for public access. To date, we have made public approximately 60% of the documented collection available online. Whilst we have a web interface for searching the collection, we are now also making the dataset available for free public download. By being able to see “everything” at once, new connections and understandings may emerge.

    What is it?

    The download contains only text metadata, or “tombstone” information—a brief object description that includes temporal, geographic, and provenance information—for over 120,000 objects.

    Is it complete?

    No. The data is only tombstone information. Tombstone information is the raw data that is created by museum staff at the time of acquisition for recording the basic ‘facts’ about an object. As such, it is unedited. Historically, museum staff have used this data only for identifying the object, tracking its whereabouts in storage or exhibition, and for internal report and label creation. Like most museums, Cooper-Hewitt had never predicted that the public might use technologies, such as the web, to explore museum collections in the way that they do now. As such, this data has not been created with a “public audience” in mind. Not every field is complete for each record, nor is there any consistency in the way in which data has been entered over the many years of its accumulation. Considerable additional information is available in research files that have not yet been digitized and, as the research work of the museum is ongoing, the records will continue to be updated and change over time.

    Which all sounds great, if you know what the Cooper-Hewitt collection houses.

    From the about page:

    Smithsonian’s Cooper-Hewitt, National Design Museum is the only museum in the nation devoted exclusively to historic and contemporary design. The Museum presents compelling perspectives on the impact of design on daily life through active educational and curatorial programming.

    It is the mission of Cooper-Hewitt’s staff and Board of Trustees to advance the public understanding of design across the thirty centuries of human creativity represented by the Museum’s collection. The Museum was founded in 1897 by Amy, Eleanor, and Sarah Hewitt—granddaughters of industrialist Peter Cooper—as part of The Cooper Union for the Advancement of Science and Art. A branch of the Smithsonian since 1967, Cooper-Hewitt is housed in the landmark Andrew Carnegie Mansion on Fifth Avenue in New York City.

    The campus also includes two historic townhouses renovated with state-of-the-art conservation technology and a unique terrace and garden. Cooper-Hewitt’s collections include more than 217,000 design objects and a world-class design library. Its exhibitions, in-depth educational programs, and on-site, degree-granting master’s program explore the process of design, both historic and contemporary. As part of its mission, Cooper-Hewitt annually sponsors the National Design Awards, a prestigious program which honors innovation and excellence in American design. Together, these resources and programs reinforce Cooper-Hewitt’s position as the preeminent museum and educational authority for the study of design in the United States.

    Even without images, I can imagine enhancing library catalog holdings with annotations about particular artifacts being located at the Cooper-Hewitt.

    October 29, 2013

    A Checklist for Creating Data Products

    Filed under: Data,Marketing — Patrick Durusau @ 6:30 pm

    A Checklist for Creating Data Products by Zach Gemignani.

    From the post:

    Are you are sitting on a gold mine — if only you could transform your unique data into a valuable, monetizable data product?

    Over the years, we’ve worked with dozens of clients to create applications that refine data and package the results in a form users will love. We often talk with product managers early in the conception phase to help define the target market and end-user needs, even before designing interfaces for presenting and visualizing the data.

    In the process, we’ve learned a few lessons and gather a bunch of useful resources. Download our Checklist for Product Managers of Data Solutions. It is divided into four sections:

    1. Audience: Understand the people who need your data
    2. Data: Define and enhance the data for your solution
    3. Design: Craft an application that solves problems
    4. Delivery: Transition from application to profitable product

    Zach and friends have done a good job packing this one page checklist with helpful hints.

    No turn-key solution to riches but may spark some ideas that will move you closer to a viable data product.

    « Newer PostsOlder Posts »

    Powered by WordPress