Archive for the ‘Microdata’ Category

Black Friday Dreaming with Bob DuCharme

Saturday, November 15th, 2014

Querying aggregated Walmart and BestBuy data with SPARQL by Bob DuCharme.

From the post:

The combination of microdata and seems to have hit a sweet spot that has helped both to get a lot of traction. I’ve been learning more about microdata recently, but even before I did, I found that the W3C’s Microdata to RDF Distiller written by Ivan Herman would convert microdata stored in web pages into RDF triples, making it possible to query this data with SPARQL. With major retailers such as Walmart and BestBuy making such data available on—as far as I can tell—every single product’s web page, this makes some interesting queries possible to compare prices and other information from the two vendors.

Bob’s use of SPARQL won’t be ready for this coming Black Friday but some Black Friday in the future?

One can imagine “blue light specials” being input by shoppers on location and driving traffic patterns at the larger malls.

Well worth your time to see where Bob was able to get using public tools.

I first saw this in a tweet by Ivan Herman.

Web Data Commons (2012) – [RDFa at 1.28% of 40.5 million websites]

Friday, December 14th, 2012

Web Data Commons announced the extraction results from the August 2012 Common Crawl corpus on 2012-12-10!


The August 2012 Common Crawl Corpus is available on Amazon S3 in the bucket aws-publicdatasets under the key prefix /common-crawl/parse-output/segment/ .

The numbers:

Extraction Statistics

Crawl Date January-June 2012
Total Data 40.1 Terabyte (compressed)
Parsed HTML URLs 3,005,629,093
URLs with Triples 369,254,196
Domains in Crawl 40,600,000
Domains with Triples 2,286,277
Typed Entities 1,811,471,956
Triples 7,350,953,995

See also:

Web Data Commons Extraction Report – August 2012 Corpus


Additional Statistics and Analysis of the Web Data Commons August 2012 Corpus

Where the authors report:

Altogether we discovered structured data within 369 million of the 3 billion pages contained in the Common Crawl corpus (12.3%). The pages containing structured data originate from 2.29 million among the 40.5 million websites (PLDs) contained in the corpus (5.65%). Approximately 519 thousand websites use RDFa, while only 140 thousand websites use Microdata. Microformats are used on 1.7 million websites. It is interesting to see that Microformats are used by approximately 2.5 times as many websites as RDFa and Microdata together.

PLDs = Pay-Level-Domains.

The use of Microformats on “2.5 times as many websites as RDFa and Microdata together” has to make you wonder about the viability of RDFa.

Or to put it differently, if RDFa is 1.28% of the 40.5 million websites, eight (8) years after its introduction (2004) and four (4) years after reaching Recommendation status (2008), is it time to look for an alternative?

I first saw the news about the new Web Data Commons data drop in a tweet by Tobias Trapp.

Web Data Commons

Thursday, March 22nd, 2012

Web Data Commons

From the webpage:

More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages. The Web Data Commons project extracts this data from several billion web pages and provides the extracted data for download. Web Data Commons thus enables you to use the data without needing to crawl the Web yourself.

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using encoding standards such as Microformats, Microdatas and RDFa. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads and (soon) also in the form of CSV-tables for common entity types (e.g. product, organization, location, …).

Web Data Commons thus enables you to use structured data originating from hundreds of million web pages within your applications without needing to crawl the Web yourself.

Pages in the Common Crawl corpora are included based on their PageRank score, thereby making the crawls snapshots of the current popular part of the Web.

This reminds me of the virtual observatory practice in astronomy. Astronomical data is too large to easily transfer and many who need to use the data lack the software or processing power. The solution? Holders of the data make it available via interfaces that deliver a sub-part of the data, processed according to the requester’s needs.

The Web Data Commons is much the same thing as it frees most of us from both crawling the web and/or extracting structured data from it. Or at least giving us the basis for more pointed crawling of the web.

A very welcome development!

W3C HTML Data Task Force Publishes 2 Notes

Tuesday, March 13th, 2012

W3C HTML Data Task Force Publishes 2 Notes

From the post:

The W3C HTML Data Task Force has published two notes, the HTML Data Guide and Microdata to RDF. According to the abstract of the former, ” This guide aims to help publishers and consumers of HTML data use it well. With several syntaxes and vocabularies to choose from, it provides guidance about how to decide which meets the publisher’s or consumer’s needs. It discusses when it is necessary to mix syntaxes and vocabularies and how to publish and consume data that uses multiple formats. It describes how to create vocabularies that can be used in multiple syntaxes and general best practices about the publication and consumption of HTML data.”

One can only hope that the W3C will eventually sanctify industry standard practices for metadata. Perhaps they will call it RDF-NG. Whatever.

rNews is here. And this is what it means.

Friday, February 17th, 2012

rNews is here. And this is what it means. by EVAN SANDHAUS.

From the post:

On January 23rd, 2012, The Times made a subtle change to articles published on We rolled out phase one of our implementation of rNews – a new standard for embedding machine-readable publishing metadata into HTML documents. Many of our users will never see the change but the change will likely impact how they experience the news.

Far beneath the surface of lurk the databases — databases of articles, metadata and images, databases that took tremendous effort to develop, databases that the world only glimpses through the dark lens of HTML.

A rather slow lead into the crux of the story, the New York Times has started embedding rNews snippets in its news stories as of January 23rd, 2012. With the use of rNews to expand in the future.

Interesting result if you follow the request to paste the URL for The Bookstores Last Stand,, into the Google Rich Snippet tool. Go ahead, I’m not going anywhere, try it.

The New York Times has already diverged from the schema that it wants others to follow: “Warning: Page contains property “identifier” which is not part of the schema.

Earlier in the article Evan notes:

Several extensions to HTML have emerged that allow web publishers to explicitly markup structural metadata. These technologies include Microformats, HTML 5 Microdata and the Resource Description Framework in Attributes (RDFa).

For these technologies to be usefully applied, however, everybody has to agree what things should be called. For example, what The Times calls a “Headline,” a blogger might call a “Title,” and a German publisher might call an “überschrift.”

To use these new technologies for expressing underlying structure, the web publishing industry has to agree on a standard set of names and attributes, not an easy task. (emphasis added)

Using common names whenever possible but adapting (rather than breaking) in the event of change would be a better strategy.

One that would serve the NYT until 2173 and keep articles back to January 23rd 2012 as accessible as the day they were published.

HTML Data Task Force

Sunday, October 2nd, 2011

HTML Data Task Force, chaired by Jeni Tennison.

Another opportunity to participate in important work at the W3C without a membership. The “details” of getting diverse formats to work together.

Close analysis may show the need for changes to syntaxes, etc., but as far as mapping goes, topic maps can take syntaxes as they are. Could be an opportunity to demonstrate working solutions for actual use cases.

From the wikipage:

This HTML Data Task Force considers RDFa 1.1 and microdata as separate syntaxes, and conducts a technical analysis on the relationship between the two formats. The analysis discusses specific use cases and provide guidance on what format is best suited for what use cases. It further addresses the question how different formats can be used within the same document when required and how data expressed in the different formats can be combined by consumers.

The task force MAY propose modifications in the form of bug reports and change proposals on the microdata and/or RDFa specifications, to help users to easily transition between the two syntaxes or use them together. As with all such comments, the ultimate decisions on implementing these will rest with the respective Working Groups.

Further, the Task Force should also produce a draft specifications of mapping algorithms from an HTML+microdata content to RDF, as well as a mapping of RDFa to microdata’s JSON format. These MAY serve as input documents to possible future recommendation track works. These mappings should be, if possible, generic, i.e., they should not be dependent on any particular vocabulary. A goal for these mappings should be to facilitate the use of both formats with the same vocabularies without creating incompatibilities.

The Task Force will also consider design patterns for vocabularies, and provide guidance on how vocabularies should be shaped to be usable with both microdata and RDFa and potentially with microformats. These patterns MAY lead to change proposals of existing (RDF) vocabularies, and MAY result in general guidelines for the design of vocabularies for structured data on the web, building on existing community work in this area.

The Task Force liaises with the SWIG Web Schemas Task Force to ensure that lessons from real-world experience are incorporated into the Task Force recommendations and that any best practices described by the Task Force are synchronised with real-world practice.

The Task Force conducts its work through the mailing list (use this link to subscribe or look at the public archives), as well as on the #html-data-tf channel of the (public) W3C IRC server.

Microdata and RDFa Living Together in Harmony

Sunday, August 21st, 2011

Microdata and RDFa Living Together in Harmony by Jeni Tennison.

From the post:

One of the options that the TAG put forward when it asked the W3C to put together task force on embedded data in HTML was the co-existence of RDFa and microdata. If that’s what we’re headed for, what might make things easier for consumers and publishers who have to live in that world?

In a situation where there are two competing standards, I think that developers — both on the publication and consumption sides — are going to want to hedge their bets. They will want to avoid being tied to one syntax in case it turns out that that syntax isn’t supported by the majority of publishers/consumers in the long term and they have to switch.

Publishers like us at who are aiming to share their data to whoever is interested in it (rather than having a particular consumer in mind) are also likely to want to publish in both microdata and RDFa, rather than force potential consumers to adopt a particular processing model, and will therefore need to mix the syntaxes within their pages.

Interesting and detailed analysis of the issues of reconciling microdata and RDFa.

Jeni asks if this type of analysis is worthy of something more official than a blog post.

I would say yes. I think this sort of mapping analysis should be published along with any competing format.

You would not frequent a software project that lacks version control.

Why use a data format/annotation that doesn’t provide a mapping to “competing” formats? (The emphasis being on “competing” formats. Not mappings to any possible format but to those in direct competition with the proposed format/annotation system.)

I have no objection to new formats but if there is an existing format, document its shortcomings and a mapping to the new format, along with where the mapping fails.

Doesn’t save us from competing formats but it may ease the evaluation and/or co-existence of formats.

From a topic map perspective, such a mapping is just more grist for the mill.