Archive for the ‘Microformats’ Category

Web Data Commons (2012) – [RDFa at 1.28% of 40.5 million websites]

Friday, December 14th, 2012

Web Data Commons announced the extraction results from the August 2012 Common Crawl corpus on 2012-12-10!

Access:

The August 2012 Common Crawl Corpus is available on Amazon S3 in the bucket aws-publicdatasets under the key prefix /common-crawl/parse-output/segment/ .

The numbers:

Extraction Statistics

Crawl Date January-June 2012
Total Data 40.1 Terabyte (compressed)
Parsed HTML URLs 3,005,629,093
URLs with Triples 369,254,196
Domains in Crawl 40,600,000
Domains with Triples 2,286,277
Typed Entities 1,811,471,956
Triples 7,350,953,995

See also:

Web Data Commons Extraction Report – August 2012 Corpus

and,

Additional Statistics and Analysis of the Web Data Commons August 2012 Corpus

Where the authors report:

Altogether we discovered structured data within 369 million of the 3 billion pages contained in the Common Crawl corpus (12.3%). The pages containing structured data originate from 2.29 million among the 40.5 million websites (PLDs) contained in the corpus (5.65%). Approximately 519 thousand websites use RDFa, while only 140 thousand websites use Microdata. Microformats are used on 1.7 million websites. It is interesting to see that Microformats are used by approximately 2.5 times as many websites as RDFa and Microdata together.

PLDs = Pay-Level-Domains.

The use of Microformats on “2.5 times as many websites as RDFa and Microdata together” has to make you wonder about the viability of RDFa.

Or to put it differently, if RDFa is 1.28% of the 40.5 million websites, eight (8) years after its introduction (2004) and four (4) years after reaching Recommendation status (2008), is it time to look for an alternative?

I first saw the news about the new Web Data Commons data drop in a tweet by Tobias Trapp.

Web Data Commons

Thursday, March 22nd, 2012

Web Data Commons

From the webpage:

More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages. The Web Data Commons project extracts this data from several billion web pages and provides the extracted data for download. Web Data Commons thus enables you to use the data without needing to crawl the Web yourself.

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using encoding standards such as Microformats, Microdatas and RDFa. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads and (soon) also in the form of CSV-tables for common entity types (e.g. product, organization, location, …).

Web Data Commons thus enables you to use structured data originating from hundreds of million web pages within your applications without needing to crawl the Web yourself.

Pages in the Common Crawl corpora are included based on their PageRank score, thereby making the crawls snapshots of the current popular part of the Web.

This reminds me of the virtual observatory practice in astronomy. Astronomical data is too large to easily transfer and many who need to use the data lack the software or processing power. The solution? Holders of the data make it available via interfaces that deliver a sub-part of the data, processed according to the requester’s needs.

The Web Data Commons is much the same thing as it frees most of us from both crawling the web and/or extracting structured data from it. Or at least giving us the basis for more pointed crawling of the web.

A very welcome development!

rNews is here. And this is what it means.

Friday, February 17th, 2012

rNews is here. And this is what it means. by EVAN SANDHAUS.

From the post:

On January 23rd, 2012, The Times made a subtle change to articles published on nytimes.com. We rolled out phase one of our implementation of rNews – a new standard for embedding machine-readable publishing metadata into HTML documents. Many of our users will never see the change but the change will likely impact how they experience the news.

Far beneath the surface of nytimes.com lurk the databases — databases of articles, metadata and images, databases that took tremendous effort to develop, databases that the world only glimpses through the dark lens of HTML.

A rather slow lead into the crux of the story, the New York Times has started embedding rNews snippets in its news stories as of January 23rd, 2012. With the use of rNews to expand in the future.

Interesting result if you follow the request to paste the URL for The Bookstores Last Stand, http://www.nytimes.com/2012/01/29/business/barnes-noble-taking-on-amazon-in-the-fight-of-its-life.html, into the Google Rich Snippet tool. Go ahead, I’m not going anywhere, try it.

The New York Times has already diverged from the schema that it wants others to follow: “Warning: Page contains property “identifier” which is not part of the schema.

Earlier in the article Evan notes:

Several extensions to HTML have emerged that allow web publishers to explicitly markup structural metadata. These technologies include Microformats, HTML 5 Microdata and the Resource Description Framework in Attributes (RDFa).

For these technologies to be usefully applied, however, everybody has to agree what things should be called. For example, what The Times calls a “Headline,” a blogger might call a “Title,” and a German publisher might call an “überschrift.”

To use these new technologies for expressing underlying structure, the web publishing industry has to agree on a standard set of names and attributes, not an easy task. (emphasis added)

Using common names whenever possible but adapting (rather than breaking) in the event of change would be a better strategy.

One that would serve the NYT until 2173 and keep articles back to January 23rd 2012 as accessible as the day they were published.