Microformats « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 1, 2014

Baloo [KDE drops RDF]

Filed under: Metadata,Microformats,RDF — Patrick Durusau @ 4:28 pm

From the post:

Baloo is the next generation of the Nepomuk project. It’s responsible for handling user metadata such as tags, rating and comments. It also handles indexing and searching for files, emails, contacts, and so on. Baloo aims to be lighter on resources and more reliable than its parent project.

…

The Nepomuk project started as a research project in the European Union. The goal was to explore the use of relations between data for finding what you are looking for. It was build completely on top of RDF. While RDF is a great from a theoretical point of view, it is not the simplest tool to understand or optimize. The databases which currently exist for RDF are not suited for desktop use.

The Nepomuk developers have tried very hard over the last years to optimize the indexing and searching infrastructure, and they have now come to the conclusion that Nepomuk cannot be further optimized without migrating away from RDF.

RDF also heavily relied on ontologies. These ontologies are a way to describe how the data should be stored and represented. They used the ontologies from the original EU research project – Shared Desktop Ontologies. These ontologies were not designed in a time when it was not very clear how they would work and have sub-optimal performance and ease of use. They are quite vague in certain areas and often duplicate information. This leads to scenarios where it takes forever to figure out how the data should be stored. Additionally, since all the data needs to be stored in RDF, one cannot optimize for one specific data type.

Given these shortcomings and the many lessons learned over the last years the Nepomuk developers decided to drop RDF and rechristen the project under the name of Baloo. You can find more technical background and info on its architecture here.

I suggested to someone in synchronous time that authoring support for schema.org based metadata could be a win-win for users and document processing software.

For users, search appliances, local or even Google, can ingest “lite” schema definitions that provide immediate ROI on adding semantics to your documents. Well, I say immediate, as soon as they are indexed.

That should require no more skill than being able to type, assuming your document software can recognize the terms you use and annotate them properly.

Think of the different between the number of people who can author XML using MS Office or Apache OpenOffice, etc. Now compare that to people who natively author DocBook documents.

If you want a successful strategy, do you follow the one that has resulted in a user base measured in increments of hundred’s of millions or do you prefer the righteous remnant approach with say less than 50,000?

I’m no marketing person but even I know the answer to that one.

PS: There are some ankle biters who complain about the MS Office user numbers. Let’s just say between MS Office and Apache OpenOffice and the other ODF based word processors, that DocBook users are out-numbered by at least 20,000 to 1. Who needs more accurate numbers than that?

PPS: Microformats don’t have the precision that RDF and/or Topic Maps have to offer. But precision without adoption can’t be very precise. With adoption of microformats, more precision can be added as required by particular use cases.

I first saw this in a tweet by Jan Schnasse.

Comments Off

December 14, 2012

Web Data Commons (2012) – [RDFa at 1.28% of 40.5 million websites]

Filed under: Common Crawl,Microdata,Microformats,RDFa — Patrick Durusau @ 2:34 pm

Web Data Commons announced the extraction results from the August 2012 Common Crawl corpus on 2012-12-10!

Access:

The August 2012 Common Crawl Corpus is available on Amazon S3 in the bucket aws-publicdatasets under the key prefix /common-crawl/parse-output/segment/ .

The numbers:

Extraction Statistics

Crawl Date January-June 2012

Total Data 40.1 Terabyte (compressed)

Parsed HTML URLs 3,005,629,093

URLs with Triples 369,254,196

Domains in Crawl 40,600,000

Domains with Triples 2,286,277

Typed Entities 1,811,471,956

Triples 7,350,953,995

Crawl Date	January-June 2012
Total Data	40.1 Terabyte	(compressed)
Parsed HTML URLs	3,005,629,093
URLs with Triples	369,254,196
Domains in Crawl	40,600,000
Domains with Triples	2,286,277
Typed Entities	1,811,471,956
Triples	7,350,953,995

March 22, 2012

Web Data Commons

Filed under: Common Crawl,Microdata,Microformats,PageRank,RDFa — Patrick Durusau @ 7:42 pm

Web Data Commons

From the webpage:

More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages. The Web Data Commons project extracts this data from several billion web pages and provides the extracted data for download. Web Data Commons thus enables you to use the data without needing to crawl the Web yourself.

…

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using encoding standards such as Microformats, Microdatas and RDFa. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads and (soon) also in the form of CSV-tables for common entity types (e.g. product, organization, location, …).

Web Data Commons thus enables you to use structured data originating from hundreds of million web pages within your applications without needing to crawl the Web yourself.

Pages in the Common Crawl corpora are included based on their PageRank score, thereby making the crawls snapshots of the current popular part of the Web.

This reminds me of the virtual observatory practice in astronomy. Astronomical data is too large to easily transfer and many who need to use the data lack the software or processing power. The solution? Holders of the data make it available via interfaces that deliver a sub-part of the data, processed according to the requester’s needs.

The Web Data Commons is much the same thing as it frees most of us from both crawling the web and/or extracting structured data from it. Or at least giving us the basis for more pointed crawling of the web.

A very welcome development!

Comments Off

February 17, 2012

rNews is here. And this is what it means.

Filed under: Microdata,Microformats,rNews — Patrick Durusau @ 5:02 pm

rNews is here. And this is what it means. by EVAN SANDHAUS.

From the post:

On January 23rd, 2012, The Times made a subtle change to articles published on nytimes.com. We rolled out phase one of our implementation of rNews – a new standard for embedding machine-readable publishing metadata into HTML documents. Many of our users will never see the change but the change will likely impact how they experience the news.

Far beneath the surface of nytimes.com lurk the databases — databases of articles, metadata and images, databases that took tremendous effort to develop, databases that the world only glimpses through the dark lens of HTML.

A rather slow lead into the crux of the story, the New York Times has started embedding rNews snippets in its news stories as of January 23rd, 2012. With the use of rNews to expand in the future.

Interesting result if you follow the request to paste the URL for The Bookstores Last Stand, http://www.nytimes.com/2012/01/29/business/barnes-noble-taking-on-amazon-in-the-fight-of-its-life.html, into the Google Rich Snippet tool. Go ahead, I’m not going anywhere, try it.

The New York Times has already diverged from the schema that it wants others to follow: “Warning: Page contains property “identifier” which is not part of the schema.”

Earlier in the article Evan notes:

Several extensions to HTML have emerged that allow web publishers to explicitly markup structural metadata. These technologies include Microformats, HTML 5 Microdata and the Resource Description Framework in Attributes (RDFa).

For these technologies to be usefully applied, however, everybody has to agree what things should be called. For example, what The Times calls a “Headline,” a blogger might call a “Title,” and a German publisher might call an “überschrift.”

To use these new technologies for expressing underlying structure, the web publishing industry has to agree on a standard set of names and attributes, not an easy task. (emphasis added)

Using common names whenever possible but adapting (rather than breaking) in the event of change would be a better strategy.

One that would serve the NYT until 2173 and keep articles back to January 23rd 2012 as accessible as the day they were published.

Comments Off