Wikibase DataModel released! « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 3, 2014

Wikibase DataModel released!

Filed under: Data Models,Identification,Precision,Subject Identity,Wikidata,Wikipedia — Patrick Durusau @ 5:04 pm

Wikibase DataModel released! by Jeroen De Dauw.

From the post:

I’m happy to announce the 0.6 release of Wikibase DataModel. This is the first real release of this component.

DataModel?

Wikibase is the software behind Wikidata.org. At its core, this software is about describing entities. Entities are collections of claims, which can have qualifiers, references and values of various different types. How this all fits together is described in the DataModel document written by Markus and Denny at the start of the project. The Wikibase DataModel component contains (PHP) domain objects representing entities and their various parts, as well as associated domain logic.

I wanted to draw your attention to this discussion of “items:”

Items are Entities that are typically represented by a Wikipage (at least in some Wikipedia languages). They can be viewed as “the thing that a Wikipage is about,” which could be an individual thing (the person Albert Einstein), a general class of things (the class of all Physicists), and any other concept that is the subject of some Wikipedia page (including things like History of Berlin).

The IRI of an Item will typically be closely related to the URL of its page on Wikidata. It is expected that Items store a shorter ID string (for example, as a title string in MediaWiki) that is used in both cases. ID strings might have a standardized technical format such as “wd1234567890” and will usually not be seen by users. The ID of an Item should be stable and not change after it has been created.

The exact meaning of an Item cannot be captured in Wikidata (or any technical system), but is discussed and decided on by the community of editors, just as it is done with the subject of Wikipedia articles now. It is possible that an Item has multiple “aspects” to its meaning. For example, the page Orca describes a species of whales. It can be viewed as a class of all Orca whales, and an individual whale such as Keiko would be an element of this class. On the other hand, the species Orca is also a concept about which we can make individual statements. For example, one could say that the binomial name (a Property) of the Orca species has the Value “Orcinus orca (Linnaeus, 1758).”

However, it is intended that the information stored in Wikidata is generally about the topic of the Item. For example, the Item for History of Berlin should store data about this history (if there is any such data), not about Berlin (the city). It is not intended that data about one subject is distributed across multiple Wikidata Items: each Item fully represents one thing. This also helps for data integration across languages: many languages have no separate article about Berlin’s history, but most have an article about Berlin.

What do you make of the claim:

The exact meaning of an Item cannot be captured in Wikidata (or any technical system), but is discussed and decided on by the community of editors, just as it is done with the subject of Wikipedia articles now. It is possible that an Item has multiple “aspects” to its meaning. For example, the page Orca describes a species of whales. It can be viewed as a class of all Orca whales, and an individual whale such as Keiko would be an element of this class. On the other hand, the species Orca is also a concept about which we can make individual statements. For example, one could say that the binomial name (a Property) of the Orca species has the Value “Orcinus orca (Linnaeus, 1758).”

I may write an information system that fails to distinguish between a species of whales, a class of whales and a particular whale, but that is a design choice, not a foregone conclusion.

In the case of Wikipedia, which relies upon individuals repeating the task of extracting relevant information from loosely gathered data, that approach words quite well.

But there isn’t one degree of precision of identification that works for all cases.

My suspicion is that for more demanding search applications, such as drug interactions, less precise identifications could lead to unfortunate, even fatal, results.

Yes?

Comments (1)

1 Comment

I fully agree. Other systems can have a different way to separate entities. What the sentence claims is that no technical system can capture the exact meaning. You can get more or less exact, more or less vague, but there is no way you can capture the exact semantics for most terms in Wikipedia (say, for whales) completely in a system.

Comment by Denny Vrandečić — January 9, 2014 @ 5:51 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.