Archive for the ‘Wikipedia’ Category

The Wikidata revolution is here:…

Friday, April 26th, 2013

The Wikidata revolution is here: enabling structured data on Wikipedia by Tilman Bayer.

From the post:

A year after its announcement as the first new Wikimedia project since 2006, Wikidata has now begun to serve the over 280 language versions of Wikipedia as a common source of structured data that can be used in more than 25 million articles of the free encyclopedia.

By providing Wikipedia editors with a central venue for their efforts to collect and vet such data, Wikidata leads to a higher level of consistency and quality in Wikipedia articles across the many language editions of the encyclopedia. Beyond Wikipedia, Wikidata’s universal, machine-readable knowledge database will be freely reusable by anyone, enabling numerous external applications.

“Wikidata is a powerful tool for keeping information in Wikipedia current across all language versions,” said Wikimedia Foundation Executive Director Sue Gardner. “Before Wikidata, Wikipedians needed to manually update hundreds of Wikipedia language versions every time a famous person died or a country’s leader changed. With Wikidata, such new information, entered once, can automatically appear across all Wikipedia language versions. That makes life easier for editors and makes it easier for Wikipedia to stay current.”

This is a great source of curated data!

TSDW:… [Enterprise Disambiguation]

Monday, April 22nd, 2013

TSDW: Two-stage word sense disambiguation using Wikipedia by Chenliang Li, Aixin Sun, Anwitaman Datta. (Li, C., Sun, A. and Datta, A. (2013), TSDW: Two-stage word sense disambiguation using Wikipedia. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22829)

Abstract:

The semantic knowledge of Wikipedia has proved to be useful for many tasks, for example, named entity disambiguation. Among these applications, the task of identifying the word sense based on Wikipedia is a crucial component because the output of this component is often used in subsequent tasks. In this article, we present a two-stage framework (called TSDW) for word sense disambiguation using knowledge latent in Wikipedia. The disambiguation of a given phrase is applied through a two-stage disambiguation process: (a) The first-stage disambiguation explores the contextual semantic information, where the noisy information is pruned for better effectiveness and efficiency; and (b) the second-stage disambiguation explores the disambiguated phrases of high confidence from the first stage to achieve better redisambiguation decisions for the phrases that are difficult to disambiguate in the first stage. Moreover, existing studies have addressed the disambiguation problem for English text only. Considering the popular usage of Wikipedia in different languages, we study the performance of TSDW and the existing state-of-the-art approaches over both English and Traditional Chinese articles. The experimental results show that TSDW generalizes well to different semantic relatedness measures and text in different languages. More important, TSDW significantly outperforms the state-of-the-art approaches with both better effectiveness and efficiency.

TSDW works because Wikipedia is a source of unambiguous phrases, that can also be used to disambiguate phrases that one first pass are not unambiguous.

But Wikipedia did not always exist and was built out of the collaboration of thousands of users over time.

Does that offer a clue as to building better search tools for enterprise data?

What if statistically improbable phrases are mined from new enterprise documents and links created to definitions for those phrases?

Thinking picking a current starting point avoids a “…boil the ocean…” scenario before benefits can be shown.

Current content is also more likely to be a search target.

Domain expertise and literacy required.

Expertise in logic or ontologies not.

WikiSynonyms: Find synonyms using Wikipedia redirects

Tuesday, February 26th, 2013

WikiSynonyms: Find synonyms using Wikipedia redirects by Panos Ipeirotis.

Many many years back, I worked with Wisam Dakka on a paper to create faceted interfaced for text collections. One of the requirements for that project was to discover synonyms for named entities. While we explored a variety of directions, the one that I liked most was Wisam’s idea to use the Wikipedia redirects to discover terms that are mostly synonymous.

Did you know, for example, that ISO/IEC 14882:2003 and X3J16 are synonyms of C++? Yes, me neither. However, Wikipedia reveals that through its redirect structure.

This rocks!

Talk about an easy path to populating variant names for a topic map!

Complete with examples, code, suggestions on hacking Wikipedia data sets (downloaded).

Wikipedia and Legislative Data Workshop

Tuesday, February 26th, 2013

Wikipedia and Legislative Data Workshop

From the post:

Interested in the bills making their way through Congress?

Think they should be covered well in Wikipedia?

Well, let’s do something about it!

On Thursday and Friday, March 14th and 15th, we are hosting a conference here at the Cato Institute to explore ways of using legislative data to enhance Wikipedia.

Our project to produce enhanced XML markup of federal legislation is well under way, and we hope to use this data to make more information available to the public about how bills affect existing law, federal agencies, and spending, for example.

What better way to spread knowledge about federal public policy than by supporting the growth of Wikipedia content?

Thursday’s session is for all comers. Starting at 2:30 p.m., we will familiarize ourselves with Wikipedia editing and policy, and at 5:30 p.m. we’ll have a Sunshine Week reception. (You don’t need to attend in the afternoon to come to the reception. Register now!)

On Friday, we’ll convene experts in government transparency, in Wikipedia editorial processes and decisions, and in MediaWiki technology to think things through and plot a course.

I remain unconvinced about greater transparency into the “apparent” legislative process.

On the other hand, it may provide the “hook” or binding point to make who wins and who loses more evident.

If the Cato representatives mention their ideals being founded in the 18th century, you might want to remember that infant mortality was greater than 40% in foundling hospitals of the time.

People who speak glowingly of the 18th century didn’t live in the 18th century. And imagine themselves as landed gentry of the time.

I first saw this at the Legal Informatics Blog.

Strong components of the Wikipedia graph

Friday, January 18th, 2013

Strong components of the Wikipedia graph

From the post:

I recently covered strong connectivity analysis in my graph algorithms class, so I’ve been playing today with applying it to the link structure of (small subsets of) Wikipedia.

For instance, here’s one of the strong components among the articles linked from Hans Freudenthal (a mathematician of widely varied interests): Algebraic topology, Freudenthal suspension theorem, George W. Whitehead, Heinz Hopf, Homotopy group, Homotopy groups of spheres, Humboldt University of Berlin, Luitzen Egbertus Jan Brouwer, Stable homotopy theory, Suspension (topology), University of Amsterdam, Utrecht University. Mostly this makes sense, but I’m not quite sure how the three universities got in there. Maybe from their famous faculty members?

One of responses to this post suggest grabbing the entire Wikipedia dataset for purposes of trying out algorithms.

A good suggestion for algorithms, perhaps even algorithms meant to reduce visual clutter, but at what point does a graph become too “busy” for visual analysis?

Recalling the research that claims people can only remember seven or so things at one time.

Wikipedia:Database download

Tuesday, November 20th, 2012

Wikipedia:Database download

From the webpage:

Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance). All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages. For our advice about complying with these licenses, see Wikipedia:Copyrights.

I know you are already aware of this as a data source but every time I want to confirm something about it, I have a devil of a time finding it at Wikipedia.

If I remember that I wrote about it here, perhaps it will be easier to find. ;-)

What I need to do is get one of those multi-terabyte network appliances for Christmas. Then copy large data sets that I don’t need updated as often as I need to consult their structures. (Like the next one I am about to mention.)

Mining a multilingual association dictionary from Wikipedia…

Saturday, November 17th, 2012

Mining a multilingual association dictionary from Wikipedia for cross-language information retrieval by Zheng Ye, Jimmy Xiangji Huang, Ben He, Hongfei Lin.

Abstract:

Wikipedia is characterized by its dense link structure and a large number of articles in different languages, which make it a notable Web corpus for knowledge extraction and mining, in particular for mining the multilingual associations. In this paper, motivated by a psychological theory of word meaning, we propose a graph-based approach to constructing a cross-language association dictionary (CLAD) from Wikipedia, which can be used in a variety of cross-language accessing and processing applications. In order to evaluate the quality of the mined CLAD, and to demonstrate how the mined CLAD can be used in practice, we explore two different applications of the mined CLAD to cross-language information retrieval (CLIR). First, we use the mined CLAD to conduct cross-language query expansion; and, second, we use it to filter out translation candidates with low translation probabilities. Experimental results on a variety of standard CLIR test collections show that the CLIR retrieval performance can be substantially improved with the above two applications of CLAD, which indicates that the mined CLAD is of sound quality.

Is there a lesson here about using Wikipedia as a starter set of topics across languages?

Not the final product but a starting place other than ground zero for creation of a multi-lingual topic map.

Parsing Wikipedia Articles with Node.js and jQuery

Friday, August 31st, 2012

Parsing Wikipedia Articles with Node.js and jQuery by Ben Coe.

From the post:

For some NLP research I’m currently doing, I was interested in parsing structured information from Wikipedia articles. I did not want to use a full-featured MediaWiki parser. WikiFetch Crawls a Wikipedia article using Node.js and jQuery. It returns a structured JSON-representation of the page.

Harvesting of content (unless you are authoring all of it) is a major part of any topic map project.

Does this work for you?

Other small utilities or scripts you would recommend?

I first saw this at: DZone.

Open Source at Netflix [Open Source Topic Maps Are....?]

Friday, July 20th, 2012

Open Source at Netflix by Ruslan Meshenberg.

A great plug for open source (among others):

Improved code and documentation quality – we’ve observed that the peer pressure from “Social Coding” has driven engineers to make sure code is clean and well structured, documentation is useful and up to date. What we’ve learned is that a component may be “Good enough for running in production, but not good enough for Github”.

A question as much to myself as anyone: Where are the open source topic maps?

There have been public dump sites for topic maps but have you seen an active community maintaining a public topic map?

Is it a technology/interface issue?

A control/authorship issue?

Something else?

Wikipedia works, although uneven. And there are a number of other similar efforts that are more or less successful.

Suggestions on what sets them apart?

Or suggestions you think should be tried? It isn’t possible to anticipate success. If the opposite were true, we would all be very successful. (Or at least that’s what I would wish for, your mileage may vary.)

Take it as given that any effort at a public topic map tool, a public topic map community or even a particular public topic map, or some combination thereof, is likely to fail.

But, we simply have to dust ourselves off and try other subject or combination of those things or others.

Graphity source code and wikipedia raw data

Monday, July 9th, 2012

Graphity source code and wikipedia raw data is online (neo4j based social news stream framework) René Pickhardt.

From the post:

8 months ago I posted the results of my research about fast retrieval of social news feeds and in particular my graph index graphity. The index is able to serve more than 12 thousand personalized social news streams per second in social networks with several million active users. I was able to show that the system is independent of the node degree or network size. Therefor it scales to graphs of arbitrary size.

Today I am pleased to anounce that our joint work was accepted as a full research paper at IEEE SocialCom conference 2012. The conference will take place in early September 2012 in Amsterdam. As promised before I will now open the source code of Graphity to the community. Its documentation could / and might be improved in future also I am sure that one is even able to use a better data structure for our implementation of the priority queue.

Still the attention from the developer community for Graphity was quite high so maybe the source code is of help to anyone. The source code consists of the entire evaluation framework that we used for our evaluation against other baselines which will also help anyone to reproduce our evaluation.

There is some nice things one can learn in setting up multthreading for time measurements and also how to set up a good logging mechanism.

Just in case you are interested in all the changes ever made to the German entries in Wikipedia.

That’s one use case. ;-)

Deeply awesome work!

Please take a close look! This looks important!

Stability as Illusion

Monday, July 9th, 2012

In A Visual Way to See What is Changing Within Wikipedia, Jennifer Shockley writes:

Wikipedia is a go to source for quick answers outside the classroom, but many don’t realize Wiki is an ever evolving information source. Geekosystem’s article “Wikistats Show You What Parts Of Wikipedia Are Changing” provides a visual way to see what is changing within Wikipedia.

Is there any doubt that all of our information sources are constantly evolving?

Whether by edits to the sources or in our reading of those sources?

I wonder, have there been recall/precision studies done chronologically?

That is to say, studies of user evaluation of precision/recall on a given data set that repeat the evaluation with users at five (5) year intervals?

To learn if user evaluations of precision/recall change over time for the same queries on the same body of material?

My suspicion, without attributing a cause, is yes.

Suggestions or pointers welcome!

Mapping Research With WikiMaps

Tuesday, July 3rd, 2012

Mapping Research With WikiMaps

From the post:

An international research team has developed a dynamic tool that allows you to see a map of what is “important” on Wikipedia and the connections between different entries. The tool, which is currently in the “alpha” phase of development, displays classic musicians, bands, people born in the 1980s, and selected celebrities, including Lady Gaga, Barack Obama, and Justin Bieber. A slider control, or play button, lets you move through time to see how a particular topic or group has evolved over the last 3 or 4 years. The desktop version allows you to select any article or topic.

Wikimaps builds on the fact that Wikipedia contains a vast amount of high-quality information, despite the very occasional spot of vandalism and the rare instances of deliberate disinformation or inadvertent misinformation. It also carries with each article meta data about the page’s authors and the detailed information about every single contribution, edit, update and change. This, Reto Kleeb, of the MIT Center for Collective Intelligence, and colleagues say, “…opens new opportunities to investigate the processes that lie behind the creation of the content as well as the relations between knowledge domains.” They suggest that because Wikipedia has such a great amount of underlying information in the metadata it is possible to create a dynamic picture of the evolution of a page, topic or collection of connections.

See the demo version: http://www.ickn.org/wikimaps/.

For some very cutting edge thinking, see: Intelligent Collaborative Knowledge Networks (MIT) which has a download link to “Condor,” a local version of the wikimaps software.

Wikimaps builds upon a premise similar to the original premise of the WWW. Links break, deal with it. Hypertext systems prior to the WWW had tremendous overhead to make sure links remained viable. So much overhead that none of them could scale. The WWW allowed links to break and to be easily created. That scales. (The failure of the Semantic Web can be traced to the requirement that links not fail. Just the opposite of what made the WWW workable.)

Wikimaps builds upon the premise that the “facts we have may be incomplete, incorrect, partial or even contradictory. All things that most semantic systems posit as verboten. An odd requirements since our information is always incomplete, incorrect (possibly), partial or even contradictory. We have set requirements for our information systems that we can’t meet working by hand. Not surprising that our systems fail and fail to scale.

How much information failure can you tolerate?

A question that should be asked of every information system at the design stage. If the answer is none, move onto a project with some chance of success.

I was surprised at the journal reference, not one I would usually scan. Recent origin, expensive, not in library collections I access.

Journal reference:

Reto Kleeb et al. Wikimaps: dynamic maps of knowledge. Int. J. Organisational Design and Engineering, 2012, 2, 204-224

Abstract:

We introduce Wikimaps, a tool to create a dynamic map of knowledge from Wikipedia contents. Wikimaps visualise the evolution of links over time between articles in different subject areas. This visualisation allows users to learn about the context a subject is embedded in, and offers them the opportunity to explore related topics that might not have been obvious. Watching a Wikimap movie permits users to observe the evolution of a topic over time. We also introduce two static variants of Wikimaps that focus on particular aspects of Wikipedia: latest news and people pages. ‘Who-works-with-whom-on-Wikipedia’ (W5) links between two articles are constructed if the same editor has worked on both articles. W5 links are an excellent way to create maps of the most recent news. PeopleMaps only include links between Wikipedia pages about ‘living people’. PeopleMaps in different-language Wikipedias illustrate the difference in emphasis on politics, entertainment, arts and sports in different cultures.

Just in case you are interested: International Journal of Organisational Design and Engineering, Editor in Chief: Prof. Rodrigo Magalhaes, ISSN online: 1758-9800, ISSN print: 1758-9797.

Creating a Semantic Graph from Wikipedia

Sunday, June 3rd, 2012

Creating a Semantic Graph from Wikipedia by Ryan Tanner, Trinity University.

Abstract:

With the continued need to organize and automate the use of data, solutions are needed to transform unstructred text into structred information. By treating dependency grammar functions as programming language functions, this process produces \property maps” which connect entities (people, places, events) with snippets of information. These maps are used to construct a semantic graph. By inputting Wikipedia, a large graph of information is produced representing a section of history. The resulting graph allows a user to quickly browse a topic and view the interconnections between entities across history.

Of particular interest is Ryan’s approach to the problem:

Most approaches to this problem rely on extracting as much information as possible from a given input. My approach comes at the problem from the opposite direction and tries to extract a little bit of information very quickly but over an extremely large input set. My hypothesis is that by doing so a large collection of texts can be quickly processed while still yielding useful output.

A refreshing change from semantic orthodoxy that has a happy result.

Printing the thesis now for a close read.

(Source: Jack Park)

Using BerkeleyDB to Create a Large N-gram Table

Friday, May 18th, 2012

Using BerkeleyDB to Create a Large N-gram Table by Richard Marsden.

From the post:

Previously, I showed you how to create N-Gram frequency tables from large text datasets. Unfortunately, when used on very large datasets such as the English language Wikipedia and Gutenberg corpora, memory limitations limited these scripts to unigrams. Here, I show you how to use the BerkeleyDB database to create N-gram tables of these large datasets.

Large datasets such as the Wikipedia and Gutenberg English language corpora cannot be used to create N-gram frequency tables using the previous script due to the script’s large in-memory requirements. The solution is to create the frequency table as a disk-based dataset. For this, the BerkeleyDB database in key-value mode is ideal. This is an open source “NoSQL” library which supports a disk based database and in-memory caching. BerkeleyDB can be downloaded from the Oracle website, and also ships with a number of Linux distributions, including Ubuntu. To use BerkeleyDB from Python, you will need the bsddb3 package. This is included with Python 2.* but is an additional download for Python 3 installations.

Richard promises to make the resulting data sets available as an Azure service. Sample code, etc, will be posted to his blog.

Another Wikipedia based analysis.

From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas

Friday, May 18th, 2012

From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas by Valentin Spitkovsky and Peter Norvig (Google Research Team).

From the post:

Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google’s core competency, and important for many other tasks in information retrieval and natural language processing. We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in these areas.

How do we represent concepts? Our approach piggybacks on the unique titles of entries from an encyclopedia, which are mostly proper and common noun phrases. We consider each individual Wikipedia article as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using Wikipedia’s groupings of articles into hierarchical categories.

(examples omitted)

The database that we are providing was designed for recall. It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links, many of them referencing non-existent articles. For technical details, see our paper (to be presented at LREC 2012) and the README file accompanying the data. (emphasis added)

Did you catch those numbers?

Now there is a truly remarkable resource.

What will you make out of it?

Explore Geographic Coverage in Mapping Wikipedia

Saturday, April 7th, 2012

Explore Geographic Coverage in Mapping Wikipedia

From the post:

TraceMedia, in collaboration with the Oxford Internet Institute, maps language use across Wikipedia in an interactive, fittingly named Mapping Wikipedia.

Simply select a language, a region, and the metric that you want to map, such as word count, number of authors, or the languages themselves, and you’ve got a view into “local knowledge production and representation” on the encyclopedia. Each dot represents an article with a link to the Wikipedia article. For the number of dots on the map, a maximum of 800,000, it works surprisingly without a hitch, other than the time it initially takes to load articles.

You need to follow the link to: Who represents the Arab world online? Mapping and measuring local knowledge production and representation in the Middle East and North Africa. The researchers are concerned with fairness and balance of coverage of the Arab world.

Rather than focusing on Wikipedia, an omnipresent resource on the WWW, I would rather have a mapping of who originates the news feeds more generally? Rather than focusing on who is absent. Moreover, I would ask why the Arab OPEC members have not been more effective at restoring balance in the news media?

Cross Domain Search by Exploiting Wikipedia

Monday, March 12th, 2012

Cross Domain Search by Exploiting Wikipedia by Chen Liu, Sai Wu, Shouxu Jiang, and Anthony K. H. Tung.

Abstract:

The abundance of Web 2.0 resources in various media formats calls for better resource integration to enrich user experience. This naturally leads to a new cross domain resource search requirement, in which a query is a resource in one modal and the results are closely related resources in other modalities. With cross domain search, we can better exploit existing resources.

Intuitively, tags associated with Web 2.0 resources are a straightforward medium to link resources with different modality together. However, tagging is by nature an ad hoc activity. They often contain noises and are affected by the subjective inclination of the tagger. Consequently, linking resources simply by tags will not be reliable. In this paper, we propose an approach for linking tagged resources to concepts extracted from Wikipedia, which has become a fairly reliable reference over the last few years. Compared to the tags, the concepts are therefore of higher quality. We develop effective methods for cross-modal search based on the concepts associated with resources. Extensive experiments were conducted, and the results show that our solution achieves good performance.

When the authors say “cross domain,” they are referring to different types of resources, say text vs. images or images vs. sound or any of those three vs. some other type of resource. One search can return “related” resources of different resource types.

Although the “cross domain” searching is interesting, I am more interested in the mapping that was performed on Wikipedia. The authors define three semantic relationships:

  • Link between Tag and Concept
  • Correlation of Concepts
  • Semantic Distance

It seems to me that the author’s are attacking “big data,” which has unbounded semantics from the “other” end. That is they are mapping a finite universe of semantics (Wikipedia) and then using that finite mapping to mine a much larger, unbounded semantic universe.

Or perhaps creating a semantic lens through which to view “related resources” in a much larger semantic universe. And without the overhead of Linked Data, which is mentioned under other work.

Exploring Wikipedia with Gremlin Graph Traversals

Saturday, March 10th, 2012

Exploring Wikipedia with Gremlin Graph Traversals by Marko Rodriguez.

From the post:

There are numerous ways in which Wikipedia can be represented as a graph. The articles and the href hyperlinks between them is one way. This type of graph is known a single-relational graph because all the edges have the same meaning — a hyperlink. A more complex rendering could represent the people discussed in the articles as “people-vertices” who know other “people-vertices” and that live in particular “city-vertices” and work for various “company-vertices” — so forth and so on until what emerges is a multi-relational concept graph. For the purpose of this post, a middle ground representation is used. The vertices are Wikipedia articles and Wikipedia categories. The edges are hyperlinks between articles as well as taxonomical relations amongst the categories.

If you aren’t interested in graph representations of data before reading this post, it is likely you will be afterwards.

Take a few minutes to read it and then let me know what you think.

Is Wikipedia Going To Explode?

Thursday, March 1st, 2012

I ran across a problem in Wikipedia that may mean it is about to explode. You decide.

You have heard about the danger of “combinatorial explosions” if we have more than one identifier. Every identifier has to be mapped to every other identifier.

Imagine that a – j represent different identifiers for the same subject.

This graphic represents a “small” combinatorial explosion.

combinatorial explosion

If that looks hard to read, here is a larger version:

Large Explosion

Is that better? ;-)

Here is where I noticed the problem: the Wikipedia XML file has synonyms for the entries.

The article on anarchism has one hundred and one other names:

  1. af:Anargisme
  2. als:Anarchismus
  3. ar:لاسلطوية
  4. an:Anarquismo
  5. ast:Anarquismu
  6. az:Anarxizm
  7. bn:নৈরাজ্যবাদ
  8. zh-min-nan:Hui-thóng-tī-chú-gī
  9. be:Анархізм
  10. be-x-old:Анархізм
  11. bo:གཞུང་མེད་ལམ་སྲོལ།
  12. bs:Anarhizam
  13. br:Anveliouriezh
  14. bg:Анархизъм
  15. ca:Anarquisme
  16. cs:Anarchismus
  17. cy:Anarchiaeth
  18. da:Anarkisme
  19. pdc:Anarchism
  20. de:Anarchismus
  21. et:Anarhism
  22. el:Αναρχισμός
  23. es:Anarquismo
  24. eo:Anarkiismo
  25. eu:Anarkismo
  26. fa:آنارشیسم
  27. hif:Khalbali
  28. fo:Anarkisma
  29. fr:Anarchisme
  30. fy:Anargisme
  31. ga:Ainrialachas
  32. gd:Ain-Riaghailteachd
  33. gl:Anarquismo
  34. ko:아나키즘
  35. hi:अराजकता
  36. hr:Anarhizam
  37. id:Anarkisme
  38. ia:Anarchismo
  39. is:Stjórnleysisstefna
  40. it:Anarchismo
  41. he:אנרכיזם
  42. jv:Anarkisme
  43. kn:ಅರಾಜಕತಾವಾದ
  44. ka:ანარქიზმი
  45. kk:Анархизм
  46. sw:Utawala huria
  47. lad:Anarkizmo
  48. krc:Анархизм
  49. la:Anarchismus
  50. lv:Anarhisms
  51. lb:Anarchismus
  52. lt:Anarchizmas
  53. jbo:nonje’asi’o
  54. hu:Anarchizmus
  55. mk:Анархизам
  56. ml:അരാജകത്വവാദം
  57. mr:अराजकता
  58. arz:اناركيه
  59. ms:Anarkisme
  60. mwl:Anarquismo
  61. mn:Анархизм
  62. nl:Anarchisme
  63. ja:アナキズム
  64. no:Anarkisme
  65. nn:Anarkisme
  66. oc:Anarquisme
  67. pnb:انارکی
  68. ps:انارشيزم
  69. pl:Anarchizm
  70. pt:Anarquismo
  71. ro:Anarhism
  72. rue:Анархізм
  73. ru:Анархизм
  74. sah:Анархизм
  75. sco:Anarchism
  76. simple:Anarchism
  77. sk:Anarchizmus
  78. sl:Anarhizem
  79. ckb:ئانارکیزم
  80. sr:Анархизам
  81. sh:Anarhizam
  82. fi:Anarkismi
  83. sv:Anarkism
  84. tl:Anarkismo
  85. ta:அரசின்மை
  86. th:อนาธิปไตย
  87. tg:Анархизм
  88. tr:Anarşizm
  89. uk:Анархізм
  90. ur:فوضیت
  91. ug:ئانارخىزم
  92. za:Fouzcwngfujcujyi
  93. vec:Anarchismo
  94. vi:Chủ nghĩa vô chính phủ
  95. fiu-vro:Anarkism
  96. war:Anarkismo
  97. yi:אנארכיזם
  98. zh-yue:無政府主義
  99. diq:Anarşizm
  100. bat-smg:Anarkėzmos
  101. zh:无政府主义

Now you can imagine the “combinatorial explosion” that awaits the entry on anarchism in Wikipedia, one hundred and two names (102, including English) when compared to my ten identifiers.

Except that Wikipedia leaves the relationships between all these identifiers for anarchism unspecified.

You can call them into existence, one to the other, as needed, but then you assume the burden of processing them. All the identifiers remain available to other users for their purposes as well.

Hmmm, with the language prefixes mapping to scopes, this looks like a good source for names and variant names for topics in a topic map.

What do you think?


According to my software, this is post #4,000. Looking for ways to better deliver information about topic maps and their construction. Suggestions (not to mention support) welcome!

Making sense of Wikipedia categories

Tuesday, February 21st, 2012

Making sense of Wikipedia categories

Hal Daume III writes:

Wikipedia’s category hierarchy forms a graph. It’s definitely cyclic (Category:Ethology belongs to Category:Behavior, which in turn belongs to Category:Ethology).

At any rate, did you know that “Chicago Stags coaches” are a subcategory of “Natural sciences”? If you don’t believe me, go to the Wikipedia entry for the Natural sciences category, and expand the following list of subcategories:

(subcategories omitted)

I guess it kind of makes sense. There are some other fun ones, like “Rhaeto-Romance languages”, “American World War I flying aces” and “1911 films”. Of course, these are all quite deep in the “hierarchy” (all of those are at depth 15 or higher).

Hal examines several strategies and concludes asking:

Has anyone else tried and succeed at using the Wikipedia category structure?

Some other questions:

Is Hal right that hand annotation doesn’t “scale?”

I have heard that more times than I can count but never seen any studies cited to support it.

After all, Wikipedia was manually edited and produced. Yes? No automated process created its content. So, what is the barrier to hand annotation?

If you think about it, the same could be said about email but most email (yes?) is written by hand. Not produced by automated processes (well, except for spam), so why can’t it be hand annotated? Or at least why can’t we capture semantics of email at the point of composition and annotate it there by automated means?

Hand annotation may not scale for sensor data or financial data streams but is hand annotation needed for such sources?

Hand annotation may not scale for say twitter posts by non-English speakers. But only for agencies with very short-sighted if not actively bigoted hiring/contracting practices.

Has anyone loaded the Wikipedia categories into a graph database? What sort of interface would you suggest for trial arrangement of the categories?

PS: If you are interested in discussing how-to establish assisted annotation for twitter, email or other data streams, with or without user awareness, send me a post.

Semantic Web – Sweet Spot(s) and ‘Gold Standards’

Monday, January 23rd, 2012

Mike Bergman posted a two-part series on how to make the Semantic Web work:

Seeking a Semantic Web Sweet Spot

In Search of ‘Gold Standards’ for the Semantic Web

Both are worth your time to read but the second sets the bar for “Gold Standards” for the Semantic Web as:

The need for gold standards for the semantic Web is particularly acute. First, by definition, the scope of the semantic Web is all things and all concepts and all entities. Second, because it embraces human knowledge, it also embraces all human languages with the nuances and varieties thereof. There is an immense gulf in referenceability from the starting languages of the semantic Web in RDF, RDFS and OWL to this full scope. This gulf is chiefly one of vocabulary (or lack thereof). We know how to construct our grammars, but we have few words with understood relationships between them to put in the slots.

The types of gold standards useful to the semantic Web are similar to those useful to our analogy of human languages. We need guidance on structure (syntax and grammar), plus reference vocabularies that encompass the scope of the semantic Web (that is, everything). Like human languages, the vocabulary references should have analogs to dictionaries, thesauri and encyclopedias. We want our references to deal with the specific demands of the semantic Web in capturing the lexical basis of human languages and the connectedness (or not) of things. We also want bases by which all of this information can be related to different human languages.

To capture these criteria, then, I submit we should consider a basic starting set of gold standards:

  • RDF/RDFS/OWL — the data model and basic building blocks for the languages
  • Wikipedia — the standard reference vocabulary of things, concepts and entities, plus other structural guidances
  • WordNet — lexical language references as an aid to natural language processing, and
  • UMBEL — the structural reference for the connectedness of things for basic coherence and inference, plus a vocabulary for mapping amongst reference structures and things.

Each of these potential gold standards is next discussed in turn. The majority of discussion centers on Wikipedia and UMBEL.

There is one criteria that Mike leaves out: Choice of a majority of users.

Use by a majority of users is a sweet spot that brooks no argument.