Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 3, 2012

Mapping Research With WikiMaps

Filed under: Mapping,Maps,WikiMaps,Wikipedia — Patrick Durusau @ 5:12 am

Mapping Research With WikiMaps

From the post:

An international research team has developed a dynamic tool that allows you to see a map of what is “important” on Wikipedia and the connections between different entries. The tool, which is currently in the “alpha” phase of development, displays classic musicians, bands, people born in the 1980s, and selected celebrities, including Lady Gaga, Barack Obama, and Justin Bieber. A slider control, or play button, lets you move through time to see how a particular topic or group has evolved over the last 3 or 4 years. The desktop version allows you to select any article or topic.

Wikimaps builds on the fact that Wikipedia contains a vast amount of high-quality information, despite the very occasional spot of vandalism and the rare instances of deliberate disinformation or inadvertent misinformation. It also carries with each article meta data about the page’s authors and the detailed information about every single contribution, edit, update and change. This, Reto Kleeb, of the MIT Center for Collective Intelligence, and colleagues say, “…opens new opportunities to investigate the processes that lie behind the creation of the content as well as the relations between knowledge domains.” They suggest that because Wikipedia has such a great amount of underlying information in the metadata it is possible to create a dynamic picture of the evolution of a page, topic or collection of connections.

See the demo version: http://www.ickn.org/wikimaps/.

For some very cutting edge thinking, see: Intelligent Collaborative Knowledge Networks (MIT) which has a download link to “Condor,” a local version of the wikimaps software.

Wikimaps builds upon a premise similar to the original premise of the WWW. Links break, deal with it. Hypertext systems prior to the WWW had tremendous overhead to make sure links remained viable. So much overhead that none of them could scale. The WWW allowed links to break and to be easily created. That scales. (The failure of the Semantic Web can be traced to the requirement that links not fail. Just the opposite of what made the WWW workable.)

Wikimaps builds upon the premise that the “facts we have may be incomplete, incorrect, partial or even contradictory. All things that most semantic systems posit as verboten. An odd requirements since our information is always incomplete, incorrect (possibly), partial or even contradictory. We have set requirements for our information systems that we can’t meet working by hand. Not surprising that our systems fail and fail to scale.

How much information failure can you tolerate?

A question that should be asked of every information system at the design stage. If the answer is none, move onto a project with some chance of success.

I was surprised at the journal reference, not one I would usually scan. Recent origin, expensive, not in library collections I access.

Journal reference:

Reto Kleeb et al. Wikimaps: dynamic maps of knowledge. Int. J. Organisational Design and Engineering, 2012, 2, 204-224

Abstract:

We introduce Wikimaps, a tool to create a dynamic map of knowledge from Wikipedia contents. Wikimaps visualise the evolution of links over time between articles in different subject areas. This visualisation allows users to learn about the context a subject is embedded in, and offers them the opportunity to explore related topics that might not have been obvious. Watching a Wikimap movie permits users to observe the evolution of a topic over time. We also introduce two static variants of Wikimaps that focus on particular aspects of Wikipedia: latest news and people pages. ‘Who-works-with-whom-on-Wikipedia’ (W5) links between two articles are constructed if the same editor has worked on both articles. W5 links are an excellent way to create maps of the most recent news. PeopleMaps only include links between Wikipedia pages about ‘living people’. PeopleMaps in different-language Wikipedias illustrate the difference in emphasis on politics, entertainment, arts and sports in different cultures.

Just in case you are interested: International Journal of Organisational Design and Engineering, Editor in Chief: Prof. Rodrigo Magalhaes, ISSN online: 1758-9800, ISSN print: 1758-9797.

June 3, 2012

Creating a Semantic Graph from Wikipedia

Creating a Semantic Graph from Wikipedia by Ryan Tanner, Trinity University.

Abstract:

With the continued need to organize and automate the use of data, solutions are needed to transform unstructred text into structred information. By treating dependency grammar functions as programming language functions, this process produces \property maps” which connect entities (people, places, events) with snippets of information. These maps are used to construct a semantic graph. By inputting Wikipedia, a large graph of information is produced representing a section of history. The resulting graph allows a user to quickly browse a topic and view the interconnections between entities across history.

Of particular interest is Ryan’s approach to the problem:

Most approaches to this problem rely on extracting as much information as possible from a given input. My approach comes at the problem from the opposite direction and tries to extract a little bit of information very quickly but over an extremely large input set. My hypothesis is that by doing so a large collection of texts can be quickly processed while still yielding useful output.

A refreshing change from semantic orthodoxy that has a happy result.

Printing the thesis now for a close read.

(Source: Jack Park)

May 18, 2012

Using BerkeleyDB to Create a Large N-gram Table

Filed under: BerkeleyDB,N-Gram,Natural Language Processing,Wikipedia — Patrick Durusau @ 3:16 pm

Using BerkeleyDB to Create a Large N-gram Table by Richard Marsden.

From the post:

Previously, I showed you how to create N-Gram frequency tables from large text datasets. Unfortunately, when used on very large datasets such as the English language Wikipedia and Gutenberg corpora, memory limitations limited these scripts to unigrams. Here, I show you how to use the BerkeleyDB database to create N-gram tables of these large datasets.

Large datasets such as the Wikipedia and Gutenberg English language corpora cannot be used to create N-gram frequency tables using the previous script due to the script’s large in-memory requirements. The solution is to create the frequency table as a disk-based dataset. For this, the BerkeleyDB database in key-value mode is ideal. This is an open source “NoSQL” library which supports a disk based database and in-memory caching. BerkeleyDB can be downloaded from the Oracle website, and also ships with a number of Linux distributions, including Ubuntu. To use BerkeleyDB from Python, you will need the bsddb3 package. This is included with Python 2.* but is an additional download for Python 3 installations.

Richard promises to make the resulting data sets available as an Azure service. Sample code, etc, will be posted to his blog.

Another Wikipedia based analysis.

From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas

Filed under: Concept Detection,Dictionary,Entities,Wikipedia,Word Meaning — Patrick Durusau @ 2:12 pm

From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas by Valentin Spitkovsky and Peter Norvig (Google Research Team).

From the post:

Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google’s core competency, and important for many other tasks in information retrieval and natural language processing. We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in these areas.

How do we represent concepts? Our approach piggybacks on the unique titles of entries from an encyclopedia, which are mostly proper and common noun phrases. We consider each individual Wikipedia article as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using Wikipedia’s groupings of articles into hierarchical categories.

(examples omitted)

The database that we are providing was designed for recall. It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links, many of them referencing non-existent articles. For technical details, see our paper (to be presented at LREC 2012) and the README file accompanying the data. (emphasis added)

Did you catch those numbers?

Now there is a truly remarkable resource.

What will you make out of it?

April 7, 2012

Explore Geographic Coverage in Mapping Wikipedia

Filed under: Mapping,Maps,Ontopia,Wikipedia — Patrick Durusau @ 7:42 pm

Explore Geographic Coverage in Mapping Wikipedia

From the post:

TraceMedia, in collaboration with the Oxford Internet Institute, maps language use across Wikipedia in an interactive, fittingly named Mapping Wikipedia.

Simply select a language, a region, and the metric that you want to map, such as word count, number of authors, or the languages themselves, and you’ve got a view into “local knowledge production and representation” on the encyclopedia. Each dot represents an article with a link to the Wikipedia article. For the number of dots on the map, a maximum of 800,000, it works surprisingly without a hitch, other than the time it initially takes to load articles.

You need to follow the link to: Who represents the Arab world online? Mapping and measuring local knowledge production and representation in the Middle East and North Africa. The researchers are concerned with fairness and balance of coverage of the Arab world.

Rather than focusing on Wikipedia, an omnipresent resource on the WWW, I would rather have a mapping of who originates the news feeds more generally? Rather than focusing on who is absent. Moreover, I would ask why the Arab OPEC members have not been more effective at restoring balance in the news media?

March 12, 2012

Cross Domain Search by Exploiting Wikipedia

Filed under: Linked Data,Searching,Wikipedia — Patrick Durusau @ 8:04 pm

Cross Domain Search by Exploiting Wikipedia by Chen Liu, Sai Wu, Shouxu Jiang, and Anthony K. H. Tung.

Abstract:

The abundance of Web 2.0 resources in various media formats calls for better resource integration to enrich user experience. This naturally leads to a new cross domain resource search requirement, in which a query is a resource in one modal and the results are closely related resources in other modalities. With cross domain search, we can better exploit existing resources.

Intuitively, tags associated with Web 2.0 resources are a straightforward medium to link resources with different modality together. However, tagging is by nature an ad hoc activity. They often contain noises and are affected by the subjective inclination of the tagger. Consequently, linking resources simply by tags will not be reliable. In this paper, we propose an approach for linking tagged resources to concepts extracted from Wikipedia, which has become a fairly reliable reference over the last few years. Compared to the tags, the concepts are therefore of higher quality. We develop effective methods for cross-modal search based on the concepts associated with resources. Extensive experiments were conducted, and the results show that our solution achieves good performance.

When the authors say “cross domain,” they are referring to different types of resources, say text vs. images or images vs. sound or any of those three vs. some other type of resource. One search can return “related” resources of different resource types.

Although the “cross domain” searching is interesting, I am more interested in the mapping that was performed on Wikipedia. The authors define three semantic relationships:

  • Link between Tag and Concept
  • Correlation of Concepts
  • Semantic Distance

It seems to me that the author’s are attacking “big data,” which has unbounded semantics from the “other” end. That is they are mapping a finite universe of semantics (Wikipedia) and then using that finite mapping to mine a much larger, unbounded semantic universe.

Or perhaps creating a semantic lens through which to view “related resources” in a much larger semantic universe. And without the overhead of Linked Data, which is mentioned under other work.

March 10, 2012

Exploring Wikipedia with Gremlin Graph Traversals

Filed under: Gremlin,Neo4j,Wikipedia — Patrick Durusau @ 8:21 pm

Exploring Wikipedia with Gremlin Graph Traversals by Marko Rodriguez.

From the post:

There are numerous ways in which Wikipedia can be represented as a graph. The articles and the href hyperlinks between them is one way. This type of graph is known a single-relational graph because all the edges have the same meaning — a hyperlink. A more complex rendering could represent the people discussed in the articles as “people-vertices” who know other “people-vertices” and that live in particular “city-vertices” and work for various “company-vertices” — so forth and so on until what emerges is a multi-relational concept graph. For the purpose of this post, a middle ground representation is used. The vertices are Wikipedia articles and Wikipedia categories. The edges are hyperlinks between articles as well as taxonomical relations amongst the categories.

If you aren’t interested in graph representations of data before reading this post, it is likely you will be afterwards.

Take a few minutes to read it and then let me know what you think.

March 1, 2012

Is Wikipedia Going To Explode?

Filed under: Combinatorics,Wikipedia — Patrick Durusau @ 9:10 pm

I ran across a problem in Wikipedia that may mean it is about to explode. You decide.

You have heard about the danger of “combinatorial explosions” if we have more than one identifier. Every identifier has to be mapped to every other identifier.

Imagine that a – j represent different identifiers for the same subject.

This graphic represents a “small” combinatorial explosion.

combinatorial explosion

If that looks hard to read, here is a larger version:

Large Explosion

Is that better? 😉

Here is where I noticed the problem: the Wikipedia XML file has synonyms for the entries.

The article on anarchism has one hundred and one other names:

  1. af:Anargisme
  2. als:Anarchismus
  3. ar:لاسلطوية
  4. an:Anarquismo
  5. ast:Anarquismu
  6. az:Anarxizm
  7. bn:নৈরাজ্যবাদ
  8. zh-min-nan:Hui-thóng-tī-chú-gī
  9. be:Анархізм
  10. be-x-old:Анархізм
  11. bo:གཞུང་མེད་ལམ་སྲོལ།
  12. bs:Anarhizam
  13. br:Anveliouriezh
  14. bg:Анархизъм
  15. ca:Anarquisme
  16. cs:Anarchismus
  17. cy:Anarchiaeth
  18. da:Anarkisme
  19. pdc:Anarchism
  20. de:Anarchismus
  21. et:Anarhism
  22. el:Αναρχισμός
  23. es:Anarquismo
  24. eo:Anarkiismo
  25. eu:Anarkismo
  26. fa:آنارشیسم
  27. hif:Khalbali
  28. fo:Anarkisma
  29. fr:Anarchisme
  30. fy:Anargisme
  31. ga:Ainrialachas
  32. gd:Ain-Riaghailteachd
  33. gl:Anarquismo
  34. ko:아나키즘
  35. hi:अराजकता
  36. hr:Anarhizam
  37. id:Anarkisme
  38. ia:Anarchismo
  39. is:Stjórnleysisstefna
  40. it:Anarchismo
  41. he:אנרכיזם
  42. jv:Anarkisme
  43. kn:ಅರಾಜಕತಾವಾದ
  44. ka:ანარქიზმი
  45. kk:Анархизм
  46. sw:Utawala huria
  47. lad:Anarkizmo
  48. krc:Анархизм
  49. la:Anarchismus
  50. lv:Anarhisms
  51. lb:Anarchismus
  52. lt:Anarchizmas
  53. jbo:nonje’asi’o
  54. hu:Anarchizmus
  55. mk:Анархизам
  56. ml:അരാജകത്വവാദം
  57. mr:अराजकता
  58. arz:اناركيه
  59. ms:Anarkisme
  60. mwl:Anarquismo
  61. mn:Анархизм
  62. nl:Anarchisme
  63. ja:アナキズム
  64. no:Anarkisme
  65. nn:Anarkisme
  66. oc:Anarquisme
  67. pnb:انارکی
  68. ps:انارشيزم
  69. pl:Anarchizm
  70. pt:Anarquismo
  71. ro:Anarhism
  72. rue:Анархізм
  73. ru:Анархизм
  74. sah:Анархизм
  75. sco:Anarchism
  76. simple:Anarchism
  77. sk:Anarchizmus
  78. sl:Anarhizem
  79. ckb:ئانارکیزم
  80. sr:Анархизам
  81. sh:Anarhizam
  82. fi:Anarkismi
  83. sv:Anarkism
  84. tl:Anarkismo
  85. ta:அரசின்மை
  86. th:อนาธิปไตย
  87. tg:Анархизм
  88. tr:Anarşizm
  89. uk:Анархізм
  90. ur:فوضیت
  91. ug:ئانارخىزم
  92. za:Fouzcwngfujcujyi
  93. vec:Anarchismo
  94. vi:Chủ nghĩa vô chính phủ
  95. fiu-vro:Anarkism
  96. war:Anarkismo
  97. yi:אנארכיזם
  98. zh-yue:無政府主義
  99. diq:Anarşizm
  100. bat-smg:Anarkėzmos
  101. zh:无政府主义

Now you can imagine the “combinatorial explosion” that awaits the entry on anarchism in Wikipedia, one hundred and two names (102, including English) when compared to my ten identifiers.

Except that Wikipedia leaves the relationships between all these identifiers for anarchism unspecified.

You can call them into existence, one to the other, as needed, but then you assume the burden of processing them. All the identifiers remain available to other users for their purposes as well.

Hmmm, with the language prefixes mapping to scopes, this looks like a good source for names and variant names for topics in a topic map.

What do you think?


According to my software, this is post #4,000. Looking for ways to better deliver information about topic maps and their construction. Suggestions (not to mention support) welcome!

February 21, 2012

Making sense of Wikipedia categories

Filed under: Annotation,Classification,Wikipedia — Patrick Durusau @ 8:00 pm

Making sense of Wikipedia categories

Hal Daume III writes:

Wikipedia’s category hierarchy forms a graph. It’s definitely cyclic (Category:Ethology belongs to Category:Behavior, which in turn belongs to Category:Ethology).

At any rate, did you know that “Chicago Stags coaches” are a subcategory of “Natural sciences”? If you don’t believe me, go to the Wikipedia entry for the Natural sciences category, and expand the following list of subcategories:

(subcategories omitted)

I guess it kind of makes sense. There are some other fun ones, like “Rhaeto-Romance languages”, “American World War I flying aces” and “1911 films”. Of course, these are all quite deep in the “hierarchy” (all of those are at depth 15 or higher).

Hal examines several strategies and concludes asking:

Has anyone else tried and succeed at using the Wikipedia category structure?

Some other questions:

Is Hal right that hand annotation doesn’t “scale?”

I have heard that more times than I can count but never seen any studies cited to support it.

After all, Wikipedia was manually edited and produced. Yes? No automated process created its content. So, what is the barrier to hand annotation?

If you think about it, the same could be said about email but most email (yes?) is written by hand. Not produced by automated processes (well, except for spam), so why can’t it be hand annotated? Or at least why can’t we capture semantics of email at the point of composition and annotate it there by automated means?

Hand annotation may not scale for sensor data or financial data streams but is hand annotation needed for such sources?

Hand annotation may not scale for say twitter posts by non-English speakers. But only for agencies with very short-sighted if not actively bigoted hiring/contracting practices.

Has anyone loaded the Wikipedia categories into a graph database? What sort of interface would you suggest for trial arrangement of the categories?

PS: If you are interested in discussing how-to establish assisted annotation for twitter, email or other data streams, with or without user awareness, send me a post.

January 23, 2012

Semantic Web – Sweet Spot(s) and ‘Gold Standards’

Filed under: OWL,RDF,UMBEL,Wikipedia,WordNet — Patrick Durusau @ 7:43 pm

Mike Bergman posted a two-part series on how to make the Semantic Web work:

Seeking a Semantic Web Sweet Spot

In Search of ‘Gold Standards’ for the Semantic Web

Both are worth your time to read but the second sets the bar for “Gold Standards” for the Semantic Web as:

The need for gold standards for the semantic Web is particularly acute. First, by definition, the scope of the semantic Web is all things and all concepts and all entities. Second, because it embraces human knowledge, it also embraces all human languages with the nuances and varieties thereof. There is an immense gulf in referenceability from the starting languages of the semantic Web in RDF, RDFS and OWL to this full scope. This gulf is chiefly one of vocabulary (or lack thereof). We know how to construct our grammars, but we have few words with understood relationships between them to put in the slots.

The types of gold standards useful to the semantic Web are similar to those useful to our analogy of human languages. We need guidance on structure (syntax and grammar), plus reference vocabularies that encompass the scope of the semantic Web (that is, everything). Like human languages, the vocabulary references should have analogs to dictionaries, thesauri and encyclopedias. We want our references to deal with the specific demands of the semantic Web in capturing the lexical basis of human languages and the connectedness (or not) of things. We also want bases by which all of this information can be related to different human languages.

To capture these criteria, then, I submit we should consider a basic starting set of gold standards:

  • RDF/RDFS/OWL — the data model and basic building blocks for the languages
  • Wikipedia — the standard reference vocabulary of things, concepts and entities, plus other structural guidances
  • WordNet — lexical language references as an aid to natural language processing, and
  • UMBEL — the structural reference for the connectedness of things for basic coherence and inference, plus a vocabulary for mapping amongst reference structures and things.

Each of these potential gold standards is next discussed in turn. The majority of discussion centers on Wikipedia and UMBEL.

There is one criteria that Mike leaves out: Choice of a majority of users.

Use by a majority of users is a sweet spot that brooks no argument.

« Newer Posts

Powered by WordPress