Archive for the ‘Wikipedia’ Category

Similar Pages for Wikipedia – Lateral – Indexing Practices

Saturday, April 23rd, 2016

Similar Pages for Wikipedia (Chrome extension)

I started looking at this software with a mis-impression that I hope you can avoid.

I installed the extension and as advertised, if I am on a Wikipedia page, it recommends “similar” Wikipedia pages.

Unless I’m billing time, plowing through page after page of tangentially related material isn’t my idea of a good time.

Ah, but I confused “document” with “page.”

I discovered that error while reading Adding Documents at Lateral, which gives the following example:


Ah! So “document” means as much or as little text as I choose to use when I add the document.

Which means if I were creating a document store of graph papers, I would capture only the new material and not the inevitable a “graph consists of nodes and edges….”

There are pre-populatd data sets, News 350,000+ news and blog articles, updated every 15 mins; arXiv 1M+ papers (all), updated daily; PubMed 6M+ medical journals from before July 2014; SEC 6,000+ yearly financial reports / 10-K filings from 2014; Wikipedia 463,000 pages which had 20+ page views in 2013.

I suspect the granularity on the pre-populated data sets is “document” in the usual sense size.

Glad to see the option to define a “document” to be an arbitrary span of text.

I don’t need to find more “documents” (in the usual sense) but more relevant snippets that are directly on point.

Hmmm, perhaps indexing at the level of paragraphs instead of documents (usual sense)?

Which makes me wonder why we index at the level of documents (usual sense) anyway? Is it simply tradition from when indexes were prepared by human indexers? And indexes were limited by physical constraints?

Most misinformation inserted into Wikipedia may persist [Read Responsibly]

Tuesday, April 14th, 2015

Experiment concludes: Most misinformation inserted into Wikipedia may persist by Gregory Kohs.

A months-long experiment to deliberately insert misinformation into thirty different Wikipedia articles has been brought to an end, and the results may surprise you. In 63% of cases, the phony information persisted not for minutes or hours, but for weeks and months. Have you ever heard of Ecuadorian students dressed in formal three-piece suits, leading hiking tours of the Galapagos Islands? Did you know that during the testing of one of the first machines to make paper bags, two thumbs and a toe were lost to the cutting blade? And would it surprise you to learn that pain from inflammation is caused by the human body’s release of rhyolite, an igneous, volcanic rock?

None of these are true, but Wikipedia has been presenting these “facts” as truth now for more than six weeks. And the misinformation isn’t buried on seldom-viewed pages, either. Those three howlers alone have been viewed by over 125,000 Wikipedia readers thus far.

The second craziest thing of all may be that when I sought to roll back the damage I had caused Wikipedia, after fixing eight of the thirty articles, my User account was blocked by a site administrator. The most bizarre thing is what happened next: another editor set himself to work restoring the falsehoods, following the theory that a blocked editor’s edits must be reverted on sight.

Alex Brown tweeted this story along with the comment:

Wikipedia’s purported “self-correcting” prowess is more myth than reality

True, but not to pick on Wikipedia, the same is true for the benefits of peer review in general. A cursory survey of the posts at Retraction Watch will leave you wondering what peer reviewers are doing because it certainly isn’t reading assigned papers. At least not closely.

For historical references on peer review, see: Three myths about scientific peer review by Michael Nielsen.

Peer review is also used in grant processes, prompting the Wall Street Journal to call for lotteries to award NIH grants.

There are literally hundreds of other sources and accounts that demonstrate whatever functions peer review may have, quality assurance isn’t one of them. I suspect “gate keeping,” by academics who are only “gate keepers,” is its primary function.

The common thread running through all of these accounts is that you and only you can choose to read responsibly.

As a reader: Read critically! Do the statements in an article, post, etc., fit with what you know about the subject? Or with general experience? What sources did the author cite? Simply citing Pompous Work I does not mean Pompous Work I said anything about the subject. Check the citations by reading the citations. (You will be very surprised in some cases.) After doing your homework, if you still have doubts, such as with reported experiments, contact the author and explain what you have done thus far and your questions (nicely).

Even agreement between Pompous Work I and the author doesn’t mean you don’t have a good question. Pompous works are corrected year in and year out.

As an author: Do not cite papers you have not read. Do not cite papers because another author said a paper said. Verify your citations do exist and that they in fact support your claims. Post all of your data publicly. (No caveats, claims without supporting evidence are simply noise.)

Crawling the WWW – A $64 Question

Saturday, January 24th, 2015

Have you ever wanted to crawl the WWW? To make a really comprehensive search? Waiting for a private power facility and server farm? You need wait no longer!

Ross Fairbanks details in WikiReverse data pipeline details the creation of Wikireverse:

WikiReverse is a reverse web-link graph for Wikipedia articles. It consists of approximately 36 million links to 4 million Wikipedia articles from 900,000 websites.

You can browse the data at WikiReverse or downloaded from S3 as a torrent.

The first thought that struck me was the data set would be useful for deciding which Wikipedia links are the default subject identifiers for particular subjects.

My second thought was what a wonderful starting place to find links with similar content strings, for the creation of topics with multiple subject identifiers.

My third thought was, $64 to search a CommonCrawl data set!

You can do a lot of searches at $64 per before you get to the cost of a server farm, much less a server farm plus a private power facility.

True, it won’t be interactive but then few searches at the NSA are probably interactive. 😉

The true upside being you are freed from the tyranny of page-rank and hidden algorithms by which vendors attempt to guess what is best for them and secondarily, what is best for you.

Take the time to work through Ross’ post and develop your skills with the CommonCrawl data.

Wikipedia in Python, Gephi, and Neo4j

Thursday, January 8th, 2015

Wikipedia in Python, Gephi, and Neo4j: Vizualizing relationships in Wikipedia by Matt Krzus.

From the introduction:


We have had a bit of a stretch here where we used Wikipedia for a good number of things. From Doc2Vec to experimenting with word2vec layers in deep RNNs, here are a few of those cool visualization tools we’ve used along the way.

Cool things you will find in this post:

  • Building relationship links between Categories and Subcategories
  • Visualization with Networkx (think Betweenness Centrality and PageRank)
  • Neo4j and Cypher (the author thinks avoiding the Giraph learning curve is a plus, I leave that for you to decide)
  • Visualization with Gephi


Extracting SVO Triples from Wikipedia

Saturday, November 1st, 2014

Extracting SVO Triples from Wikipedia by Sujit Pal.

From the post:

I recently came across this discussion (login required) on LinkedIn about extracting (subject, verb, object) (SVO) triples from text. Jack Park, owner of the SolrSherlock project, suggested using ReVerb to do this. I remembered an entertaining Programming Assignment from when I did the Natural Language Processing Course on Coursera, that involved finding spouse names from a small subset of Wikipedia, so I figured I it would be interesting to try using ReVerb against this data.

This post describes that work. As before, given the difference between this and the “preferred” approach that the automatic grader expects, results are likely to be wildly off the mark. BTW, I highly recommend taking the course if you haven’t already, there are lots of great ideas in there. One of the ideas deals with generating “raw” triples, then filtering them using known (subject, object) pairs to find candidate verbs, then turning around and using the verbs to find unknown (subject, object) pairs.

So in order to find the known (subject, object) pairs, I decided to parse the Infobox content (the “semi-structured” part of Wikipedia pages). Wikipedia markup is a mini programming language in itself, so I went looking for some pointers on how to parse it (third party parsers or just ideas) on StackOverflow. Someone suggested using DBPedia instead, since they have already done the Infobox extraction for you. I tried both, and somewhat surprisingly, manually parsing Infobox gave me better results in some cases, so I describe both approaches below.

As Sujit points out, you will want to go beyond Wikipedia with this technique but it is a good place to start!

If somebody does leak the Senate Report on CIA Torture, that would be a great text (hopefully the full version) to mine with such techniques.

Remembering that anonymity = no accountability.

Building a language-independent keyword-based system with the Wikipedia Miner

Monday, October 27th, 2014

Building a language-independent keyword-based system with the Wikipedia Miner by Gauthier Lemoine.

From the post:

Extracting keywords from texts and HTML pages is a common subject that opens doors to a lot of potential applications. These include classification (what is this page topic?), recommendation systems (identifying user likes to recommend the more accurate content), search engines (what is this page about?), document clustering (how can I pack different texts into a common group) and much more.

Most applications of these are usually based on only one language, usually english. However, it would be better to be able to process document in any language. For example, a case in a recommender system would be a user that speaks French and English. In his history, he gave positive ratings to a few pages containing the keyword “Airplane”. So, for next recommendations, we would boost this keyword. With a language-independent approach, we would also be able to boost pages containing “Avion”, the french term for airplane. If the user gave positive ratings to pages in English containing “Airplane”, and in French containing “Avion”, we would also be able to merge easily into the same keyword to build a language-independent user profile that will be used for accurate French and English recommendations.

This articles shows one way to achieve good results using an easy strategy. It is obvious that we can achieve better results using more complex algorithms.

The NSA can hire translators so I would not bother sharing this technique for harnessing the thousands of expert hours in Wikipedia with them.

Bear in mind that Wikipedia does not reach a large number of minority languages, dialects, and certainly not deliberate obscurity in any language. Your mileage will vary depending upon your particular use case.


Tuesday, August 19th, 2014


From the webpage:

A collection of utilities for parsing WikiMedia XML dumps with the intent of indexing the content in Solr.

I haven’t tried this, yet, but utilities for major data sources are always welcome!

Wikipedia Usage Statistics

Sunday, June 22nd, 2014

Wikipedia Usage Statistics by Paul Houle.

From the post:

The Wikimedia Foundation publishes page view statistics for Wikimedia projects here; this serveris rate-limited so it took roughly a month to transfer this 4 TB data set into S3 Storage in the AWS cloud. The photo on the left is of a hard drive containing a copy of the data that was produced with AWS Import/Export.

Once in S3, it is easy to process this data with Amazon Map/Reduce using the Open Source telepath software.

The first product developed from this is SubjectiveEye3D.

It’s your turn

Future projects require that this data be integrated with semantic data from :BaseKB and that has me working on tools such as RDFeasy. In the meantime, a mirror of the Wikipedia pagecounts from Jan 2008 to Feb 2014 is available in a requester pays bucket in S3 , which means you can use it in the Amazon Cloud for free and download data elsewhere for the cost of bulk network transfer.

Interesting isn’t it?

That “open” data can be so difficult to obtain and manipulate that it may as well not be “open” at all for the average user.

Something to keep in mind when big players talk about privacy. Do they mean private from their prying eyes or yours?

I think you will find in most cases that “privacy” means private from you and not the big players.

If you want to do a good deed for this week, support this data set at Gittip.

I first saw this in a tweet by Gregory Piatetsky.

Evaluating Entity Linking with Wikipedia

Monday, April 28th, 2014

Evaluating Entity Linking with Wikipedia by Ben Hachey, et al.


Named Entity Linking (NEL) grounds entity mentions to their corresponding node in a Knowledge Base (KB). Recently, a number of systems have been proposed for linking entity mentions in text to Wikipedia pages. Such systems typically search for candidate entities and then disambiguate them, returning either the best candidate or NIL. However, comparison has focused on disambiguation accuracy, making it difficult to determine how search impacts performance. Furthermore, important approaches from the literature have not been systematically compared on standard data sets.

We reimplement three seminal NEL systems and present a detailed evaluation of search strategies. Our experiments find that coreference and acronym handling lead to substantial improvement, and search strategies account for much of the variation between systems. This is an interesting finding, because these aspects of the problem have often been neglected in the literature, which has focused largely on complex candidate ranking algorithms.

A very deep survey of entity linking literature (including record linkage) and implementation of three complete entity linking systems for comparison.

At forty-eight (48) pages it isn’t a quick read but should be your starting point for pushing the boundaries on entity linking research.

I first saw this in a tweet by Alyona Medelyan.

Wikidata: A Free Collaborative Knowledge Base

Thursday, March 20th, 2014

Wikidata: A Free Collaborative Knowledge Base by Denny Vrandečić and Markus Krötzsch.


Unnoticed by most of its readers, Wikipedia is currently undergoing dramatic changes, as its sister project Wikidata introduces a new multilingual ‘Wikipedia for data’ to manage the factual information of the popular online encyclopedia. With Wikipedia’s data becoming cleaned and integrated in a single location, opportunities arise for many new applications.

In this article, we provide an extended overview of Wikidata, including its essential design choices and data model. Based on up-to-date statistics, we discuss the project’s development so far and outline interesting application areas for this new resource.

Denny Vrandečić, Markus Krötzsch. Wikidata: A Free Collaborative Knowledge Base. In Communications of the ACM (to appear). ACM 2014.

If you aren’t already impressed by Wikidata, this article should be the cure!

Wikibase DataModel released!

Friday, January 3rd, 2014

Wikibase DataModel released! by Jeroen De Dauw.

From the post:

I’m happy to announce the 0.6 release of Wikibase DataModel. This is the first real release of this component.


Wikibase is the software behind At its core, this software is about describing entities. Entities are collections of claims, which can have qualifiers, references and values of various different types. How this all fits together is described in the DataModel document written by Markus and Denny at the start of the project. The Wikibase DataModel component contains (PHP) domain objects representing entities and their various parts, as well as associated domain logic.

I wanted to draw your attention to this discussion of “items:”

Items are Entities that are typically represented by a Wikipage (at least in some Wikipedia languages). They can be viewed as “the thing that a Wikipage is about,” which could be an individual thing (the person Albert Einstein), a general class of things (the class of all Physicists), and any other concept that is the subject of some Wikipedia page (including things like History of Berlin).

The IRI of an Item will typically be closely related to the URL of its page on Wikidata. It is expected that Items store a shorter ID string (for example, as a title string in MediaWiki) that is used in both cases. ID strings might have a standardized technical format such as “wd1234567890” and will usually not be seen by users. The ID of an Item should be stable and not change after it has been created.

The exact meaning of an Item cannot be captured in Wikidata (or any technical system), but is discussed and decided on by the community of editors, just as it is done with the subject of Wikipedia articles now. It is possible that an Item has multiple “aspects” to its meaning. For example, the page Orca describes a species of whales. It can be viewed as a class of all Orca whales, and an individual whale such as Keiko would be an element of this class. On the other hand, the species Orca is also a concept about which we can make individual statements. For example, one could say that the binomial name (a Property) of the Orca species has the Value “Orcinus orca (Linnaeus, 1758).”

However, it is intended that the information stored in Wikidata is generally about the topic of the Item. For example, the Item for History of Berlin should store data about this history (if there is any such data), not about Berlin (the city). It is not intended that data about one subject is distributed across multiple Wikidata Items: each Item fully represents one thing. This also helps for data integration across languages: many languages have no separate article about Berlin’s history, but most have an article about Berlin.

What do you make of the claim:

The exact meaning of an Item cannot be captured in Wikidata (or any technical system), but is discussed and decided on by the community of editors, just as it is done with the subject of Wikipedia articles now. It is possible that an Item has multiple “aspects” to its meaning. For example, the page Orca describes a species of whales. It can be viewed as a class of all Orca whales, and an individual whale such as Keiko would be an element of this class. On the other hand, the species Orca is also a concept about which we can make individual statements. For example, one could say that the binomial name (a Property) of the Orca species has the Value “Orcinus orca (Linnaeus, 1758).”

I may write an information system that fails to distinguish between a species of whales, a class of whales and a particular whale, but that is a design choice, not a foregone conclusion.

In the case of Wikipedia, which relies upon individuals repeating the task of extracting relevant information from loosely gathered data, that approach words quite well.

But there isn’t one degree of precision of identification that works for all cases.

My suspicion is that for more demanding search applications, such as drug interactions, less precise identifications could lead to unfortunate, even fatal, results.


DBpedia 3.9 released…

Monday, September 23rd, 2013

DBpedia 3.9 released, including wider infobox coverage, additional type statements, and new YAGO and Wikidata links by Christopher Sahnwaldt.

From the post:

we are happy to announce the release of DBpedia 3.9.

The most important improvements of the new release compared to DBpedia 3.8 are:

1. the new release is based on updated Wikipedia dumps dating from March / April 2013 (the 3.8 release was based on dumps from June 2012), leading to an overall increase in the number of concepts in the English edition from 3.7 to 4.0 million things.

2. the DBpedia ontology is enlarged and the number of infobox to ontology mappings has risen, leading to richer and cleaner concept descriptions.

3. we extended the DBpedia type system to also cover Wikipedia articles that do not contain an infobox.

4. we provide links pointing from DBpedia concepts to Wikidata concepts and updated the links pointing at YAGO concepts and classes, making it easier to integrate knowledge from these sources.

The English version of the DBpedia knowledge base currently describes 4.0 million things, out of which 3.22 million are classified in a consistent Ontology, including 832,000 persons, 639,000 places (including 427,000 populated places), 372,000 creative works (including 116,000 music albums, 78,000 films and 18,500 video games), 209,000 organizations (including 49,000 companies and 45,000 educational institutions), 226,000 species and 5,600 diseases.

We provide localized versions of DBpedia in 119 languages. All these versions together describe 24.9 million things, out of which 16.8 million overlap (are interlinked) with the concepts from the English DBpedia. The full DBpedia data set features labels and abstracts for 12.6 million unique things in 119 different languages; 24.6 million links to images and 27.6 million links to external web pages; 45.0 million external links into other RDF datasets, 67.0 million links to Wikipedia categories, and 41.2 million YAGO categories.

Altogether the DBpedia 3.9 release consists of 2.46 billion pieces of information (RDF triples) out of which 470 million were extracted from the English edition of Wikipedia, 1.98 billion were extracted from other language editions, and about 45 million are links to external data sets.

Detailed statistics about the DBpedia data sets in 24 popular languages are provided at Dataset Statistics.

The main changes between DBpedia 3.8 and 3.9 are described below. For additional, more detailed information please refer to the Change Log.

Almost like an early holiday present isn’t it? 😉

I continue to puzzle over the notion of “extraction.”

Not that I have an alternative but extracting data only kicks the data can one step down the road.

When someone wants to use my extracted data, they are going to extract data from my extraction. And so on.

That seems incredibly wasteful and error-prone.

Enough money is spend doing the ETL shuffle every year that research on ETL avoidance should be a viable proposition.

Mapping Wikipedia – The Schema

Friday, July 19th, 2013

I haven’t finished documenting the issues I encountered with SQLFairy in parsing the MediaWiki schema but I was able to create a png diagram of the schema.


Should be easier than reading the schema but otherwise I’m not all that impressed.

Some modeling issues to note up front.

SQL Identifiers:

The INT datatype in MySQL is defined as:

Type Storage Minimum Value Maximum Value
INT 4 -2147483648 2147483647

Whereas, the XML syntax for topic maps defines the item identifiers datatype as xsd:ID.

XSD:ID is defined as:

The type xsd:ID is used for an attribute that uniquely identifies an element in an XML document. An xsd:ID value must be an xsd:NCName. This means that it must start with a letter or underscore, and can only contain letters, digits, underscores, hyphens, and periods.

Opps! “[M]ust start with a letter or underscore….”

That leaves out all the INT type IDs that you find in SQL databases.

And it rules out all identifiers that don’t start with a letter or underscore.

One good reason to have an alternative (to XML) syntax for topic maps. The name limitation arose more than twenty years ago and should not trouble us now.

SQL Tables/Rows

Wikipedia summarizes relational database tables in part as:

A relation is defined as a set of tuples that have the same attributes. A tuple usually represents an object and information about that object. Objects are typically physical objects or concepts. A relation is usually described as a table, which is organized into rows and columns. All the data referenced by an attribute are in the same domain and conform to the same constraints….

As the articles notes: “A tuple usually represents an object and information about that object.” (Read subject for object.)

Converting a database to a topic map begins with deciding what subject every row of each table represents. And recording what information has been captured for each subject.

As you work through the MediaWiki tables, ask yourself what information about a subject must be matched for it to be the same subject?


From Wikipedia:

Normalization was first proposed by Codd as an integral part of the relational model. It encompasses a set of procedures designed to eliminate nonsimple domains (non-atomic values) and the redundancy (duplication) of data, which in turn prevents data manipulation anomalies and loss of data integrity. The most common forms of normalization applied to databases are called the normal forms.

True enough but the article glosses over the shortfall of then current databases to handle “non-atomic values” and to lack the performance to tolerate duplication of data.

I say “…then current databases…” but the limitations of “non-atomic values” and non-duplication of data persist to this day. Normalization, an activity by the user, is meant to compensate for poor hardware/software performance.

From a topic map perspective, normalization means you will find data about a subject in more than one table.

Next Week

I will start with the “user” table in MediaWiki-1.21.1-tables.sql next Monday.

Question: Which other tables, if any, should we look at while modeling the subject from rows in the user table?

Mapping Wikipedia – Update

Thursday, July 18th, 2013

I have spent a good portion of today trying to create an image of the MediaWiki table structure.

While I think the SQLFairy (aka SQL Translator) is going to work quite well, it has rather cryptic error messages.

For instance, if the SQL syntax isn’t supported by its internal parser, the error message references the start of the table.

Which means, of course, that you have to compare statements in the table to the subset of SQL that is supported.

I am rapidly losing my SQL parsing skills as the night wears on so I am stopping with a little over 50% of the MediaWiki schema parsing.

Hopefully will finish correcting the SQL file tomorrow and will post the image of the MediaWiki schema.

Plus notes on what I found to not be recognized in SQLFairy to ease your use of it on other SQL schemas.

Mapping Wikipedia

Wednesday, July 17th, 2013

Carl Lemp, commented in the XTM group at LinkedIn, potential redesign of topic maps discussion:

2. There are only a few tools to help build a Topic Map.
3. There is almost nothing to help translate familiar information structures to Topic Map structures.
Getting through 2 and 3 is a bitch.

I can’t help with #2 but I may be able to help with #3.

I suggest mapping the MediaWiki structure that is used for Wikipedia into a topic map.

As a demonstration it has the following advantages:

  1. Conversion from SQL dump to topic map scripts.
  2. Large enough to test alternative semantics.
  3. Sub-sets of Wikipedia good for starter maps.
  4. Useful to merge with other data sets.
  5. Well known data set.
  6. Widespread data format (SQL).

The MediaWiki schema MediaWiki-1.21.1-tables.sql.

The base output format will be CTM.

When we want to test alternative semantics, I suggest that we use “.” followed by “0tm” (zero followed by “tm”) as the file extension. Comments at the head of the file should reference or document the semantics to be applied in processing the file.

In terms of sigla for annotating the SQL, are there any strong feelings against? (Drawn from the TMDM vocabulary section):

A association representation of a relationship between one or more subjects
Ar association role representation of the involvement of a subject in a relationship represented by an association
Art association role type subject describing the nature of the participation of an association role player in an association
At association type subject describing the nature of the relationship represented by associations of that type
Ir information resource a representation of a resource as a sequence of bytes; it could thus potentially be retrieved over a network
Ii item identifier locator assigned to an information item in order to allow it to be referred to
O occurrence representation of a relationship between a subject and an information resource
Ot occurrence type subject describing the nature of the relationship between the subjects and information resources linked by the occurrences of that type
S scope context within which a statement is valid
Si subject identifier locator that refers to a subject indicator
Sl subject locator locator that refers to the information resource that is the subject of a topic
T topic symbol used within a topic map to represent one, and only one, subject, in order to allow statements to be made about the subject
Tn topic name name for a topic, consisting of the base form, known as the base name, and variants of that base form, known as variant names
Tnt topic name type subject describing the nature of the topic names of that type
Tf topic type subject that captures some commonality in a set of subjects
Vn variant name alternative form of a topic name that may be more suitable in a certain context than the corresponding base name

The first step I would suggest is creating a visualization of the MediaWiki schema.

We will still have to iterate over the tables but getting an over all view of the schema will be helpful.

Suggestions on your favorite schema visualization tool?

Distributing the Edit History of Wikipedia Infoboxes

Thursday, May 30th, 2013

Distributing the Edit History of Wikipedia Infoboxes by Enrique Alfonseca.

From the post:

Aside from its value as a general-purpose encyclopedia, Wikipedia is also one of the most widely used resources to acquire, either automatically or semi-automatically, knowledge bases of structured data. Much research has been devoted to automatically building disambiguation resources, parallel corpora and structured knowledge from Wikipedia. Still, most of those projects have been based on single snapshots of Wikipedia, extracting the attribute values that were valid at a particular point in time. So about a year ago we compiled and released a data set that allows researchers to see how data attributes can change over time.


For this reason, we released, in collaboration with Wikimedia Deutschland e.V., a resource containing all the edit history of infoboxes in Wikipedia pages. While this was already available indirectly in Wikimedia’s full history dumps, the smaller size of the released dataset will make it easier to download and process this data. The released dataset contains 38,979,871 infobox attribute updates for 1,845,172 different entities, and it is available for download both from Google and from Wikimedia Deutschland’s Toolserver page. A description of the dataset can be found in our paper WHAD: Wikipedia Historical Attributes Data, accepted for publication at the Language Resources and Evaluation journal.

How much data do you need beyond the infoboxes of Wikipedia?

And knowing what values were in the past … isn’t that like knowing prior identifiers for subjects?

Automatically Acquiring Synonym Knowledge from Wikipedia

Monday, May 27th, 2013

Automatically Acquiring Synonym Knowledge from Wikipedia by Koji Sekiguchi.

From the post:

Synonym search sure is convenient. However, in order for an administrator to allow users to use these convenient search functions, he or she has to provide them with a synonym dictionary (CSV file) described above. New words are created every day and so are new synonyms. A synonym dictionary might have been prepared by a person in charge with huge effort but sometimes will be left unmaintained as time goes by or his/her position is taken over.

That is a reason people start longing for an automatic creation of synonym dictionary. That request has driven me to write the system I will explain below. This system learns synonym knowledge from “dictionary corpus” and outputs “original word – synonym” combinations of high similarity to a CSV file, which in turn can be applied to the SynonymFilter of Lucene/Solr as is.

This “dictionary corpus” is a corpus that contains entries consisting of “keywords” and their “descriptions”. An electronic dictionary exactly is a dictionary corpus and so is Wikipedia, which you are familiar with and is easily accessible.

Let’s look at a method to use the Japanese version of Wikipedia to automatically get synonym knowledge.

Complex representation of synonyms, which includes domain or scope would be more robust.

On the other hand, some automatic generation of synonyms is better than no synonyms at all.

Take this as a good place to start but not as a destination for synonym generation.

The Wikidata revolution is here:…

Friday, April 26th, 2013

The Wikidata revolution is here: enabling structured data on Wikipedia by Tilman Bayer.

From the post:

A year after its announcement as the first new Wikimedia project since 2006, Wikidata has now begun to serve the over 280 language versions of Wikipedia as a common source of structured data that can be used in more than 25 million articles of the free encyclopedia.

By providing Wikipedia editors with a central venue for their efforts to collect and vet such data, Wikidata leads to a higher level of consistency and quality in Wikipedia articles across the many language editions of the encyclopedia. Beyond Wikipedia, Wikidata’s universal, machine-readable knowledge database will be freely reusable by anyone, enabling numerous external applications.

“Wikidata is a powerful tool for keeping information in Wikipedia current across all language versions,” said Wikimedia Foundation Executive Director Sue Gardner. “Before Wikidata, Wikipedians needed to manually update hundreds of Wikipedia language versions every time a famous person died or a country’s leader changed. With Wikidata, such new information, entered once, can automatically appear across all Wikipedia language versions. That makes life easier for editors and makes it easier for Wikipedia to stay current.”

This is a great source of curated data!

TSDW:… [Enterprise Disambiguation]

Monday, April 22nd, 2013

TSDW: Two-stage word sense disambiguation using Wikipedia by Chenliang Li, Aixin Sun, Anwitaman Datta. (Li, C., Sun, A. and Datta, A. (2013), TSDW: Two-stage word sense disambiguation using Wikipedia. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22829)


The semantic knowledge of Wikipedia has proved to be useful for many tasks, for example, named entity disambiguation. Among these applications, the task of identifying the word sense based on Wikipedia is a crucial component because the output of this component is often used in subsequent tasks. In this article, we present a two-stage framework (called TSDW) for word sense disambiguation using knowledge latent in Wikipedia. The disambiguation of a given phrase is applied through a two-stage disambiguation process: (a) The first-stage disambiguation explores the contextual semantic information, where the noisy information is pruned for better effectiveness and efficiency; and (b) the second-stage disambiguation explores the disambiguated phrases of high confidence from the first stage to achieve better redisambiguation decisions for the phrases that are difficult to disambiguate in the first stage. Moreover, existing studies have addressed the disambiguation problem for English text only. Considering the popular usage of Wikipedia in different languages, we study the performance of TSDW and the existing state-of-the-art approaches over both English and Traditional Chinese articles. The experimental results show that TSDW generalizes well to different semantic relatedness measures and text in different languages. More important, TSDW significantly outperforms the state-of-the-art approaches with both better effectiveness and efficiency.

TSDW works because Wikipedia is a source of unambiguous phrases, that can also be used to disambiguate phrases that one first pass are not unambiguous.

But Wikipedia did not always exist and was built out of the collaboration of thousands of users over time.

Does that offer a clue as to building better search tools for enterprise data?

What if statistically improbable phrases are mined from new enterprise documents and links created to definitions for those phrases?

Thinking picking a current starting point avoids a “…boil the ocean…” scenario before benefits can be shown.

Current content is also more likely to be a search target.

Domain expertise and literacy required.

Expertise in logic or ontologies not.

WikiSynonyms: Find synonyms using Wikipedia redirects

Tuesday, February 26th, 2013

WikiSynonyms: Find synonyms using Wikipedia redirects by Panos Ipeirotis.

Many many years back, I worked with Wisam Dakka on a paper to create faceted interfaced for text collections. One of the requirements for that project was to discover synonyms for named entities. While we explored a variety of directions, the one that I liked most was Wisam’s idea to use the Wikipedia redirects to discover terms that are mostly synonymous.

Did you know, for example, that ISO/IEC 14882:2003 and X3J16 are synonyms of C++? Yes, me neither. However, Wikipedia reveals that through its redirect structure.

This rocks!

Talk about an easy path to populating variant names for a topic map!

Complete with examples, code, suggestions on hacking Wikipedia data sets (downloaded).

Wikipedia and Legislative Data Workshop

Tuesday, February 26th, 2013

Wikipedia and Legislative Data Workshop

From the post:

Interested in the bills making their way through Congress?

Think they should be covered well in Wikipedia?

Well, let’s do something about it!

On Thursday and Friday, March 14th and 15th, we are hosting a conference here at the Cato Institute to explore ways of using legislative data to enhance Wikipedia.

Our project to produce enhanced XML markup of federal legislation is well under way, and we hope to use this data to make more information available to the public about how bills affect existing law, federal agencies, and spending, for example.

What better way to spread knowledge about federal public policy than by supporting the growth of Wikipedia content?

Thursday’s session is for all comers. Starting at 2:30 p.m., we will familiarize ourselves with Wikipedia editing and policy, and at 5:30 p.m. we’ll have a Sunshine Week reception. (You don’t need to attend in the afternoon to come to the reception. Register now!)

On Friday, we’ll convene experts in government transparency, in Wikipedia editorial processes and decisions, and in MediaWiki technology to think things through and plot a course.

I remain unconvinced about greater transparency into the “apparent” legislative process.

On the other hand, it may provide the “hook” or binding point to make who wins and who loses more evident.

If the Cato representatives mention their ideals being founded in the 18th century, you might want to remember that infant mortality was greater than 40% in foundling hospitals of the time.

People who speak glowingly of the 18th century didn’t live in the 18th century. And imagine themselves as landed gentry of the time.

I first saw this at the Legal Informatics Blog.

Strong components of the Wikipedia graph

Friday, January 18th, 2013

Strong components of the Wikipedia graph

From the post:

I recently covered strong connectivity analysis in my graph algorithms class, so I’ve been playing today with applying it to the link structure of (small subsets of) Wikipedia.

For instance, here’s one of the strong components among the articles linked from Hans Freudenthal (a mathematician of widely varied interests): Algebraic topology, Freudenthal suspension theorem, George W. Whitehead, Heinz Hopf, Homotopy group, Homotopy groups of spheres, Humboldt University of Berlin, Luitzen Egbertus Jan Brouwer, Stable homotopy theory, Suspension (topology), University of Amsterdam, Utrecht University. Mostly this makes sense, but I’m not quite sure how the three universities got in there. Maybe from their famous faculty members?

One of responses to this post suggest grabbing the entire Wikipedia dataset for purposes of trying out algorithms.

A good suggestion for algorithms, perhaps even algorithms meant to reduce visual clutter, but at what point does a graph become too “busy” for visual analysis?

Recalling the research that claims people can only remember seven or so things at one time.

Wikipedia:Database download

Tuesday, November 20th, 2012

Wikipedia:Database download

From the webpage:

Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance). All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages. For our advice about complying with these licenses, see Wikipedia:Copyrights.

I know you are already aware of this as a data source but every time I want to confirm something about it, I have a devil of a time finding it at Wikipedia.

If I remember that I wrote about it here, perhaps it will be easier to find. 😉

What I need to do is get one of those multi-terabyte network appliances for Christmas. Then copy large data sets that I don’t need updated as often as I need to consult their structures. (Like the next one I am about to mention.)

Mining a multilingual association dictionary from Wikipedia…

Saturday, November 17th, 2012

Mining a multilingual association dictionary from Wikipedia for cross-language information retrieval by Zheng Ye, Jimmy Xiangji Huang, Ben He, Hongfei Lin.


Wikipedia is characterized by its dense link structure and a large number of articles in different languages, which make it a notable Web corpus for knowledge extraction and mining, in particular for mining the multilingual associations. In this paper, motivated by a psychological theory of word meaning, we propose a graph-based approach to constructing a cross-language association dictionary (CLAD) from Wikipedia, which can be used in a variety of cross-language accessing and processing applications. In order to evaluate the quality of the mined CLAD, and to demonstrate how the mined CLAD can be used in practice, we explore two different applications of the mined CLAD to cross-language information retrieval (CLIR). First, we use the mined CLAD to conduct cross-language query expansion; and, second, we use it to filter out translation candidates with low translation probabilities. Experimental results on a variety of standard CLIR test collections show that the CLIR retrieval performance can be substantially improved with the above two applications of CLAD, which indicates that the mined CLAD is of sound quality.

Is there a lesson here about using Wikipedia as a starter set of topics across languages?

Not the final product but a starting place other than ground zero for creation of a multi-lingual topic map.

Parsing Wikipedia Articles with Node.js and jQuery

Friday, August 31st, 2012

Parsing Wikipedia Articles with Node.js and jQuery by Ben Coe.

From the post:

For some NLP research I’m currently doing, I was interested in parsing structured information from Wikipedia articles. I did not want to use a full-featured MediaWiki parser. WikiFetch Crawls a Wikipedia article using Node.js and jQuery. It returns a structured JSON-representation of the page.

Harvesting of content (unless you are authoring all of it) is a major part of any topic map project.

Does this work for you?

Other small utilities or scripts you would recommend?

I first saw this at: DZone.

Open Source at Netflix [Open Source Topic Maps Are….?]

Friday, July 20th, 2012

Open Source at Netflix by Ruslan Meshenberg.

A great plug for open source (among others):

Improved code and documentation quality – we’ve observed that the peer pressure from “Social Coding” has driven engineers to make sure code is clean and well structured, documentation is useful and up to date. What we’ve learned is that a component may be “Good enough for running in production, but not good enough for Github”.

A question as much to myself as anyone: Where are the open source topic maps?

There have been public dump sites for topic maps but have you seen an active community maintaining a public topic map?

Is it a technology/interface issue?

A control/authorship issue?

Something else?

Wikipedia works, although uneven. And there are a number of other similar efforts that are more or less successful.

Suggestions on what sets them apart?

Or suggestions you think should be tried? It isn’t possible to anticipate success. If the opposite were true, we would all be very successful. (Or at least that’s what I would wish for, your mileage may vary.)

Take it as given that any effort at a public topic map tool, a public topic map community or even a particular public topic map, or some combination thereof, is likely to fail.

But, we simply have to dust ourselves off and try other subject or combination of those things or others.

Graphity source code and wikipedia raw data

Monday, July 9th, 2012

Graphity source code and wikipedia raw data is online (neo4j based social news stream framework) René Pickhardt.

From the post:

8 months ago I posted the results of my research about fast retrieval of social news feeds and in particular my graph index graphity. The index is able to serve more than 12 thousand personalized social news streams per second in social networks with several million active users. I was able to show that the system is independent of the node degree or network size. Therefor it scales to graphs of arbitrary size.

Today I am pleased to anounce that our joint work was accepted as a full research paper at IEEE SocialCom conference 2012. The conference will take place in early September 2012 in Amsterdam. As promised before I will now open the source code of Graphity to the community. Its documentation could / and might be improved in future also I am sure that one is even able to use a better data structure for our implementation of the priority queue.

Still the attention from the developer community for Graphity was quite high so maybe the source code is of help to anyone. The source code consists of the entire evaluation framework that we used for our evaluation against other baselines which will also help anyone to reproduce our evaluation.

There is some nice things one can learn in setting up multthreading for time measurements and also how to set up a good logging mechanism.

Just in case you are interested in all the changes ever made to the German entries in Wikipedia.

That’s one use case. 😉

Deeply awesome work!

Please take a close look! This looks important!

Stability as Illusion

Monday, July 9th, 2012

In A Visual Way to See What is Changing Within Wikipedia, Jennifer Shockley writes:

Wikipedia is a go to source for quick answers outside the classroom, but many don’t realize Wiki is an ever evolving information source. Geekosystem’s article “Wikistats Show You What Parts Of Wikipedia Are Changing” provides a visual way to see what is changing within Wikipedia.

Is there any doubt that all of our information sources are constantly evolving?

Whether by edits to the sources or in our reading of those sources?

I wonder, have there been recall/precision studies done chronologically?

That is to say, studies of user evaluation of precision/recall on a given data set that repeat the evaluation with users at five (5) year intervals?

To learn if user evaluations of precision/recall change over time for the same queries on the same body of material?

My suspicion, without attributing a cause, is yes.

Suggestions or pointers welcome!

Mapping Research With WikiMaps

Tuesday, July 3rd, 2012

Mapping Research With WikiMaps

From the post:

An international research team has developed a dynamic tool that allows you to see a map of what is “important” on Wikipedia and the connections between different entries. The tool, which is currently in the “alpha” phase of development, displays classic musicians, bands, people born in the 1980s, and selected celebrities, including Lady Gaga, Barack Obama, and Justin Bieber. A slider control, or play button, lets you move through time to see how a particular topic or group has evolved over the last 3 or 4 years. The desktop version allows you to select any article or topic.

Wikimaps builds on the fact that Wikipedia contains a vast amount of high-quality information, despite the very occasional spot of vandalism and the rare instances of deliberate disinformation or inadvertent misinformation. It also carries with each article meta data about the page’s authors and the detailed information about every single contribution, edit, update and change. This, Reto Kleeb, of the MIT Center for Collective Intelligence, and colleagues say, “…opens new opportunities to investigate the processes that lie behind the creation of the content as well as the relations between knowledge domains.” They suggest that because Wikipedia has such a great amount of underlying information in the metadata it is possible to create a dynamic picture of the evolution of a page, topic or collection of connections.

See the demo version:

For some very cutting edge thinking, see: Intelligent Collaborative Knowledge Networks (MIT) which has a download link to “Condor,” a local version of the wikimaps software.

Wikimaps builds upon a premise similar to the original premise of the WWW. Links break, deal with it. Hypertext systems prior to the WWW had tremendous overhead to make sure links remained viable. So much overhead that none of them could scale. The WWW allowed links to break and to be easily created. That scales. (The failure of the Semantic Web can be traced to the requirement that links not fail. Just the opposite of what made the WWW workable.)

Wikimaps builds upon the premise that the “facts we have may be incomplete, incorrect, partial or even contradictory. All things that most semantic systems posit as verboten. An odd requirements since our information is always incomplete, incorrect (possibly), partial or even contradictory. We have set requirements for our information systems that we can’t meet working by hand. Not surprising that our systems fail and fail to scale.

How much information failure can you tolerate?

A question that should be asked of every information system at the design stage. If the answer is none, move onto a project with some chance of success.

I was surprised at the journal reference, not one I would usually scan. Recent origin, expensive, not in library collections I access.

Journal reference:

Reto Kleeb et al. Wikimaps: dynamic maps of knowledge. Int. J. Organisational Design and Engineering, 2012, 2, 204-224


We introduce Wikimaps, a tool to create a dynamic map of knowledge from Wikipedia contents. Wikimaps visualise the evolution of links over time between articles in different subject areas. This visualisation allows users to learn about the context a subject is embedded in, and offers them the opportunity to explore related topics that might not have been obvious. Watching a Wikimap movie permits users to observe the evolution of a topic over time. We also introduce two static variants of Wikimaps that focus on particular aspects of Wikipedia: latest news and people pages. ‘Who-works-with-whom-on-Wikipedia’ (W5) links between two articles are constructed if the same editor has worked on both articles. W5 links are an excellent way to create maps of the most recent news. PeopleMaps only include links between Wikipedia pages about ‘living people’. PeopleMaps in different-language Wikipedias illustrate the difference in emphasis on politics, entertainment, arts and sports in different cultures.

Just in case you are interested: International Journal of Organisational Design and Engineering, Editor in Chief: Prof. Rodrigo Magalhaes, ISSN online: 1758-9800, ISSN print: 1758-9797.

Creating a Semantic Graph from Wikipedia

Sunday, June 3rd, 2012

Creating a Semantic Graph from Wikipedia by Ryan Tanner, Trinity University.


With the continued need to organize and automate the use of data, solutions are needed to transform unstructred text into structred information. By treating dependency grammar functions as programming language functions, this process produces \property maps” which connect entities (people, places, events) with snippets of information. These maps are used to construct a semantic graph. By inputting Wikipedia, a large graph of information is produced representing a section of history. The resulting graph allows a user to quickly browse a topic and view the interconnections between entities across history.

Of particular interest is Ryan’s approach to the problem:

Most approaches to this problem rely on extracting as much information as possible from a given input. My approach comes at the problem from the opposite direction and tries to extract a little bit of information very quickly but over an extremely large input set. My hypothesis is that by doing so a large collection of texts can be quickly processed while still yielding useful output.

A refreshing change from semantic orthodoxy that has a happy result.

Printing the thesis now for a close read.

(Source: Jack Park)