Archive for the ‘Semantic Annotation’ Category

Auto Tagging Articles using Semantic Analysis and Machine Learning

Wednesday, May 2nd, 2012

Auto Tagging Articles using Semantic Analysis and Machine Learning


The idea is to implement an auto tagging feature that provides tags automatically to the user depending upon the content of the post. The tags will get populated as soon as the user leaves the focus on the content text area or via ajax on the press of a button.I’ll be using semantic analysis and topic modeling techniques to judge the topic of the article and extract keywords also from it. Based on an algorithm and a ranking mechanism the user will be provided with a list of tags from which he can select those that best describe the article and also train a user-content specific semi-supervised machine learning model in the background.

A Drupal sandbox for work on auto tagging posts.

Or, topic map authoring without being “in your face.”

Depends on how you read “tags.”

Paper Review: “Recovering Semantic Tables on the WEB”

Saturday, March 17th, 2012

Paper Review: “Recovering Semantic Tables on the WEB”

Sean Golliher writes:

A paper entitled “Recovering Semantics of Tables on the Web” was presented at the 37th Conference on Very Large Databases in Seattle, WA . The paper’s authors included 6 Google engineers along with professor Petros Venetis of Stanford University and Gengxin Miao of UC Santa Barbara. The paper summarizes an approach for recovering the semantics of tables with additional annotations other than what the author of a table has provided. The paper is of interest to developers working on the semantic web because it gives insight into how programmers can use semantic data (database of triples) and Open Information Extraction (OIE) to enhance unstructured data on the web. In addition they compare how a “maximum-likelihood” model, used to assign class labels to tables, compares to a “database of triples” approach. The authors show that their method for labeling tables is capable of labeling “an order of magnitude more tables on the web than is possible using Wikipedia/YAGO and many more than freebase.”

The authors claim that “the Web offers approximately 100 million tables but the meaning of each table is rarely explicit from the table itself”. Tables on the Web are embedded within HTML which makes extracting meaning from them a difficult task. Since tables are embedded in HTML search engines typically treat them like any other text in the document. In addition, authors of tables usually have labels that are specific to their own labeling style and assigned attributes are usually not meaningful. As the authors state: “Every creator of a table has a particular Schema in mind”. In this paper the authors describe a system where they automatically add additional annotations to a table in order to extract meaningful relationships between the entities in the table and other columns within table. The authors reference the table example shown below in Table. 1.1 . The table has no row or column labels and there is no title associated to it. To extract the meaning from this table, using text analysis, a search engine would have to relate the table entries to the text surrounding the document and/or analyze the text entries in the table.

The annotation process, first with class/instance and then out of a triple database, reminds me of Newcomb’s “conferral” of properties. That is some content in the text (or in a subject representative/proxy) causes additional key/value pairs to be assigned/conferred. Nothing particularly remarkable about that process.

I am not suggesting that the ISA/triple database strategy will work equally for all areas. What annotation/conferral strategy works best for you will depend on your data and the requirements imposed upon a solution. I would like to hear from you about annotation/conferral strategies that work with particular data sets.

Wikimeta Project’s Evolution…

Monday, February 6th, 2012

Wikimeta Project’s Evolution Includes Commercial Ambitions and Focus On Text-Mining, Semantic Annotation Robustness by Jennifer Zaino.

From the post:

Wikimeta, the semantic tagging and annotation architecture for incorporating semantic knowledge within documents, websites, content management systems, blogs and applications, this month is incorporating itself as a company called Wikimeta Technologies. Wikimeta, which has a heritage linked with the NLGbAse project, last year was provided as its own web service.

The Semantic Web Blog interviews Dr. Eric Charton about Wikimeta and its future plans.

More interesting that the average interview piece. I have a weakness for academic projects and Wikimeta certainly has the credentials in that regard.

On the other hand, when I read statements like:

So when we said Wikimeta makes over 94 percent of good semantic annotation in the three first ranked suggested annotations, this is tested, evaluated, published, peer-reviewed and reproducible by third parties.

I have to wonder what standard for “…good semantic annotation…” was in play and for what application would 94 percent be acceptable?

Annotation of nuclear power plant documentation? Drug interaction documentation? Jet engine repair manual? Chemical reaction warning on product? None of those sound like 94% right situations.

That isn’t a criticism of this project but of the notion that “correctness” of semantic annotation can be measured separate and apart from some particular use case.

It could be the case that 94% correct is unnecessary if we are talking about the content of Access Hollywood.

And your particular use case may lie somewhere in between those two extremes.

Do read the interview as this sound like it will be an interesting project, whatever your thoughts on “correctness.”