Paper Review: “Recovering Semantic Tables on the WEB”
Sean Golliher writes:
A paper entitled “Recovering Semantics of Tables on the Web” was presented at the 37th Conference on Very Large Databases in Seattle, WA . The paper’s authors included 6 Google engineers along with professor Petros Venetis of Stanford University and Gengxin Miao of UC Santa Barbara. The paper summarizes an approach for recovering the semantics of tables with additional annotations other than what the author of a table has provided. The paper is of interest to developers working on the semantic web because it gives insight into how programmers can use semantic data (database of triples) and Open Information Extraction (OIE) to enhance unstructured data on the web. In addition they compare how a “maximum-likelihood” model, used to assign class labels to tables, compares to a “database of triples” approach. The authors show that their method for labeling tables is capable of labeling “an order of magnitude more tables on the web than is possible using Wikipedia/YAGO and many more than freebase.”
The authors claim that “the Web offers approximately 100 million tables but the meaning of each table is rarely explicit from the table itself”. Tables on the Web are embedded within HTML which makes extracting meaning from them a difficult task. Since tables are embedded in HTML search engines typically treat them like any other text in the document. In addition, authors of tables usually have labels that are specific to their own labeling style and assigned attributes are usually not meaningful. As the authors state: “Every creator of a table has a particular Schema in mind”. In this paper the authors describe a system where they automatically add additional annotations to a table in order to extract meaningful relationships between the entities in the table and other columns within table. The authors reference the table example shown below in Table. 1.1 . The table has no row or column labels and there is no title associated to it. To extract the meaning from this table, using text analysis, a search engine would have to relate the table entries to the text surrounding the document and/or analyze the text entries in the table.
The annotation process, first with class/instance and then out of a triple database, reminds me of Newcomb’s “conferral” of properties. That is some content in the text (or in a subject representative/proxy) causes additional key/value pairs to be assigned/conferred. Nothing particularly remarkable about that process.
I am not suggesting that the ISA/triple database strategy will work equally for all areas. What annotation/conferral strategy works best for you will depend on your data and the requirements imposed upon a solution. I would like to hear from you about annotation/conferral strategies that work with particular data sets.