Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 22, 2012

Gnip Introduces Historical PowerTrack for Twitter [Gnip Feed Misses What?]

Filed under: Semantics,Tweets — Patrick Durusau @ 4:31 am

Gnip Introduces Historical PowerTrack for Twitter

From the post:

Gnip, the largest provider of social data to the world, is launching Historical PowerTrack for Twitter, which makes available every public Tweet since the launch of Twitter in March of 2006.

People use Twitter to connect with and share information on the things they care about. To date, analysts have had incomplete access to historical Tweets. Starting today, companies can now analyze a full six years of discussion around their brands and product launches to better understand the impact of these conversations. Political reporters can compare Tweets around the 2008 Election to the activity we are seeing around this year’s Election. Financial firms can backtest their trading algorithms to model how incorporating Twitter data generates additional signal. Business Intelligence companies can incorporate six years of Tweets into their data offerings so their customers can identify correlation with key business metrics like inventory and revenue.

“We’ve been developing Historical PowerTrack for Twitter for more than a year,” said Chris Moody, President and COO of Gnip. “During our early access phase, we’ve given companies like Esri, Brandwatch, Networked Insights, Union Metrics, Waggener Edstrom Worldwide and others the opportunity to take advantage of this amazing new data. With today’s announcement, we’re making this data fully available to the entire data ecosystem.” (emphasis added)

Can you name one thing that Gnip’s “PowerTrack for Twitter” is not capturing?

Think about it for a minute. I am sure they have all the “text” of tweets, along with whatever metadata was in the stream.

So what is Gnip missing and cannot deliver to you?

In a word, semantics.

The one thing that makes one message valuable and another irrelevant.

Example: In a 1950’s episode of “I Love Lucy,” Lucy says to Ricky over the phone, “There’s a man here making passionate love to me.” Didn’t have the same meaning in the 1950’s as it does now (and Ricky was in on the joke).

A firehose of tweets may be impressive, but so is an open fire plug in the summer.

Without direction (read semantics), the water just runs off into the sewer.

September 14, 2012

ESWC 2013 : 10th Extended Semantic Web Conference

Filed under: BigData,Linked Data,Semantic Web,Semantics — Patrick Durusau @ 1:24 pm

ESWC 2013 : 10th Extended Semantic Web Conference

Important Dates:

Abstract submission: December 5th, 2012

Full paper submission: December 12th, 2012

Authors’ rebuttals: February 11th-12th, 2013

Acceptance Notification: February 22nd, 2013

Camera ready: March 9th, 2013

Conference: May 26th-30th, 2013

From the call for papers:

ESWC is the premier European-based annual conference for researchers and practitioners in the field of semantic technologies. ESWC is the ideal venue for the discussion of the latest scientific insights and novel applications of semantic technologies.

The leading motto of the 10th edition of ESWC will be “Semantics and Big Data”. A crucial challenge that will guide the efforts of many scientific communities in the years to come is the one of making sense of large volumes of heterogeneous and complex data. Application-relevant data often has to be processed in real time and originates from diverse sources such as Linked Data, text and speech, images, videos and sensors, communities and social networks, etc. ESWC, with its focus on semantics, can offer an important contribution to global challenge.

ESWC 2013 will feature nine thematic research tracks (see below) as well as an in-use and industrial track. In line with the motto “Semantics and Big Data”, the conference will feature a special track on “Semantic Technologies for Big Data Analytics in Real Time”. In order to foster the interaction with other disciplines, this year’s edition will also feature a special track on “Cognition and Semantic Web”.

For the research and special tracks, we welcome the submission of papers describing theoretical, analytical, methodological, empirical, and application research on semantic technologies. For the In-Use and Industrial track we solicit the submission of papers describing the practical exploitation of semantic technologies in different domains and sectors. Submitted papers should describe original work, present significant results, and provide rigorous, principled, and repeatable evaluation. We strongly encourage and appreciate the submission of papers including links to data sets and other material used for the evaluation as well as to live demos or source code for tool implementations.

Submitted papers will be judged based on originality, awareness of related work, potential impact on the Semantic Web field, technical soundness of the proposed methods, and readability. Each paper will be reviewed by at least three program committee members in addition to one track chair. This year a rebuttal phase has been introduced in order to give authors the opportunity to provide feedback to reviewers’ questions. The authors’ answers will support reviewers and track chairs in their discussion and in taking final decisions regarding acceptance.

I would call your attention to:

A crucial challenge that will guide the efforts of many scientific communities in the years to come is the one of making sense of large volumes of heterogeneous and complex data.

Sounds like they are playing the topic map song!

Ping me if you are able to attend and would like to collaborate on a paper.

September 13, 2012

…Milton Friedman’s thermostat [Perils of Observation]

Filed under: Semantics — Patrick Durusau @ 4:42 pm

Why are (almost all) economists unaware of Milton Friedman’s thermostat?

Skipping past a long introduction, here’s the beef:

Everybody knows that if you press down on the gas pedal the car goes faster, other things equal, right? And everybody knows that if a car is going uphill the car goes slower, other things equal, right?

But suppose you were someone who didn’t know those two things. And you were a passenger in a car watching the driver trying to keep a constant speed on a hilly road. You would see the gas pedal going up and down. You would see the car going downhill and uphill. But if the driver were skilled, and the car powerful enough, you would see the speed stay constant.

So, if you were simply looking at this particular “data generating process”, you could easily conclude: “Look! The position of the gas pedal has no effect on the speed!”; and “Look! Whether the car is going uphill or downhill has no effect on the speed!”; and “All you guys who think that gas pedals and hills affect speed are wrong!”

And no, you can not get around this problem by doing a multivariate regression of speed on gas pedal and hill. That’s because gas pedal and hill will be perfectly colinear. And no, you do not get around this problem simply by observing an unskilled driver who is unable to keep the speed perfectly constant. That’s because what you are really estimating is the driver’s forecast errors of the relationship between speed gas and hill, and not the true structural relationship between speed gas and hill. And it really bugs me that people who know a lot more econometrics than I do think that you can get around the problem this way, when you can’t. And it bugs me even more that econometricians spend their time doing loads of really fancy stuff that I can’t understand when so many of them don’t seem to understand Milton Friedman’s thermostat. Which they really need to understand.

If the driver is doing his job right, and correctly adjusting the gas pedal to the hills, you should find zero correlation between gas pedal and speed, and zero correlation between hills and speed. Any fluctuations in speed should be uncorrelated with anything the driver can see. They are the driver’s forecast errors, because he can’t see gusts of headwinds coming. And if you do find a correlation between gas pedal and speed, that correlation could go either way. A driver who over-estimates the power of his engine, or who under-estimates the effects of hills, will create a correlation between gas pedal and speed with the “wrong” sign. He presses the gas pedal down going uphill, but not enough, and the speed drops.

What you “observe” is dependent upon information you have learned outside the immediate situation.

And therefore changes the semantics of statements that you make to others about your observations. Their interpretations of your statements are also dependent upon other information.

There is no cure all for this type of issue, but being aware of it impacts our chances to avoid it. Maybe. 😉

The US poverty map in 2011 [Who Defines Poverty?]

Filed under: Data,Semantics — Patrick Durusau @ 4:18 pm

The US poverty map in 2011 by Simon Rogers.

From the post:

New figures from the US census show that 46.2 million Americans live in poverty and another 48.6m have no health insurance. In Maryland, the median income is $68,876, in Kentucky it is $39,856, some $10,054 below than the US average. Click on each state below to see the data – or use the dropdown to see the map change

As always an interesting presentation of data (along with access to the raw data).

But what about “poverty” in the United States versus “poverty” in other places?

The World Bank’s “Poverty” page reports in part:

  • Poverty headcount ratio at $1.25 a day (PPP) (% of population)
    • East Asia & Pacific
    • Europe & Central Asia
    • Latin America & Caribbean
    • Middle East & North Africa
    • South Asia
    • Sub-Saharan Africa
  • Poverty headcount ratio at $2 a day (PPP) (% of population)
    • East Asia & Pacific
    • Europe & Central Asia
    • Latin America & Caribbean
    • Middle East & North Africa
    • South Asia
    • Sub-Saharan Africa

What area is missing from this list?

Can you say: “North America?”

The poverty rate per day for North American is an important comparison point in discussions of global trade, environment and similar issues.

Can you point me towards more comprehensive comparison data?


PS: $2 per day is $730 annual. $1.25 per day is $456.25 annual.

Hafslund SESAM – Semantic integration in practice

Filed under: Integration,Semantics — Patrick Durusau @ 10:57 am

Hafslund SESAM – Semantic integration in practice by Lars Marius Garshol.

Lars has posted his slides from a practical implementation of semantic integration, and what he saw along the way.

I particularly liked the line:

Generally, archive systems are glorified trash cans – putting it in the archive effectively means hiding it

BTW, Lars mentions he has a paper on this project. If you are looking for publishable semantic integration content, you might want to ping him.

larsga@bouvet.no
http://twitter.com/larsga

September 5, 2012

IWCS 2013 Workshop: Towards a formal distributional semantics

Filed under: Conferences,Semantics — Patrick Durusau @ 4:43 pm

IWCS 2013 Workshop: Towards a formal distributional semantics

When Mar 19, 2013 – Mar 22, 2013
Where Potsdam, Germany
Submission Deadline Nov 30, 2012
Notification Due Jan 4, 2013
Final Version Due Jan 25, 2013

From the call for papers:

The Tenth International Conference for Computational Semantics (IWCS) will be held March 20–22, 2013 in Potsdam, Germany.

The aim of the IWCS conference is to bring together researchers interested in the computation, annotation, extraction, and representation of meaning in natural language, whether this is from a lexical or structural semantic perspective. IWCS embraces both symbolic and statistical approaches to computational semantics, and everything in between.

Topics of Interest

Areas of special interest for the conference will be computational aspects of meaning of natural language within written, spoken, or multimodal communication. Papers are invited that are concerned with topics in these and closely related areas, including the following:

  • representation of meaning
  • syntax-semantics interface
  • representing and resolving semantic ambiguity
  • shallow and deep semantic processing and reasoning
  • hybrid symbolic and statistical approaches to representing semantics
  • alternative approaches to compositional semantics
  • inference methods for computational semantics
  • recognizing textual entailment
  • learning by reading
  • methodologies and practices for semantic annotation
  • machine learning of semantic structures
  • statistical semantics
  • computational aspects of lexical semantics
  • semantics and ontologies
  • semantic web and natural language processing
  • semantic aspects of language generation
  • semantic relations in discourse and dialogue
  • semantics and pragmatics of dialogue acts
  • multimodal and grounded approaches to computing meaning
  • semantics-pragmatics interface

Definitely sounds like a topic map sort of meeting!

September 3, 2012

Legal Rules, Text and Ontologies Over Time [The eternal “now?”]

Filed under: Legal Informatics,Ontology,Semantics — Patrick Durusau @ 3:06 pm

Legal Rules, Text and Ontologies Over Time by Monica Palmirani, Tommaso Ognibene and Luca Cervone.

Abstract:

The current paper presents the “Fill the gap” project that aims to design a set of XML standards for modelling legal documents in the Semantic Web over time. The goal of the project is to design an information system using XML standards able to store in an XML-native database legal resources and legal rules in an integrated way for supporting legal knowledge engineers and end-users (e.g., public administrative officers, judges, citizens).

It was refreshing to read:

The law changes over time and consequently change the rules and the ontological classes (e.g., the definition of EU citizenship changed in 2004 with the annexation of 10 new member states in the European Community). It is also fundamental to assign dates to the ontology and to the rules, , based on an analytical approach, to the text, and analyze the relationships among sets of dates. The semantic web cake recommends that content, metadata should be modelled and represented in separate and clean layers. This recommendation is not widely followed from too many XML schemas, including those in the legal domain. The layers of content and rules are often confused to pursue a short annotation syntax, or procedural performance parameters or simply because a neat analysis of the semantic and abstract components is missing.

Not being mindful of time, of the effective date of changes to laws, the dates of events/transactions, can be hazardous to your pocketbook and/or your freedom!

Does your topic map account for time or does it exist in an eternal “now?” like the WWW?

I first saw this at Legal Informatics.

September 2, 2012

HTML [Lessons in Semantic Interoperability – Part 3]

Filed under: HTML,Interoperability,Semantics — Patrick Durusau @ 12:06 pm

If HTML is an example of semantic interoperability, are there parts of HTML that can be re-used for more semantic interoperability?

Some three (3) year old numbers on usage of HTML elements:

Element Percentage
a 21.00
td 15.63
br 9.08
div 8.23
tr 8.07
img 7.12
option 4.90
li 4.48
span 3.98
table 3.15
font 2.80
b 2.32
p 1.98
input 1.79
script 1.77
strong 0.97
meta 0.95
link 0.66
ul 0.65
hr 0.37
http://webmasters.stackexchange.com/questions/11406/recent-statistics-on-html-usage-in-the-wild

Assuming they still hold true, the <a> element is by far the most popular.

Implications for a semantic interoperability solution that leverages on the <a> element?

Leave the syntax the hell alone!

As we saw in parts 1 and 2 of this series, the <a> element has:

  • simplicity
  • immediate feedback

If you don’t believe me, teach someone who doesn’t know HTML at all how to create an <a> element and verify its presence in browser. (I’ll wait.)

Back so soon? 😉

To summarize: The <a> element is simple, has immediate feedback and is in widespread use.

All of which makes it a likely candidate to leverage for semantic interoperability. But how?

And what of all the other identifiers in the world? What happens to them?

September 1, 2012

HTML [Lessons in Semantic Interoperability – Part 2]

Filed under: HTML,Interoperability,Semantics,Web Server — Patrick Durusau @ 10:11 am

While writing Elli (Erlang Web Server) [Lessons in Semantic Interoperability – Part 1], I got distracted by the realization that web servers produce semantically interoperable content every day. Lots of it. For hundreds of millions of users.

My question: What makes the semantics of HTML different?

The first characteristic that came to mind was simplicity. Unlike some markup languages, ;-), HTML did not have to await the creation of WYSIWYG editors to catch on. In part I suspect because after a few minutes with it, most users (not all), could begin to author HTML documents.

Think about the last time you learned something new. What is the one thing that brings closure to the learning experience?

Feedback, knowing if your attempt at an answer is right or wrong. If right, you will attempt the same solution under similar circumstances in the future. If wrong, you will try again (hopefully).

When HTML appeared, so did primitive (in today’s terms) web browsers.

Any user learning HTML could get immediate feedback on their HTML authoring efforts.

Not:

  • After installing additional validation software
  • After debugging complex syntax or configurations
  • After millions of other users do the same thing
  • After new software appears to take advantage of it

Immediate feedback means just that immediate feedback.

The second characteristic is immediate feedback.

You can argue that such feedback was an environmental factor and not a characteristic of HTML proper.

Possibly, possibly but if such a distinction is possible and meaningful, how does it help with the design/implementation of the next successful semantic interoperability language?

I would argue by whatever means, any successful semantic interoperability language is going to include immediate feedback, however you classify it.

Elli (Erlang Web Server) [Lessons in Semantic Interoperability – Part 1]

Filed under: Erlang,Interoperability,Semantics,Web Server — Patrick Durusau @ 8:04 am

Elli

From the post:

My name is Knut, and I want to show you something really cool that I built to solve some problems we are facing here at Wooga.

Having several very successful social games means we have a large number of users. In a single game, they can generate around ten thousand HTTP requests per second to our backend systems. Building and operating the software required to service these games is a big challenge that sometimes requires creative solutions.

As developers at Wooga, we are responsible for the user experience. We want to make our games not only fun and enjoyable but accessible at all times. To do this we need to understand and control the software and hardware we rely on. When we see an area where we can improve the user experience, we go for it. Sometimes this means taking on ambitious projects. An example of this is Elli, a webserver which has become one of the key building blocks of our successful backends.

Having used many of the big Erlang webservers in production with great success, we still found ourselves thinking of how we could improve. We want a simple and robust core with no errors or edge cases causing problems. We need to measure the performance to help us optimize our network and user code. Most importantly, we need high performance and low CPU usage so our servers can spend their resources running our games.

I started this post about Elli to point out the advantages of having a custom web server application. If your needs aren’t meet by one of the standard ones.

Something clicked and I realized that web servers, robust and fast as well as lame and slow, churn out semantically interoperable content every day.

For hundreds of millions of users.

Rather than starting from the perspective of the “semantic interoperability” we want, why not examine the “semantic interoperability” we have already, for clues on what may or may not work to increase it?

When I say “semantic interoperability” on the web, I am speaking of the interpretation of HTML markup, the <a>, <p>, <ol>, <ul>, <div>, <h1-6>, elements that make up most pages.

What characteristics do those markup elements share that might be useful in creating more semantic interoperability?

The first characteristic is simplicity.

You don’t need a lot of semantic overhead machinery or understanding to use any of them.

A plain text editor and knowledge that some text has a general presentation is enough.

Takes a few minutes for a user to learn enough HTML to produce meaningful (to them and others) results.

At least in the case of HTML, that simplicity has lead to a form of semantic interoperability.

HTML was defined with interoperable semantics but unadopted interoperable semantics are like no interoperable semantics at all.

If HTML has simplicity of semantics, what else does it have that lead to widespread adoption?

August 30, 2012

Applied and implied semantics in crystallographic publishing

Filed under: Publishing,Semantics — Patrick Durusau @ 10:54 am

Applied and implied semantics in crystallographic publishing by Brian McMahon. Journal of Cheminformatics 2012, 4:19 doi:10.1186/1758-2946-4-19.

Abstract:

Background

Crystallography is a data-rich, software-intensive scientific discipline with a community that has undertaken direct responsibility for publishing its own scientific journals. That community has worked actively to develop information exchange standards allowing readers of structure reports to access directly, and interact with, the scientific content of the articles.

Results

Structure reports submitted to some journals of the International Union of Crystallography (IUCr) can be automatically validated and published through an efficient and cost-effective workflow. Readers can view and interact with the structures in three-dimensional visualization applications, and can access the experimental data should they wish to perform their own independent structure solution and refinement. The journals also layer on top of this facility a number of automated annotations and interpretations to add further scientific value.

Conclusions

The benefits of semantically rich information exchange standards have revolutionised the scholarly publishing process for crystallography, and establish a model relevant to many other physical science disciplines.

A strong reminder to authors and publishers of the costs and benefits of making semantics explicit. (And the trade-offs involved.)

August 26, 2012

Semantic University

Filed under: Semantic Web,Semantics — Patrick Durusau @ 2:18 pm

Semantic University

From the homepage:

Semantic University will be the single largest and most accessible source of educational material relating to semantic technologies. Moreover, it will fill several important gaps in current material by providing:

  • Lessons suitable to those brand new to the space.
  • Comparisons, both high-level and in-depth, with related technologies, such as NoSQL and Big Data.
  • Interactive, hands on tutorials.

Have you used these materials? Comparison to others?

Metric Spaces — A Primer [Semantic Metrics?]

Filed under: Distance,Metric Spaces,Semantics — Patrick Durusau @ 1:45 pm

Metric Spaces — A Primer by Jeremy Kun.

The Blessing of Distance

We have often mentioned the idea of a “metric” on this blog, and we briefly described a formal definition for it. Colloquially, a metric is simply the mathematical notion of a distance function, with certain well-behaved properties. Since we’re now starting to cover a few more metrics (and things which are distinctly not metrics) in the context of machine learning algorithms, we find it pertinent to lay out the definition once again, discuss some implications, and explore a few basic examples.

The most important thing to take away from this discussion is that not all spaces have a notion of distance. For a space to have a metric is a strong property with far-reaching mathematical consequences. Essentially, metrics impose a topology on a space, which the reader can think of as the contortionist’s flavor of geometry. We’ll explore this idea after a few examples.

On the other hand, from a practical standpoint one can still do interesting things without a true metric. The downside is that work relying on (the various kinds of) non-metrics doesn’t benefit as greatly from existing mathematics. This can often spiral into empirical evaluation, where justifications and quantitative guarantees are not to be found.

An enjoyable introduction to metric spaces.

Absolutely necessary for machine learning and computational tasks.

However, I am mindful that the mapping from semantics to a location in metric space is an arbitrary one. Our evaluations of metrics assigned to any semantic, are wholly dependent upon that mapping.

Not that we can escape that trap but to urge caution when claims are made on the basis of arbitrarily assigned metric locations. (A small voice should be asking: What if we change the assigned metric locations? What result then?)

August 19, 2012

Bi-directional semantic similarity….

Filed under: Bioinformatics,Biomedical,Semantics,Similarity — Patrick Durusau @ 6:32 pm

Bi-directional semantic similarity for gene ontology to optimize biological and clinical analyses by Sang Jay Bien, Chan Hee Park, Hae Jin Shim, Woongcheol Yang, Jihun Kim and Ju Han Kim.

Abstract:

Background Semantic similarity analysis facilitates automated semantic explanations of biological and clinical data annotated by biomedical ontologies. Gene ontology (GO) has become one of the most important biomedical ontologies with a set of controlled vocabularies, providing rich semantic annotations for genes and molecular phenotypes for diseases. Current methods for measuring GO semantic similarities are limited to considering only the ancestor terms while neglecting the descendants. One can find many GO term pairs whose ancestors are identical but whose descendants are very different and vice versa. Moreover, the lower parts of GO trees are full of terms with more specific semantics.

Methods This study proposed a method of measuring semantic similarities between GO terms using the entire GO tree structure, including both the upper (ancestral) and the lower (descendant) parts. Comprehensive comparison studies were performed with well-known information content-based and graph structure-based semantic similarity measures with protein sequence similarities, gene expression-profile correlations, protein–protein interactions, and biological pathway analyses.

Conclusion The proposed bidirectional measure of semantic similarity outperformed other graph-based and information content-based methods.

Makes me curious what the experience with direction and identification has been with other ontologies?

August 12, 2012

Semantic physical science

Filed under: Science,Semantics — Patrick Durusau @ 7:56 pm

Semantic physical science by Peter Murray-Rust and Henry S Rzepa. (Journal of Cheminformatics 2012, 4:14 doi:10.1186/1758-2946-4-14)

Abstract:

The articles in this special issue arise from a workshop and symposium held in January 2012 (‘Semantic Physical Science’). We invited people who shared our vision for the potential of the web to support chemical and related subjects. Other than the initial invitations, we have not exercised any control over the content of the contributed articles.

There are pointers to videos and other materials for the following workshop presentations:

  • Introduction – Peter Murray-Rust [11]
  • Why we (PNNL) are supporting semantic science – Bill Shelton
  • Adventures in Semantic Materials Informatics – Nico Adams
  • Semantic Crystallographic Publishing – Brian McMahon [12]
  • Service-oriented science: why good code matters and why a fundamental change in thinking is required – Cameron Neylon [13]
  • On the use of CML in computational materials research – Martin Dove [14]
  • FoX, CML and semantic tools for atomistic simulation – Andrew Walker [15]
  • Semantic Physical Science: the CML roadmap – Marcus Hanwell [16]
  • CMLisattion of NWChem and development strategy for FoXification and dictionaries – Bert de Jong
  • NMR working group – Nancy Washton

A remarkable workshop with which I have only one minor difference:

There was remarkable and exciting unanimity that semantics should and could be introduced now and rapidly into the practice of large areas of chemistry. We agreed that we should concentrate on the three main areas of crystallography, computation and NMR spectroscopy. In crystallography, this is primarily a strategy of working very closely with the IUCr, being able to translate crystallographic data automatically into semantic form and exploring the value of semantic publication and repositories. The continued development of Chempound for crystal structures is Open and so can be fed back regularly into mainstream crystallography.

When computers were being introduced to indexing chemistry and other physical sciences in the 1950’s/60’s, the then practitioners were under the impression their data already had semantics. That it did not have to await the next turn of the century in order to have semantics.

Not to take anything away from the remarkable progress that CML and related efforts have made, but they are not the advent of semantics for chemistry.

Clarification of semantics, documentation of semantics, refinement of semantics, all true.

But chemistry (and data) has always had semantics.

August 1, 2012

Semantic Silver Bullets?

Filed under: Information Sharing,Marketing,Semantics — Patrick Durusau @ 1:46 pm

The danger of believing in silver bullets

Nick Wakeman writes in the Washington Technology Business Beat:

Whether it is losing weight, getting rich or managing government IT, it seems we can’t resist the lure of a silver bullet. The magic pill. The easy answer.

Ten or 12 years ago, I remember a lot of talk about leasing and reverse auctions, and how they were going to transform everything.

Since then, outsourcing and insourcing have risen and fallen from favor. Performance-based contracting was going to be the solution to everything. And what about the huge systems integration projects like Deepwater?

They start with a bang and end with a whimper, or in some cases, a moan and a whine. And of course, along the way, millions and even billions of dollars get wasted.

I think we are in the midst of another silver bullet phenomenon with all the talk around cloud computing and everything as a service.

I wish I could say that topic maps are a semantic silver bullet. Or better yet, a semantic hand grenade. One that blows other semantic approaches away.

Truthfully, topic maps are neither one.

Topic maps rely upon users, assisted by various technologies, to declare and identify subjects they want to talk about and, just as importantly, relationships between those subjects. Not to mention where information about those subjects can be found.

If you need evidence of the difficulty of those tasks, consider the near idiotic results you get from search engines. Considering the task they do pretty good but pretty good still takes time and effort to sort out every time you search.

Topic maps aren’t easy, no silver bullet, but you can capture subjects of interest to you, define their relationships to other subjects and specify where more information can be found.

Once captured, that information can be shared, used and/or merged with information gathered by others.

Bottom line is that better semantic results, for sharing, for discovery, for navigation, all require hard work.

Are you ready?

July 3, 2012

Three Steps to Heaven: Semantic Publishing in a Real World Workflow

Filed under: Publishing,Semantics — Patrick Durusau @ 2:27 pm

Three Steps to Heaven: Semantic Publishing in a Real World Workflow by Phillip Lord, Simon Cockell, and Robert Stevens.

Abstract:

Semantic publishing offers the promise of computable papers, enriched visualisation and a realisation of the linked data ideal. In reality, however, the publication process contrives to prevent richer semantics while culminating in a `lumpen’ PDF. In this paper, we discuss a web-first approach to publication, and describe a three-tiered approach which integrates with the existing authoring tooling. Critically, although it adds limited semantics, it does provide value to all the participants in the process: the author, the reader and the machine.

With a touch of irony and gloom the authors write:

… There are signi cant barriers to the acceptance of semantic publishing as a standard mechanism for academic publishing. The web was invented around 1990 as a light-weight mechanism for publication of documents. It has subsequently had a massive impact on society in general. It has, however, barely touched most scientifi c publishing; while most journals have a website, the publication process still revolves around the generation of papers, moving from Microsoft Word or LATEX [5], through to a final PDF which looks, feels and is something designed to be printed onto paper4. Adding semantics into this environment is difficult or impossible; the content of the PDF has to be exposed and semantic content retrofi tted or, in all likelihood, a complex process of author and publisher interaction has to be devised and followed. If semantic data publishing and semantic publishing of academic narratives are to work together, then academic publishing needs to change.

4. This includes conferences dedicated to the web and the use of web technologies.

One could add “…includes papers about changing the publishing process” but I digress.

I don’t disagree that adding semantics to the current system has proved problematic.

I do disagree that changing the current system, which is deeply embedded in research, publishing and social practices is likely to succeed.

At least if success is defined as a general solution to adding semantics to scientific research and publishing in general. Such projects may be successful in creating new methods of publishing scientific research but that just expands the variety of methods we must account for.

That doesn’t have a “solution like” feel to me. You?

July 1, 2012

The Case for Semantics-Based Methods in Reverse Engineering

Filed under: Reverse Engineering,Semantics — Patrick Durusau @ 4:47 pm

The Case for Semantics-Based Methods in Reverse Engineering by Rolf Rolles. (pdf – slides)

Jennifer Shockley quotes Rolf as saying:

“The goal of my RECON 2012 keynote speech was to introduce methods in academic program analysis and demonstrate — intuitively, without drawing too much on formalism — how they can be used to solve practical problems that are interesting to industrial researchers in the real world. Given that it was the keynote speech, and my goal of making the material as accessible as possible, I attempted to make my points with pictures instead of dense technical explanations.”

From his blog post: ‘RECON 2012 Keynote: The Case for Semantics-Based Methods in Reverse Engineering.’

Rolf also points to a reading list on program analysis.

Did someone say semantics? 😉

Anyone working on topic map based tools for reverse engineering?

Thinking that any improvement in sharing of results, even partial results, would improve response times.

The observational roots of reference of the semantic web

Filed under: Identity,Semantic Web,Semantics — Patrick Durusau @ 4:42 pm

The observational roots of reference of the semantic web by Simon Scheider, Krzysztof Janowicz, and Benjamin Adams.

Abstract:

Shared reference is an essential aspect of meaning. It is also indispensable for the semantic web, since it enables to weave the global graph, i.e., it allows different users to contribute to an identical referent. For example, an essential kind of referent is a geographic place, to which users may contribute observations. We argue for a human-centric, operational approach towards reference, based on respective human competences. These competences encompass perceptual, cognitive as well as technical ones, and together they allow humans to inter-subjectively refer to a phenomenon in their environment. The technology stack of the semantic web should be extended by such operations. This would allow establishing new kinds of observation-based reference systems that help constrain and integrate the semantic web bottom-up.

In arguing for recasting the problem of semantics as one of reference, the authors say:

Reference systems. Solutions to the problem of reference should transgress syntax as well as technology. They cannot solely rely on computers but must also rely on human referential competences. This requirement is met by reference systems [22]. Reference systems are different from ontologies in that they constrain meaning bottom-up [11]. Most importantly, they are not “yet another chimera” invented by ontology engineers, but already exist in various successful variants.

I rather like the “human referential competences….”

After all, useful semantic systems are about references that we recognize.

June 24, 2012

The Turing Digital Archive

Filed under: Computer Science,Semantics,Turing Machines — Patrick Durusau @ 8:18 pm

The Turing Digital Archive

From the webpage:

Alan Turing (1912-54) is best-known for helping decipher the code created by German Enigma machines in the Second World War, and for being one of the founders of computer science and artificial intelligence.

This archive contains many of Turing’s letters, talks, photographs and unpublished papers, as well as memoirs and obituaries written about him. It contains images of the original documents that are held in the Turing collection at King’s College, Cambridge. For more information about this digital archive and tips on using the site see About the archive.

I ran across this archive when I followed a reference to the original paper on Turing machines, http://www.turingarchive.org/viewer/?id=466&title=01a.

I will be returning to this original description in one or more posts on Turing machines and semantics.

June 13, 2012

On the value of being inexact

Filed under: Computation,Computer Science,Inexact,Semantics — Patrick Durusau @ 12:31 pm

Algorithmic methodologies for ultra-efficient inexact architectures for sustaining technology scaling by Avinash Lingamneni, Kirthi Krishna Muntimadugu, Richard M. Karp, Krishna V. Palem, and Christian Piguet.

The following non-technical blurb caught my eye:

Researchers have unveiled an “inexact” computer chip that challenges the industry’s dogmatic 50-year pursuit of accuracy. The design improves power and resource efficiency by allowing for occasional errors. Prototypes unveiled this week at the ACM International Conference on Computing Frontiers in Cagliari, Italy, are at least 15 times more efficient than today’s technology.

[ads deleted]

The research, which earned best-paper honors at the conference, was conducted by experts from Rice University in Houston, Singapore’s Nanyang Technological University (NTU), Switzerland’s Center for Electronics and Microtechnology (CSEM) and the University of California, Berkeley.

“It is exciting to see this technology in a working chip that we can measure and validate for the first time,” said project leader Krishna Palem, who also serves as director of the Rice-NTU Institute for Sustainable and Applied Infodynamics (ISAID). “Our work since 2003 showed that significant gains were possible, and I am delighted that these working chips have met and even exceeded our expectations.” [From: Computing experts unveil superefficient ‘inexact’ chip which I saw in a list of links by Greg Linden.

Think about it. We are inexact and so are our semantics.

But we attempt to model our inexact semantics with increasingly exact computing platforms.

Does that sound like a modeling mis-match to you?

BTW, if you are interested in the details, see: Algorithmic methodologies for ultra-efficient inexact architectures for sustaining technology scaling

Abstract:

Owing to a growing desire to reduce energy consumption and widely anticipated hurdles to the continued technology scaling promised by Moore’s law, techniques and technologies such as inexact circuits and probabilistic CMOS (PCMOS) have gained prominence. These radical approaches trade accuracy at the hardware level for significant gains in energy consumption, area, and speed. While holding great promise, their ability to influence the broader milieu of computing is limited due to two shortcomings. First, they were mostly based on ad-hoc hand designs and did not consider algorithmically well-characterized automated design methodologies. Also, existing design approaches were limited to particular layers of abstraction such as physical, architectural and algorithmic or more broadly software. However, it is well-known that significant gains can be achieved by optimizing across the layers. To respond to this need, in this paper, we present an algorithmically well-founded cross-layer co-design framework (CCF) for automatically designing inexact hardware in the form of datapath elements. Specifically adders and multipliers, and show that significant associated gains can be achieved in terms of energy, area, and delay or speed. Our algorithms can achieve these gains with adding any additional hardware overhead. The proposed CCF framework embodies a symbiotic relationship between architecture and logic-layer design through the technique of probabilistic pruning combined with the novel confined voltage scaling technique introduced in this paper, applied at the physical layer. A second drawback of the state of the art with inexact design is the lack of physical evidence established through measuring fabricated ICs that the gains and other benefits that can be achieved are valid. Again, in this paper, we have addressed this shortcoming by using CCF to fabricate a prototype chip implementing inexact data-path elements; a range of 64-bit integer adders whose outputs can be erroneous. Through physical measurements of our prototype chip wherein the inexact adders admit expected relative error magnitudes of 10% or less, we have found that cumulative gains over comparable and fully accurate chips, quantified through the area-delay-energy product, can be a multiplicative factor of 15 or more. As evidence of the utility of these results, we demonstrate that despite admitting error while achieving gains, images processed using the FFT algorithm implemented using our inexact adders are visually discernible.

Why the link to the ACM Digital library or to the “unoffiical version” were not reported in any of the press stories I cannot say.

June 11, 2012

Scale, Structure, and Semantics

Filed under: Communication,Semantic Web,Semantics — Patrick Durusau @ 4:20 pm

Scale, Structure, and Semantics by Daniel Turkelang.

From the post:

This morning I had the pleasure to present a keynote address at the Semantic Technology & Business Conference (SemTechBiz). I’ve had a long and warm relationship with the semantic technology community — especially with Marco Neumann and the New York Semantic Web Meetup.

To give you a taste of the slides:

1. Knowledge representation is overrated.

2. Computation is underrated.

3. We have a communication problem.

I find it helpful to think of search/retrieval as asynchronous conversation.

If I can’t continue or find my place in or know what a conversation is about, there is a communication problem.

June 5, 2012

Sourcing Semantics

Filed under: Semantics — Patrick Durusau @ 7:56 pm

Ancient Jugs Hold the Secret to Practical Mathematics in Biblical Times is a good illustration of the source of semantics.

From the post:

Archaeologists in the eastern Mediterranean region have been unearthing spherical jugs, used by the ancients for storing and trading oil, wine, and other valuable commodities. Because we’re used to the metric system, which defines units of volume based on the cube, modern archaeologists believed that the merchants of antiquity could only approximately assess the capacity of these round jugs, says Prof. Itzhak Benenson of Tel Aviv University’s Department of Geography.

Now an interdisciplinary collaboration between Prof. Benensonand Prof. Israel Finkelstein of TAU’s Department of Archaeology and Ancient Near Eastern Cultures has revealed that, far from relying on approximations, merchants would have had precise measurements of their wares — and therefore known exactly what to charge their clients.

The researchers discovered that the ancients devised convenient mathematical systems in order to determine the volume of each jug. They theorize that the original owners and users of the jugs measured their contents through a system that linked units of length to units of volume, possibly by using a string to measure the circumference of the spherical container to determine the precise quantity of liquid within.

The system, which the researchers believe was developed by the ancient Egyptians and used in the Eastern Mediterranean from about 1,500 to 700 BCE, was recently reported in the journal PLoS ONE. Its discovery was part of the Reconstruction of Ancient Israel project supported by the European Union.

The artifacts in question are between 2,700 and 3,500 years old.

When did they take on the semantic of being a standardized unit of measurement based on circumference?

A. When they were in common use, approximately 1,500 to 700 BCE?

B. When this discovery was made as per this article?

Understanding that the artifacts have not changed, was this semantic “lost” during the time period between A and B?

Or have we re-attributed to these artifacts the semantic of being a standardized unit of measurement based on circumference?

If you have some explanation other than our being the source of the measurement semantic, I am interested to hear about it.

That may seem like a trivial point but consider its implications carefully.

If we are the source of semantics, then we are the source of semantics for ontologies, classification systems, IR, etc.

Making those semantics subject to the same uncertainty, vagueness, competing semantics as any other.

Making them subject to being defined/disclosed to be as precise as necessary.

Not defining semantics for the ages. Defining semantics against particular requirements. Not the same thing.


The journal reference:

Elena Zapassky, Yuval Gadot, Israel Finkelstein, Itzhak Benenson. An Ancient Relation between Units of Length and Volume Based on a Sphere. PLoS ONE, 2012; 7 (3): e33895 DOI: 10.1371/journal.pone.0033895

June 3, 2012

Creating a Semantic Graph from Wikipedia

Creating a Semantic Graph from Wikipedia by Ryan Tanner, Trinity University.

Abstract:

With the continued need to organize and automate the use of data, solutions are needed to transform unstructred text into structred information. By treating dependency grammar functions as programming language functions, this process produces \property maps” which connect entities (people, places, events) with snippets of information. These maps are used to construct a semantic graph. By inputting Wikipedia, a large graph of information is produced representing a section of history. The resulting graph allows a user to quickly browse a topic and view the interconnections between entities across history.

Of particular interest is Ryan’s approach to the problem:

Most approaches to this problem rely on extracting as much information as possible from a given input. My approach comes at the problem from the opposite direction and tries to extract a little bit of information very quickly but over an extremely large input set. My hypothesis is that by doing so a large collection of texts can be quickly processed while still yielding useful output.

A refreshing change from semantic orthodoxy that has a happy result.

Printing the thesis now for a close read.

(Source: Jack Park)

May 31, 2012

Large Heterogeneous Data 2012

Filed under: Conferences,Heterogeneous Data,Mapping,Semantics — Patrick Durusau @ 12:56 pm

Workshop on Discovering Meaning On the Go in Large Heterogeneous Data 2012 (LHD-12)

Important Dates

  • Deadline for paper subsmission: July 31, 2012
  • Author notification: August 21, 2012
  • Deadline for camera-ready: September 10, 2012
  • Workshop date: November 11th or 12th, 2012

Take the time to read the workshop description.

A great summary of the need for semantic mappings, not more semantic fascism.

From the call for papers:

An interdisciplinary approach is necessary to discover and match meaning dynamically in a world of increasingly large data sources. This workshop aims to bring together practitioners from academia, industry and government for interaction and discussion. This will be a half-day workshop which primarily aims to initiate discussion and debate. It will involve

  • A panel discussion focussing on these issues from an industrial and governmental point of view. Membership to be confirmed, but we expect a representative from Scottish Government and from Google, as well as others.
  • Short presentations grouped into themed panels, to stimulate debate not just about individual contributions but also about the themes in general.

Workshop Description

The problem of semantic alignment – that of two systems failing to understand one another when their representations are not identical – occurs in a huge variety of areas: Linked Data, database integration, e-science, multi-agent systems, information retrieval over structured data; anywhere, in fact, where semantics or a shared structure are necessary but centralised control over the schema of the data sources is undesirable or impractical. Yet this is increasingly a critical problem in the world of large scale data, particularly as more and more of this kind of data is available over the Web.

In order to interact successfully in an open and heterogeneous environment, being able to dynamically and adaptively integrate large and heterogeneous data from the Web “on the go” is necessary. This may not be a precise process but a matter of finding a good enough integration to allow interaction to proceed successfully, even if a complete solution is impossible.

Considerable success has already been achieved in the field of ontology matching and merging, but the application of these techniques – often developed for static environments – to the dynamic integration of large-scale data has not been well studied.

Presenting the results of such dynamic integration to both end-users and database administrators – while providing quality assurance and provenance – is not yet a feature of many deployed systems. To make matters more difficult, on the Web there are massive amounts of information available online that could be integrated, but this information is often chaotically organised, stored in a wide variety of data-formats, and difficult to interpret.

This area has been of interest in academia for some time, and is becoming increasingly important in industry and – thanks to open data efforts and other initiatives – to government as well. The aim of this workshop is to bring together practitioners from academia, industry and government who are involved in all aspects of this field: from those developing, curating and using Linked Data, to those focusing on matching and merging techniques.

Topics of interest include, but are not limited to:

  • Integration of large and heterogeneous data
  • Machine-learning over structured data
  • Ontology evolution and dynamics
  • Ontology matching and alignment
  • Presentation of dynamically integrated data
  • Incentives and human computation over structured data and ontologies
  • Ranking and search over structured and semi-structured data
  • Quality assurance and data-cleansing
  • Vocabulary management in Linked Data
  • Schema and ontology versioning and provenance
  • Background knowledge in matching
  • Extensions to knowledge representation languages to better support change
  • Inconsistency and missing values in databases and ontologies
  • Dynamic knowledge construction and exploitation
  • Matching for dynamic applications (e.g., p2p, agents, streaming)
  • Case studies, software tools, use cases, applications
  • Open problems
  • Foundational issues

Applications and evaluations on data-sources that are from the Web and Linked Data are particularly encouraged.

Several years from now, how will you find this conference (and its proceedings)?

  • Large Heterogeneous Data 2012
  • Workshop on Discovering Meaning On the Go in Large Heterogeneous Data 2012
  • LHD-12

Just curious.

Joint International Workshop on Entity-oriented and Semantic Search

Filed under: Entity Extraction,Entity Resolution,LOD,Semantic Search,Semantics — Patrick Durusau @ 7:32 am

1st Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012

Important Dates:

  • Submissions Due: July 2, 2012
  • Notification of Acceptance: July 23, 2012
  • Camera Ready: August 1, 2012
  • Workshop date: August 16th, 2012

Located at the 35th ACM SIGIR Conference, Portland, Oregon, USA, August 12–16, 2012.

From the homepage of the workshop:

About the Workshop:

The workshop encompasses various tasks and approaches that go beyond the traditional bag-of-words paradigm and incorporate an explicit representation of the semantics behind information needs and relevant content. This kind of semantic search, based on concepts, entities and relations between them, has attracted attention both from industry and from the research community. The workshop aims to bring people from different communities (IR, SW, DB, NLP, HCI, etc.) and backgrounds (both academics and industry practitioners) together, to identify and discuss emerging trends, tasks and challenges. This joint workshop is a sequel of the Entity-oriented Search and Semantic Search Workshop series held at different conferences in previous years.

Topics

The workshop aims to gather all works that discuss entities along three dimensions: tasks, data and interaction. Tasks include entity search (search for entities or documents representing entities), relation search (search entities related to an entity), as well as more complex tasks (involving multiple entities, spatio-temporal relations inclusive, involving multiple queries). In the data dimension, we consider (web/enterprise) documents (possibly annotated with entities/relations), Linked Open Data (LOD), as well as user generated content. The interaction dimension gives room for research into user interaction with entities, also considering how to display results, as well as whether to aggregate over multiple entities to construct entity profiles. The workshop especially encourages submissions on the interface of IR and other disciplines, such as the Semantic Web, Databases, Computational Linguistics, Data Mining, Machine Learning, or Human Computer Interaction. Examples of topic of interest include (but are not limited to):

  • Data acquisition and processing (crawling, storage, and indexing)
  • Dealing with noisy, vague and incomplete data
  • Integration of data from multiple sources
  • Identification, resolution, and representation of entities (in documents and in queries)
  • Retrieval and ranking
  • Semantic query modeling (detecting, modeling, and understanding search intents)
  • Novel entity-oriented information access tasks
  • Interaction paradigms (natural language, keyword-based, and hybrid interfaces) and result representation
  • Test collections and evaluation methodology
  • Case studies and applications

We particularly encourage formal evaluation of approaches using previously established evaluation benchmarks: Semantic Search Challenge 2010, Semantic Search Challenge 2011, TREC Entity Search Track.

All workshops are special to someone. This one sounds more special than most. Collocated with the ACM SIGIR 2012 meeting. Perhaps that’s the difference.

May 24, 2012

Visual and semantic interpretability of projections of high dimensional data for classification tasks

Filed under: Classification,High Dimensionality,Semantics,Visualization — Patrick Durusau @ 6:10 pm

Visual and semantic interpretability of projections of high dimensional data for classification tasks by Ilknur Icke and Andrew Rosenberg.

A number of visual quality measures have been introduced in visual analytics literature in order to automatically select the best views of high dimensional data from a large number of candidate data projections. These methods generally concentrate on the interpretability of the visualization and pay little attention to the interpretability of the projection axes. In this paper, we argue that interpretability of the visualizations and the feature transformation functions are both crucial for visual exploration of high dimensional labeled data. We present a two-part user study to examine these two related but orthogonal aspects of interpretability. We first study how humans judge the quality of 2D scatterplots of various datasets with varying number of classes and provide comparisons with ten automated measures, including a number of visual quality measures and related measures from various machine learning fields. We then investigate how the user perception on interpretability of mathematical expressions relate to various automated measures of complexity that can be used to characterize data projection functions. We conclude with a discussion of how automated measures of visual and semantic interpretability of data projections can be used together for exploratory analysis in classification tasks.

Rather small group of test subjects (20) so I don’t think you can say much other than more work is needed.

Then it occurred to me that I often speak of studies applying to “users” without stopping to remember that for many tasks, I fall into that self-same category. Subject to the same influences, fatigues and even mistakes.

Anyone know of research by researchers being applied to the same researchers?

May 6, 2012

Why Your Brain Isn’t A Computer

Filed under: Artificial Intelligence,Semantics,Subject Identity — Patrick Durusau @ 7:45 pm

Why Your Brain Isn’t A Computer by Alex Knapp.

Alex writes:

“If the human brain were so simple that we could understand it, we would be so simple that we couldn’t.”
– Emerson M. Pugh

Earlier this week, i09 featured a primer, of sorts, by George Dvorsky regarding how an artificial human brain could be built. It’s worth reading, because it provides a nice overview of the philosophy that underlies some artificial intelligence research, while simultaneously – albeit unwittingly – demonstrating the some of the fundamental flaws underlying artificial intelligence research based on the computational theory of mind.

The computational theory of mind, in essence, says that your brain works like a computer. That is, it takes input from the outside world, then performs algorithms to produce output in the form of mental state or action. In other words, it claims that the brain is an information processor where your mind is “software” that runs on the “hardware” of the brain.

Dvorsky explicitly invokes the computational theory of mind by stating “if brain activity is regarded as a function that is physically computed by brains, then it should be possible to compute it on a Turing machine, namely a computer.” He then sets up a false dichotomy by stating that “if you believe that there’s something mystical or vital about human cognition you’re probably not going to put too much credence” into the methods of developing artificial brains that he describes.

I don’t normally read Forbes but I made and exception in this case and am glad I did.

Not that I particularly care about which side of the AI debate you come out on.

I do think that the notion of “emergent” properties is an important one for judging subject identities. Whether those subjects occur in text messages, intercepted phone calls, signal “intell” of any sort.

Properties that identify subjects “emerge” from a person who speaks the language in question, who has social/intellectual/cultural experiences that give them a grasp of the matters under discussion and perhaps the underlying intent of the parties to the conversation.

A computer program can be trained to mindlessly sort through large amounts of data. It can even be trained to acceptable levels of mis-reading, mis-interpretation.

What will our evaluation be when it misses the one conversation prior to another 9/11? Because the context or language was not anticipated? Because the connection would only emerge out of a living understanding of cultural context?

Computers are deeply useful, but not when emergent properties, emergent properties of the sort that identify subjects, targets and the like are at issue.

April 27, 2012

Making Search Hard(er)

Filed under: Identity,Searching,Semantics — Patrick Durusau @ 6:10 pm

Rafael Maia posts:

first R, now Julia… are programmers trying on purpose to come up with names for their languages that make it hard to google for info? 😛

I don’t know that two cases prove that programmers are responsible for all the semantic confusion in the world.

A search for FORTRAN produces FORTRAN Formula Translation/Translator.

But compare COBOL:


COBOL Common Business-Oriented Language
COBOL Completely Obsolete Business-Oriented Language 🙂
COBOL Completely Over and Beyond Obvious Logic 🙂
COBOL Compiles Only By Odd Luck 🙂
COBOL Completely Obsolete Burdensome Old Language 🙂

May be something to programmers peeing in the semantic pool.

On the other hand, there are examples prior to programming of semantic overloading of strings.

Here is an interesting question:

Is a string overloaded, semantically speaking, when used or read?

Does your answer impact how you would build a search engine? Why/why not?

March 17, 2012

Paper Review: “Recovering Semantic Tables on the WEB”

Filed under: Searching,Semantic Annotation,Semantics — Patrick Durusau @ 8:19 pm

Paper Review: “Recovering Semantic Tables on the WEB”

Sean Golliher writes:

A paper entitled “Recovering Semantics of Tables on the Web” was presented at the 37th Conference on Very Large Databases in Seattle, WA . The paper’s authors included 6 Google engineers along with professor Petros Venetis of Stanford University and Gengxin Miao of UC Santa Barbara. The paper summarizes an approach for recovering the semantics of tables with additional annotations other than what the author of a table has provided. The paper is of interest to developers working on the semantic web because it gives insight into how programmers can use semantic data (database of triples) and Open Information Extraction (OIE) to enhance unstructured data on the web. In addition they compare how a “maximum-likelihood” model, used to assign class labels to tables, compares to a “database of triples” approach. The authors show that their method for labeling tables is capable of labeling “an order of magnitude more tables on the web than is possible using Wikipedia/YAGO and many more than freebase.”

The authors claim that “the Web offers approximately 100 million tables but the meaning of each table is rarely explicit from the table itself”. Tables on the Web are embedded within HTML which makes extracting meaning from them a difficult task. Since tables are embedded in HTML search engines typically treat them like any other text in the document. In addition, authors of tables usually have labels that are specific to their own labeling style and assigned attributes are usually not meaningful. As the authors state: “Every creator of a table has a particular Schema in mind”. In this paper the authors describe a system where they automatically add additional annotations to a table in order to extract meaningful relationships between the entities in the table and other columns within table. The authors reference the table example shown below in Table. 1.1 . The table has no row or column labels and there is no title associated to it. To extract the meaning from this table, using text analysis, a search engine would have to relate the table entries to the text surrounding the document and/or analyze the text entries in the table.

The annotation process, first with class/instance and then out of a triple database, reminds me of Newcomb’s “conferral” of properties. That is some content in the text (or in a subject representative/proxy) causes additional key/value pairs to be assigned/conferred. Nothing particularly remarkable about that process.

I am not suggesting that the ISA/triple database strategy will work equally for all areas. What annotation/conferral strategy works best for you will depend on your data and the requirements imposed upon a solution. I would like to hear from you about annotation/conferral strategies that work with particular data sets.

« Newer PostsOlder Posts »

Powered by WordPress