Cry Me A River, But First Let’s Agree About What A River Is
The post starts off well enough:
How do you define a forest? How about deforestation? It sounds like it would be fairly easy to get agreement on those terms. But beyond the basics – that a definition for the first would reflect that a forest is a place with lots of trees and the second would reflect that it’s a place where there used to be lots of trees – it’s not so simple.
And that has consequences for everything from academic and scientific research to government programs. As explained by Krzysztof Janowicz, perfectly valid definitions for these and other geographic terms exist by the hundreds, in legal texts and government documents and elsewhere, and most of them don’t agree with each other. So, how can one draw good conclusions or make important decisions when the data informing those is all over the map, so to speak.
Having enough data isn’t the problem – there’s official data from the government, volunteer data, private organization data, and so on – but if you want to do a SPARQL query of it to discover all towns in the U.S., you’re going to wind up with results that include the places in Utah with populations of less than 5,000, and Los Angeles too – since California legally defines cities and towns as the same thing.
“So this clearly blows up your data, because your analysis is you thinking that you are looking at small rural places,” he says.
This Big Data challenge is not a new problem for the geographic-information sciences community. But it is one that’s getting even more complicated, given the tremendous influx of more and more data from more and more sources: Satellite data, rich data in the form of audio and video, smart sensor network data, volunteer location data from efforts like the Citizen Science Project and services like Facebook Places and Foursquare. “The heterogeneity of data is still increasing. Semantic web tools would help you if you had the ontologies but we don’t have them,” he says. People have been trying to build top-level global ontologies for a couple of decades, but that approach hasn’t yet paid off, he thinks. There needs to be more of a bottom-up take: “The biggest challenge from my perspective is coming up with the rules systems and ontologies from the data.”
All true, many of which objectors to the current Semantic Web approach have been saying for a very long time.
I am not sure about the line: “The heterogeneity of data is still increasing.”
In part because I don’t know of any reliable measure of heterogeneity by which a comparison could be made. True there is more data now than at some X point in the past, but that isn’t necessarily an indication of increased heterogeneity. But that is a minor point.
More serious is the a miracle occurs statement that follows:
How to do it, he thinks, is to make very small and really local ontologies directly mined with the help of data mining or machine learning techniques, and then interlink them and use new kinds of reasoning to see how to reason in the presence of inconsistencies. “That approach is local ontologies that arrive from real application needs,” he says. “So we need ontologies and semantic web reasoning to have neater data that is human and also machine readable. And more effective querying based on analogy or similarity reasoning to find data sets that are relevant to our work and exclude data that may use the same terms but has different ontological assumptions underlying it.”
Doesn’t that have the same feel as the original Semantic Web proposals that were going to eliminate semantic ambiguity from the top down? The very approach that is panned in this article?
And “new kinds of reasoning,” ones I assume have not been invented yet, are going “to reason in the presence of inconsistencies.” And excluding data that “…has different ontological assumptions underlying it.”
Since we are the source of ontological assumptions that underlie the use of terms, I am real curious about how those assumptions are going to become available to these to be invented reasoning techniques?
Oh, that’s right, we are all going to specify our ontological assumptions at the bottom to percolate up. Except that to be useful for machine reasoning, they will have to be as crude as the ones that were going to be imposed from the top down.
I wonder why the indeterminate nature of semantics continues to elude Semantic Web researchers. A snapshot of semantics today may be slightly incorrect tomorrow, probably incorrect in some respect in a month and almost surely incorrect in a year or more.
Take Saddam Hussein for example. One time friend and confidant of Donald Rumsfeld (there are pictures). But over time those semantics changed, largely because Hussein slipped the lease and was no longer a proper vassal to the US. Suddenly, the weapons of mass destruction, in part nerve gas we caused to be sold to him, became a concern. And so Hussein became an enemy of the US. Same person, same facts. Different semantics.
There are less dramatic examples but you get the idea.
We can capture even changing semantics but we need to decide what semantics we want to capture and at what cost? Perhaps that is a better way to frame my objection to most Semantic Web activities, they are not properly scoped. Yes?