Hafslund Sesam — an archive on semantics by Lars Marius Garshol and Axel Borge.
Abstract:
Sesam is an archive system developed for Hafslund, a Norwegian energy company. It achieves the often-sought but rarely-achieved goal of automatically enriching metadata by using semantic technologies to extract and integrate business data from business applications. The extracted data is also indexed with a search engine together with the archived documents, allowing true enterprise search.
A curious paper that requires careful reading.
Since the paper makes technology choices, it’s only appropriate to start with the requirements:
The system must handle 1000 users, although not necessarily simultaneously.
Initial calculations of data size assumed 1.4 million customers and 1 million electric meters with 30-50 properties each. Including various other data gave a rough estimate on the order of 100 million statements.
The archive must be able to receive up to 2 documents per second over an interval of many hours, in order to handle about 100,000 documents a day during peak periods. The documents would mostly be paper forms recording electric meter readings.
To inherit metadata tags automatically requires running queries to achieve transitive closure. Assuming on average 10 queries for each document, the system must be able to handle 20 queries per second on 100 million statements.
In the next section, the authors concede that the fourth requirement, “RDF data integration” was unrealistic, so the fourth requirement was dropped:
The canonical approach to RDF data integration is currently query federation of SPARQL queries against a set of heterogeneous data sources, often using R2RML. Given the size of the data set, the generic nature of the transitive closure queries, and the number of data sources to be supported, we considered achieving 20 queries per second with query federation unrealistic.
Which leaves only:
The system must handle 1000 users, although not necessarily simultaneously.
Initial calculations of data size assumed 1.4 million customers and 1 million electric meters with 30-50 properties each. Including various other data gave a rough estimate on the order of 100 million statements.
The archive must be able to receive up to 2 documents per second over an interval of many hours, in order to handle about 100,000 documents a day during peak periods. The documents would mostly be paper forms recording electric meter readings.
as the requirements to be met.
I mention that because of the following technology choice statement:
To write generic code we must use a schemaless data representation, which must also be standards-based. The only candidates were Topic Maps [ISO13250-2] and RDF. The available Topic Maps implementations would not be able to handle the query throughput at the data sizes required. Testing of the Virtuoso triple store indicated that it could handle the workload just fine. RDF thus appeared to be the only suitable technology.
But there is no query throughput requirement. At least not for the storage mechanism. For deduplication in the ERP system (section 3.5), the authors choose to follow neither topic maps nor RDF but a much older technology, record linkage.
The other query mechnism is a Recommind search engine, which is reported to not be able to index and search at the same time. (section 4.1)
If I am reading the paper correctly, data from different sources are stored as received from various sources and owl:sameAs statements are used to map data to the archives schema.
I puzzle at that point because RDF is simply a format and OWL a means to state a mapping statement.
Given the semantic vagaries of owl:sameAs (Semantic Drift and Linked Data/Semantic Web), I have to wonder about the longer term maintenance of owl:sameAs mappings?
There is no expression of a reason for “sameAs” A reason that might prompt a future maintainer of the system to follow or not some particular “sameAs.”
Still, the project was successful and that counts for more than using any single technology to the exclusion of all others.
The comments on performance of topic maps options does make me mindful of the lack of benchmark data sets for topic maps.