The CMS Data Aggregation System
Meta-data plays a significant role in large modern enterprises, research experiments and digital libraries where it comes from many different sources and is distributed in a variety of digital formats. It is organized and managed by constantly evolving software using both relational and non-relational data sources. Even though we can apply an information retrieval approach to non-relational data sources, we can’t do so for relational ones, where information is accessed via a pre-established set of data-services.
Here we discuss a new data aggregation system which consumes, indexes and delivers information from different relational and non-relational data sources to answer cross data-service queries and explore meta-data associated with
petabytes of experimental data. We combine the simplicity of keyword-based search with the precision of RDMS under the new system. The aggregated information is collected from various sources, allowing end-users to place dynamic queries, get precise answers and trigger information retrieval on demand. Based on the use cases of the CMS experiment, we have performed a set of detailed, large scale tests the results of which we present in this paper.
When I first skimmed this article it was quite exciting. Merging data from different domains, etc.
Unless I am real mistaken (always possible), the “integration” described here is highly specific to the data in question. Still a very impressive project.
Supplemental reading:
A multi-dimensional view on information retrieval of CMS data
CMS Data Aggregation System – Slides
Questions:
- Are the techniques described applicable to other data sets? (3-5 pages, citations)
- What, if anything, would you change about this system to make it more general? (3-5 pages, no citations)
- How would you extend (if necessary) this system to integrate it with resources in your collection?
PS: The software is supposed be released as public domain software.