Querying Semi-Structured Data
The Semi-structured data and P2P graph databases post I point to has a broken reference to Serge Abiteboul’s “Querying Semi-Structured Data.” Since I could not correct it there and the topic is of interest for topic maps, I created this entry for it here.
From the Introduction:
The amount of data of all kinds available electronically has increased dramatically in recent years. The data resides in different forms, ranging from unstructured data in le systems to highly structured in relational database systems. Data is accessible through a variety of interfaces including Web browsers, database query languages, application-specific interfaces, or data exchange formats. Some of this data is raw data, e.g., images or sound. Some of it has structure even if the structure is often implicit, and not as rigid or regular as that found in standard database systems. Sometimes the structure exists but has to be extracted from the data. Sometimes also it exists but we prefer to ignore it for certain purposes such as browsing. We call here semi-structured data this data that is (from a particular viewpoint) neither raw data nor strictly typed, i.e., not table-oriented as in a relational model or sorted-graph as in object databases.
As will seen later when the notion of semi-structured data is more precisely defined, the need for semi-structured data arises naturally in the context of data integration, even when the data sources are themselves well-structured. Although data integration is an old topic, the need to integrate a wider variety of data-formats (e.g., SGML or ASN.1 data) and data found on the Web has brought the topic of semi-structured data to the forefront of research.
The main purpose of the paper is to isolate the essential aspects of semi-structured data. We also survey some proposals of models and query languages for semi-structured data. In particular, we consider recent works at Stanford U. and U. Penn on semi-structured data. In both cases, the motivation is found in the integration of heterogeneous data. The “lightweight” data models they use (based on labelled graphs) are very similar.
As we shall see, the topic of semi-structured data has no precise boundary. Furthermore, a theory of semi-structured data is still missing. We will try to highlight some important issues in this context.
The paper is organized as follows. In Section 2, we discuss the particularities of semi-structured data. In Section 3, we consider the issue of the data structure
and in Section 4, the issue of the query language.
A bit dated, 1996, but still worth reading. Updating the paper would make a nice semester size project
BTW, note the download graphics. Makes me think that archives should have an “anonymous notice” feature that allows anyone downloading a paper to send an email to anyone who has downloaded the paper in the past, without disclosing the emails of the prior downloaders.
I would really like to know what the people in Jan/Feb of 2011 were looking for? Perhaps they are working on an update of the paper? Or would like to collaborate on updating the paper.
Seems like a small “feature” that would allow researchers to contact others without disclosure of email addresses (other than for the sender of course).
Formal publication data:
Abiteboul, S. (1996) Querying Semi-Structured Data. Technical Report. Stanford InfoLab. (Publication Note: Database Theory – ICDT ’97, 6th International Conference, Delphi, Greece, January 8-10, 1997)