Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 22, 2014

CIDOC Conceptual Reference Model

Filed under: Conceptualizations,Heterogeneous Data,Integration,Museums,Semantic Diversity — Patrick Durusau @ 4:45 pm

CIDOC Conceptual Reference Model (pdf)

From the “Definition of the CIDOC Conceptual Reference Model:”

This document is the formal definition of the CIDOC Conceptual Reference Model (“CRM”), a formal ontology intended to facilitate the integration, mediation and interchange of heterogeneous cultural heritage information. The CRM is the culmination of more than a decade of standards development work by the International Committee for Documentation (CIDOC) of the International Council of Museums (ICOM). Work on the CRM itself began in 1996 under the auspices of the ICOM-CIDOC Documentation Standards Working Group. Since 2000, development of the CRM has been officially delegated by ICOM-CIDOC to the CIDOC CRM Special Interest Group, which collaborates with the ISO working group ISO/TC46/SC4/WG9 to bring the CRM to the form and status of an International Standard.

Objectives of the CIDOC CRM

The primary role of the CRM is to enable information exchange and integration between heterogeneous sources of cultural heritage information. It aims at providing the semantic definitions and clarifications needed to transform disparate, localised information sources into a coherent global resource, be it with in a larger institution, in intranets or on the Internet. Its perspective is supra-institutional and abstracted from any specific local context. This goal determines the constructs and level of detail of the CRM.

More specifically, it defines and is restricted to the underlying semantics of database schemata and document structures used in cultural heritage and museum documentation in terms of a formal ontology. It does not define any of the terminology appearing typically as data in the respective data structures; however it foresees the characteristic relationships for its use. It does not aim at proposing what cultural institutions should document. Rather it explains the logic of what they actually currently document, and thereby enables semantic interoperability.

It intends to provide a model of the intellectual structure of cultural documentation in logical terms. As such, it is not optimised for implementation-specific storage and processing aspects. Implementations may lead to solutions where elements and links between relevant elements of our conceptualizations are no longer explicit in a database or other structured storage system. For instance the birth event that connects elements such as father, mother, birth date, birth place may not appear in the database, in order to save storage space or response time of the system. The CRM allows us to explain how such apparently disparate entities are intellectually interconnected, and how the ability of the database to answer certain intellectual questions is affected by the omission of such elements and links.

The CRM aims to support the following specific functionalities:

  • Inform developers of information systems as a guide to good practice in conceptual modelling, in order to effectively structure and relate information assets of cultural documentation.
  • Serve as a common language for domain experts and IT developers to formulate requirements and to agree on system functionalities with respect to the correct handling of cultural contents.
  • To serve as a formal language for the identification of common information contents in different data formats; in particular to support the implementation of automatic data transformation algorithms from local to global data structures without loss of meaning. The latter being useful for data exchange, data migration from legacy systems, data information integration and mediation of heterogeneous sources.
  • To support associative queries against integrated resources by providing a global model of the basic classes and their associations to formulate such queries.
  • It is further believed, that advanced natural language algorithms and case-specific heuristics can take significant advantage of the CRM to resolve free text information into a formal logical form, if that is regarded beneficial. The CRM is however not thought to be a means to replace scholarly text, rich in meaning, by logical forms, but only a means to identify related data.

(emphasis in original)

Apologies for the long quote but this covers a number of important topic map issues.

For example:

For instance the birth event that connects elements such as father, mother, birth date, birth place may not appear in the database, in order to save storage space or response time of the system. The CRM allows us to explain how such apparently disparate entities are intellectually interconnected, and how the ability of the database to answer certain intellectual questions is affected by the omission of such elements and links.

In topic map terms I would say that the database omits a topic to represent “birth event” and therefore there is no role player for an association with the various role players. What subjects will have representatives in a topic map is always a concern for topic map authors.

Helpfully, CIDOC explicitly separates the semantics it documents from data structures.

Less helpfully:

Because the CRM’s primary role is the meaningful integration of information in an Open World, it aims to be monotonic in the sense of Domain Theory. That is, the existing CRM constructs and the deductions made from them must always remain valid and well-formed, even as new constructs are added by extensions to the CRM.

Which restricts integration using CRM to systems where CRM is the primary basis for integration, as opposed to be one way to integrate several data sets.

That may not seem important in “web time,” where 3 months equals 1 Internet year. But when you think of integrating data and integration practices as they evolve over decades if not centuries, the limitations of monotonic choices come to the fore.

To take one practical discussion under way, how to handle warning about radioactive waste, which must endure anywhere from 10,000 to 1,000,000 years? A far simpler task than preserving semantics over centuries.

If you think that is easy, remember that lots of people saw the pyramids of Egypt being built. But it was such common knowledge, that no one thought to write it down.

Preservation of semantics is a daunting task.

CIDOC merits a slow read by anyone interested in modeling, semantics, vocabularies, and preservation.

PS: CIDOC: Conceptual Reference Model as a Word file.

April 22, 2012

Big Data and the Coming Conceptual Model Revolution

Filed under: BigData,Conceptualizations — Patrick Durusau @ 7:07 pm

Big Data and the Coming Conceptual Model Revolution

Malcolm Chisholm writes:

Conceptual models must capture all business concepts and all relevant relationships. If instances of things are also part of the business reality, they must be captured too. Unfortunately, there is no standard methodology and notation to do this. Conceptual models that communicate business reality effectively require some degree of artistic imagination. They are products of analysis, not of design.(emphasis added)

That’s the trick isn’t it? Developing a good conceptual model.

You can have system requirements for multiple Terabytes of data storage, Gigabytes of bandwidth, messages and processes galore, but if you don’t have a good conceptual model, it’s just so much hardware junk.

Are you planning your system based on hardware or software capabilities?

Or are you developing a conceptual model you want to implement in hardware and software?

Which one do you think will come closer to meeting your needs?

January 6, 2012

General Purpose Computer-Assisted Clustering and Conceptualization

Filed under: Clustering,Conceptualizations — Patrick Durusau @ 11:39 am

General Purpose Computer-Assisted Clustering and Conceptualization by Justin Grimmer and Gary King.

Abstract:

We develop a computer-assisted method for the discovery of insightful conceptualizations, in the form of clusterings (i.e., partitions) of input objects. Each of the numerous fully automated methods of cluster analysis proposed in statistics, computer science, and biology optimize a different objective function. Almost all are well defined, but how to determine before the fact which one, if any, will partition a given set of objects in an “insightful” or “useful” way for a given user is unknown and difficult, if not logically impossible. We develop a metric space of partitions from all existing cluster analysis methods applied to a given data set (along with millions of other solutions we add based on combinations of existing clusterings), and enable a user to explore and interact with it, and quickly reveal or prompt useful or insightful conceptualizations. In addition, although uncommon in unsupervised learning problems, we offer and implement evaluation designs that make our computer-assisted approach vulnerable to being proven suboptimal in specific data types. We demonstrate that our approach facilitates more efficient and insightful discovery of useful information than either expert human coders or many existing fully automated methods.

Despite my misgivings about metric spaces for semantics, the central theme that clustering (dare I say merging?) cannot be determined in advance of some user viewing the data, makes sense to me. Not every user will want or perhaps even need to do interactive clustering but I think this theme represents a substantial advance in this area.

The publication appeared in the Proceeding of the National Academy of Sciences of the United States of America and the authors are from Stanford and Harvard, respectively. Institutions that value technical and scientific brilliance.

Powered by WordPress