Archive for the ‘Data Models’ Category

OMOP Common Data Model V5.0

Friday, February 19th, 2016

OMOP Common Data Model V5.0

From the webpage:

The Observational Medical Outcomes Partnership (OMOP) was a public-private partnership established to inform the appropriate use of observational healthcare databases for studying the effects of medical products. Over the course of the 5-year project and through its community of researchers from industry, government, and academia, OMOP successfully achieved its aims to:

  1. Conduct methodological research to empirically evaluate the performance of various analytical methods on their ability to identify true associations and avoid false findings
  2. Develop tools and capabilities for transforming, characterizing, and analyzing disparate data sources across the health care delivery spectrum, and
  3. Establish a shared resource so that the broader research community can collaboratively advance the science.

The results of OMOP's research has been widely published and presented at scientific conferences, including annual symposia.

The OMOP Legacy continues…

The community is actively using the OMOP Common Data Model for their various research purposes. Those tools will continue to be maintained and supported, and information about this work is available in the public domain.

The OMOP Research Lab, a central computing resource developed to facilitate methodological research, has been transitioned to the Reagan-Udall Foundation for the FDA under the Innovation in Medical Evidence Development and Surveillance (IMEDS) Program, and has been re-branded as the IMEDS Lab. Learn more at

Observational Health Data Sciences and Informatics (OHDSI) has been established as a multi-stakeholder, interdisciplinary collaborative to create open-source solutions that bring out the value of observational health data through large-scale analytics. The OHDSI collaborative includes all of the original OMOP research investigators, and will develop its tools using the OMOP Common Data Model. Learn more at

The OMOP Common Data Model will continue to be an open-source, community standard for observational healthcare data. The model specifications and associated work products will be placed in the public domain, and the entire research community is encouraged to use these tools to support everybody's own research activities.

One of the many data models that will no doubt be in play as work begins on searching for a common cancer research language.

Every data model has a constituency, the trick is to find two or more where cross-mapping has semantic and hopefully financial ROI.

I first saw this in a tweet by Christophe Lalanne.

Collaborative Annotation for Scientific Data Discovery and Reuse [+ A Stumbling Block]

Thursday, July 2nd, 2015

Collaborative Annotation for Scientific Data Discovery and Reuse by Kirk Borne.

From the post:

The enormous growth in scientific data repositories requires more meaningful indexing, classification and descriptive metadata in order to facilitate data discovery, reuse and understanding. Meaningful classification labels and metadata can be derived autonomously through machine intelligence or manually through human computation. Human computation is the application of human intelligence to solving problems that are either too complex or impossible for computers. For enormous data collections, a combination of machine and human computation approaches is required. Specifically, the assignment of meaningful tags (annotations) to each unique data granule is best achieved through collaborative participation of data providers, curators and end users to augment and validate the results derived from machine learning (data mining) classification algorithms. We see very successful implementations of this joint machine-human collaborative approach in citizen science projects such as Galaxy Zoo and the Zooniverse (

In the current era of scientific information explosion, the big data avalanche is creating enormous challenges for the long-term curation of scientific data. In particular, the classic librarian activities of classification and indexing become insurmountable. Automated machine-based approaches (such as data mining) can help, but these methods only work well when the classification and indexing algorithms have good training sets. What happens when the data includes anomalous patterns or features that are not represented in the training collection? In such cases, human-supported classification and labeling become essential – humans are very good at pattern discovery, detection and recognition. When the data volumes reach astronomical levels, it becomes particularly useful, productive and educational to crowdsource the labeling (annotation) effort. The new data objects (and their associated tags) then become new training examples, added to the data mining training sets, thereby improving the accuracy and completeness of the machine-based algorithms.

Kirk goes onto say:

…it is incumbent upon science disciplines and research communities to develop common data models, taxonomies and ontologies.

Sigh, but we know from experience that has never worked. True, we can develop more common data models, taxonomies and ontologies, but they will be in addition to the present common data models, taxonomies and ontologies. Not to mention that developing knowledge is going to lead to future common data models, taxonomies and ontologies.

If you don’t believe me, take a look at: Library of Congress Subject Headings Tentative Monthly List 07 (July 17, 2015). These subject headings have not yet been approved but they are in addition to existing subject headings.

The most recent approved list: Library of Congress Subject Headings Monthly List 05 (May 18, 2015). For approved lists going back to 1997, see: Library of Congress Subject Headings (LCSH) Approved Lists.

Unless you are working in some incredibly static and sterile field, the basic terms that are found in “common data models, taxonomies and ontologies” are going to change over time.

The only sure bet in the area of knowledge and its classification is that change is coming.

But, Kirk is right, common data models, taxonomies and ontologies are useful. So how do we make them more useful in the face of constant change?

Why not use topics to model elements/terms of common data models, taxonomies and ontologies? Which would enable user to search across such elements/terms by the properties of those topics. Possibly discovering topics that represent the same subject under a different term or element.

Imagine working on an update of a common data model, taxonomy or ontology and not having to guess at the meaning of bare elements or terms? A wealth of information, including previous elements/terms for the same subject being present at each topic.

All of the benefits that Kirk claims would accrue, plus empowering users who only know previous common data models, taxonomies and ontologies, to say nothing of easing the transition to future common data models, taxonomies and ontologies.

Knowledge isn’t static. Our methodologies for knowledge classification should be as dynamic as the knowledge we seek to classify.

Domain Modeling: Choose your tools

Sunday, June 28th, 2015

Kirk Borne posted to Twitter:

Great analogy by @wcukierski at #GEOINT2015 on #DataScience Domain Modeling > bulldozers: toy model versus the real thing.



Does your tool adapt to the data? (The real bulldozer above.)

Or, do you adapt your data to the tool? (The toy bulldozer above.)

No, I’m not going there. That is like a “the best editor” flame war. You have to decide that question for yourself and your project.

Good luck!

NoSQL Data Modelling (Jan Steemann)

Monday, December 22nd, 2014

From the description:

Learn about data modelling in a NoSQL environment in this half-day class.

Even though most NoSQL databases follow the “schema-free” data paradigma, what a database is really good at is determined by its underlying architecture and storage model.

It is therefore important to choose a matching data model to get the best out of the underlying database technology. Application requirements such as consistency demands also need to be considered.

During the half-day, attendees will get an overview of different data storage models available in NoSQL databases. There will also be hands-on examples and experiments using key/value, document, and graph data structures.

No prior knowledge of NoSQL databases is required. Some basic experience with relational databases (like MySQL) or data modelling will be helpful but is not essential. Participants will need to bring their own laptop (preferably Linux or MacOS). Installation instructions for the required software will be sent out prior to the class.

Great lecture on beginning data modeling for NoSQL.

What I haven’t encountered is a war story approach to data modeling. That is a book or series of lectures that iterates over data modeling problems encountered in practice, what considerations were taken into account and the solution decided upon. A continuing series of annual volumes with great indexing would make a must have series for any SQL or NoSQL DBA.

Jan mentions as a nearly comprehensive NoSQL database information site. And it nearly is. Nearly because it currently omits Weaver (Graph Store) under graph databases. If you notice other omissions, please forward them to Maintaining a current list of resources is exhausting work.

Data Modelling: The Thin Model [Entities with only identifiers]

Monday, October 27th, 2014

Data Modelling: The Thin Model by Mark Needham.

From the post:

About a third of the way through Mastering Data Modeling the authors describe common data modelling mistakes and one in particular resonated with me – ‘Thin LDS, Lost Users‘.

LDS stands for ‘Logical Data Structure’ which is a diagram depicting what kinds of data some person or group wants to remember. In other words, a tool to help derive the conceptual model for our domain.

They describe the problem that a thin model can cause as follows:

[…] within 30 minutes [of the modelling session] the users were lost…we determined that the model was too thin. That is, many entities had just identifying descriptors.

While this is syntactically okay, when we revisited those entities asking, What else is memorable here? the users had lots to say.

When there was flesh on the bones, the uncertainty abated and the session took a positive course.

I found myself making the same mistake a couple of weeks ago during a graph modelling session. I tend to spend the majority of the time focused on the relationships between the bits of data and treat the meta data or attributes almost as an after thought.

A good example of why subjects need multiple attributes, even multiple identifying attributes.

When sketching just a bare data model, the author, having prepared in advance is conversant with the scant identifiers. The audience, on the other hand is not. Additional attributes for each entity quickly reminds the audience of the entity in question.

Take this as anecdotal evidence that multiple attributes assist users in recognition of entities (aka subjects).

Will that impact how you identify subjects for your users?

First complex, then simple

Saturday, July 19th, 2014

First complex, then simple by James D Malley and Jason H Moore. (BioData Mining 2014, 7:13)


At the start of a data analysis project it is often suggested that the researcher look first at multiple simple models. That is, always begin with simple, one variable at a time analyses, such as multiple single-variable tests for association or significance. Then, later, somehow (how?) pull all the separate pieces together into a single comprehensive framework, an inclusive data narrative. For detecting true compound effects with more than just marginal associations, this is easily defeated with simple examples. But more critically, it is looking through the data telescope from wrong end.

I would have titled this article: “Data First, Models Later.”

That is the author’s start with no formal theories about what data will prove and upon finding signals in the data, then generate simple models to explain the signals.

I am sure their questions of the data are driven by a suspicion of what the data may prove, but that isn’t the same thing as asking questions designed to prove a model generated before the data is queried.

FoundationDB: Developer Recipes

Thursday, April 24th, 2014

FoundationDB: Developer Recipes

From the webpage:

Learn how to build new data models, indexes, and more on top of the FoundationDB key-value store API.

I was musing the other day about how to denormalize a data structure for indexing.

This is the reverse of that process but still should be instructive.

Graphistas should note that FoundationDB also implements the Blueprints API (blueprints-foundationdb-graph).

Data Modeling – FoundationDB

Saturday, February 15th, 2014

Data Modeling – FoundationDB

From the webpage:

FoundationDB’s core provides a simple data model coupled with powerful transactions. This combination allows building richer data models and libraries that inherit the scalability, performance, and integrity of the database. The goal of data modeling is to design a mapping of data to keys and values that enables effective storage and retrieval. Good decisions will yield an extensible, efficient abstraction. This document covers the fundamentals of data modeling with FoundationDB.

Great preparation for these tutorials using the tuple layer of FoundationDB:

The Class Scheduling tutorial introduces the fundamental concepts needed to design and build a simple application using FoundationDB, beginning with basic interaction with the database and walking through a few simple data modeling techniques.

The Enron Email Corpus tutorial introduces approaches to loading data in FoundationDB and further illustrates data modeling techniques using a well-known, publicly available data set.

The Managing Large Values and Blobs tutorial discusses approaches to working with large data objects in FoundationDB. It introduces the blob layer and illustrates its use to build a simple file library.

The Lightweight Query Language tutorial discusses a layer that allows Datalog to be used as an interactive query language for FoundationDB. It describes both the FoundationDB binding and the use of the query language itself.


Wikibase DataModel released!

Friday, January 3rd, 2014

Wikibase DataModel released! by Jeroen De Dauw.

From the post:

I’m happy to announce the 0.6 release of Wikibase DataModel. This is the first real release of this component.


Wikibase is the software behind At its core, this software is about describing entities. Entities are collections of claims, which can have qualifiers, references and values of various different types. How this all fits together is described in the DataModel document written by Markus and Denny at the start of the project. The Wikibase DataModel component contains (PHP) domain objects representing entities and their various parts, as well as associated domain logic.

I wanted to draw your attention to this discussion of “items:”

Items are Entities that are typically represented by a Wikipage (at least in some Wikipedia languages). They can be viewed as “the thing that a Wikipage is about,” which could be an individual thing (the person Albert Einstein), a general class of things (the class of all Physicists), and any other concept that is the subject of some Wikipedia page (including things like History of Berlin).

The IRI of an Item will typically be closely related to the URL of its page on Wikidata. It is expected that Items store a shorter ID string (for example, as a title string in MediaWiki) that is used in both cases. ID strings might have a standardized technical format such as “wd1234567890” and will usually not be seen by users. The ID of an Item should be stable and not change after it has been created.

The exact meaning of an Item cannot be captured in Wikidata (or any technical system), but is discussed and decided on by the community of editors, just as it is done with the subject of Wikipedia articles now. It is possible that an Item has multiple “aspects” to its meaning. For example, the page Orca describes a species of whales. It can be viewed as a class of all Orca whales, and an individual whale such as Keiko would be an element of this class. On the other hand, the species Orca is also a concept about which we can make individual statements. For example, one could say that the binomial name (a Property) of the Orca species has the Value “Orcinus orca (Linnaeus, 1758).”

However, it is intended that the information stored in Wikidata is generally about the topic of the Item. For example, the Item for History of Berlin should store data about this history (if there is any such data), not about Berlin (the city). It is not intended that data about one subject is distributed across multiple Wikidata Items: each Item fully represents one thing. This also helps for data integration across languages: many languages have no separate article about Berlin’s history, but most have an article about Berlin.

What do you make of the claim:

The exact meaning of an Item cannot be captured in Wikidata (or any technical system), but is discussed and decided on by the community of editors, just as it is done with the subject of Wikipedia articles now. It is possible that an Item has multiple “aspects” to its meaning. For example, the page Orca describes a species of whales. It can be viewed as a class of all Orca whales, and an individual whale such as Keiko would be an element of this class. On the other hand, the species Orca is also a concept about which we can make individual statements. For example, one could say that the binomial name (a Property) of the Orca species has the Value “Orcinus orca (Linnaeus, 1758).”

I may write an information system that fails to distinguish between a species of whales, a class of whales and a particular whale, but that is a design choice, not a foregone conclusion.

In the case of Wikipedia, which relies upon individuals repeating the task of extracting relevant information from loosely gathered data, that approach words quite well.

But there isn’t one degree of precision of identification that works for all cases.

My suspicion is that for more demanding search applications, such as drug interactions, less precise identifications could lead to unfortunate, even fatal, results.


Modeling data with functional programming in R

Sunday, October 27th, 2013

Modeling data with functional programming in R by Brain Lee Rowe.

From the post:

As some of you know, I’ve been writing a book (to be published by CRC Press/Chapman & Hall and released in late 2014) for the past year and a half. It’s one of those books that spans multiple disciplines so is both unique and also niche. In essence it’s a manifesto of sorts on using functional programming for mathematical modeling and analysis, which is based on my R package lambda.r. It spans the lambda calculus, traditional mathematical analysis, and set theory to 1) develop a mathematical model for the R language, and 2) show how to use this formalism to prove the equivalence of programs to their underlying model. I try to keep the book focused on applications, so I discuss financial trading systems, some NLP/document classification, and also web analytics.

The book started off as a more practical affair, but one thing that I’ve learned through this process is how to push ideas to the limit. So now it delves into quite a bit of theory, which makes it a more compelling read. In some ways it reminds me of ice climbing, where you’re scaling a waterfall and really pushing yourself in ways you didn’t think possible. Three chapters into the process, and it’s been that same combination of exhilarating challenge that results in conflicting thoughts racing through your head: “Holy crap — what am I doing?!” versus “This is so fun — wheeeee!” versus “I can’t believe I did it!”

Brain says some of the images, proofs and examples need work but that should not diminish your reading of the draft.

Do take the time to return comments while you are reading the draft.

Rowe – Modeling data with functional programming.

I first saw this in a tweet from StatFact.

A different take on data skepticism

Thursday, April 25th, 2013

A different take on data skepticism by Beau Cronin.

From the post:

Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.

…Well, I would argue that all ML methods are not created equal with regard to their safety. In fact, it is exactly some of the simplest (and most widely used) methods that are the most dangerous.

Why? Because these methods have lots of hidden assumptions. Well, maybe the assumptions aren’t so much hidden as nodded-at-but-rarely-questioned. A good analogy might be jumping to the sentencing phase of a criminal trial without first assessing guilt: asking “What is the punishment that best fits this crime?” before asking “Did the defendant actually commit a crime? And if so, which one?” As another example of a simple-yet-dangerous method, k-means clustering assumes a value for k, the number of clusters, even though there may not be a “good” way to divide the data into this many buckets. Maybe seven buckets provides a much more natural explanation than four. Or maybe the data, as observed, is truly undifferentiated and any effort to split it up will result in arbitrary and misleading distinctions. Shouldn’t our methods ask these more fundamental questions as well?

Beau make several good points on questioning data methods.

I would extend those “…more fundamental questions…” to data as well.

Data, at least as far as I know, doesn’t drop from the sky. It is collected, generated, sometimes both, by design.

That design had some reason for collecting that data, in some particular way and in a given format.

Like methods, data stands mute with regard to those designs, what choices were made, by who and for what reason?

Giving voice what can be known about methods and data falls to human users.

Practical tools for exploring data and models

Wednesday, April 17th, 2013

Practical tools for exploring data and models by Hadley Alexander Wickham. (PDF)

From the introduction:

This thesis describes three families of tools for exploring data and models. It is organised in roughly the same way that you perform a data analysis. First, you get the data in a form that you can work with; Section 1.1 introduces the reshape framework for restructuring data, described fully in Chapter 2. Second, you plot the data to get a feel for what is going on; Section 1.2 introduces the layered grammar of graphics, described in Chapter 3. Third, you iterate between graphics and models to build a succinct quantitative summary of the data; Section 1.3 introduces strategies for visualising models, discussed in Chapter 4. Finally, you look back at what you have done, and contemplate what tools you need to do better in the future; Chapter 5 summarises the impact of my work and my plans for the future.

The tools developed in this thesis are firmly based in the philosophy of exploratory data analysis (Tukey, 1977). With every view of the data, we strive to be both curious and sceptical. We keep an open mind towards alternative explanations, never believing we
have found the best model. Due to space limitations, the following papers only give a glimpse at this philosophy of data analysis, but it underlies all of the tools and strategies that are developed. A fuller data analysis, using many of the tools developed in this thesis, is available in Hobbs et al. (To appear).

Has a focus on R tools, including ggplot2 and Wilkinson’s The Grammar of Graphics.

The “…never believing we have found the best model” approach works for me!


I first saw this at Data Scholars.

Data Governance needs Searchers, not Planners

Wednesday, March 6th, 2013

Data Governance needs Searchers, not Planners by Jim Harris.

From the post:

In his book Everything Is Obvious: How Common Sense Fails Us, Duncan Watts explained that “plans fail, not because planners ignore common sense, but rather because they rely on their own common sense to reason about the behavior of people who are different from them.”

As development economist William Easterly explained, “A Planner thinks he already knows the answer; A Searcher admits he doesn’t know the answers in advance. A Planner believes outsiders know enough to impose solutions; A Searcher believes only insiders have enough knowledge to find solutions, and that most solutions must be homegrown.”

I made a similar point in my post Data Governance and the Adjacent Possible. Change management efforts are resisted when they impose new methods by emphasizing bad business and technical processes, as well as bad data-related employee behaviors, while ignoring unheralded processes and employees whose existing methods are preventing other problems from happening.

If you don’t remember any line from any post you read here or elsewhere, remember this one:

“…they rely on their own common sense to reason about the behavior of people who are different from them.”

Whenever you encounter a situation where that description fits, you will find failed projects, waste and bad morale.

Data models for version management…

Sunday, March 3rd, 2013

Data models for version management of legislative documents by María Hallo Carrasco, M. Mercedes Martínez-González, and Pablo de la Fuente Redondo.


This paper surveys the main data models used in projects including the management of changes in digital normative legislation. Models have been classified based on a set of criteria, which are also proposed in the paper. Some projects have been chosen as representative for each kind of model. The advantages and problems of each type are analysed, and future trends are identified.

I first saw this at Legal Informatics, which had already assembled the following resources:

The legislative metadata models discussed in the paper include:

Useful as models of change tracking should you want to express that in a topic map.

To say nothing of overcoming the semantic impedance between these model.

Starting Data Analysis with Assumptions

Friday, January 11th, 2013

Why you don’t get taxis in Singapore when it rains? by Zafar Anjum.

From the post:

It is common experience that when it rains, it is difficult to get a cab in Singapore-even when you try to call one in or use your smartphone app to book one.

Why does it happen? What could be the reason behind it?

Most people would think that this unavailability of taxis during rain is because of high demand for cab services.

Well, Big Data has a very surprising answer for you, as astonishing as it was for researcher Oliver Senn.

When Senn was first given his assignment to compare two months of weather satellite data with 830 million GPS records of 80 million taxi trips, he was a little disappointed. “Everyone in Singapore knows it’s impossible to get a taxi in a rainstorm,” says Senn, “so I expected the data to basically confirm that assumption.” As he sifted through the data related to a vast fleet of more than 16,000 taxicabs, a strange pattern emerged: it appeared that many taxis weren’t moving during rainstorms. In fact, the GPS records showed that when it rained (a frequent occurrence in this tropical island state), many drivers pulled over and didn’t pick up passengers at all.

Senn did discover the reason for the patterns in the data, which is being addressed.

The first question should have been: Is this a big data problem?

True, Senn had lots of data to crunch, but that isn’t necessarily an indicator of a big data problem.

Interviews of a few taxi drivers would have dispelled the original assumption of high demand for taxis. It would also have led to the cause of the patterns Senn recognized.

That is the patterns were a symptom, not a cause.

I first saw this in So you want to be a (big) data hero? by Vinnie Mirchandani.

Data modeling … with graphs

Wednesday, November 7th, 2012

Data modeling … with graphs by Peter Bell.

Nothing surprising for topic map users but a nice presentation on modeling for graphs.

For Neo4j, unlike topic maps, you have to normalize your data before entering it into the graph.

That is if you want one node per subject.

Depends on your circumstances if that is worthwhile.

Amazing things have been done with normalized data in relational databases.

Assuming you want to pay the cost of normalization, which can include a lack of interoperability with others, errors in conversion, brittleness in the face of changing models, etc.

Topic Map Modeling of Sequestration Data (Help Pls!)

Saturday, September 29th, 2012

With the political noise in the United States over presidential and other elections, it is easy to lose sight of a looming “sequestration” that on January 2, 2013 will result in:

10.0% reduction non-exempt defense mandatory funding
9.4% reduction non-exempt defense discretionary funding
8.2% reduction non-exempt nondefense discretionary funding
7.6% reduction non-exempt nondefense mandatory funding
2.0% reduction Medicare

The report is not a model of clarity/transparency. See: U.S. Sequestration Report – Out of the Shadows/Into the Light?.

Report caveats make it clear cited amounts are fanciful estimates that can change radically as more information becomes available.

Be that as it may, a topic map based on the reported accounts as topics can capture the present day conjectures. To say nothing of capturing future revelations of exact details.

Whether from sequestration or from efforts to avoid sequestration.

Tracking/transparency has to start somewhere and it may as well be here.

In evaluating the data for creation of a topic map, I have encountered an entry with a topic map modeling issue.

I could really use your help.

Here is the entry in question:

Department of Health and Human Services, Health Resources and Services Administration, 009-15-0350, Health Resources and Services, Nondefense Function, Mandatory (page 80 of Appendix A, page 92 of the pdf of the report):

BA Type BA Amount Sequester Percentage Sequester Amount
Sequestrable BA 514 7.6 39
Sequestrable BA
– special rule
1352 2.0 27
Exempt BA 10
Total Gross BA 1876
Offsets -16
Net BA 1860

If it read as follows, no problem.

Example: Not Accurate

BA Type BA Amount Sequester Percentage Sequester Amount
Sequestrable BA 514 7.6 39
Sequestrable BA
– special rule
1352 2.0 27
Total Gross BA 1876

Because there is no relationship between “Exempt BA” and “Offsets” to either “Sequestrable BA” or “Sequestrable BA – special rule.” I just report both of them with the percentages and total amounts to be withheld.

True, the percentages don’t change, nor does the amount to be withheld change, because of the “Exempt BA” or the “Offsets.” (Trusting soul that I am, I did verify the calculations. 😉 )

Problem: How do I represent the relationship between the “Exempt BA” and “Offsets” to either/or/both “Sequestrable BA,” “Sequestrable BA – special rule?”

Of the 1318 entries in Appendix A of this report, including this one, it is the only entry with this issue. (A number of accounts are split into discretionary/mandatory parts. I am counting each part as a separate “entry.”)

If I ignore “Exempt BA” and “Offsets” in this case, my topic map is an incomplete representation of Appendix A.

It is also the case that I want to represent the information “as written.” There may be some external explanation that clarifies this entry, but that would be an “addition” to the original topic map.


Modeling vs Mining?

Wednesday, May 16th, 2012

Steve Miller writes in Politics of Data Models and Mining:

I recently came across an interesting thread, “Is data mining still a sin against the norms of econometrics?”, from the Advanced Business Analytics LinkedIn Discussion Group. The point of departure for the dialog is a paper entitled “Three attitudes towards data mining”, written by couple of academic econometricians.

The data mining “attitudes” range from the extremes that DM techniques are to be avoided like the plague, to one where “data mining is essential and that the only hope that we have of using econometrics to uncover true economic relationships is to be found in the intelligent mining of data.” The authors note that machine learning phobia is currently the norm in economics research.

Why is this? “Data mining is considered reprehensible largely because the world is full of accidental correlations, so that what a search turns up is thought to be more a reflection of what we want to find than what is true about the world.” In contrast, “Econometrics is regarded as hypothesis testing. Only a well specified model should be estimated and if it fails to support the hypothesis, it fails; and the economist should not search for a better specification.”

In other words, econometrics focuses on explanation, expecting its practitioners to generate hypotheses for testing with regression models. ML, on the other hand, obsesses on discovery and prediction, often content to let the data talk directly, without the distraction of “theory.” Just as bad, the results of black-box ML might not be readily interpretable for tests of economic hypotheses.

Watching other communities fight over odd questions is always more enjoyable than serious disputes of grave concern in our own. (See Using “Punning” to Answer httpRange-14 for example.)

I mention the economist’s dispute, not simply to make jests at the expense of “econometricians.” (Do topic map supporters need a difficult name? TopicMapologists? Too short.)

The economist’s debate is missing an understanding that modeling requires some knowledge of the domain (mining whether formal or informal) and mining requires some idea of an output (models whether spoken or unspoken). A failing that is all too common across modeling/mining domains.

To put it another way:

We never stumble upon data that is “untouched by human hands.”

We never build models without knowledge of the data we are modeling.

The relevant question is: Does the model or data mining provide a useful result?

(Typically measured by your client’s joy or sorrow over your results.)

Data and Reality

Thursday, March 15th, 2012

Data and Reality: A Timeless Perspective on Data Management by Steve Hoberman.

I remember William Kent, the original author of “Data and Reality” from a presentation he made in 2003, entitled: “The unsolvable identity problem.”

His abstract there read:

The identity problem is intractable. To shed light on the problem, which currently is a swirl of interlocking problems that tend to get tumbled together in any discussion, we separate out the various issues so they can be rationally addressed one at a time as much as possible. We explore various aspects of the problem, pick one aspect to focus on, pose an idealized theoretical solution, and then explore the factors rendering this solution impractical. The success of this endeavor depends on our agreement that the selected aspect is a good one to focus on, and that the idealized solution represents a desirable target to try to approximate as well as we can. If we achieve consensus here, then we at least have a unifying framework for coordinating the various partial solutions to fragments of the problem.

I haven’t read the “new” version of “Data and Reality” (just ordered a copy) but I don’t recall the original needing much in the way of changes.

The original carried much the same message, that all of our solutions are partial even within a domain, temporary, chronologically speaking, and at best “useful” for some particular purpose. I rather doubt you will find that degree of uncertainty being confessed by the purveyors of any current semantic solution.

I did pull my second edition off the shelf and with free shipping (5-8 days), I should have time to go over my notes and highlights before the “new” version appears.

More to follow.

NoSQL Data Modeling Techniques

Sunday, March 4th, 2012

NoSQL Data Modeling Techniques by Ilya Katsov.

From the post:

NoSQL databases are often compared by various non-functional criteria, such as scalability, performance, and consistency. This aspect of NoSQL is well-studied both in practice and theory because specific non-functional properties are often the main justification for NoSQL usage and fundamental results on distributed systems like CAP theorem are well applicable to the NoSQL systems. At the same time, NoSQL data modeling is not so well studied and lacks of systematic theory like in relational databases. In this article I provide a short comparison of NoSQL system families from the data modeling point of view and digest several common modeling techniques.

To explore data modeling techniques, we have to start with some more or less systematic view of NoSQL data models that preferably reveals trends and interconnections. The following figure depicts imaginary “evolution” of the major NoSQL system families, namely, Key-Value stores, BigTable-style databases, Document databases, Full Text Search Engines, and Graph databases:

Very complete and readable coverage of NoSQL data modeling techniques!

A must read if you are interested in making good choices between NoSQL solutions.

This post could profitably turned into a book length treatment with longer and a greater variety of examples.

Modelling with Graphs

Tuesday, November 22nd, 2011

Modelling with Graphs by Alistair Jones at NoSQL Br 2011

From the description:

Neo4j is a powerful and expressive tool for storing, querying and manipulating data. However modelling data as graphs is quite different from modelling data under with relational databases. In this talk we’ll cover modelling business domains using graphs and show how they can be persisted and queried in the popular open source graph database Neo4j. We’ll contrast this approach with the relational model, and discuss the impact on complexity, flexibility and performance. We’ll also discuss strategies for deciding how to proceed when a graph allows multiple ways to represent the same concept, and explain the trade-offs involved. As a side-effect, we’ll examine some of the new tools for how to query graph data in Neo4j, and discuss architectures for using Neo4j in enterprise applications.

Alistair is a Software Engineer with Neo Technology, the company behind the popular open source graph database Neo4j.

Alistair has extensive experience as a developer, technical lead and architect for teams building enterprise software across a range of industries. He has a particular focus Domain Driven Design, and is an expert on Agile methodologies. Alistair often writes and presents on applying Agile principles to the discipline of performance testing.

Excellent presentation!

Anyone care to suggest a book on modeling or modeling with graphs?

Connecting the Dots: An Introduction

Wednesday, November 9th, 2011

Connecting the Dots: An Introduction

A new series of posts by Rick Sherman who writes:

In the real world the situations I discuss or encounter in enterprise BI, data warehousing and MDM implementations lead me to the conclusion that many enterprises simply do not connect the dots. These implementations potentially involve various disciplines such as data modeling, business and data requirements gathering, data profiling, data integration, data architecture, technical architecture, BI design, data governance, master data management (MDM) and predictive analytics. Although many BI project teams have experience in each of these disciplines they’re not applying the knowledge from one discipline to another.

The result is knowledge silos where the the best practices and experience from one discipline is not applied in the other disciplines.

The impact is a loss in productivity for all, higher long-term costs and poorly constructed solutions. This often results in solutions that are difficult to change as the business changes, don’t scale as the data volumes or numbers of uses increase, or is costly to maintain and operate.

Imagine that, knowledge silos in the practice of eliminating knowledge silos.

I suspect that reflects the reality that each of us is a model of a knowledge silo. There are areas we like better than others, areas we know better than others, areas where we simply don’t have the time to learn. But when asked for an answer to our part of a project, we have to have some answer, so we give the one we know. Hard to imagine us doing otherwise.

We can try to offset that natural tendency by reading broadly, looking for new areas or opportunities to learn new techniques, or at least have team members or consultants who make a practice out of surveying knowledge techniques broadly.

Rick promises to show how data modeling is disconnected from the other BI disciplines in the next Connecting the Dots post.

< NIEM > National Information Exchange Model

Monday, September 5th, 2011

< NIEM > National Information Exchange Model

From the technical introduction:

NIEM provides a common vocabulary for consistent, repeatable exchanges of information between agencies and domains. The model is represented in a number of forms, including a data dictionary and a reference schema, and includes the body of concepts and rules that underlie its structure, maintain its consistency, and govern its use.

NIEM is a comprehensive resource for organizations to successfully exchange information, offering tools, terminology, help, training, governance, and an active community of users.

NIEM uses extensible markup language (XML), which allows the structure and meaning of data to be defined through simple but carefully defined syntax rules and provides a common framework for information exchange.

The model’s unique architecture enables data components to be constrained, extended, and augmented as necessary to formulate XML exchange schemas, and XML instance documents defining the information payloads for data exchange. These exchange-defining documents are packaged in information exchange package documentation (IEPDs) that are reusable, modifiable, and extendable.

It’s Labor Day and I have yet to get the “tools” link to work. Must be load on the site. 😉

It’s a large effort and site so it will take some time to explore it.

If you are participating in < NIEM > please give a shout.

PS: I encountered < NIEM > following a link to the 2011 National Training Event videos. Registration is required but free.

If You Have Too Much Data, then “Good Enough” Is Good Enough

Sunday, June 12th, 2011

If You Have Too Much Data, then “Good Enough” Is Good Enough by Pat Helland.

This is a must read article where the author concludes:

The database industry has benefited immensely from the seminal work on data theory started in the 1970s. This work changed the world and continues to be very relevant, but it is apparent now that it captures only part of the problem.

We need a new theory and taxonomy of data that must include:

  • Identity and versions. Unlocked data comes with identity and optional versions.
  • Derivation. Which versions of which objects contributed to this knowledge? How is their schema interpreted? Changes to the source would drive a recalculation just as in Excel. If a legal reason means the source data may not be used, you should forget about using the knowledge derived from it.
  • Lossyness of the derivation. Can we invent a bounding that describes the inaccuracies introduced by derived data? Is this a multidimensional inaccuracy? Can we differentiate loss from the inaccuracies caused by sheer size?
  • Attribution by pattern. Just like a Mulligan stew, patterns can be derived from attributes that are derived from patterns (and so on). How can we bound taint from knowledge that we are not legally or ethically supposed to have?
  • Classic locked database data. Let’s not forget that any new theory and taxonomy of data should include the classic database as a piece of the larger puzzle.

The example of data relativity, a local “now” in data systems, which may not be consistent with the state at some other location, was particularly good.

If the TMRM is a Data Model…

Wednesday, May 4th, 2011

Whenever I hear the TMRM referred to or treated like a data model, I feel like saying in a Darth Vader type voice:

If the TMRM is a data model, then where are its data types?

It is my understanding that data models, legends in TMRM-speak, define data types on which they base declarations of equivalence (in terms of the subjects represented).

Being somewhat familiar with the text of the TMRM, or at least the current draft, I don’t see any declaration of data types in the TMRM.

Nor do I see any declarations of where the recursion of keys ends. Another important aspect of legends.

Nor do I see any declarations of equivalence (on the absent data types).

Yes, there is an abstraction of a path language, which would depend upon the data types and recursion through keys and values, but that is only an abstraction of a path language. It awaits declaration of data types, etc., in order to be an implementable path language.

There is a reason for the TMRM being written at that level of abstraction. To support any number of legends, written with any range of data types and choices with regard to the composition of those data types and subsequently the paths supported.

Any legend is going to make those choices and they are all equally valid if not all equally useful for some use cases. Every legend closes off some choices and opens up others.

For example, in bioinformatics, why would I want to do the subjectIdentifier/subjectLocator shuffle when I am concerned with standard identifiers for genes for example?

BTW, before anyone rushes out to write the legend syntax, realize that its writing results in subjects that could also be the targets of topic maps with suitable legends.

It is important that syntaxes be subjects, for a suitable legend, because syntaxes come and go out of fashion.

The need to merge subjects represented by those syntaxes, however, awaits only the next person with a brilliant insight.

IC Bias: If it’s measurable, it’s meaningful

Thursday, April 21st, 2011

Dean Conway writes in Data Science in the U.S. Intelligence Community [1] about modeling assumptions:

For example, it is common for an intelligence analyst to measure the relationship between two data sets as they pertain to some ongoing global event. Consider, therefore, in the recent case of the democratic revolution in Egypt that an analyst had been asked to determine the relationship between the volume of Twitter traffic related to the protests and the size of the crowds in Tahrir Square. Assuming the analyst had the data hacking skills to acquire the Twitter data, and some measure of crowd density in the square, the next step would be to decide how to model the relationship statistically.

One approach would be to use a simple linear regression to estimate how Tweets affect the number of protests, but would this be reasonable? Linear regression assumes an independent distribution of observations, which is violated by the nature of mining Twitter. Also, these events happen in both time (over the course of several hours) and space (the square), meaning there would be considerable time- and spatial-dependent bias in the sample. Understanding how modeling assumptions impact the interpretations of analytical results is critical to data science, and this is particularly true in the IC.

His central point that: Understanding how modeling assumptions impact the interpretations of analytical results is critical to data science, and this is particularly true in the IC. cannot be over emphasized.

The example of Twitter traffic reveals a deeper bias in the intelligence community, if it’s measurable, it’s meaningful.

No doubt Twitter facilitated communication within communities that already existed but that does not make it an enabling technology.

The revolution was made possible by community organizers working over decades ( and trade unions (

And the revolution continued after Twitter and then cell phones were turned off.

Understanding such events requires investment in human intell and analysis, not over reliance on SIGINT. [2]

[1] Spring (2011) issue of I-Q-Tel’s quarterly journal, IQT Quarterly

[2] That a source is technical or has lights and bells, does not make it reliable or even useful.

PS: The Twitter traffic, such as it was, may have primarily been from: Twitter, I think, is being used by news media people with computer connections, through those kind of means. Facebook, Twitter, and the Middle East, IEEE Spectrum, Steve Cherry interviews Ben Zhao, expert on social networking performance.

Are we really interested in how news people use Twitter, even in a social movement context?

OSCON Data 2011 Call for Participation

Wednesday, March 2nd, 2011

OSCON Data 2011 Call for Participation

Deadline: 11:59pm 03/14/2011 PDT

From the website:

The O’Reilly OSCON Data conference is the first of its kind: bringing together open source culture and data hackers to cover data management at a very practical level. From disks and databases through to big data and analytics, OSCON Data will have instruction and inspiration from the people who actually do the work.

OSCON Data will take place July 25-27, 2011, in Portland, Oregon. We’ll be co-located with OSCON itself.

Proposals should include as much detail about the topic and format for the presentation as possible. Vague and overly broad proposals don’t showcase your skills and knowledge, and our volunteer reviewers aren’t mind readers. The more you can tell us, the more likely the proposal will be selected.

Proposals that seem like a “vendor pitch” will not be considered. The purpose of OSCON Data is to enlighten, not to sell.

Submit a proposal.

Yes, it is right before Balisage but I think worth considering if you are on the West Coast and can’t get to Balisage this year or if you are feeling really robust. 😉

Hmmm, I wonder how a proposal that merges the indexes of the different NoSQL volumes from O’Reilly would be received? You are aware that O’Reilly is re-creating the X-Windows problem that was the genesis of both topic maps and DocBook?

I will have to write that up in detail at some point. I wasn’t there but have spoken to some of the principals who were. Plus I have the notes, etc.

…a grain of salt

Friday, February 25th, 2011

Benjamin Bock asked me recently about how I would model a mole of salt in a topic map.

That is a good question but I think we had better start with a single grain of salt and then work our way up from there.

At first blush, and only at first blush, do many subjects look quite easy to represent in a topic map.

A grain of salt looks simple to at first glance, just create a PSI (Published Subject Identifier), put that as the subjectIdentifier on a topic and be done with it.

Well…, except that I don’t want to talk about a particular grain of salt, I want to talk about salt more generally.

OK, one of those, I see.

Alright, same answer as before, except make the PSI for salt in general, not some particular grain of salt.

Well,…., except that when I go to the Wikipedia article on salt, Salt, I find that salt is a compound of chlorine and sodium.

A compound, oh, that means something made up of more than one subject. In a particular type of relationship.

Sounds like an association to me.

Of a particular type, an ionic association. (I looked it up, see: Ionic Compound)

And this association between chlorine and sodium has several properties reported by Wikipedia, here are just a few of them:

  • Molar mass: 58.443 g/mol
  • Appearance: Colorless/white crystalline solid
  • Odor: Odorless
  • Density: 2.165 g/cm3
  • Melting point: 801 °C, 1074 K, 1474 °F
  • Boiling point: 1413 °C, 1686 K, 2575 °F
  • … and several others.

    If you are interested in scientific/technical work, please be aware of CAS, a work product of the American Chemical Society, with a very impressive range unique identifiers. (56 million organic and inorganic substances, 62 million sequences and they have a counter that increments while you are on the page.)

    Note that unlike my suggestion, CAS takes the assign a unique identifier view for the substances, sequences and chemicals that they curate.

    Oh, sorry, got interested in the CAS as a source for subject identification. In fact, that is a nice segway to consider how to represent the millions and millions of compounds.

    We could create associations with the various components being role players but then we would have to reify those associations in order to hang additional properties off of them. Well, technically speaking in XTM we would create non-occurrence occurrences and type those to hold the additional properties.

    Sorry, I was presuming the decision to represent compounds as associations. Shout out when I start to presume that sort of thing. 😉

    The reason I would represent compounds as associations is that the components of the associations are then subjects I can talk about and even add addition properties to, or create mappings between.

    I suspect that CAS has chemistry from the 1800’s fairly well covered but what about older texts? Substances before then may not be of interest to commercial chemists but certainly would be of interest to historians and other scholars.

    Use of a topic map plus the CAS identifiers would enable scholars studying older materials to effectively share information about older texts, which have different designations for substances than CAS would record.

    You could argue that I could use a topic for compounds, much as CAS does, and rely upon searching in order to discover relationships.

    Tis true, tis true, but my modeling preference is for relationships seen as subjects, although I must confess I would prefer a next generation syntax that avoids the reification overhead of XTM.

    Given the prevalent of complex relationships/associations as you see from the CAS index, I think a simplification of the representation of associations is warranted.

    Sorry, I never did quite reach Benjamin’s question about a mole of salt but I will take up that gauge again tomorrow.

    We will see that measurements (which figured into his questions about recipes as well) is an interesting area of topic map design.

    PS: Comments and/or suggestions on areas to post about are most welcome. Subject analysis for topic maps is not unlike cataloging in library science to a degree, except that what classification you assign is entirely the work product of your experience, reading and analysis. There are no fixed answers, only the ones that you find the most useful.

    Apache OODT – Top Level Project

    Friday, January 7th, 2011

    Apache OODT is the first ASF Top Level Project status for NASA developed software.

    From the website:

    Just what is Apache™ OODT?

    It’s metadata for middleware (and vice versa):

    • Transparent access to distributed resources
    • Data discovery and query optimization
    • Distributed processing and virtual archives

    But it’s not just for science! It’s also a software architecture:

    • Models for information representation
    • Solutions to knowledge capture problems
    • Unification of technology, data, and metadata

    Looks like a project that could benefit from having topic maps as part of its tool kit.

    Check out the 0.1 OODT release and see what you think.

    Developing High Quality Data Models – Book

    Thursday, December 9th, 2010

    Developing High Quality Data Models by Dr. Matthew West is due out in January of 2011. (Pre-order: Elsevier, Amazon)

    From the website:

    Anyone charged with developing a data model knows that there is a wide variety of potential problems likely to arise before achieving a high quality data model. With dozens of attributes and millions of rows, data modelers are in always danger of inconsistency and inaccuracy. The development of the data model itself could result in difficulties presenting accurate data. The need to improve data models begins in getting it right in the first place.

    Developing High Quality Data Models uses real-world examples to show you how to identify a number of data modeling principles and analysis techniques that will enable you to develop data models that consistently meet business requirements. A variety of generic data model patterns that exemplify the principles and techniques discussed build upon one another to give a powerful and integrated generic data model with wide applicability across many disciplines. The principles and techniques outlined in this book are applicable in government and industry, including but not limited to energy exploration, healthcare, telecommunications, transportation, military defense, transportation and so on.

    Table of Contents:

    Chapter 1- Introduction
    Chapter 2- Entity Relationship Model Basics
    Chapter 3- Some types and uses of data models
    Chapter 4- Data models and enterprise architecture
    Chapter 5- Some observations on data models and data modeling
    Chapter 6- Some General Principles for Conceptual, Integration and Enterprise Data Models
    Chapter 7- Applying the principles for attributes
    Chapter 8- General principles for relationships
    Chapter 9- General principles for entity types
    Chapter 10- Motivation and overview for an ontological framework
    Chapter 12- Classes
    Chapter 13- Intentionally constructed objects
    Chapter 14- Systems and system components
    Chapter 15- Requirements specifications
    Chapter 16- Concluding Remarks
    Chapter 17- The HQDM Framework Schema

    I first became familiar with the work of Dr. West from Ontolog. You can visit his publications page to see why I am looking forward to this book.

    Citation of and comments on this work will follow as soon as access and time allow.