Archive for the ‘Data Models’ Category

A different take on data skepticism

Thursday, April 25th, 2013

A different take on data skepticism by Beau Cronin.

From the post:

Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.

…Well, I would argue that all ML methods are not created equal with regard to their safety. In fact, it is exactly some of the simplest (and most widely used) methods that are the most dangerous.

Why? Because these methods have lots of hidden assumptions. Well, maybe the assumptions aren’t so much hidden as nodded-at-but-rarely-questioned. A good analogy might be jumping to the sentencing phase of a criminal trial without first assessing guilt: asking “What is the punishment that best fits this crime?” before asking “Did the defendant actually commit a crime? And if so, which one?” As another example of a simple-yet-dangerous method, k-means clustering assumes a value for k, the number of clusters, even though there may not be a “good” way to divide the data into this many buckets. Maybe seven buckets provides a much more natural explanation than four. Or maybe the data, as observed, is truly undifferentiated and any effort to split it up will result in arbitrary and misleading distinctions. Shouldn’t our methods ask these more fundamental questions as well?

Beau make several good points on questioning data methods.

I would extend those “…more fundamental questions…” to data as well.

Data, at least as far as I know, doesn’t drop from the sky. It is collected, generated, sometimes both, by design.

That design had some reason for collecting that data, in some particular way and in a given format.

Like methods, data stands mute with regard to those designs, what choices were made, by who and for what reason?

Giving voice what can be known about methods and data falls to human users.

Practical tools for exploring data and models

Wednesday, April 17th, 2013

Practical tools for exploring data and models by Hadley Alexander Wickham. (PDF)

From the introduction:

This thesis describes three families of tools for exploring data and models. It is organised in roughly the same way that you perform a data analysis. First, you get the data in a form that you can work with; Section 1.1 introduces the reshape framework for restructuring data, described fully in Chapter 2. Second, you plot the data to get a feel for what is going on; Section 1.2 introduces the layered grammar of graphics, described in Chapter 3. Third, you iterate between graphics and models to build a succinct quantitative summary of the data; Section 1.3 introduces strategies for visualising models, discussed in Chapter 4. Finally, you look back at what you have done, and contemplate what tools you need to do better in the future; Chapter 5 summarises the impact of my work and my plans for the future.

The tools developed in this thesis are firmly based in the philosophy of exploratory data analysis (Tukey, 1977). With every view of the data, we strive to be both curious and sceptical. We keep an open mind towards alternative explanations, never believing we
have found the best model. Due to space limitations, the following papers only give a glimpse at this philosophy of data analysis, but it underlies all of the tools and strategies that are developed. A fuller data analysis, using many of the tools developed in this thesis, is available in Hobbs et al. (To appear).

Has a focus on R tools, including ggplot2 and Wilkinson’s The Grammar of Graphics.

The “…never believing we have found the best model” approach works for me!

You?

I first saw this at Data Scholars.

Data Governance needs Searchers, not Planners

Wednesday, March 6th, 2013

Data Governance needs Searchers, not Planners by Jim Harris.

From the post:

In his book Everything Is Obvious: How Common Sense Fails Us, Duncan Watts explained that “plans fail, not because planners ignore common sense, but rather because they rely on their own common sense to reason about the behavior of people who are different from them.”

As development economist William Easterly explained, “A Planner thinks he already knows the answer; A Searcher admits he doesn’t know the answers in advance. A Planner believes outsiders know enough to impose solutions; A Searcher believes only insiders have enough knowledge to find solutions, and that most solutions must be homegrown.”

I made a similar point in my post Data Governance and the Adjacent Possible. Change management efforts are resisted when they impose new methods by emphasizing bad business and technical processes, as well as bad data-related employee behaviors, while ignoring unheralded processes and employees whose existing methods are preventing other problems from happening.

If you don’t remember any line from any post you read here or elsewhere, remember this one:

“…they rely on their own common sense to reason about the behavior of people who are different from them.”

Whenever you encounter a situation where that description fits, you will find failed projects, waste and bad morale.

Data models for version management…

Sunday, March 3rd, 2013

Data models for version management of legislative documents by María Hallo Carrasco, M. Mercedes Martínez-González, and Pablo de la Fuente Redondo.

Abstract:

This paper surveys the main data models used in projects including the management of changes in digital normative legislation. Models have been classified based on a set of criteria, which are also proposed in the paper. Some projects have been chosen as representative for each kind of model. The advantages and problems of each type are analysed, and future trends are identified.

I first saw this at Legal Informatics, which had already assembled the following resources:

The legislative metadata models discussed in the paper include:

Useful as models of change tracking should you want to express that in a topic map.

To say nothing of overcoming the semantic impedance between these model.

Starting Data Analysis with Assumptions

Friday, January 11th, 2013

Why you don’t get taxis in Singapore when it rains? by Zafar Anjum.

From the post:

It is common experience that when it rains, it is difficult to get a cab in Singapore-even when you try to call one in or use your smartphone app to book one.

Why does it happen? What could be the reason behind it?

Most people would think that this unavailability of taxis during rain is because of high demand for cab services.

Well, Big Data has a very surprising answer for you, as astonishing as it was for researcher Oliver Senn.

When Senn was first given his assignment to compare two months of weather satellite data with 830 million GPS records of 80 million taxi trips, he was a little disappointed. “Everyone in Singapore knows it’s impossible to get a taxi in a rainstorm,” says Senn, “so I expected the data to basically confirm that assumption.” As he sifted through the data related to a vast fleet of more than 16,000 taxicabs, a strange pattern emerged: it appeared that many taxis weren’t moving during rainstorms. In fact, the GPS records showed that when it rained (a frequent occurrence in this tropical island state), many drivers pulled over and didn’t pick up passengers at all.

Senn did discover the reason for the patterns in the data, which is being addressed.

The first question should have been: Is this a big data problem?

True, Senn had lots of data to crunch, but that isn’t necessarily an indicator of a big data problem.

Interviews of a few taxi drivers would have dispelled the original assumption of high demand for taxis. It would also have led to the cause of the patterns Senn recognized.

That is the patterns were a symptom, not a cause.

I first saw this in So you want to be a (big) data hero? by Vinnie Mirchandani.

Data modeling … with graphs

Wednesday, November 7th, 2012

Data modeling … with graphs by Peter Bell.

Nothing surprising for topic map users but a nice presentation on modeling for graphs.

For Neo4j, unlike topic maps, you have to normalize your data before entering it into the graph.

That is if you want one node per subject.

Depends on your circumstances if that is worthwhile.

Amazing things have been done with normalized data in relational databases.

Assuming you want to pay the cost of normalization, which can include a lack of interoperability with others, errors in conversion, brittleness in the face of changing models, etc.

Topic Map Modeling of Sequestration Data (Help Pls!)

Saturday, September 29th, 2012

With the political noise in the United States over presidential and other elections, it is easy to lose sight of a looming “sequestration” that on January 2, 2013 will result in:

10.0% reduction non-exempt defense mandatory funding
9.4% reduction non-exempt defense discretionary funding
8.2% reduction non-exempt nondefense discretionary funding
7.6% reduction non-exempt nondefense mandatory funding
2.0% reduction Medicare

The report is not a model of clarity/transparency. See: U.S. Sequestration Report – Out of the Shadows/Into the Light?.

Report caveats make it clear cited amounts are fanciful estimates that can change radically as more information becomes available.

Be that as it may, a topic map based on the reported accounts as topics can capture the present day conjectures. To say nothing of capturing future revelations of exact details.

Whether from sequestration or from efforts to avoid sequestration.

Tracking/transparency has to start somewhere and it may as well be here.

In evaluating the data for creation of a topic map, I have encountered an entry with a topic map modeling issue.

I could really use your help.

Here is the entry in question:

Department of Health and Human Services, Health Resources and Services Administration, 009-15-0350, Health Resources and Services, Nondefense Function, Mandatory (page 80 of Appendix A, page 92 of the pdf of the report):

BA Type BA Amount Sequester Percentage Sequester Amount
Sequestrable BA 514 7.6 39
Sequestrable BA
– special rule
1352 2.0 27
Exempt BA 10
Total Gross BA 1876
Offsets -16
Net BA 1860

If it read as follows, no problem.

Example: Not Accurate

BA Type BA Amount Sequester Percentage Sequester Amount
Sequestrable BA 514 7.6 39
Sequestrable BA
– special rule
1352 2.0 27
Total Gross BA 1876

Because there is no relationship between “Exempt BA” and “Offsets” to either “Sequestrable BA” or “Sequestrable BA – special rule.” I just report both of them with the percentages and total amounts to be withheld.

True, the percentages don’t change, nor does the amount to be withheld change, because of the “Exempt BA” or the “Offsets.” (Trusting soul that I am, I did verify the calculations. ;-) )

Problem: How do I represent the relationship between the “Exempt BA” and “Offsets” to either/or/both “Sequestrable BA,” “Sequestrable BA – special rule?”

Of the 1318 entries in Appendix A of this report, including this one, it is the only entry with this issue. (A number of accounts are split into discretionary/mandatory parts. I am counting each part as a separate “entry.”)

If I ignore “Exempt BA” and “Offsets” in this case, my topic map is an incomplete representation of Appendix A.

It is also the case that I want to represent the information “as written.” There may be some external explanation that clarifies this entry, but that would be an “addition” to the original topic map.

Suggestions?

Modeling vs Mining?

Wednesday, May 16th, 2012

Steve Miller writes in Politics of Data Models and Mining:

I recently came across an interesting thread, “Is data mining still a sin against the norms of econometrics?”, from the Advanced Business Analytics LinkedIn Discussion Group. The point of departure for the dialog is a paper entitled “Three attitudes towards data mining”, written by couple of academic econometricians.

The data mining “attitudes” range from the extremes that DM techniques are to be avoided like the plague, to one where “data mining is essential and that the only hope that we have of using econometrics to uncover true economic relationships is to be found in the intelligent mining of data.” The authors note that machine learning phobia is currently the norm in economics research.

Why is this? “Data mining is considered reprehensible largely because the world is full of accidental correlations, so that what a search turns up is thought to be more a reflection of what we want to find than what is true about the world.” In contrast, “Econometrics is regarded as hypothesis testing. Only a well specified model should be estimated and if it fails to support the hypothesis, it fails; and the economist should not search for a better specification.”

In other words, econometrics focuses on explanation, expecting its practitioners to generate hypotheses for testing with regression models. ML, on the other hand, obsesses on discovery and prediction, often content to let the data talk directly, without the distraction of “theory.” Just as bad, the results of black-box ML might not be readily interpretable for tests of economic hypotheses.

Watching other communities fight over odd questions is always more enjoyable than serious disputes of grave concern in our own. (See Using “Punning” to Answer httpRange-14 for example.)

I mention the economist’s dispute, not simply to make jests at the expense of “econometricians.” (Do topic map supporters need a difficult name? TopicMapologists? Too short.)

The economist’s debate is missing an understanding that modeling requires some knowledge of the domain (mining whether formal or informal) and mining requires some idea of an output (models whether spoken or unspoken). A failing that is all too common across modeling/mining domains.

To put it another way:

We never stumble upon data that is “untouched by human hands.”

We never build models without knowledge of the data we are modeling.

The relevant question is: Does the model or data mining provide a useful result?

(Typically measured by your client’s joy or sorrow over your results.)

Data and Reality

Thursday, March 15th, 2012

Data and Reality: A Timeless Perspective on Data Management by Steve Hoberman.

I remember William Kent, the original author of “Data and Reality” from a presentation he made in 2003, entitled: “The unsolvable identity problem.”

His abstract there read:

The identity problem is intractable. To shed light on the problem, which currently is a swirl of interlocking problems that tend to get tumbled together in any discussion, we separate out the various issues so they can be rationally addressed one at a time as much as possible. We explore various aspects of the problem, pick one aspect to focus on, pose an idealized theoretical solution, and then explore the factors rendering this solution impractical. The success of this endeavor depends on our agreement that the selected aspect is a good one to focus on, and that the idealized solution represents a desirable target to try to approximate as well as we can. If we achieve consensus here, then we at least have a unifying framework for coordinating the various partial solutions to fragments of the problem.

I haven’t read the “new” version of “Data and Reality” (just ordered a copy) but I don’t recall the original needing much in the way of changes.

The original carried much the same message, that all of our solutions are partial even within a domain, temporary, chronologically speaking, and at best “useful” for some particular purpose. I rather doubt you will find that degree of uncertainty being confessed by the purveyors of any current semantic solution.

I did pull my second edition off the shelf and with free shipping (5-8 days), I should have time to go over my notes and highlights before the “new” version appears.

More to follow.

NoSQL Data Modeling Techniques

Sunday, March 4th, 2012

NoSQL Data Modeling Techniques by Ilya Katsov.

From the post:

NoSQL databases are often compared by various non-functional criteria, such as scalability, performance, and consistency. This aspect of NoSQL is well-studied both in practice and theory because specific non-functional properties are often the main justification for NoSQL usage and fundamental results on distributed systems like CAP theorem are well applicable to the NoSQL systems. At the same time, NoSQL data modeling is not so well studied and lacks of systematic theory like in relational databases. In this article I provide a short comparison of NoSQL system families from the data modeling point of view and digest several common modeling techniques.

To explore data modeling techniques, we have to start with some more or less systematic view of NoSQL data models that preferably reveals trends and interconnections. The following figure depicts imaginary “evolution” of the major NoSQL system families, namely, Key-Value stores, BigTable-style databases, Document databases, Full Text Search Engines, and Graph databases:

Very complete and readable coverage of NoSQL data modeling techniques!

A must read if you are interested in making good choices between NoSQL solutions.

This post could profitably turned into a book length treatment with longer and a greater variety of examples.

Modelling with Graphs

Tuesday, November 22nd, 2011

Modelling with Graphs by Alistair Jones at NoSQL Br 2011

From the description:

Neo4j is a powerful and expressive tool for storing, querying and manipulating data. However modelling data as graphs is quite different from modelling data under with relational databases. In this talk we’ll cover modelling business domains using graphs and show how they can be persisted and queried in the popular open source graph database Neo4j. We’ll contrast this approach with the relational model, and discuss the impact on complexity, flexibility and performance. We’ll also discuss strategies for deciding how to proceed when a graph allows multiple ways to represent the same concept, and explain the trade-offs involved. As a side-effect, we’ll examine some of the new tools for how to query graph data in Neo4j, and discuss architectures for using Neo4j in enterprise applications.

Alistair is a Software Engineer with Neo Technology, the company behind the popular open source graph database Neo4j.

Alistair has extensive experience as a developer, technical lead and architect for teams building enterprise software across a range of industries. He has a particular focus Domain Driven Design, and is an expert on Agile methodologies. Alistair often writes and presents on applying Agile principles to the discipline of performance testing.

Excellent presentation!

Anyone care to suggest a book on modeling or modeling with graphs?

Connecting the Dots: An Introduction

Wednesday, November 9th, 2011

Connecting the Dots: An Introduction

A new series of posts by Rick Sherman who writes:

In the real world the situations I discuss or encounter in enterprise BI, data warehousing and MDM implementations lead me to the conclusion that many enterprises simply do not connect the dots. These implementations potentially involve various disciplines such as data modeling, business and data requirements gathering, data profiling, data integration, data architecture, technical architecture, BI design, data governance, master data management (MDM) and predictive analytics. Although many BI project teams have experience in each of these disciplines they’re not applying the knowledge from one discipline to another.

The result is knowledge silos where the the best practices and experience from one discipline is not applied in the other disciplines.

The impact is a loss in productivity for all, higher long-term costs and poorly constructed solutions. This often results in solutions that are difficult to change as the business changes, don’t scale as the data volumes or numbers of uses increase, or is costly to maintain and operate.

Imagine that, knowledge silos in the practice of eliminating knowledge silos.

I suspect that reflects the reality that each of us is a model of a knowledge silo. There are areas we like better than others, areas we know better than others, areas where we simply don’t have the time to learn. But when asked for an answer to our part of a project, we have to have some answer, so we give the one we know. Hard to imagine us doing otherwise.

We can try to offset that natural tendency by reading broadly, looking for new areas or opportunities to learn new techniques, or at least have team members or consultants who make a practice out of surveying knowledge techniques broadly.

Rick promises to show how data modeling is disconnected from the other BI disciplines in the next Connecting the Dots post.

< NIEM > National Information Exchange Model

Monday, September 5th, 2011

< NIEM > National Information Exchange Model

From the technical introduction:

NIEM provides a common vocabulary for consistent, repeatable exchanges of information between agencies and domains. The model is represented in a number of forms, including a data dictionary and a reference schema, and includes the body of concepts and rules that underlie its structure, maintain its consistency, and govern its use.

NIEM is a comprehensive resource for organizations to successfully exchange information, offering tools, terminology, help, training, governance, and an active community of users.

NIEM uses extensible markup language (XML), which allows the structure and meaning of data to be defined through simple but carefully defined syntax rules and provides a common framework for information exchange.

The model’s unique architecture enables data components to be constrained, extended, and augmented as necessary to formulate XML exchange schemas, and XML instance documents defining the information payloads for data exchange. These exchange-defining documents are packaged in information exchange package documentation (IEPDs) that are reusable, modifiable, and extendable.

It’s Labor Day and I have yet to get the “tools” link to work. Must be load on the site. ;-)

It’s a large effort and site so it will take some time to explore it.

If you are participating in < NIEM > please give a shout.

PS: I encountered < NIEM > following a link to the 2011 National Training Event videos. Registration is required but free.

If You Have Too Much Data, then “Good Enough” Is Good Enough

Sunday, June 12th, 2011

If You Have Too Much Data, then “Good Enough” Is Good Enough by Pat Helland.

This is a must read article where the author concludes:

The database industry has benefited immensely from the seminal work on data theory started in the 1970s. This work changed the world and continues to be very relevant, but it is apparent now that it captures only part of the problem.

We need a new theory and taxonomy of data that must include:

  • Identity and versions. Unlocked data comes with identity and optional versions.
  • Derivation. Which versions of which objects contributed to this knowledge? How is their schema interpreted? Changes to the source would drive a recalculation just as in Excel. If a legal reason means the source data may not be used, you should forget about using the knowledge derived from it.
  • Lossyness of the derivation. Can we invent a bounding that describes the inaccuracies introduced by derived data? Is this a multidimensional inaccuracy? Can we differentiate loss from the inaccuracies caused by sheer size?
  • Attribution by pattern. Just like a Mulligan stew, patterns can be derived from attributes that are derived from patterns (and so on). How can we bound taint from knowledge that we are not legally or ethically supposed to have?
  • Classic locked database data. Let’s not forget that any new theory and taxonomy of data should include the classic database as a piece of the larger puzzle.

The example of data relativity, a local “now” in data systems, which may not be consistent with the state at some other location, was particularly good.

If the TMRM is a Data Model…

Wednesday, May 4th, 2011

Whenever I hear the TMRM referred to or treated like a data model, I feel like saying in a Darth Vader type voice:

If the TMRM is a data model, then where are its data types?

It is my understanding that data models, legends in TMRM-speak, define data types on which they base declarations of equivalence (in terms of the subjects represented).

Being somewhat familiar with the text of the TMRM, or at least the current draft, I don’t see any declaration of data types in the TMRM.

Nor do I see any declarations of where the recursion of keys ends. Another important aspect of legends.

Nor do I see any declarations of equivalence (on the absent data types).

Yes, there is an abstraction of a path language, which would depend upon the data types and recursion through keys and values, but that is only an abstraction of a path language. It awaits declaration of data types, etc., in order to be an implementable path language.

There is a reason for the TMRM being written at that level of abstraction. To support any number of legends, written with any range of data types and choices with regard to the composition of those data types and subsequently the paths supported.

Any legend is going to make those choices and they are all equally valid if not all equally useful for some use cases. Every legend closes off some choices and opens up others.

For example, in bioinformatics, why would I want to do the subjectIdentifier/subjectLocator shuffle when I am concerned with standard identifiers for genes for example?

BTW, before anyone rushes out to write the legend syntax, realize that its writing results in subjects that could also be the targets of topic maps with suitable legends.

It is important that syntaxes be subjects, for a suitable legend, because syntaxes come and go out of fashion.

The need to merge subjects represented by those syntaxes, however, awaits only the next person with a brilliant insight.

IC Bias: If it’s measurable, it’s meaningful

Thursday, April 21st, 2011

Dean Conway writes in Data Science in the U.S. Intelligence Community [1] about modeling assumptions:

For example, it is common for an intelligence analyst to measure the relationship between two data sets as they pertain to some ongoing global event. Consider, therefore, in the recent case of the democratic revolution in Egypt that an analyst had been asked to determine the relationship between the volume of Twitter traffic related to the protests and the size of the crowds in Tahrir Square. Assuming the analyst had the data hacking skills to acquire the Twitter data, and some measure of crowd density in the square, the next step would be to decide how to model the relationship statistically.

One approach would be to use a simple linear regression to estimate how Tweets affect the number of protests, but would this be reasonable? Linear regression assumes an independent distribution of observations, which is violated by the nature of mining Twitter. Also, these events happen in both time (over the course of several hours) and space (the square), meaning there would be considerable time- and spatial-dependent bias in the sample. Understanding how modeling assumptions impact the interpretations of analytical results is critical to data science, and this is particularly true in the IC.

His central point that: Understanding how modeling assumptions impact the interpretations of analytical results is critical to data science, and this is particularly true in the IC. cannot be over emphasized.

The example of Twitter traffic reveals a deeper bias in the intelligence community, if it’s measurable, it’s meaningful.

No doubt Twitter facilitated communication within communities that already existed but that does not make it an enabling technology.

The revolution was made possible by community organizers working over decades (http://english.aljazeera.net/news/middleeast/2011/02/2011212152337359115.html) and trade unions (http://www.guardian.co.uk/commentisfree/2011/feb/10/trade-unions-egypt-tunisia).

And the revolution continued after Twitter and then cell phones were turned off.

Understanding such events requires investment in human intell and analysis, not over reliance on SIGINT. [2]


[1] Spring (2011) issue of I-Q-Tel’s quarterly journal, IQT Quarterly

[2] That a source is technical or has lights and bells, does not make it reliable or even useful.

PS: The Twitter traffic, such as it was, may have primarily been from: Twitter, I think, is being used by news media people with computer connections, through those kind of means. Facebook, Twitter, and the Middle East, IEEE Spectrum, Steve Cherry interviews Ben Zhao, expert on social networking performance.

Are we really interested in how news people use Twitter, even in a social movement context?

OSCON Data 2011 Call for Participation

Wednesday, March 2nd, 2011

OSCON Data 2011 Call for Participation

Deadline: 11:59pm 03/14/2011 PDT

From the website:

The O’Reilly OSCON Data conference is the first of its kind: bringing together open source culture and data hackers to cover data management at a very practical level. From disks and databases through to big data and analytics, OSCON Data will have instruction and inspiration from the people who actually do the work.

OSCON Data will take place July 25-27, 2011, in Portland, Oregon. We’ll be co-located with OSCON itself.

Proposals should include as much detail about the topic and format for the presentation as possible. Vague and overly broad proposals don’t showcase your skills and knowledge, and our volunteer reviewers aren’t mind readers. The more you can tell us, the more likely the proposal will be selected.

Proposals that seem like a “vendor pitch” will not be considered. The purpose of OSCON Data is to enlighten, not to sell.

Submit a proposal.

Yes, it is right before Balisage but I think worth considering if you are on the West Coast and can’t get to Balisage this year or if you are feeling really robust. ;-)

Hmmm, I wonder how a proposal that merges the indexes of the different NoSQL volumes from O’Reilly would be received? You are aware that O’Reilly is re-creating the X-Windows problem that was the genesis of both topic maps and DocBook?

I will have to write that up in detail at some point. I wasn’t there but have spoken to some of the principals who were. Plus I have the notes, etc.

…a grain of salt

Friday, February 25th, 2011

Benjamin Bock asked me recently about how I would model a mole of salt in a topic map.

That is a good question but I think we had better start with a single grain of salt and then work our way up from there.

At first blush, and only at first blush, do many subjects look quite easy to represent in a topic map.

A grain of salt looks simple to at first glance, just create a PSI (Published Subject Identifier), put that as the subjectIdentifier on a topic and be done with it.

Well…, except that I don’t want to talk about a particular grain of salt, I want to talk about salt more generally.

OK, one of those, I see.

Alright, same answer as before, except make the PSI for salt in general, not some particular grain of salt.

Well,…., except that when I go to the Wikipedia article on salt, Salt, I find that salt is a compound of chlorine and sodium.

A compound, oh, that means something made up of more than one subject. In a particular type of relationship.

Sounds like an association to me.

Of a particular type, an ionic association. (I looked it up, see: Ionic Compound)

And this association between chlorine and sodium has several properties reported by Wikipedia, here are just a few of them:

  • Molar mass: 58.443 g/mol
  • Appearance: Colorless/white crystalline solid
  • Odor: Odorless
  • Density: 2.165 g/cm3
  • Melting point: 801 °C, 1074 K, 1474 °F
  • Boiling point: 1413 °C, 1686 K, 2575 °F
  • … and several others.

    If you are interested in scientific/technical work, please be aware of CAS, a work product of the American Chemical Society, with a very impressive range unique identifiers. (56 million organic and inorganic substances, 62 million sequences and they have a counter that increments while you are on the page.)

    Note that unlike my suggestion, CAS takes the assign a unique identifier view for the substances, sequences and chemicals that they curate.

    Oh, sorry, got interested in the CAS as a source for subject identification. In fact, that is a nice segway to consider how to represent the millions and millions of compounds.

    We could create associations with the various components being role players but then we would have to reify those associations in order to hang additional properties off of them. Well, technically speaking in XTM we would create non-occurrence occurrences and type those to hold the additional properties.

    Sorry, I was presuming the decision to represent compounds as associations. Shout out when I start to presume that sort of thing. ;-)

    The reason I would represent compounds as associations is that the components of the associations are then subjects I can talk about and even add addition properties to, or create mappings between.

    I suspect that CAS has chemistry from the 1800′s fairly well covered but what about older texts? Substances before then may not be of interest to commercial chemists but certainly would be of interest to historians and other scholars.

    Use of a topic map plus the CAS identifiers would enable scholars studying older materials to effectively share information about older texts, which have different designations for substances than CAS would record.

    You could argue that I could use a topic for compounds, much as CAS does, and rely upon searching in order to discover relationships.

    Tis true, tis true, but my modeling preference is for relationships seen as subjects, although I must confess I would prefer a next generation syntax that avoids the reification overhead of XTM.

    Given the prevalent of complex relationships/associations as you see from the CAS index, I think a simplification of the representation of associations is warranted.

    Sorry, I never did quite reach Benjamin’s question about a mole of salt but I will take up that gauge again tomorrow.

    We will see that measurements (which figured into his questions about recipes as well) is an interesting area of topic map design.
    *****

    PS: Comments and/or suggestions on areas to post about are most welcome. Subject analysis for topic maps is not unlike cataloging in library science to a degree, except that what classification you assign is entirely the work product of your experience, reading and analysis. There are no fixed answers, only the ones that you find the most useful.

    Apache OODT – Top Level Project

    Friday, January 7th, 2011

    Apache OODT is the first ASF Top Level Project status for NASA developed software.

    From the website:

    Just what is Apache™ OODT?

    It’s metadata for middleware (and vice versa):

    • Transparent access to distributed resources
    • Data discovery and query optimization
    • Distributed processing and virtual archives

    But it’s not just for science! It’s also a software architecture:

    • Models for information representation
    • Solutions to knowledge capture problems
    • Unification of technology, data, and metadata

    Looks like a project that could benefit from having topic maps as part of its tool kit.

    Check out the 0.1 OODT release and see what you think.

    Developing High Quality Data Models – Book

    Thursday, December 9th, 2010

    Developing High Quality Data Models by Dr. Matthew West is due out in January of 2011. (Pre-order: Elsevier, Amazon)

    From the website:

    Anyone charged with developing a data model knows that there is a wide variety of potential problems likely to arise before achieving a high quality data model. With dozens of attributes and millions of rows, data modelers are in always danger of inconsistency and inaccuracy. The development of the data model itself could result in difficulties presenting accurate data. The need to improve data models begins in getting it right in the first place.

    Developing High Quality Data Models uses real-world examples to show you how to identify a number of data modeling principles and analysis techniques that will enable you to develop data models that consistently meet business requirements. A variety of generic data model patterns that exemplify the principles and techniques discussed build upon one another to give a powerful and integrated generic data model with wide applicability across many disciplines. The principles and techniques outlined in this book are applicable in government and industry, including but not limited to energy exploration, healthcare, telecommunications, transportation, military defense, transportation and so on.

    Table of Contents:

    Preface
    Chapter 1- Introduction
    Chapter 2- Entity Relationship Model Basics
    Chapter 3- Some types and uses of data models
    Chapter 4- Data models and enterprise architecture
    Chapter 5- Some observations on data models and data modeling
    Chapter 6- Some General Principles for Conceptual, Integration and Enterprise Data Models
    Chapter 7- Applying the principles for attributes
    Chapter 8- General principles for relationships
    Chapter 9- General principles for entity types
    Chapter 10- Motivation and overview for an ontological framework
    Chapter 12- Classes
    Chapter 13- Intentionally constructed objects
    Chapter 14- Systems and system components
    Chapter 15- Requirements specifications
    Chapter 16- Concluding Remarks
    Chapter 17- The HQDM Framework Schema

    I first became familiar with the work of Dr. West from Ontolog. You can visit his publications page to see why I am looking forward to this book.

    Citation of and comments on this work will follow as soon as access and time allow.