Archive for the ‘Modeling’ Category

Become a Super Modeler

Thursday, May 9th, 2013

Become a Super Modeler (Webinar)

Thursday, May 16th
11am PDT / 2pm EDT / 7pm BST / 8pm CEST

Sure you can do some time series modeling. Maybe some user profiles. What’s going to make you a super modeler? Let’s take a look at some great techniques taken from real world applications where we exploit the Cassandra big table model to it’s fullest advantage. We’ll cover some of the new features in CQL 3 as well as some tried and true methods. In particular, we will look at fast indexing techniques to get data faster at scale. You’ll be jet setting through your data like a true super modeler in no time.

Speaker: Patrick McFadin, Principal Solutions Architect at DataStax

Looks interesting and I have neglected to look closely at CQL 3.

Could be some incentive to read up before the webinar.

Successful PROV Tutorial at EDBT

Friday, April 5th, 2013

Successful PROV Tutorial at EDBT by Paul Groth.

From the post:

On March 20th, 2013 members of the Provenance Working Group gave a tutorial on the PROV family of specifications at the EDBT conference in Genova, Italy. EDBT (“Extending Database Technology”) is widely regarded as one of the prime venues in Europe for dissemination of data management research.

The 1.5 hours tutorial was attended by about 26 participants, mostly from academia. It was structured into three parts of approximately the same length. The first two parts introduced PROV as a relational data model with constraints and inference rules, supported by a (nearly) relational notation (PROV-N). The third part presented known extensions and applications of PROV, based on the extensive PROV implementation report and implementations known to the presenter at the time.

All the presentation material is available here.

As the first part of the tutorial notes:

  • Provenance is not a new subject
    • workflow systems
    • databases
    • knowledge representation
    • information retrieval
  • Existing community-grown vocabularies
    • Open Provenance Model (OPM)
    • Dublin Core
    • Provenir ontology
    • Provenance vocabulary
    • SWAN provenance ontology
    • etc.

The existence of “other” vocabularies isn’t an issue for topic maps.

You can query on “your” vocabulary and obtain results from “other” vocabularies.

Enriches your information and that of others.

You will need to know about the vocabularies of others and their oddities.

For the W3C work on provenance, follow this tutorial and the others it mentions.

Data Points: Preview

Thursday, April 4th, 2013

Data Points: Preview by Nathan Yau.

As you already know, Nathan is a rich source for interesting graphics and visualizations, some of which I have the good sense to point to.

What you may not know is that Nathan has a new book out: Data Points: Visualizations That Mean Something.

Data Points

Not a book about coding to visualize data but rather:

Data Points is all about process from a non-programming point of view. Start with the data, really understand it, and then go from there. Data Points is about looking at your data from different perspectives and how it relates to real life. Then design accordingly.

That’s the hard part isn’t it?

Like the ongoing discussion here about modeling for topic maps.

Unless you understand the data, models and visualizations alike are going to be meaningless.

Check out Nathan’s new book to increase your chances of models and visualizations that mean something.

Design Pattern Sources?

Monday, April 1st, 2013

To continue with the need for topic map design pattern thread, what sources would you suggest for design patterns?

Thinking that it would be more efficient to start from commonly known patterns and then when necessary, to branch out into new or unique ones.

Not to mention that starting with familiar patterns, as opposed to esoteric ones, will provide some comfort level for users.

Sources that I have found useful include:

Data Model Patterns: Conventions of Thought by David C. Hay.

Domain-Driven Design: Tackling Complexity in the Heart of Software by Eric Evans.

Developing High Quality Data Models by Matthew West. (Think Shell Oil. Serious enterprise context.)

Do you have any favorites you would suggest?

After a day or two of favorites, the next logical step would be to choose a design pattern and with an eye on Kal’s Design Pattern Examples , attempt to fashion a design template.

Just one, not bother to specify what comes next.

Working one bite at a time will make the task seem manageable.

Yes?

Topic Map Design Patterns For Information Architecture

Monday, April 1st, 2013

Topic Map Design Patterns For Information Architecture by Kal Ahmed.

Abstract:

Software design patterns give programmers a high level language for discussing the design of software applications. For topic maps to achieve widespread adoption and improved interoperability, a set of topic map design patterns are needed to codify existing practices and make them available to a wider audience. Combining structured descriptions of design patterns with Published Subject Identifiers would enable not only the reuse of design approaches but also encourage the use of common sets of PSIs. This paper presents the arguments for developing and publishing topic map design patterns and a proposed notation for diagramming design patterns based on UML. Finally, by way of examples, the paper presents some design patterns for representation of traditional classification schemes such as thesauri, hierarchical and faceted classification.

Kal used UML to model the design patterns and their constraints. (TMCL, the Topic Map Constraint Language, had yet to be written. (TMCL)

For visual modeling purposes, are there any constraints in TMCL that cannot be modeled in UML?

I ask because I have not compared TMCL to UML.

Using UML to express the generic constraints in TMCL would be a first step towards answering the need for topic maps design patterns.

Topic Map Design Patterns

Monday, April 1st, 2013

A recent comment on topic map design patterns reads in part:

The second problem, and the one I’m working through now, is that information modeling with topic maps is a new paradigm for me (and most people I’m sure) and the information on topic map models is widely dispersed. Techquila had some design patterns that were very useful and later those were put put in a paper by A. Kal but, in general, it is a lot more difficult to figure out the information model with topic maps than it is with SQL or NoSQL or RDF because those other technologies have a lot more open discussions of designs to cover specific use cases. If those discussions existed for topic maps, it would make it easier for non-experts like me to connect the high-level this-is-how-topic-maps-work type information (that is plentiful) with the this-is-the-problem-and-this-is-the-model-that-solves-it type information (that is hard to find for topic maps).

Specifically, the problem I’m trying to solve and many other real world problems need a semi-structured information model, not just an amorphous blob of topics and associations. There are multiple dimensions of hierarchies and sequences that need to be modeled so that the end user can query the system with OLAP type queries where they drill up and down or pan forward and back through the information until they find what they need.

Do you know of any books of Topic Maps use cases and/or design patterns?

Unfortunately I had to say that I knew of no “Topic Maps use cases and/or design patterns” books.

There is XML topic maps : creating and using topic maps for the Web by Sam Hunting and Jack Park, but it isn’t what I would call a design pattern book.

While searching for the Hunting/Park book I did find: Topic Maps: Semantische Suche im Internet (Xpert.press) (German Edition) [Paperback] by Richard Widhalm (Author), Thomas Mück, with a 2012 publication date. Don’t be deceived. This is a reprint of the 2002 edition.

Any books that I have missed on topic maps modeling in particular?

The comment identifies a serious lack of resources on use cases and design patterns for topic maps.

My suggestion is that we all refresh our memories of Kal’s work on topic map design patterns (which I will cover in a separate post) and start to correct this deficiency.

What say you all?

Learning Grounded Models of Meaning

Friday, March 29th, 2013

Learning Grounded Models of Meaning

Schedule and readings for seminar by Katrin Erk and Jason Baldridge:

Natural language processing applications typically need large amounts of information at the lexical level: words that are similar in meaning, idioms and collocations, typical relations between entities,lexical patterns that can be used to draw inferences, and so on. Today such information is mostly collected automatically from large amounts of data, making use of regularities in the co-occurrence of words. But documents often contain more than just co-occurring words, for example illustrations, geographic tags, or a link to a date. Just like co-occurrences between words, these co-occurrences of words and extra-linguistic data can be used to automatically collect information about meaning. The resulting grounded models of meaning link words to visual, geographic, or temporal information. Such models can be used in many ways: to associate documents with geographic locations or points in time, or to automatically find an appropriate image for a given document, or to generate text to accompany a given image.

In this seminar, we discuss different types of extra-linguistic data, and their use for the induction of grounded models of meaning.

Very interesting reading that should keep you busy for a while! ;-)

MetaNetX.org…

Saturday, March 16th, 2013

MetaNetX.org: a website and repository for accessing, analysing and manipulating metabolic networks by Mathias Ganter, Thomas Bernard, Sébastien Moretti, Joerg Stelling and Marco Pagni. (Bioinformatics (2013) 29 (6): 815-816. doi: 10.1093/bioinformatics/btt036)

Abstract:

MetaNetX.org is a website for accessing, analysing and manipulating genome-scale metabolic networks (GSMs) as well as biochemical pathways. It consistently integrates data from various public resources and makes the data accessible in a standardized format using a common namespace. Currently, it provides access to hundreds of GSMs and pathways that can be interactively compared (two or more), analysed (e.g. detection of dead-end metabolites and reactions, flux balance analysis or simulation of reaction and gene knockouts), manipulated and exported. Users can also upload their own metabolic models, choose to automatically map them into the common namespace and subsequently make use of the website’s functionality.

http://metanetx.org.

The authors are addressing a familiar problem:

Genome-scale metabolic networks (GSMs) consist of compartmentalized reactions that consistently combine biochemical, genetic and genomic information. When also considering a biomass reaction and both uptake and secretion reactions, GSMs are often used to study genotype–phenotype relationships, to direct new discoveries and to identify targets in metabolic engineering (Karr et al., 2012). However, a major difficulty in GSM comparisons and reconstructions is to integrate data from different resources with different nomenclatures and conventions for both metabolites and reactions. Hence, GSM consolidation and comparison may be impossible without detailed biological knowledge and programming skills. (emphasis added)

For which they propose an uncommon solution:

MetaNetX.org is implemented as a user-friendly and self-explanatory website that handles all user requests dynamically (Fig. 1a). It allows a user to access a collection of hundreds of published models, browse and select subsets for comparison and analysis, upload or modify new models and export models in conjunction with their results. Its functionality is based on a common namespace defined by MNXref (Bernard et al., 2012). In particular, all repository or user uploaded models are automatically translated with or without compartments into the common namespace; small deviations from the original model are possible due to the automatic reconciliation steps implemented by Bernard et al. (2012). However, a user can choose not to translate his model but still make use of the website’s functionalities. Furthermore, it is possible to augment the given reaction set by user-defined reactions, for example, for model augmentation.

The bioinformatics community recognizes the intellectual poverty of lock step models.

Wonder when the intelligence community is going to have that “a ha” moment?

Model Matters: Graphs, Neo4j and the Future

Friday, March 8th, 2013

Model Matters: Graphs, Neo4j and the Future by Tareq Abedrabbo.

From the post:

As part of our work, we often help our customers choose the right datastore for a project. There are usually a number of considerations involved in that process, such as performance, scalability, the expected size of the data set, and the suitability of the data model to the problem at hand.

This blog post is about my experience with graph database technologies, specifically Neo4j. I would like to share some thoughts on when Neo4j is a good fit but also what challenges Neo4j faces now and in the near future.

I would like to focus on the data model in this blog post, which for me is the crux of the matter. Why? Simply because if you don’t choose the appropriate data model, there are things you won’t be able to do efficiently and other things you won’t be able to do at all. Ultimately, all the considerations I mentioned earlier influence each other and it boils down to finding the most acceptable trade-off rather than picking a database technology for one specific feature one might fancy.

So when is a graph model suitable? In a nutshell when the domain consists of semi-structured, highly connected data. That being said, it is important to understand that semi-structured doesn’t imply an absence of structure; there needs to be some order in your data to make any domain model purposeful. What it actually means is that the database doesn’t enforce a schema explicitly at any given point in time. This makes it possible for entities of different types to cohabit – usually in different dimensions – in the same graph without the need to make them all fit into a single rigid structure. It also means that the domain can evolve and be enriched over time when new requirements are discovered, mostly with no fear of breaking the existing structure.

Effectively, you can start taking a more fluid view of your domain as a number of superimposed layers or dimensions, each one representing a slice of the domain, and each layer can potentially be connected to nodes in other layers.

More importantly, the graph becomes the single place where the full domain representation can be consolidated in a meaningful and coherent way. This is something I have experienced on several projects, because modeling for the graph gives developers the opportunity to think about the domain in a natural and holistic way. The alternative is often a data-centric approach, that usually results from integrating different data flows together into a rigidly structured form which is convenient for databases but not for the domain itself.

Interesting review of the current and some projected capabilities of Neo4j.

I am particularly sympathetic to starting with the data users have as opposed to starting with a model written in software and shoe horning the user’s data to fit the model.

Can be done, has been done (for decades), and works quite well in some cases.

But not all cases.

Using molecular networks to assess molecular similarity

Friday, February 15th, 2013

Systems chemistry: Using molecular networks to assess molecular similarity by Bailey Fallon.

From the post:

In new research published in Journal of Systems Chemistry, Sijbren Otto and colleagues have provided the first experimental approach towards molecular networks that can predict bioactivity based on an assessment of molecular similarity.

Molecular similarity is an important concept in drug discovery. Molecules that share certain features such as shape, structure or hydrogen bond donor/acceptor groups may have similar properties that make them common to a particular target. Assessment of molecular similarity has so far relied almost exclusively on computational approaches, but Dr Otto reasoned that a measure of similarity might be obtained by interrogating the molecules in solution experimentally.

Important work for drug discovery but there are semantic lessons here as well:

Tests for similarity/sameness are domain specific.

Which means there are no universal tests for similarity/sameness.

Lacking universal tests for similarity/sameness, we should focus on developing documented and domain specific tests for similarity/sameness.

Domain specific tests provide quicker ROI than less useful and doomed universal solutions.

Documented domain specific tests may, no guarantees, enable us to find commonalities between domain measures of similarity/sameness.

But our conclusions will be based on domain experience and not projection from our domain onto others, less well known domains.

The Evolution of Regression Modeling… [Webinar]

Wednesday, February 6th, 2013

The Evolution of Regression Modeling: From Classical Linear Regression to Modern Ensembles by Mikhail Golovnya and Illia Polosukhin.

Dates/Times:

Part 1: Fri March 1, 10 am, PST

Part 2: Friday, March 15, 10 am, PST

Part 3: Friday, March 29, 10 am, PST

Part 4: Friday, April 12, 10 am, PST

From the webpage:

Class Description: Regression is one of the most popular modeling methods, but the classical approach has significant problems. This webinar series address these problems. Are you are working with larger datasets? Is your data challenging? Does your data include missing values, nonlinear relationships, local patterns and interactions? This webinar series is for you! We will cover improvements to conventional and logistic regression, and will include a discussion of classical, regularized, and nonlinear regression, as well as modern ensemble and data mining approaches. This series will be of value to any classically trained statistician or modeler.

Details:

Part 1: March 1 – Regression methods discussed

  •     Classical Regression
  •     Logistic Regression
  •     Regularized Regression: GPS Generalized Path Seeker
  •     Nonlinear Regression: MARS Regression Splines

Part 2: March 15 – Hands-on demonstration of concepts discussed in Part 1

  •     Step-by-step demonstration
  •     Datasets and software available for download
  •     Instructions for reproducing demo at your leisure
  •     For the dedicated student: apply these methods to your own data (optional)

Part 3: March 29 – Regression methods discussed
*Part 1 is a recommended pre-requisite

  •     Nonlinear Ensemble Approaches: TreeNet Gradient Boosting; Random Forests; Gradient Boosting incorporating RF
  •     Ensemble Post-Processing: ISLE; RuleLearner

Part 4: April 12 – Hands-on demonstration of concepts discussed in part 3

  •     Step-by-step demonstration
  •     Datasets and software available for download
  •     Instructions for reproducing demo at your leisure
  •     For the dedicated student: apply these methods to your own data (optional)

Salford Systems offers other introductory videos, webinars and tutorial and case studies.

Regression modeling is a tool you will encounter in data analysis and is likely to be an important part of your exploration toolkit.

I first saw this at KDNuggets.

…[D]emocratization of modeling, simulations, and predictions

Sunday, January 27th, 2013

Technical engine for democratization of modeling, simulations, and predictions by Justyna Zander and Pieter J. Mosterman. (Justyna Zander and Pieter J. Mosterman. 2012. Technical engine for democratization of modeling, simulations, and predictions. In Proceedings of the Winter Simulation Conference (WSC ’12). Winter Simulation Conference , Article 228 , 14 pages.)

Abstract:

Computational science and engineering play a critical role in advancing both research and daily-life challenges across almost every discipline. As a society, we apply search engines, social media, and selected aspects of engineering to improve personal and professional growth. Recently, leveraging such aspects as behavioral model analysis, simulation, big data extraction, and human computation is gaining momentum. The nexus of the above facilitates mass-scale users in receiving awareness about the surrounding and themselves. In this paper, an online platform for modeling and simulation (M&S) on demand is proposed. It allows an average technologist to capitalize on any acquired information and its analysis based on scientifically-founded predictions and extrapolations. The overall objective is achieved by leveraging open innovation in the form of crowd-sourcing along with clearly defined technical methodologies and social-network-based processes. The platform aims at connecting users, developers, researchers, passionate citizens, and scientists in a professional network and opens the door to collaborative and multidisciplinary innovations. An example of a domain-specific model of a pick and place machine illustrates how to employ the platform for technical innovation and collaboration.

It is an interesting paper but when speaking of integration of models the authors say:

The integration is performed in multiple manners. Multi-domain tools that become accessible from one common environment using the cloud-computing paradigm serve as a starting point. The next step of integration happens when various M&S execution semantics (and models of computation (cf., Lee and Sangiovanni-Vincentelli 1998; Lee 2010) are merged and model transformations are performed.

That went by too quickly for me. You?

The question of effective semantic integration is an important one.

The U.S. federal government publishes enough data to map where some of the dark data is waiting to be found.

The good, bad or irrelevant data churned out every week, makes the amount of effort required an ever increasing barrier to its use by the public.

Perhaps that is by design?

What do you think?

The Music Encoding Conference 2013

Saturday, November 10th, 2012

The Music Encoding Conference 2013

22-24 May, 2013
Mainz Academy for Literature and Sciences, Mainz, Germany

Important dates:
31 December 2012: Deadline for abstract submissions
31 January 2013: Notification of acceptance/rejection of submissions
21-24 May 2013: Conference
31 July 2013: Deadline for submission of full papers for conference proceedings
December 2013: Publication of conference proceedings

From the email announcement of the conference:

You are cordially invited to participate in the Music Encoding Conference 2013 – Concepts, Methods, Editions, to be held 22-24 May, 2013, at the Mainz Academy for Literature and Sciences in Mainz, Germany.

Music encoding is now a prominent feature of various areas in musicology and music librarianship. The encoding of symbolic music data provides a foundation for a wide range of scholarship, and over the last several years, has garnered a great deal of attention in the digital humanities. This conference intends to provide an overview of the current state of data modeling, generation, and use, and aims to introduce new perspectives on topics in the fields of traditional and computational musicology, music librarianship, and scholarly editing, as well as in the broader area of digital humanities.

As the conference has a dual focus on music encoding and scholarly editing in the context of the digital humanities, the Program Committee is also happy to announce keynote lectures by Frans Wiering (Universiteit Utrecht) and Daniel Pitti (University of Virginia), both distinguished scholars in their respective fields of musicology and markup technologies in the digital humanities.

Proposals for papers, posters, panel discussions, and pre-conference workshops are encouraged. Prospective topics for submissions include:

  • theoretical and practical aspects of music, music notation models, and scholarly editing
  • rendering of symbolic music data in audio and graphical forms
  • relationships between symbolic music data, encoded text, and facsimile images
  • capture, interchange, and re-purposing of music data and metadata
  • ontologies, authority files, and linked data in music encoding
  • additional topics relevant to music encoding and music editing

I know Daniel Pitti from the TEI (Text Encoding Initiative). His presence assures me this will be a great conference for markup, modeling and music enthusiasts.

I can recognize music because it comes in those little plastic boxes. ;-) If you want to talk about the markup/encoding/mapping side, ping me.

Data modeling … with graphs

Wednesday, November 7th, 2012

Data modeling … with graphs by Peter Bell.

Nothing surprising for topic map users but a nice presentation on modeling for graphs.

For Neo4j, unlike topic maps, you have to normalize your data before entering it into the graph.

That is if you want one node per subject.

Depends on your circumstances if that is worthwhile.

Amazing things have been done with normalized data in relational databases.

Assuming you want to pay the cost of normalization, which can include a lack of interoperability with others, errors in conversion, brittleness in the face of changing models, etc.

Make your own buckyball

Wednesday, October 31st, 2012

Make your own buckyball by John D. Cook.

From the post:

This weekend a couple of my daughters and I put together a buckyball from a Zometool kit. The shape is named for Buckminster Fuller of geodesic dome fame. Two years after Fuller’s death, scientists discovered that the shape appears naturally in the form of a C60 molecule, named Buckminsterfullerene in his honor. In geometric lingo, the shape is a truncated icosahedron. It’s also the shape of many soccer balls.

Don’t be embarrassed to use these at the office.

According to the PR, Roger Penrose does.

Modeling Question: What Happens When Dots Don’t Connect?

Saturday, October 13th, 2012

Working with a data set and have run across a different question than vagueness/possibility of relationships. (see Topic Map Modeling of Sequestration Data (Help Pls!) if you want to help with that one.)

What if when analyzing the data I determine there is no association between two subjects?

I am assuming that if there is no association, there are no roles at play.

How do I record the absence of the association?

I don’t want to trust the next user will “notice” the absence of the association.

A couple of use cases come to mind:

I suspect there is an association but have no proof. The cheating husband/wife scenario. (I suppose there I would know the “roles.”)

What about corporations or large organizations? Allegations are made but no connection to identifiable actors.

Corporations act only through agents. A charge that names the responsible agents is different from a general allegation.

How do I distinguish those? Or make it clear no agent has been named?

Wouldn’t that be interesting?

We read now: XYZ corporation plead guilty to government contract fraud.

We could read: A, B, and C, XYZ corporation and L, M, N, government contract officers managed the XYZ government contract. XYZ plead guilty to contract fraud and was fined $.

Could keep better score on private and public employees that keep turning up in contract fraud cases.

One test for transparency is accountability.

No accountability, no transparency.

PostgreSQL Database Modeler

Thursday, October 4th, 2012

PostgreSQL Database Modeler

From the readme file at github:

PostgreSQL Database Modeler, or simply, pgModeler is an open source tool for modeling databases that merges the classical concepts of entity-relationship diagrams with specific features that only PostgreSQL implements. The pgModeler translates the models created by the user to SQL code and apply them onto database clusters from version 8.0 to 9.1.

Other modeling tools you have or are likely to encounter writing topic maps?

When the output of diverse modeling tools or diverse output from the same modeling tool needs semantic reconciliation, I would turn to topic maps.

I first saw this at DZone.

Topic Map Modeling of Sequestration Data (Help Pls!)

Saturday, September 29th, 2012

With the political noise in the United States over presidential and other elections, it is easy to lose sight of a looming “sequestration” that on January 2, 2013 will result in:

10.0% reduction non-exempt defense mandatory funding
9.4% reduction non-exempt defense discretionary funding
8.2% reduction non-exempt nondefense discretionary funding
7.6% reduction non-exempt nondefense mandatory funding
2.0% reduction Medicare

The report is not a model of clarity/transparency. See: U.S. Sequestration Report – Out of the Shadows/Into the Light?.

Report caveats make it clear cited amounts are fanciful estimates that can change radically as more information becomes available.

Be that as it may, a topic map based on the reported accounts as topics can capture the present day conjectures. To say nothing of capturing future revelations of exact details.

Whether from sequestration or from efforts to avoid sequestration.

Tracking/transparency has to start somewhere and it may as well be here.

In evaluating the data for creation of a topic map, I have encountered an entry with a topic map modeling issue.

I could really use your help.

Here is the entry in question:

Department of Health and Human Services, Health Resources and Services Administration, 009-15-0350, Health Resources and Services, Nondefense Function, Mandatory (page 80 of Appendix A, page 92 of the pdf of the report):

BA Type BA Amount Sequester Percentage Sequester Amount
Sequestrable BA 514 7.6 39
Sequestrable BA
– special rule
1352 2.0 27
Exempt BA 10
Total Gross BA 1876
Offsets -16
Net BA 1860

If it read as follows, no problem.

Example: Not Accurate

BA Type BA Amount Sequester Percentage Sequester Amount
Sequestrable BA 514 7.6 39
Sequestrable BA
– special rule
1352 2.0 27
Total Gross BA 1876

Because there is no relationship between “Exempt BA” and “Offsets” to either “Sequestrable BA” or “Sequestrable BA – special rule.” I just report both of them with the percentages and total amounts to be withheld.

True, the percentages don’t change, nor does the amount to be withheld change, because of the “Exempt BA” or the “Offsets.” (Trusting soul that I am, I did verify the calculations. ;-) )

Problem: How do I represent the relationship between the “Exempt BA” and “Offsets” to either/or/both “Sequestrable BA,” “Sequestrable BA – special rule?”

Of the 1318 entries in Appendix A of this report, including this one, it is the only entry with this issue. (A number of accounts are split into discretionary/mandatory parts. I am counting each part as a separate “entry.”)

If I ignore “Exempt BA” and “Offsets” in this case, my topic map is an incomplete representation of Appendix A.

It is also the case that I want to represent the information “as written.” There may be some external explanation that clarifies this entry, but that would be an “addition” to the original topic map.

Suggestions?

“how hard can this be?” (Data and Reality)

Saturday, September 8th, 2012

Books that Influenced my Thinking: Kent’s Data and Reality by Thomas Redman.

From the post:

It was the rumor that Steve Hoberman (Technics Publications) planned to reissue Data and Reality by William Kent that led me to use this space to review books that had influenced my thinking about data and data quality. My plan had been to do the review of Data and Reality as soon as it came out. I completely missed the boat – it has been out for some six months.

I first read Data and Reality as we struggled at Bell Labs to develop a definition of data that would prove useful for data quality. While I knew philosophers had debated the merits of various approaches for thousands of years, I still thought “how hard can this be?” About twenty minutes with Kent’s book convinced me. This is really tough.
….

Amazon reports Data and Reality (3rd edition) as 200 pages long.

Looking at a hard copy I see:

  • Prefaces 17-34
  • Chapter 1 Entities 35-54
  • Chapter 2 The Nature of an Information System 55-67
  • Chapter 3 Naming 69-86
  • Chapter 4 Relationships 87-98
  • Chapter 5 Attributes 99-107
  • Chapter 6 Types and Categories and Sets 109-117
  • Chapter 7 Models 119-123
  • Chapter 8 The Record Model 125-137
  • Chapter 9 Philosophy 139-150
  • Bibliography 151-159
  • Index 161-162

Way less than the 200 pages promised by Amazon.

To ask a slightly different question:

“How hard can it be” to teach building data models?

A hard problem with no fixed solution?

Suggestions?

Does category theory make you a better programmer?

Thursday, August 2nd, 2012

Does category theory make you a better programmer? by Debasish Ghosh.

From the post:

How much of category theory knowledge should a working programmer have ? I guess this depends on what kind of language the programmer uses in his daily life. Given the proliferation of functional languages today, specifically typed functional languages (Haskell, Scala etc.) that embeds the typed lambda calculus in some form or the other, the question looks relevant to me. And apparently to a few others as well. In one of his courses on Category Theory, Graham Hutton mentioned the following points when talking about the usefulness of the theory :

  • Building bridges—exploring relationships between various mathematical objects, e.g., Products and Function
  • Unifying ideas – abstracting from unnecessary details to give general definitions and results, e.g., Functors
  • High level language – focusing on how things behave rather than what their implementation details are e.g. specification vs implementation
  • Type safety – using types to ensure that things are combined only in sensible ways e.g. (f: A -> B g: B -> C) => (g o f: A -> C)
  • Equational proofs—performing proofs in a purely equational style of reasoning

Many of the above points can be related to the experience that we encounter while programming in a functional language today. We use Product and Sum types, we use Functors to abstract our computation, we marry types together to encode domain logic within the structures that we build and many of us use equational reasoning to optimize algorithms and data structures.

But how much do we need to care about how category theory models these structures and how that model maps to the ones that we use in our programming model ?

Read the post for Debasish’s answer for programmers.

For topic map authors, remember category theory began as an effort to find commonalities between abstract mathematical structures.

Commonalities? That sounds a lot like subject sameness doesn’t it?

With category theory you can describe, model, uncover commonalities in mathematical structures and commonalities in other areas as well.

A two for one as it were. Sounds worthwhile to me.

I first saw this at DZone.

Ignorance by Stuart Firestein; It’s Not Rocket Science by Ben Miller – review

Tuesday, July 31st, 2012

Ignorance by Stuart Firestein; It’s Not Rocket Science by Ben Miller – review by Adam Rutherford

From the review, speaking of “Ignorance” by Stuart Firestein, Adam writes:

Stuart Firestein, a teacher and neuroscientist, has written a splendid and admirably short book about the pleasure of finding things out using the scientific method. He smartly outlines how science works in reality rather than in stereotype. His MacGuffin – the plot device to explore what science is – is ignorance, on which he runs a course at Columbia University in New York. Although the word “science” is derived from the Latin scire (to know), this misrepresents why it is the foundation and deliverer of civilisation. Science is to not know but have a method to find out. It is a way of knowing.

Firestein is also quick to dispel the popular notion of the scientific method, more often than not portrayed as a singular thing enshrined in stone. The scientific method is more of a utility belt for ignorance. Certainly, falsification and inductive reasoning are cornerstones of converting unknowns to knowns. But much published research is not hypothesis-driven, or even experimental, and yet can generate robust knowledge. We also invent, build, take apart, think and simply observe. It is, Firestein says, akin to looking for a black cat in a darkened room, with no guarantee the moggy is even present. But the structure of ignorance is crucial, and not merely blind feline fumbling.

The size of your questions is important, and will be determined by how much you know. Therein lies a conundrum of teaching science. Questions based on pure ignorance can be answered with knowledge. Scientific research has to be born of informed ignorance, otherwise you are not finding new stuff out. Packed with real examples and deep practical knowledge, Ignorance is a thoughtful introduction to the nature of knowing, and the joy of curiosity.

Not to slight “It’s Not Rocket Science,” but I am much more sympathetic to discussions of the “…structure of ignorance…” and how we model those structures.

If you are interested in such arguments, consider the Oxford Handbook of Skepticism. I don’t have a copy (you can fix that if you like) but it is reported to have good coverage of the subject of ignorance.

Cambridge Advanced Modeller (CAM)

Tuesday, July 24th, 2012

Cambridge Advanced Modeller (CAM)

From the webpage:

Cambridge Advanced Modeller is a software tool for modelling and analysing the dependencies and flows in complex systems – such as products, processes and organisations. It provides a diagrammer, a simulation tool, and a DSM tool.

CAM is free for research, teaching and evaluation. We only require that you cite our work if you use CAM in support of published work. Commercial evaluation is allowed. Commercial use is subject to non-onerous conditions.

Toolboxes provide several modelling notations and analysis methods. CAM can be configured to develop new modelling notations by specifying the types of element and connection allowed. A modular architecture allows new functionality, such as simulation codes, to be added.

One of the research tool boxes is topic maps! Cool!

Have you used CAM?

Wrinkling Time

Monday, July 23rd, 2012

The post by Dan Brickley that I mentioned earlier today, Dilbert schematics, made me start thinking about more complex time scenarios than serial assignment of cubicles.

Like Hermione Granger and Harry Potter’s adventure in the Prisoner of Azkaban.

For those of you who are vague on the story, Hermione uses a “Time-Turner” to go back in time several hours. As a result, she and Harry must avoid being seen by themselves (and others). Works quite well in the story but what if I wanted to model that narrative in a topic map?

Some issues/questions that occurred to me:

Harry and Hermione are the same subjects they were during the prior time interval. Or are they?

Does a linear notion of time mean they are different subjects?

How would I model their interactions with others? Such as Buckbeak? Who interacted with both versions (for lack of a better term) of Harry?

Is there a time line running parallel to the “original” time line?

Just curious, what happens if the Time-Turner fails and Harry and Hermoine don’t return to the present, ever? That is their “current” present is forever 3 hours behind their “real” present.

What other time issues, either in literature or elsewhere seem difficult to model to you?

neo4j: Handling optional relationships

Wednesday, June 27th, 2012

neo4j: Handling optional relationships by Mark Needham.

From the post:

On my ThoughtWorks neo4j there are now two different types of relationships between people nodes – they can either be colleagues or one can be the sponsor of the other.

Getting the information/relationships “in” wasn’t a problem. Getting the required information back out, that was a different story.

A useful illustration of how establishing the desired result (output in this case) can clarify what needs to be asked.

Don’t jump to the solution. Read the post and write down how you would get the desired results.

I first saw this at DZone’s Neo4j page.

How Do You Define Failure?

Wednesday, June 6th, 2012

… business intelligence implementations are often called failures when they fail to meet the required objectives, lack user acceptance or are only implemented after numerous long delays.

Called failures? Sounds like failures to me. You?

News: The cause of such failures has been discovered:

…an improperly modeled repository not adhering to basic dimensional modeling principles

Really?

I would have said that not having a shared semantic, one shared by all the shareholders in the project, would be the root cause for most project failures.

I’m not particular about how you achieve that shared semantic. You could use white boards, sticky notes or have people physically act out the system. The important thing being to avoid the assumption that other stakeholders “know what I mean by….” They probably don’t. And several months into building of data structures, interfaces, etc., is a bad time to find out you assumed incorrectly.

The lack of a shared semantic can result in an “…improperly modeled repository…” but that is much later in the process.

Quotes from: Oracle Expert Shares Implementation Key

Role Modeling

Friday, May 25th, 2012

Role Modeling

From the webpage:

Roles are about objects and how they interact to achieve some purpose. For thirty years I have tried to get them into the into the main stream, but haven’t succeeded. I believe the reason is that our programming languages are class oriented rather than object oriented. So why model in terms of objects when you cannot program them?

Almost all my documents are about role modeling in one form or another. There are two very useful abstractions on objects. One abstraction classifies objects according to their properties. The other studies how objects work together to achieve one or more of the users’ goals. I have for the past 30 years tried to make our profession aware of this important dichotomy, but have met with very little success. The Object Management Group (OMG) has standardized the Unified Modeling Language, UML. We were members of the core team defining this language and our role modeling became part of the language under the name of Collaborations. Initially, very few people seemed to appreciate the importance of the notion of Collaborations. I thought that this would change when Ivar Jacobson came out with his Use Cases because a role model shows how a system of interacting objects realizes a use case, but it is still heavy going. There are encouaging signs in the concept of Components in the emerging UML version 2.0. Even more encouaging is the ongoing work with Web Services where people and components are in the center of interest while classes are left to the specialists. My current project, BabyUML, binds it all together: algorithms coded as classes + declaration of semantic model + coding of object interaction as collaborations/role models.

The best reference is my book Working With Objects. Out of print, but is still available from some bookshops including Amazon as of January 2010.

You can download the pdf of Working with Objects (version before publication). A substantial savings over the Amazon “new” price of $100+ US.

This webpage has links to a number resources from Trygve M. H. Reenskaug on role modeling.

I saw this reference in a tweet by Inge Henriksen.

Automated science, deep data and the paradox of information – Data As Story

Saturday, March 31st, 2012

Automated science, deep data and the paradox of information…

Bradley Voytek writes:

A lot of great pieces have been written about the relatively recent surge in interest in big data and data science, but in this piece I want to address the importance of deep data analysis: what we can learn from the statistical outliers by drilling down and asking, “What’s different here? What’s special about these outliers and what do they tell us about our models and assumptions?”

The reason that big data proponents are so excited about the burgeoning data revolution isn’t just because of the math. Don’t get me wrong, the math is fun, but we’re excited because we can begin to distill patterns that were previously invisible to us due to a lack of information.

That’s big data.

Of course, data are just a collection of facts; bits of information that are only given context — assigned meaning and importance — by human minds. It’s not until we do something with the data that any of it matters. You can have the best machine learning algorithms, the tightest statistics, and the smartest people working on them, but none of that means anything until someone makes a story out of the results.

And therein lies the rub.

Do all these data tell us a story about ourselves and the universe in which we live, or are we simply hallucinating patterns that we want to see?

I reformulate Bradley’s question into:

We use data to tell stories about ourselves and the universe in which we live.

Which means that his rules of statistical methods:

  1. The more advanced the statistical methods used, the fewer critics are available to be properly skeptical.
  2. The more advanced the statistical methods used, the more likely the data analyst will be to use math as a shield.
  3. Any sufficiently advanced statistics can trick people into believing the results reflect truth.

are sources of other stories “about ourselves and the universe in which we live.”

If you prefer Bradley’s original question:

Do all these data tell us a story about ourselves and the universe in which we live, or are we simply hallucinating patterns that we want to see?

I would answer: And the difference would be?

“All Models are Right, Most are Useless”

Sunday, March 11th, 2012

“All Models are Right, Most are Useless”

A counter to George Box saying: “all models are wrong, some are useful.” by Thad Tarpey. Pointer to slides for the presentation.

Covers the fallacy of “reification” (in the modeling sense) among other amusements.

Useful to remember that maps are approximations as well.

Munging, Modeling and Visualizing Data with R

Sunday, January 29th, 2012

Munging, Modeling and Visualizing Data with R by Xavier Léauté.

With a title like that, how could I resist?

From the post:

Yesterday evening Romy Misra from visual.ly invited us to teach an introductory workshop to R for the San Francisco Data Mining meetup. Todd Holloway was kind enough to host the event at Trulia headquarters.

R can be a little daunting for beginners, so I wanted to give everyone a quick overview of its capabilities and enough material to get people started. Most importantly, the objective of this interactive session was to give everyone some time to try out some simple examples that would be useful in the future.

I hope everyone enjoyed learning some fun and easy ways to slice, model and visualize data, and that I piqued their interest enough to start exploring datasets on their own.

Slides and sample scripts follow.

First seen at Christophe Lalanne’s Bag of Tweets for January 2012.

A Task-based Model of Search

Wednesday, December 14th, 2011

A Task-based Model of Search by Tony Russell-Rose.

From the post:

A little while ago I posted an article called Findability is just So Last Year, in which I argued that the current focus (dare I say fixation) of the search community on findability was somewhat limiting, and that in my experience (of enterprise search, at least), there are a great many other types of information-seeking behaviour that aren’t adequately accommodated by the ‘search as findability’ model. I’m talking here about things like analysis, sensemaking, and other problem-solving oriented behaviours.

Now, I’m not the first person to have made this observation (and I doubt I’ll be the last), but it occurs to me that one of the reasons the debate exists in the first place is that the community lacks a shared vocabulary for defining these concepts, and when we each talk about “search tasks” we may actually be referring to quite different things. So to clarify how I see the landscape, I’ve put together the short piece below. More importantly, I’ve tried to connect the conceptual (aka academic) material to current design practice, so that we can see what difference it might make if we had a shared perspective on these things. As always, comments & feedback welcome.

High marks for a start on what complex and intertwined issues.

Not so much that we will reach a common vocabulary but so we can be clearer about where we get confused when moving from one paradigm to another.