Archive for the ‘Integration’ Category

The challenge of combining 176 x #otherpeoplesdata…

Wednesday, June 10th, 2015

The challenge of combining 176 x #otherpeoplesdata to create the Biomass And Allometry Database by Daniel Falster , Rich FitzJohn , Remko Duursma , Diego Barneche .

From the post:

Despite the hype around "big data", a more immediate problem facing many scientific analyses is that large-scale databases must be assembled from a collection of small independent and heterogeneous fragments — the outputs of many and isolated scientific studies conducted around the globe.

Collecting and compiling these fragments is challenging at both political and technical levels. The political challenge is to manage the carrots and sticks needed to promote sharing of data within the scientific community. The politics of data sharing have been the primary focus for debate over the last 5 years, but now that many journals and funding agencies are requiring data to be archived at the time of publication, the availability of these data fragments is increasing. But little progress has been made on the technical challenge: how can you combine a collection of independent fragments, each with its own peculiarities, into a single quality database?

Together with 92 other co-authors, we recently published the Biomass And Allometry Database (BAAD) as a data paper in the journal Ecology, combining data from 176 different scientific studies into a single unified database. We built BAAD for several reasons: i) we needed it for our own work ii) we perceived a strong need within the vegetation modelling community for such a database and iii) because it allowed us to road-test some new methods for building and maintaining a database ^1.

Until now, every other data compilation we are aware of has been assembled in the dark. By this we mean, end-users are provided with a finished product, but remain unaware of the diverse modifications that have been made to components in assembling the unified database. Thus users have limited insight into the quality of methods used, nor are they able to build on the compilation themselves.

The approach we took with BAAD is quite different: our database is built from raw inputs using scripts; plus the entire work-flow and history of modifications is available for users to inspect, run themselves and ultimately build upon. We believe this is a better way for managing lots of #otherpeoplesdata and so below share some of the key insights from our experience.

The highlights of the project:

1. Script everything and rebuild from source

2. Establish a data-processing pipeline

  • Don’t modify raw data files
  • Encode meta-data as data, not as code
  • Establish a formal process for processing and reviewing each data set

3. Use version control (git) to track changes and code sharing website (github) for effective collaboration

4. Embrace Openness

5. A living database

There was no mention of reconciliation of nomenclature for species. I checked some of the individual reports, such as Report for study: Satoo1968, which does mention:

Other variables: M.I. Ishihara, H. Utsugi, H. Tanouchi, and T. Hiura conducted formal search of reference databases and digitized raw data from Satoo (1968). Based on this reference, meta data was also created by M.I. Ishihara. Species name and family names were converted by M.I. Ishihara according to the following references: Satake Y, Hara H (1989a) Wild flower of Japan Woody plants I (in Japanese). Heibonsha, Tokyo; Satake Y, Hara H (1989b) Wild flower of Japan Woody plants II (in Japanese). Heibonsha, Tokyo. (Emphasis in original)

I haven’t surveyed all the reports but it appears that “conversion” of species and family names occurred prior to entering the data pipeline.

Not an unreasonable choice but it does mean that we cannot use the original names as recorded as search terms into literature that existed at the time of the original observations.

Normalization of data often leads to loss of information. Not necessarily but often does.

I first saw this in a tweet by Dr. Mike Whitfield.

Federal Data Integration: Dengue Fever

Tuesday, April 7th, 2015

The White House issued a press release today (April 7, 2015) titled: FACT SHEET: Administration Announces Actions To Protect Communities From The Impacts Of Climate Change.

That press release reads in part:

Unleashing Data: As part of the Administration’s Predict the Next Pandemic Initiative, in May 2015, an interagency working group co-chaired by OSTP, the CDC, and the Department of Defense will launch a pilot project to simulate efforts to forecast epidemics of dengue – a mosquito-transmitted viral disease affecting millions of people every year, including U.S. travelers and residents of the tropical regions of the U.S. such as Puerto Rico. The pilot project will consolidate data sets from across the federal government and academia on the environment, disease incidence, and weather, and challenge the research and modeling community to develop predictive models for dengue and other infectious diseases based on those datasets. In August 2015, OSTP plans to convene a meeting to evaluate resulting models and showcase this effort as a “proof-of-concept” for similar forecasting efforts for other infectious diseases.

I tried finding more details on earlier workshops in this effort but limiting the search to “Predict the Next Pandemic Initiative” and the domain to “.gov,” I got two “hits.” One of which was the press release I cite above.

I sent a message (webform) to the White House Office of Science and Technology Policy office and will update you with any additional information that arrives.

Of course my curiosity is about the means used to integrate the data sets. Once integrated, such data sets can be re-used, at least until it is time to integrate additional data sets. Bearing in mind that dirty data can lead to poor decision making, I would rather not duplicate the cleaning of data time after time.

Linked Data Integration with Conflicts

Tuesday, November 11th, 2014

Linked Data Integration with Conflicts by Jan Michelfeit, Tomáš Knap, Martin Nečaský.


Linked Data have emerged as a successful publication format and one of its main strengths is its fitness for integration of data from multiple sources. This gives them a great potential both for semantic applications and the enterprise environment where data integration is crucial. Linked Data integration poses new challenges, however, and new algorithms and tools covering all steps of the integration process need to be developed. This paper explores Linked Data integration and its specifics. We focus on data fusion and conflict resolution: two novel algorithms for Linked Data fusion with provenance tracking and quality assessment of fused data are proposed. The algorithms are implemented as part of the ODCleanStore framework and evaluated on real Linked Open Data.

Conflicts in Linked Data? The authors explain:

From the paper:

The contribution of this paper covers the data fusion phase with conflict resolution and a conflict-aware quality assessment of fused data. We present new algorithms that are implemented in ODCleanStore and are also available as a standalone tool ODCS-FusionTool.2

Data fusion is the step where actual data merging happens – multiple records representing the same real-world object are combined into a single, consistent, and clean representation [3]. In order to fulfill this definition, we need to establish a representation of a record, purge uncertain or low-quality values, and resolve identity and other conflicts. Therefore we regard conflict resolution as a subtask of data fusion.

Conflicts in data emerge during the data fusion phase and can be classified as schema, identity, and data conflicts. Schema conflicts are caused by di fferent source data schemata – di fferent attribute names, data representations (e.g., one or two attributes for name and surname), or semantics (e.g., units). Identity conflicts are a result of di fferent identifiers used for the same real-world objects. Finally, data conflicts occur when di fferent conflicting values exist for an attribute of one object.

Conflict can be resolved on entity or attribute level by a resolution function. Resolution functions can be classified as deciding functions, which can only choose values from the input such as the maximum value, or mediating functions, which may produce new values such as average or sum [3].

Oh, so the semantic diversity of data simply flowed into Linked Data representation.

Hmmm, watch for a basis for in the data for resolving schema, identity and data conflicts.

The related work section is particularly rich with references to non-Linked Data conflict resolution projects. Definitely worth a close read and chasing the references.

To examine the data fusion and conflict resolution algorithm the authors start by restating the problem:

  1. Diff erent identifying URIs are used to represent the same real-world entities.
  2. Diff erent schemata are used to describe data.
  3. Data conflicts emerge when RDF triples sharing the same subject and predicate have inconsistent values in place of the object.

I am skipping all the notation manipulation for the quads, etc., mostly because of the inputs into the algorithm:


As a result of human intervention, the different identifying URIs have been mapped together. Not to mention the weighting of the metadata and the desired resolution for data conflicts (location data).

With that intervention, the complex RDF notation and manipulation becomes irrelevant.

Moreover, as I am sure you are aware, there is more than one “Berlin” listed in DBpedia. Several dozen as I recall.

I mention that because the process as described does not say where the authors of the rules/mappings obtained the information necessary to distinguish one Berlin from another?

That is critical for another author to evaluate the correctness of their mappings.

At the end of the day, after the “resolution” proposed by the authors, we are in no better position to map their result to another than we were at the outset. We have bald statements with no additional data on which to evaluate those statements.

Give Appendix A. List of Conflict Resolution Functions, a close read. The authors have extracted conflict resolution functions from the literature. Should be a time saver as well as suggestive of other needed resolution functions.

PS: If you look for ODCS-FusionTool you will find LD-Fusion Tool (GitHub), which was renamed to ODCS-FusionTool a year ago. See also the official LD-FusionTool webpage.

Build Roads not Stagecoaches

Friday, July 18th, 2014

Build Roads not Stagecoaches by Martin Fenner.

Describing Eric Hysen’s keynote, Martin says:

In his keynote he described how travel from Cambridge to London in the 18th and early 19th century improved mainly as a result of better roads, made possible by changes in how these roads are financed. Translated to today, he urged the audience to think more about the infrastructure and less about the end products:

Ecosystems, not apps

— Eric Hysen

On Tuesday at csv,conf, Nick Stenning – Technical Director of the Open Knowledge Foundation – talked about data packages, an evolving standard to describe data that are passed around betwen different systems. He used the metaphor of containers, and how they have dramatically changed the transportation of goods in the last 50 years. He argued that the cost of shipping was in large part determined by the cost of loading and unloading, and the container has dramatically changed that equation. We are in a very similar situation with datasets, where most of the time is spent translating between different formats, joining things together that use different names for the same thing [emphasis added], etc.

…different names for the same thing.

Have you heard that before? 😉

But here is the irony:

When I thought more about this I realized that these building blocks are exactly the projects I get most excited about, i.e. projects that develop standards or provide APIs or libraries. Some examples would be

  • ORCID: unique identifiers for scholarly authors

OK, but many authors already have unique identifiers in DBLP, Library of Congress, Twitter, and at places I have not listed.

Nothing against ORCID, but adding yet another identifier isn’t all that helpful.

A mapping between identifiers, so having one means I can leverage the others, now that is what I call infrastructure.


…[S]emantically enriched open pharmacological space…

Wednesday, July 16th, 2014

Scientific competency questions as the basis for semantically enriched open pharmacological space development by Kamal Azzaoui, et al. (Drug Discovery Today, Volume 18, Issues 17–18, September 2013, Pages 843–852)


Molecular information systems play an important part in modern data-driven drug discovery. They do not only support decision making but also enable new discoveries via association and inference. In this review, we outline the scientific requirements identified by the Innovative Medicines Initiative (IMI) Open PHACTS consortium for the design of an open pharmacological space (OPS) information system. The focus of this work is the integration of compound–target–pathway–disease/phenotype data for public and industrial drug discovery research. Typical scientific competency questions provided by the consortium members will be analyzed based on the underlying data concepts and associations needed to answer the questions. Publicly available data sources used to target these questions as well as the need for and potential of semantic web-based technology will be presented.

Pharmacology may not be your space but this is a good example of what it takes for semantic integration of resources in a complex area.

Despite the “…you too can be a brain surgeon with our new web-based app…” from various sources, semantic integration has been, is and will remain difficult under the best of circumstances.

I don’t say that to discourage anyone but to avoid the let-down when integration projects don’t provide easy returns.

It is far better to plan for incremental and measurable benefits along the way than to fashion grandiose goals that are ever receding on the horizon.

I first saw this in a tweet by ChemConnector.

Slaying Data Silos?

Tuesday, July 1st, 2014

Krishnan Subramanian’s Modern Enterprise: Slaying the Silos with Data Virtualization keeps coming up in my Twitter feed.

In speaking of breaking down data silos, Krishnan says:

A much better approach to solving this problem is abstraction through data virtualization. It is a powerful tool, well suited for the loose coupling approach prescribed by the Modern Enterprise Model. Data virtualization helps applications retrieve and manipulate data without needing to know technical details about each data store. when implemented, organizational data can be easily accessed using a simple REST API.

Data Virtualization (or an abstracted Database as a Service) plugs into the Modern Enterprise Platform as a higher-order layer, offering the following advantages:

  • Better business decisions due to organization wide accessibility of all data
  • Higher organizational agility
  • Loosely coupled services making future proofing easier
  • Lower cost

I find that troubling because there is no mention of data integration.

In fact, in more balanced coverage of data virtualization, which recites the same advantages as Krishnan, we read:

For some reason there are those who sell virtualization software and cloud computing enablement platforms who imply that data integration is something that comes along for the ride. However, nothing gets less complex and data integration still needs to occur between the virtualized data stores as if they existed on their own machines. They are still storing data in different physical data structures, and the data must be moved or copied, and the difference with the physical data structures dealt with, as well as data quality, data integrity, data validation, data cleaning, etc. (The Pros and Cons of Data Virtualization)

Krishnan begins his post:

There’s a belief that cloud computing breaks down silos inside enterprises. Yes, the use of cloud and DevOps breaks down organizational silos between different teams but it only solves part of the problem. The bigger problem is silos between data sources. Data silos, as I would like to refer the problem, is the biggest bottlenecks enterprises face as they try to modernize their IT infrastructure. As I advocate the Modern Enterprise Model, many people ask me what problems they’ll face if they embrace it. Today I’ll do a quick post to address this question at a more conceptual level, without getting into the details.

If data silos are the biggest bottleneck enterprises face, why is the means to address that, data integration, a detail?

Every hand waving approach to data integration fuels unrealistic expectations, even among people who should know better.

There are no free lunches and there are no free avenues for data integration.

Archive integration at Mattilsynet

Saturday, June 21st, 2014

Archive integration at Mattilsynet by Lars Marius Garshol (slides)

In addition to being on the path to become a prominent beer expert (see:, Lars Marius has long been involved in integration technologies in general and topic maps in particular.

These slides give a quick overview of a current integration project.

There is one point Lars makes that merits special attention:

No hard bindings from code to data model

  • code should have no knowledge of the data model
  • all data model-specific logic should be configuration
  • makes data changes much easier to handle

(slide 4)

Keep that in mind when evaluating ETL solutions. What is being hard coded?

PS: I was amused that Lars describes RDF as “Essentially a graph database….” True but the W3C starting marketing that claim only after graph databases had a surge in popularity.

Markup editors are manipulating directed acyclic graphs so I suppose they are graph editors as well. 😉

Talend 5.5 (DYI Data Integration)

Tuesday, June 3rd, 2014

Talend Increases Big Data Integration Performance and Scalability by 45 Percent

From the post:

Only Talend 5.5 allows developers to generate high performance Hadoop code without needing to be an expert in MapReduce or Pig

(BUSINESS WIRE)–Hadoop Summit — Talend, the global big data integration software leader, today announced the availability of Talend version 5.5, the latest release of the only integration platform optimized to deliver the highest performance on all leading Hadoop distributions.

Talend 5.5 enhances Talend’s performance and scalability on Hadoop by an average of 45 percent. Adoption of Hadoop is skyrocketing and companies large and small are struggling to find enough knowledgeable Hadoop developers to meet this growing demand. Only Talend 5.5 allows any data integration developer to use a visual development environment to generate native, high performance and highly scalable Hadoop code. This unlocks a large pool of development resources that can now contribute to big data projects. In addition, Talend is staying on the cutting edge of new developments in Hadoop that allow big data analytics projects to power real-time customer interactions.


Version 5.5 of all Talend open source products is available for immediate download from Talend’s website, Experimental support for Spark code generation is also available immediately and can be downloaded from the Talend Exchange on Version 5.5 of the commercial subscription products will be available within 3 weeks and will be provided to all existing Talend customers as part of their subscription agreement. Products can be also be procured through the usual Talend representatives and partners.

To learn more about Talend 5.5 with 45 percent faster Big Data integration Performance register here for our June 10 webinar.

When you think of the centuries it took to go from a movable type press to modern word processing and near professional printing/binding capabilities, the enabling of users to perform data processing/integration, is nothing short of amazing.

Data scientists need not fear DYI data processing/integration any more than your local bar association fears “How to Avoid Probate” books on the news stand.

I don’t doubt people will be able to get some answer out of data crunching software but did they get a useful answer? Or an answer sufficient to set company policy? Or an answer that will increase their bottom line?

Encourage the use of open source software. Non-clients who use it poorly will fail. Make sure they can’t say the same about your clients.

BTW, the webinar appears to be scheduled for thirty (30) minutes. Thirty minutes on Talend 5.5? You will be better off spending that thirty minutes with Talend 5.5.

CIDOC Conceptual Reference Model

Saturday, February 22nd, 2014

CIDOC Conceptual Reference Model (pdf)

From the “Definition of the CIDOC Conceptual Reference Model:”

This document is the formal definition of the CIDOC Conceptual Reference Model (“CRM”), a formal ontology intended to facilitate the integration, mediation and interchange of heterogeneous cultural heritage information. The CRM is the culmination of more than a decade of standards development work by the International Committee for Documentation (CIDOC) of the International Council of Museums (ICOM). Work on the CRM itself began in 1996 under the auspices of the ICOM-CIDOC Documentation Standards Working Group. Since 2000, development of the CRM has been officially delegated by ICOM-CIDOC to the CIDOC CRM Special Interest Group, which collaborates with the ISO working group ISO/TC46/SC4/WG9 to bring the CRM to the form and status of an International Standard.

Objectives of the CIDOC CRM

The primary role of the CRM is to enable information exchange and integration between heterogeneous sources of cultural heritage information. It aims at providing the semantic definitions and clarifications needed to transform disparate, localised information sources into a coherent global resource, be it with in a larger institution, in intranets or on the Internet. Its perspective is supra-institutional and abstracted from any specific local context. This goal determines the constructs and level of detail of the CRM.

More specifically, it defines and is restricted to the underlying semantics of database schemata and document structures used in cultural heritage and museum documentation in terms of a formal ontology. It does not define any of the terminology appearing typically as data in the respective data structures; however it foresees the characteristic relationships for its use. It does not aim at proposing what cultural institutions should document. Rather it explains the logic of what they actually currently document, and thereby enables semantic interoperability.

It intends to provide a model of the intellectual structure of cultural documentation in logical terms. As such, it is not optimised for implementation-specific storage and processing aspects. Implementations may lead to solutions where elements and links between relevant elements of our conceptualizations are no longer explicit in a database or other structured storage system. For instance the birth event that connects elements such as father, mother, birth date, birth place may not appear in the database, in order to save storage space or response time of the system. The CRM allows us to explain how such apparently disparate entities are intellectually interconnected, and how the ability of the database to answer certain intellectual questions is affected by the omission of such elements and links.

The CRM aims to support the following specific functionalities:

  • Inform developers of information systems as a guide to good practice in conceptual modelling, in order to effectively structure and relate information assets of cultural documentation.
  • Serve as a common language for domain experts and IT developers to formulate requirements and to agree on system functionalities with respect to the correct handling of cultural contents.
  • To serve as a formal language for the identification of common information contents in different data formats; in particular to support the implementation of automatic data transformation algorithms from local to global data structures without loss of meaning. The latter being useful for data exchange, data migration from legacy systems, data information integration and mediation of heterogeneous sources.
  • To support associative queries against integrated resources by providing a global model of the basic classes and their associations to formulate such queries.
  • It is further believed, that advanced natural language algorithms and case-specific heuristics can take significant advantage of the CRM to resolve free text information into a formal logical form, if that is regarded beneficial. The CRM is however not thought to be a means to replace scholarly text, rich in meaning, by logical forms, but only a means to identify related data.

(emphasis in original)

Apologies for the long quote but this covers a number of important topic map issues.

For example:

For instance the birth event that connects elements such as father, mother, birth date, birth place may not appear in the database, in order to save storage space or response time of the system. The CRM allows us to explain how such apparently disparate entities are intellectually interconnected, and how the ability of the database to answer certain intellectual questions is affected by the omission of such elements and links.

In topic map terms I would say that the database omits a topic to represent “birth event” and therefore there is no role player for an association with the various role players. What subjects will have representatives in a topic map is always a concern for topic map authors.

Helpfully, CIDOC explicitly separates the semantics it documents from data structures.

Less helpfully:

Because the CRM’s primary role is the meaningful integration of information in an Open World, it aims to be monotonic in the sense of Domain Theory. That is, the existing CRM constructs and the deductions made from them must always remain valid and well-formed, even as new constructs are added by extensions to the CRM.

Which restricts integration using CRM to systems where CRM is the primary basis for integration, as opposed to be one way to integrate several data sets.

That may not seem important in “web time,” where 3 months equals 1 Internet year. But when you think of integrating data and integration practices as they evolve over decades if not centuries, the limitations of monotonic choices come to the fore.

To take one practical discussion under way, how to handle warning about radioactive waste, which must endure anywhere from 10,000 to 1,000,000 years? A far simpler task than preserving semantics over centuries.

If you think that is easy, remember that lots of people saw the pyramids of Egypt being built. But it was such common knowledge, that no one thought to write it down.

Preservation of semantics is a daunting task.

CIDOC merits a slow read by anyone interested in modeling, semantics, vocabularies, and preservation.

PS: CIDOC: Conceptual Reference Model as a Word file.

Big Data: Main Research/Business Challenges Ahead?

Wednesday, November 20th, 2013

Big Data Analytics at Thomson Reuters. Interview with Jochen L. Leidner by Roberto V. Zicari.

In case you don’t know, Jochen L. Leidner has the title: “Lead Scientist, of the London R&D at Thomson Reuters.”

Which goes a long way to explaining the importance of this Q&A exchange:

Q12 What are the main research challenges ahead? And what are the main business challenges ahead?

Jochen L. Leidner: Some of the main business challenges are the cost pressure that some of our customers face, and the increasing availability of low-cost or free-of-charge information sources, i.e. the commoditization of information. I would caution here that whereas the amount of information available for free is large, this in itself does not help you if you have a particular problem and cannot find the information that helps you solve it, either because the solution is not there despite the size, or because it is there but findability is low. Further challenges include information integration, making systems ever more adaptive, but only to the extent it is useful, or supporting better personalization. Having said this sometimes systems need to be run in a non-personalized mode (e.g. in the field of e-discovery, you need to have a certain consistency, namely that the same legal search systems retrieves the same things today and tomorrow, and to different parties.

How are you planning to address:

  1. The required information is not available in the system. A semantic 404 as it were. To distinguish the case of its there but wrong search terms in use.
  2. Low findability.
  3. Information integration (not normalization)
  4. System adaptability/personalization, but to users and not developers.
  5. Search consistency, same result tomorrow as today.


The rest of the interview is more than worth your time.

I singled out the research/business challenges as a possible map forward.

We all know where we have been.

Integrating the Biological Universe

Monday, November 4th, 2013

Integrating the Biological Universe by Yasset Perez-Riverol & Roberto Vera.

From the post:

Integrating biological data is perhaps one of the most daunting tasks any bioinformatician has to face. From a cursory look, it is easy to see two major obstacles standing in the way: (i) the sheer amount of existing data, and (ii) the staggering variety of resources and data types used by the different groups working in the field (reviewed at [1]). In fact, the topic of data integration has a long-standing history in computational biology and bioinformatics. A comprehensive picture of this problem can be found in recent papers [2], but this short comment will serve to illustrate some of the hurdles of data integration and as a not-so-shameless plug for our contribution towards a solution.

“Reflecting the data-driven nature of modern biology, databases have grown considerably both in size and number during the last decade. The exact number of databases is difficult to ascertain. While not exhaustive, the 2011 Nucleic Acids Research (NAR) online database collection lists 1330 published biodatabases (1), and estimates derived from the ELIXIR database provider survey suggest an approximate annual growth rate of ∼12% (2). Globally, the numbers are likely to be significantly higher than those mentioned in the online collection, not least because many are unpublished, or not published in the NAR database issue.” [1]

Which lead me to:

JBioWH: an open-source Java framework for bioinformatics data integration:


The Java BioWareHouse (JBioWH) project is an open-source platform-independent programming framework that allows a user to build his/her own integrated database from the most popular data sources. JBioWH can be used for intensive querying of multiple data sources and the creation of streamlined task-specific data sets on local PCs. JBioWH is based on a MySQL relational database scheme and includes JAVA API parser functions for retrieving data from 20 public databases (e.g. NCBI, KEGG, etc.). It also includes a client desktop application for (non-programmer) users to query data. In addition, JBioWH can be tailored for use in specific circumstances, including the handling of massive queries for high-throughput analyses or CPU intensive calculations. The framework is provided with complete documentation and application examples and it can be downloaded from the Project Web site at A MySQL server is available for demonstration purposes at

Database URL:



Integrating with Apache Camel

Thursday, September 26th, 2013

Integrating with Apache Camel by Charles Mouillard.

From the post:

Since its creation by the Apache community in 2007, the open source integration framework Apache Camel has become a developer favourite. It is recognised as a key technology to design SOA / Integration projects and address complex enterprise integration use cases. This article, the first part of a series, will reveal how the framework generates, from the Domain Specific Language, routes where exchanges take place, how they are processed according to the patterns chosen, and finally how integration occurs.

This series will be a good basis to continue onto ‘Enterprise Integration Patterns‘ and compare that to topic maps.

How should topic maps be modified (if at all) to fit into enterprise integration patterns?

Apache Camel tunes its core with new release

Friday, September 20th, 2013

Apache Camel tunes its core with new release by Lucy Carey.

From the post:

The community around open-source integration framework Apache Camel is a prolific little hub, and in the space of just four and a half months, has put together a shiny new release – Apache Camel 2.12 – the 53rd Camel version to date.

On the menu for developers is a total of 17 new components, four new examples, and souped-up performance in simple or bean languages and general routing. More than three hundred JIRA tickets have been solved, and a lot of bug swatting and general fine tuning has taken place. Reflecting the hugely active community around the platform, around half of these new components come courtesy of external contributors, and the rest from Camel team developers.

Fulltime Apache Camel committer Claus Ibsen notes in his blog that this is the first release where steps have been taken to “allow Camel components documentation in the source code which gets generated and included in the binaries.” He also writes that “a Camel component can offer endpoint completion which allows tooling to offer smart completion”, citing the hawtio web console as an example of the ways in which this enables functions like auto completion for JMS queue names, file directory names, bean names in the registry.

Camel homepage.

If you are looking for a variety of explanations about Camel, the Camel homepage recommends a discussion at StackOver.

Not quite the blind men with the elephant but enough differences in approaches to be amusing.

Interactive Entity Resolution in Relational Data… [NG Topic Map Authoring]

Wednesday, June 5th, 2013

Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation by Hyunmo Kang, Lise Getoor, Ben Shneiderman, Mustafa Bilgic, Louis Licamele.


Databases often contain uncertain and imprecise references to real-world entities. Entity resolution, the process of reconciling multiple references to underlying real-world entities, is an important data cleaning process required before accurate visualization or analysis of the data is possible. In many cases, in addition to noisy data describing entities, there is data describing the relationships among the entities. This relational data is important during the entity resolution process; it is useful both for the algorithms which determine likely database references to be resolved and for visual analytic tools which support the entity resolution process. In this paper, we introduce a novel user interface, D-Dupe, for interactive entity resolution in relational data. D-Dupe effectively combines relational entity resolution algorithms with a novel network visualization that enables users to make use of an entity’s relational context for making resolution decisions. Since resolution decisions often are interdependent, D-Dupe facilitates understanding this complex process through animations which highlight combined inferences and a history mechanism which allows users to inspect chains of resolution decisions. An empirical study with 12 users confirmed the benefits of the relational context visualization on the performance of entity resolution tasks in relational data in terms of time as well as users’ confidence and satisfaction.

Talk about a topic map authoring tool!

Even chains entity resolution decisions together!

Not to be greedy, but interactive data deduplication and integration in Hadoop would be a nice touch. 😉

Software: D-Dupe: A Novel Tool for Interactive Data Deduplication and Integration.

Take DMX-h ETL Pre-Release for a Test Drive!

Sunday, June 2nd, 2013

Take DMX-h ETL Pre-Release for a Test Drive! by Keith Kohl.

From the post:

Last Monday, we announced two new DMX-h Hadoop products, DMX-h Sort Edition and DMX-h ETL Edition. Several Blog posts last week included why I thought the announcement was cool and also some Hadoop benchmarks on both TeraSort and also running ETL.

Part of our announcement was the DMX-h ETL Pre-Release Test Drive. The test drive is a trial download of our DMX-h ETL software. We have installed our software on our partner Cloudera’s VM (VMware) image complete with the user case accelerators, sample data, documentation and even videos. While the download is a little large ─ ok it’s over 3GB─ it’s a complete VM with Linux and Cloudera’s CDH 4.2 Hadoop release (the DMX-h footprint is a mere 165MB!).


Then Keith asks later in the post:

The test drive is not your normal download. This is actually a pre-release of our DMX-h ETL product offering. While we have announced our product, it is not generally available (GA) yet…scheduled for end of June. We are offering a download of a product that isn’t even available yet…how many vendors do that?!

Err, lots of them? It’s call a beta/candidate/etc release?


Marketing quibbles aside, it does sound quite interesting.

In some ways I would like to see the VM release model become more common.

Test driving software should not be a install/configuration learning experience.

That should come after users are interested in the software.

BTW, interesting approach, at least reading the webpages/documentation.

Doesn’t generate code for conversion/ETL so there is no code to maintain. Written against the DMX-h engine.

Need to think about what that means in terms of documenting semantics.

Or reconciling different ETL approaches in the same enterprise.

More to follow.

From data to analysis:… [Data Integration For a Purpose]

Friday, May 24th, 2013

From data to analysis: linking NWChem and Avogadro with the syntax and semantics of Chemical Markup Language by Wibe A de Jong, Andrew M Walker and Marcus D Hanwell. (Journal of Cheminformatics 2013, 5:25 doi:10.1186/1758-2946-5-25)



Multidisciplinary integrated research requires the ability to couple the diverse sets of data obtained from a range of complex experiments and computer simulations. Integrating data requires semantically rich information. In this paper an end-to-end use of semantically rich data in computational chemistry is demonstrated utilizing the Chemical Markup Language (CML) framework. Semantically rich data is generated by the NWChem computational chemistry software with the FoX library and utilized by the Avogadro molecular editor for analysis and visualization.


The NWChem computational chemistry software has been modified and coupled to the FoX library to write CML compliant XML data files. The FoX library was expanded to represent the lexical input files and molecular orbitals used by the computational chemistry software. Draft dictionary entries and a format for molecular orbitals within CML CompChem were developed. The Avogadro application was extended to read in CML data, and display molecular geometry and electronic structure in the GUI allowing for an end-to-end solution where Avogadro can create input structures, generate input files, NWChem can run the calculation and Avogadro can then read in and analyse the CML output produced. The developments outlined in this paper will be made available in future releases of NWChem, FoX, and Avogadro.


The production of CML compliant XML files for computational chemistry software such as NWChem can be accomplished relatively easily using the FoX library. The CML data can be read in by a newly developed reader in Avogadro and analysed or visualized in various ways. A community-based effort is needed to further develop the CML CompChem convention and dictionary. This will enable the long-term goal of allowing a researcher to run simple “Google-style” searches of chemistry and physics and have the results of computational calculations returned in a comprehensible form alongside articles from the published literature.

Aside from its obvious importance for cheminformatics, I think there is another lesson in this article.

Integration of data required “…semantically rich information…, but just as importantly, integration was not a goal in and of itself.

Integration was only part of a workflow that had other goals.

No doubt some topic maps are useful as end products of integrated data, but what of cases where integration is part of a workflow?

Think of the non-reusable data integration mappings that are offered by many enterprise integration packages.

DCAT Application Profile for Data Portals in Europe – Final Draft

Wednesday, May 22nd, 2013

DCAT Application Profile for Data Portals in Europe – Final Draft

From the post:

The DCAT Application profile for data portals in Europe (DCAT-AP) is a specification based on the Data Catalogue vocabulary (DCAT) for describing public sector datasets in Europe. Its basic use case is to enable a cross-data portal search for data sets and make public sector data better searchable across borders and sectors. This can be achieved by the exchange of descriptions of data sets among data portals.

This final draft is open for public review until 10 June 2013. Members of the public are invited to download the specification and post their comments directly on this page. To be able to do so you need to be registered and logged in.

If you are interested in integration of data from European data portals, it is worth the time to register, etc.

Not all the data you are going to need to integrate a data set but at least a start in the right direction.

Talend Improves Usability of Big Data…

Wednesday, May 8th, 2013

Talend Improves Usability of Big Data with Release of Integration Platform

Talend today announced the availability of version 5.3 of its next-generation integration platform, a unified environment that scales the integration of data, application and business processes. With version 5.3, Talend allows any integration developer to develop on big data platforms without requiring specific expertise in these areas.

“Hadoop and NoSQL are changing the way people manage and analyze data, but up until now, it has been difficult to work with these technologies. The general lack of skillsets required to manage these new technologies continues to be a significant barrier to mainstream adoption,” said Fabrice Bonan, co-founder and chief technical officer, Talend. “Talend v5.3 delivers on our vision of providing innovative tools that hide the underlying complexity of big data, turning anyone with integration skills into expert big data developers.”

User-Friendly Tools for 100 Percent MapReduce Code

Talend v5.3 generates native Hadoop code and runs data transformations directly inside Hadoop for scalability. By leveraging MapReduce’s architecture for highly distributed data processing, data integration developers can build their jobs on Hadoop without the need for specialist programming skills.

Graphical Mapper for Complex Processes

The new graphical mapping functionality targeting big data, and especially the Pig language, allows developers to graphically build data flows to take source data and transform it using a visual mapper. For Hadoop developers familiar with Pig Latin, this mapper enables them to develop, test and preview their data jobs within a GUI environment.

Additional NoSQL Support

Talend 5.3 adds support for NoSQL databases in its integration solutions, Talend Platform for Big Data and Talend Open Studio for Big Data, with a new set of connectors for Couchbase, CouchDB and Neo4j. Built on Talend’s open source integration technology, Talend Open Studio for Big Data is a powerful and versatile open source solution for big data integration that natively supports Apache Hadoop, including connectors for Hadoop Distributed File System (HDFS), HCatalog, Hive, Oozie, Pig, Sqoop, Cassandra, Hbase and MongoDB – in addition to the more than 450 connectors included natively in the product. The integration of these platforms into Talend’s big data solution enables customers to use these new connectors to migrate and synchronize data between NoSQL databases and all other data stores and systems.

Of particular interest is their data integration package, which reportedly sports 450+ connectors to various data sources.

Unless you are interested in coding all new connectors for the same 450+ data sources.

Increasing Interoperability of Data for Social Good [$100K]

Saturday, March 23rd, 2013

Increasing Interoperability of Data for Social Good

March 4, 2013 through May 7, 2013 11:30 AM PST

Each Winner to Receive $100,000 Grant

Got your attention? Good!

From the notice:

The social sector is full of passion, intuition, deep experience, and unwavering commitment. Increasingly, social change agents from funders to activists, are adding data and information as yet one more tool for decision-making and increasing impact.

But data sets are often isolated, fragmented and hard to use. Many organizations manage data with multiple systems, often due to various requirements from government agencies and private funders. The lack of interoperability between systems leads to wasted time and frustration. Even those who are motivated to use data end up spending more time and effort on gathering, combining, and analyzing data, and less time on applying it to ongoing learning, performance improvement, and smarter decision-making.

It is the combining, linking, and connecting of different “data islands” that turns data into knowledge – knowledge that can ultimately help create positive change in our world. Interoperability is the key to making the whole greater than the sum of its parts. The Bill & Melinda Gates Foundation, in partnership with Liquidnet for Good, is looking for groundbreaking ideas to address this significant, but solvable, problem. See the website for more detail on the challenge and application instructions. Each challenge winner will receive a grant of $100,000.

From the details website:

Through this challenge, we’re looking for game-changing ideas we might never imagine on our own and that could revolutionize the field. In particular, we are looking for ideas that might provide new and innovative ways to address the following:

  • Improving the availability and use of program impact data by bringing together data from multiple organizations operating in the same field and geographical area;
  • Enabling combinations of data through application programming interface (APIs), taxonomy crosswalks, classification systems, middleware, natural language processing, and/or data sharing agreements;
  • Reducing inefficiency for users entering similar information into multiple systems through common web forms, profiles, apps, interfaces, etc.;
  • Creating new value for users trying to pull data from multiple sources;
  • Providing new ways to access and understand more than one data set, for example, through new data visualizations, including mashing up government and other data;
  • Identifying needs and barriers by experimenting with increased interoperability of multiple data sets;
  • Providing ways for people to access information that isn’t normally accessible (for using natural language processing to pull and process stories from numerous sources) and combing that information with open data sets.

Successful Proposals Will Include:

  • Identification of specific data sets to be used;
  • Clear, compelling explanation of how the solution increases interoperability;
  • Use case;
  • Description of partnership or collaboration, where applicable;
  • Overview of how solution can be scaled and/or adapted, if it is not already cross-sector in nature;
  • Explanation of why the organization or group submitting the proposal has the capacity to achieve success;
  • A general approach to ongoing sustainability of the effort.

I could not have written a more topic map oriented challenge. You?

They suggest the usual social data sites:

Integrating Structured and Unstructured Data

Thursday, February 21st, 2013

Integrating Structured and Unstructured Data by David Loshin.

It’s a checklist report but David comes up with useful commentary on the following seven points:

  1. Document clearly defined business use cases.
  2. Employ collaborative tools for the analysis, use, and management of semantic metadata.
  3. Use pattern-based analysis tools for unstructured text.
  4. Build upon methods to derive meaning from content, context, and concept.
  5. Leverage commodity components for performance and scalability.
  6. Manage the data life cycle.
  7. Develop a flexible data architecture.

It’s not going to save you planning time but may keep you from overlooking important issues.

My only quibble is that David doesn’t call out data structures as needing defined and preserved semantics.

Data is a no brainer but the containers of data, dare I say “Hadoop silos,” need to have semantics defined as well.

Data or data containers without defined and preserved semantics are much more costly in the long run.

Both in lost opportunity costs and after the fact integration costs.

Why Most BI Programs Under-Deliver Value

Sunday, February 10th, 2013

Why Most BI Programs Under-Deliver Value by Steve Dine.

From the post:

Business intelligence initiatives have been undertaken by organizations across the globe for more than 25 years, yet according to industry experts between 60 and 65 percent of BI projects and programs fail to deliver on the requirements of their customers.

This impact of this failure reaches far beyond the project investment, from unrealized revenue to increased operating costs. While the exact reasons for failure are often debated, most agree that a lack of business involvement, long delivery cycles and poor data quality lead the list. After all this time, why do organizations continue to struggle with delivering successful BI? The answer lies in the fact that they do a poor job at defining value to the customer and how that value will be delivered given the resource constraints and political complexities in nearly all organizations.

BI is widely considered an umbrella term for data integration, data warehousing, performance management, reporting and analytics. For the vast majority of BI projects, the road to value definition starts with a program or project charter, which is a document that defines the high level requirements and capital justification for the endeavor. In most cases, the capital justification centers on cost savings rather than value generation. This is due to the level of effort required to gather and integrate data across disparate source systems and user developed data stores.

As organizations mature, the number of applications that collect and store data increase. These systems usually contain few common unique identifiers to help identify related records and are often referred to as data silos. They also can capture overlapping data attributes for common organizational entities, such as product and customer. In addition, the data models of these systems are usually highly normalized, which can make them challenging to understand and difficult for data extraction. These factors make cost savings, in the form of reduced labor for data collection, easy targets. Unfortunately, most organizations don’t eliminate employees when a BI solution is implemented; they simply work on different, hopefully more value added, activities. From the start, the road to value is based on a flawed assumption and is destined to under deliver on its proposition.

This post merits a close read, several times.

In particular I like the focus on delivery of value to the customer.

Err, that would be the person paying you to do the work.

Steve promises a follow-up on “lean BI” that focuses on delivering more value that it costs to deliver.

I am inherently suspicious of “lean” or “agile” approaches. I sat on a committee that was assured by three programmers they had improved upon IBM’s programming methodology but declined to share the details.

Their requirements document for a content management system, to be constructed on top of subversion, was a paragraph in an email.

Fortunately the committee prevailed upon management to tank the project. The programmers persist, management being unable or unwilling to correct past mistakes.

I am sure there are many agile/lean programming projects that deliver well documented, high quality results.

But I don’t start with the assumption that agile/lean or other methodology projects are well documented.

That is a question of fact. One that can be answered.

Refusal to answer due to time or resource constraints, is a very bad sign.

I first saw this in a top ten tweets list from KDNuggets.

Seamless Astronomy

Thursday, February 7th, 2013

Seamless Astronomy: Linking scientific data, publications, and communities

From the webpage:

Seamless integration of scientific data and literature

Astronomical data artifacts and publications exist in disjointed repositories. The conceptual relationship that links data and publications is rarely made explicit. In collaboration with ADS and ADSlabs, and through our work in conjunction with the Institute for Quantitative Social Science (IQSS), we are working on developing a platform that allows data and literature to be seamlessly integrated, interlinked, mutually discoverable.


  • ADS All-SKy Survey (ADSASS)
  • Astronomy Dataverse
  • WorldWide Telescope (WWT)
  • Viz-e-Lab
  • Glue
  • Study of the impact of social media and networking sites on scientific dissemination
  • Network analysis and visualization of astronomical research communities
  • Data citation practices in Astronomy
  • Semantic description and annotation of scientific resources

A project with large amounts of data for integration.

Moreover, unlike the U.S. Intelligence Community, they are working towards data integration, not resisting it.

I first saw this in Four short links: 6 February 2013 by Nat Torkington.

ToxPi GUI [Data Recycling]

Sunday, February 3rd, 2013

ToxPi GUI: an interactive visualization tool for transparent integration of data from diverse sources of evidence by David M. Reif, Myroslav Sypa, Eric F. Lock, Fred A. Wright, Ander Wilson, Tommy Cathey, Richard R. Judson and Ivan Rusyn. (Bioinformatics (2013) 29 (3): 402-403. doi: 10.1093/bioinformatics/bts686)


Motivation: Scientists and regulators are often faced with complex decisions, where use of scarce resources must be prioritized using collections of diverse information. The Toxicological Prioritization Index (ToxPi™) was developed to enable integration of multiple sources of evidence on exposure and/or safety, transformed into transparent visual rankings to facilitate decision making. The rankings and associated graphical profiles can be used to prioritize resources in various decision contexts, such as testing chemical toxicity or assessing similarity of predicted compound bioactivity profiles. The amount and types of information available to decision makers are increasing exponentially, while the complex decisions must rely on specialized domain knowledge across multiple criteria of varying importance. Thus, the ToxPi bridges a gap, combining rigorous aggregation of evidence with ease of communication to stakeholders.

Results: An interactive ToxPi graphical user interface (GUI) application has been implemented to allow straightforward decision support across a variety of decision-making contexts in environmental health. The GUI allows users to easily import and recombine data, then analyze, visualize, highlight, export and communicate ToxPi results. It also provides a statistical metric of stability for both individual ToxPi scores and relative prioritized ranks.

Availability: The ToxPi GUI application, complete user manual and example data files are freely available from


Very cool!

Although like having a Ford automobile in any color, so long as the color was black, you can integrate any data source, so long as the format is csv. And values are numbers. Subject to other restrictions as well.

That’s an observation, not a criticism.

The application serves a purpose within a domain and does not “integrate” information in the sense of a topic map.

But a topic map could recycle its data to add other identifications and properties. Without having to re-write this application or its data.

Once curated, data should be re-used, not re-created/curated.

Topic maps give you more bang for your data buck.

The Power of Visual Thinking?

Saturday, February 2nd, 2013

The Power of Visual Thinking? by Chuck Hollis.

In describing an infographic about transformation of IT, Chuck says:

It’s an interesting representation of the “IT transformation journey”. Mark’s particular practice involves conducting workshops for transforming IT teams. He needed some better tools, and here you have it.

While there are those out there who might quibble on the details, there’s no argument about its communicative power.

In one simple graphic, it appears to be a very efficient tool to get a large number of stakeholders to conceptualize a shared set of concepts and sequences. And, when it comes to IT transformation, job #1 appears to be getting everyone on the same page on what lies ahead 🙂

Download the graphic here.

I must be one of those people who quibble about details.

A great graphic but such loaded categories that disagreement would be akin to voting against bacon.

Who wants to be plowing with oxen or attacked by tornadoes? Space ships and floating cities await those who adopt a transformed IT infrastructure.

Still, could be useful to give c-suite types as a summary of any technical presentation you make. Call it a “high level” view. 😉

Camel Essential Components

Wednesday, January 16th, 2013

Camel Essential Components by Christian Posta. (new Dzone Refcard)

From the webpage:

What is Apache Camel?

Camel is an open-source, lightweight, integration library that allows your applications to accomplish intelligent routing, message transformation, and protocol mediation using the established Enterprise Integration Patterns and out-of-the-box components with a highly expressive Domain Specific Language (Java, XML, or Scala). With Camel you can implement integration solutions as part of an overarching ESB solution, or as individual routes deployed to any container such as Apache Tomcat, Apache ServiceMix, JBoss AS, or even a stand-alone java process.

Why use Camel?

Camel simplifies systems integrations with an easy-to-use DSL to create routes that clearly identify the integration intentions and endpoints. Camel’s out of the box integration components are modeled after the Enterprise Integration Patterns cataloged in Gregor Hohpe and Bobby Wolf’s book ( You can use these EIPs as pre-packaged units, along with any custom processors or external adapters you may need, to easily assemble otherwise complex routing and transformation routes. For example, this route takes an XML message from a queue, does some processing, and publishes to another queue:

No explicit handling of subject identity but that’s what future releases are for. 😉

Open Source CRM [Lack of TMs – The New Acquisition “Poison Pill”?]

Tuesday, December 25th, 2012

Zurmo sets out to enchant the open source CRM space

From the post:

Being “fed up with the existing open source CRM applications”, the team at Zurmo have released their own open source customer relationship management (CRM) software – Zurmo 1.0. The CRM software, which has been in development for two years, includes deal tracking features, contact and activity management, and has scores and badges that can be managed through a built-in gamification system.

Zurmo 1.0 has been translated into ten languages and features a RESTful API to further integration with other applications. Location data is provided by Google Maps and Geocode. The application’s permission system supports roles for individual users and groups, and allows administrators to create ad-hoc teams. The application is designed to be modern and easy to use and integrates social-network-like functionality at its centre, which functions to distribute tasks, solicit advice, and publish accomplishments.

Describing what led the company to create another CRM system, Zurmo Co-Founder Ray Stoeckicht said: “We believe in CRM, but users continue to perceive it as a clunky, burdensome tool that wastes their time and only provides value to management. This space needs a major disruption and user adoption needs to be the focus.” He goes on to describe the application as “enchanting” and says that a major focus in the development of Zurmo 1.0 was the gamification aspects, which are designed to get the users to follow CRM best practices and to make correct use of the system more enjoyable. One example of gamification is “Missions“, where an employee can challenge another in exchange for a reward.

If two or more CRM systems are integrated with other applications, separately, what do you think happens if those CRM systems attempt to merge? (Without topic map capabilities.)

Not that the merging need be automatic, but if the semantics of the “other” applications and its data are defined by a topic map, doesn’t that ease future merging of CRM systems?

Assuming that every possessor of a CRM system is eyeing other possessor of CRM systems as possible acquisitions. 😉

Will the lack of data systems capable of rapid and reliable integration become the new “poison pill” for 2013?

Will the lack of data systems capable of rapid and reliable integration be a mark against management of a purchaser?

Either? Both?

Geospatial Intelligence Forum

Monday, December 24th, 2012

Geospatial Intelligence Forum: The Magazine of the National Intelligence Community

Apologies but I could not afford a magazine subscription for every reader of this blog.

The next best thing is a free magazine that may be useful in your data integration/topic map practice.

Defense intelligence has been a hot topic for the last decade and there are no signs that is going to change any time soon.

I was browsing through Geospatial Intelligence Forum (GIF) when I encountered:

Closing the Interoperability Gap by Cheryl Gerber.

From the article:

The current technology gaps can be frustrating for soldiers to grapple with, particularly in the middle of battlefield engagements. “This is due, in part, to stovepiped databases forcing soldiers who are working in tactical operations centers to perform many work-arounds or data translations to present the best common operating picture to the commander,” said Dr. Joseph Fontanella, AGC director and Army geospatial information officer.

Now there is a use case for interoperability, being “…in the middle of battlefield engagements.”

Cheryl goes on to identify five (5) gaps in interoperability.

GIF looks like a good place to pick up riffs, memes, terminology and even possible contacts.


Building superior integrated applications with open source Apache Camel (Webinar)

Tuesday, October 30th, 2012

Webinar – Building superior integrated applications with open source Apache Camel by Claus Ibsen.

From the post:

I am scheduled to host a free webinar on building integrated applications using Apache Camel.

Date: November 6th, 2012 (moved due Sandy hurricane)
Time: 3:00 PM (Central European Time) – 10:00 AM (EDT)
Duration: 1h15m

This webinar will show you how to build integrated applications with open source Apache Camel. Camel is one of the most frequently downloaded projects, and it is changing the way teams approach integration. The webinar will start with the basics, continue with examples and how to get started, and conclude with live demo. We will cover

  • Enterprise Integration Patterns
  • Domain Specific Languages
  • Maven and Eclipse tooling
  • Java, Spring, OSGi Blueprint, Scala and Groovy
  • Deployment options
  • Extending Camel by building custom Components
  • Q and A

Before we open for QA at the end of the session, we will share links where you can go and read and learn more about Camel. Don’t miss this informative session!

You can register for the webinar at this link.

Definitely on my list to attend.


Apache Camel 2.11 – Neo4j and more new components

Tuesday, October 30th, 2012

Apache Camel 2.11 – Neo4j and more new components by Claus Ibsen.

From the post:

As usual the Camel community continues to be very active. For the upcoming Camel 2.11 release we have already five new components in the works

All five components started by members of the community, and not by people from the Camel team. For example the camel-neo4j, and camel-couchdb components is kindly donated to ASF by Stephen Samuel. Bilgin Ibryam contributed the camel-cmis component. And Cedric Vidal donated the camel-elastichsearch component. And lastly Scott Sullivan donated the camel-sjms component. 

Just in case you live in a world where Enterprise Integration Patterns are relevant. 😉

If you are not familiar with Camel: Camel in Action, Chapter 1 (direct link) free chapter 1 of the Camel in Action book.

I first saw this at DZone.

Hafslund SESAM – Semantic integration in practice

Thursday, September 13th, 2012

Hafslund SESAM – Semantic integration in practice by Lars Marius Garshol.

Lars has posted his slides from a practical implementation of semantic integration, and what he saw along the way.

I particularly liked the line:

Generally, archive systems are glorified trash cans – putting it in the archive effectively means hiding it

BTW, Lars mentions he has a paper on this project. If you are looking for publishable semantic integration content, you might want to ping him.