## Archive for the ‘Data Integration’ Category

### Apache Camel 2.11.0 Release

Wednesday, May 1st, 2013

Apache Camel 2.11.0 Release by Christian Mueller.

From the post:

The Apache Camel project is a powerful open source integration framework based on known Enterprise Integration Patterns.

The Camel community announces the immediate availability of a new minor release camel-2.11.0. This release is issued after 9 months of intense efforts. During this period the camel community continued to support previous versions and issued various patch releases as well.

The camel-2.11.0 release comes with an impressive 679 issues fixed. Camel is the open source integration framework with the largest support of protocols and data formats on the market. This release adds another 12 components, supporting technologies like cmis, couchdb, elasticsearch, redis, rx and “Springless” JMS integration.

The artifacts are published and ready for you to download either from the Apache mirrors or from the Central Maven repository.

For more details please take a look at the release notes.

Many thanks to the Camel community for making this release possible.

Spring time upgrades are underway!

### snapLogic

Saturday, April 27th, 2013

snapLogic

From What We Do:

SnapLogic is the only cloud integration solution built on modern web standards and “containerized” Snaps, allowing you to easily connect any combination of Cloud, SaaS or On-premise applications and data sources.

We’ve now entered an era in which the Internet is the network, much of the information companies need to coordinate is no longer held in relational databases, and the number of new, specialized cloud applications grows each day. Today, organizations are demanding a faster and more modular way to interoperate with all these new cloud applications and data sources.

Prefab mapping components for data sources such as Salesforce, Oracle’s PeopleSoft, SAP (all for sale) and free components for Google Spreadsheet, HDFS, Hive and others.

Two observations:

First, the “snaps” are all for data sources and not data sets, although I don’t see any reason why data sets could not be the subject of snaps.

Second, the mapping examples I saw (caveat, I did not see them all), did not provide for the recording of the basis for data operations (read subject identity).

With regard to the second observation, my impression is that snaps can be extended to provide capabilities such as we would associate with a topic map.

Something to consider even if you are fielding your own topic map application.

I am going to be reading more about snapLogic and its products.

Sing out if you have pointers or suggestions.

### …Cloud Integration is Becoming a Bigger Issue

Wednesday, April 10th, 2013

Survey Reports that Cloud Integration is Becoming a Bigger Issue by David Linthicum.

David cites a survey by KPMG that found thirty-three percent of executives complained of higher than expected costs for data integration in cloud projects.

One assume the brighter thirty-three percent of those surveyed. The remainder apparently did not recognize data integration issues in their cloud projects.

David writes:

Part of the problem is that data integration itself has never been sexy, and thus seems to be an issue that enterprise IT avoids until it can’t be ignored. However, data integration should be the life-force of the enterprise architecture, and there should be a solid strategy and foundational technology in place.

Cloud computing is not the cause of this problem, but it’s shining a much brighter light on the lack of data integration planning. Integrating cloud-based systems is a bit more complex and laborious. However, the data integration technology out there is well proven and supports cloud-based platforms as the source or the target in an integration chain. (emphasis added)

The more diverse data sources become, the larger data integration issues will loom.

Topic maps offer data integration efforts in cloud projects a choice:

1) You can integrate one off, either with inhouse or third-party tools, only to redo all that work with each new data source, or

2) You can integrate using a topic map (for integration or to document integration) and re-use the expertise from prior data integration efforts.

Suggest pitching topic maps as a value-add proposition.

### Open PHACTS

Sunday, April 7th, 2013

Open PHACTS – Open Pharmacological Space

From the homepage:

Open PHACTS is building an Open Pharmacological Space in a 3-year knowledge management project of the Innovative Medicines Initiative (IMI), a unique partnership between the European Community and the European Federation of Pharmaceutical Industries and Associations (EFPIA).

The project is due to end in March 2014, and aims to deliver a sustainable service to continue after the project funding ends. The project consortium consists of leading academics in semantics, pharmacology and informatics, driven by solid industry business requirements: 28 partners, including 9 pharmaceutical companies and 3 biotechs.

Sourcecode has just appeared on GibHub: OpenPHACTS.

Important to different communities for different reasons. My interest isn’t the same as BigPharma.

A project to watch as they navigate the thickets of vocabularies, ontologies and other semantically diverse information sources.

### 5 Pitfalls To Avoid With Hadoop

Monday, March 25th, 2013

5 Pitfalls To Avoid With Hadoop by Syncsort, Inc.

From the registration page:

Hadoop is a great vehicle to extract value from Big Data. However, relying only on Hadoop and common scripting tools like Pig, Hive and Sqoop to achieve a complete ETL solution can hinder success.

Syncsort has worked with early adopter Hadoop customers to identify and solve the most common pitfalls organizations face when deploying ETL on Hadoop.

1. Hadoop is not a data integration tool
2. MapReduce programmers are hard to find
3. Most data integration tools don’t run natively within Hadoop
4. Hadoop may cost more than you think
5. Elephants don’t thrive in isolation

Before you give up your email and phone number for the “free ebook,” be aware it is a promotional piece for Syncsort DMX-h.

Which isn’t a bad thing but if you are expecting something different, you will be disappointed.

The observations are trivially true and amount to Hadoop not having a user facing interface, pre-written routines for data integration and tools that data integration users normally expect.

OK, but a hammer doesn’t come with blueprints, nails, wood, etc., but those aren’t “pitfalls.”

It’s the nature of a hammer that those “extras” need to be supplied.

You can either do that piecemeal or you can use a single source (the equivalent of Syncsort DMX-h).

Syncsort should be on your short list of data integration options to consider but let’s avoid loose talk about Hadoop. There is enough of that in the uninformed main stream media.

### …Cloud Computing is Changing Data Integration

Monday, March 25th, 2013

More Evidence that Cloud Computing is Changing Data Integration by David Linthicum.

From the post:

In a recent Sand Hill article, Jeff Kaplan, the managing director of THINKstrategies, reports on the recent and changing state of data integration with the addition of cloud computing. “One of the ongoing challenges that continues to frustrate businesses of all sizes is data integration, and that issue has only become more complicated with the advent of the cloud. And, in the brave new world of the cloud, data integration must morph into a broader set of data management capabilities to satisfy the escalating needs of today’s business.”

In the article, Jeff reviews a recent survey conducted with several software vendors, concluding:

• Approximately 90 percent of survey respondents said integration is important in their ability to win new customers.
• Eighty-four percent of the survey respondents reported that integration has become a difficult task that is getting in the way of business.
• A quarter of the respondents said they’ve still lost customers because of integration issues.

It’s interesting to note that these issues affect legacy software vendors, as well as Software-as-a-Service (SaaS) vendors. No matter if you sell software in the cloud or deliver it on-demand, the data integration issues are becoming a hindrance.

If cloud computing and/or big data are bringing data integration into the limelight, that sounds like good news for topic maps.

Particularly topic maps of data sources that enable quick and reliable data integration without a round of exploration and testing first.

### Integrating Structured and Unstructured Data

Thursday, February 21st, 2013

Integrating Structured and Unstructured Data by David Loshin.

It’s a checklist report but David comes up with useful commentary on the following seven points:

1. Document clearly defined business use cases.
2. Employ collaborative tools for the analysis, use, and management of semantic metadata.
3. Use pattern-based analysis tools for unstructured text.
4. Build upon methods to derive meaning from content, context, and concept.
5. Leverage commodity components for performance and scalability.
6. Manage the data life cycle.
7. Develop a flexible data architecture.

It’s not going to save you planning time but may keep you from overlooking important issues.

My only quibble is that David doesn’t call out data structures as needing defined and preserved semantics.

Data is a no brainer but the containers of data, dare I say “Hadoop silos,” need to have semantics defined as well.

Data or data containers without defined and preserved semantics are much more costly in the long run.

Both in lost opportunity costs and after the fact integration costs.

### Hadoop silos need integration…

Thursday, February 21st, 2013

Hadoop silos need integration, manage all data as asset, say experts by Brian McKenna.

From the post:

Big data hype has caused infantile disorders in corporate organisations over the past year. Hadoop silos, an excess of experimentation, and an exaggeration of the importance of data scientists are among the teething problems of big data, according to experts, who suggest organisations should manage all data as an asset.

Steve Shelton, head of data services at consultancy Detica, part of BAE Systems, said Hadoop silos have become part of the enterprise IT landscape, both in the private and public sectors. “People focused on this new thing called big data and tried to isolate it [in 2011 and 2012],” he said.

The focus has been too concentrated on non-traditional data types, and that has been driven by the suppliers. The business value of data is more effectively understood when you look at it all together, big or otherwise, he said.

Have big data technologies been a distraction? “I think it has been an evolutionary learning step, but businesses are stepping back now. When it comes to information governance, you have to look at data across the patch,” said Shelton.

He said Detica had seen complaints about Hadoop silos, and these were created by people going through a proof-of-concept phase, setting up a Hadoop cluster quickly and building a team. But a Hadoop platform involves extra costs on top, in terms of managing it and integrating it into your existing business processes.

“It’s not been a waste of time and money, it is just a stage. And it is not an insurmountable challenge. The next step is to integrate those silos, but the thinking is immature relative to the technology itself,” said Shelton.

I take this as encouraging news for topic maps.

Semantically diverse data has been stores in semantically diverse datastores. Data, which if integrated, could provide business value.

Again.

There will always be a market for topic maps because people can’t stop creating semantically diverse data and data stores.

How’s that for long term market security?

No matter what data or data storage technology arises, semantic inconsistency will be with us always.

### Leveraging Ontologies for Better Data Integration

Thursday, February 21st, 2013

Leveraging Ontologies for Better Data Integration by David Linthicum.

From the post:

If you don’t understand application semantics ‑ simply put, the meaning of data ‑ then you have no hope of creating the proper data integration solution. I’ve been stating this fact since the 1990s, and it has proven correct over and over again.

Just to be clear: You must understand the data to define the proper integration flows and transformation scenarios, and provide service-oriented frameworks to your data integration domain, meaning levels of abstraction. This is applicable both in the movement of data from source to target systems, as well as the abstraction of the data using data virtualization approaches and technology, such as technology for the host of this blog.

This is where many data integration projects fall down. Most data integration occurs at the information level. So, you must always deal with semantics and how to describe semantics relative to a multitude of information systems. There is also a need to formalize this process, putting some additional methodology and technology behind the management of metadata, as well as the relationships therein.

Many in the world of data integration have begun to adopt the notion of ontology (or the instances of ontology: ontologies). Ontology is a term borrowed from philosophy that refers to the science of describing the kinds of entities in the world and how they are related.

Why should we care? Ontologies are important to data integration solutions because they provide a shared and common understanding of data that exists within the business domain. Moreover, ontologies illustrate how to facilitate communication between people and information systems. You can think of ontologies as the understanding of everything, and how everything should interact to reach a common objective. In this case the optimization of the business. (emphasis added)

The two bolded lines I wanted to call to your attention:

If you don’t understand application semantics ‑ simply put, the meaning of data ‑ then you have no hope of creating the proper data integration solution. I’ve been stating this fact since the 1990s, and it has proven correct over and over again.

I wasn’t aware understanding the “meaning of data” as a prerequisite to data integration was ever contested?

You?

I am equally unsure that having a “…common and shared understanding of data…” qualifies as an ontology.

Which is a restatement of the first point.

What interests me is how to go from non-common and non-shared understandings of data to capturing all the currently known understandings of the data?

Repeating what is uncontested or already agreed upon, isn’t going to help with that task.

### Why Most BI Programs Under-Deliver Value

Sunday, February 10th, 2013

Why Most BI Programs Under-Deliver Value by Steve Dine.

From the post:

Business intelligence initiatives have been undertaken by organizations across the globe for more than 25 years, yet according to industry experts between 60 and 65 percent of BI projects and programs fail to deliver on the requirements of their customers.

This impact of this failure reaches far beyond the project investment, from unrealized revenue to increased operating costs. While the exact reasons for failure are often debated, most agree that a lack of business involvement, long delivery cycles and poor data quality lead the list. After all this time, why do organizations continue to struggle with delivering successful BI? The answer lies in the fact that they do a poor job at defining value to the customer and how that value will be delivered given the resource constraints and political complexities in nearly all organizations.

BI is widely considered an umbrella term for data integration, data warehousing, performance management, reporting and analytics. For the vast majority of BI projects, the road to value definition starts with a program or project charter, which is a document that defines the high level requirements and capital justification for the endeavor. In most cases, the capital justification centers on cost savings rather than value generation. This is due to the level of effort required to gather and integrate data across disparate source systems and user developed data stores.

As organizations mature, the number of applications that collect and store data increase. These systems usually contain few common unique identifiers to help identify related records and are often referred to as data silos. They also can capture overlapping data attributes for common organizational entities, such as product and customer. In addition, the data models of these systems are usually highly normalized, which can make them challenging to understand and difficult for data extraction. These factors make cost savings, in the form of reduced labor for data collection, easy targets. Unfortunately, most organizations don’t eliminate employees when a BI solution is implemented; they simply work on different, hopefully more value added, activities. From the start, the road to value is based on a flawed assumption and is destined to under deliver on its proposition.

This post merits a close read, several times.

In particular I like the focus on delivery of value to the customer.

Err, that would be the person paying you to do the work.

Steve promises a follow-up on “lean BI” that focuses on delivering more value that it costs to deliver.

I am inherently suspicious of “lean” or “agile” approaches. I sat on a committee that was assured by three programmers they had improved upon IBM’s programming methodology but declined to share the details.

Their requirements document for a content management system, to be constructed on top of subversion, was a paragraph in an email.

Fortunately the committee prevailed upon management to tank the project. The programmers persist, management being unable or unwilling to correct past mistakes.

I am sure there are many agile/lean programming projects that deliver well documented, high quality results.

But I don’t start with the assumption that agile/lean or other methodology projects are well documented.

That is a question of fact. One that can be answered.

Refusal to answer due to time or resource constraints, is a very bad sign.

I first saw this in a top ten tweets list from KDNuggets.

### Rx 2.1 and ActorFx V0.2

Friday, February 8th, 2013

Rx 2.1 and ActorFx V0.2 by Claudio Caldato.

From the post:

Today Microsoft Open Technologies, Inc., is releasing updates to improve two cloud programming projects from our MS Open Tech Hub: Rx and ActorFx .

Reactive Extension (Rx) is a programming model that allows developers to use a common interface for writing applications that interact with diverse data sources, like stock quotes, Tweets, computer events, and Web service requests. Since Rx was open-sourced by MS Open Tech in November, 2012, it has become an important under-the-hood component of several high-availability multi-platform applications, including NetFlix and GitHub.

Rx 2.1 is available now via the Rx CodePlex project and includes support for Windows Phone 8, various bug fixes and contributions from the community.

ActorFx provides a non-prescriptive, language-independent model of dynamic distributed objects for highly available data structures and other logical entities via a standardized framework and infrastructure. ActorFx is based on the idea of the mathematical Actor Model, which was adapted by Microsoft’s Eric Meijer for cloud data management.

ActorFx V0.2 is available now at the CodePlex ActorFx project, originally open sourced in December 2012. The most significant new feature in our early prototype is Actor-to-Actor communication.

The Hub engineering program has been a great place to collaborate on these projects, as these assignments give us the agility and resources to work with the community. Stay tuned for more updates soon!

With each step towards better access to diverse data sources, the semantic impedance between data systems becomes more evident.

To say nothing of the semantics of the data you obtain.

The question to ask is:

Will new data makes sense when combined with data I already have?

If you don’t know or if the answer is no, you may need a topic map.

### Seamless Astronomy

Thursday, February 7th, 2013

Seamless Astronomy: Linking scientific data, publications, and communities

From the webpage:

Seamless integration of scientific data and literature

Astronomical data artifacts and publications exist in disjointed repositories. The conceptual relationship that links data and publications is rarely made explicit. In collaboration with ADS and ADSlabs, and through our work in conjunction with the Institute for Quantitative Social Science (IQSS), we are working on developing a platform that allows data and literature to be seamlessly integrated, interlinked, mutually discoverable.

Projects:

• ADS All-SKy Survey (ADSASS)
• Astronomy Dataverse
• WorldWide Telescope (WWT)
• Viz-e-Lab
• Glue
• Study of the impact of social media and networking sites on scientific dissemination
• Network analysis and visualization of astronomical research communities
• Data citation practices in Astronomy
• Semantic description and annotation of scientific resources

A project with large amounts of data for integration.

Moreover, unlike the U.S. Intelligence Community, they are working towards data integration, not resisting it.

I first saw this in Four short links: 6 February 2013 by Nat Torkington.

### Data Integration Is Now A Business Problem – That’s Good

Tuesday, January 8th, 2013

Data Integration Is Now A Business Problem – That’s Good by John Schmidt.

From the post:

Since the advent of middleware technology in the mid-1990’s, data integration has been primarily an IT-lead technical problem. Business leaders had their hands full focusing on their individual silos and were happy to delegate the complex task of integrating enterprise data and creating one version of the truth to IT. The problem is that there is now too much data that is highly fragmented across myriad internal systems, customer/supplier systems, cloud applications, mobile devices and automatic sensors. Traditional IT-lead approaches whereby a project is launched involving dozens (or hundreds) of staff to address every new opportunity are just too slow.

The good news is that data integration challenges have become so large, and the opportunities for competitive advantage from leveraging data are so compelling, that business leaders are stepping out of their silos to take charge of the enterprise integration task. This is good news because data integration is largely an agreement problem that requires business leadership; technical solutions alone can’t fully solve the problem. It also shifts the emphasis for financial justification of integration initiatives from IT cost-saving activities to revenue-generating and business process improvement initiatives. (emphasis added)

I think the key point for me is the bolded line: data integration is largely an agreement problem that requires business leadership; technical solutions alone can’t fully solve the problem.

Data integration never was a technical problem, not really. It just wasn’t important enough for leaders to create agreements to solve it.

Like a lack of sharing between U.S. intelligence agencies. Which is still the case, twelve years this next September 11th as a matter of fact.

Topic maps can capture data integration agreements, but only if users have the business leadership to reach them.

Could be a very good year!

### Biomedical Knowledge Integration

Tuesday, January 8th, 2013

Biomedical Knowledge Integration by Philip R. O. Payne.

Abstract:

The modern biomedical research and healthcare delivery domains have seen an unparalleled increase in the rate of innovation and novel technologies over the past several decades. Catalyzed by paradigm-shifting public and private programs focusing upon the formation and delivery of genomic and personalized medicine, the need for high-throughput and integrative approaches to the collection, management, and analysis of heterogeneous data sets has become imperative. This need is particularly pressing in the translational bioinformatics domain, where many fundamental research questions require the integration of large scale, multi-dimensional clinical phenotype and bio-molecular data sets. Modern biomedical informatics theory and practice has demonstrated the distinct benefits associated with the use of knowledge-based systems in such contexts. A knowledge-based system can be defined as an intelligent agent that employs a computationally tractable knowledge base or repository in order to reason upon data in a targeted domain and reproduce expert performance relative to such reasoning operations. The ultimate goal of the design and use of such agents is to increase the reproducibility, scalability, and accessibility of complex reasoning tasks. Examples of the application of knowledge-based systems in biomedicine span a broad spectrum, from the execution of clinical decision support, to epidemiologic surveillance of public data sets for the purposes of detecting emerging infectious diseases, to the discovery of novel hypotheses in large-scale research data sets. In this chapter, we will review the basic theoretical frameworks that define core knowledge types and reasoning operations with particular emphasis on the applicability of such conceptual models within the biomedical domain, and then go on to introduce a number of prototypical data integration requirements and patterns relevant to the conduct of translational bioinformatics that can be addressed via the design and use of knowledge-based systems.

A chapter in “Translational Bioinformatics” collection for PLOS Computational Biology.

A very good survey of the knowledge integration area, which alas does not include topic maps.

Well, but it does include use cases at the end of the chapter that are biomedical specific.

Thinking those would be good cases to illustrate the use of topic maps for biomedical knowledge integration.

Yes?

### Teiid (8.2 Final Released!) [Component for TM System]

Thursday, November 22nd, 2012

Teiid

From the homepage:

Teiid is a data virtualization system that allows applications to use data from multiple, heterogenous data stores.

Teiid is comprised of tools, components and services for creating and executing bi-directional data services. Through abstraction and federation, data is accessed and integrated in real-time across distributed data sources without copying or otherwise moving data from its system of record.

Teiid Parts

• Query Engine: The heart of Teiid is a high-performance query engine that processes relational, XML, XQuery and procedural queries from federated datasources.  Features include support for homogenous schemas, hetrogenous schemas, transactions, and user defined functions.
• Embedded: An easy-to-use JDBC Driver that can embed the Query Engine in any Java application. (as of 7.0 this is not supported, but on the roadmap for future releases)
• Server: An enterprise ready, scalable, managable, runtime for the Query Engine that runs inside JBoss AS that provides additional security, fault-tolerance, and administrative features.
• Connectors: Teiid includes a rich set of Translators and Resource Adapters that enable access to a variety of sources, including most relational databases, web services, text files, and ldap.  Need data from a different source? A custom translators and resource adaptors can easily be developed.
• Tools:

Teiid 8.2 final was released on November 20, 2012.

Like most integration services, not strong on integration between integration services.

Would make one helluva component for a topic map system.

A system with an inter-integration solution mapping layer in addition to the capabilities of Teiid.

### Standards and Infrastructure for Innovation Data Exchange [#6000]

Saturday, October 13th, 2012

Standards and Infrastructure for Innovation Data Exchange by Laurel L. Haak, David Baker, Donna K. Ginther, Gregg J. Gordon, Matthew A. Probus, Nirmala Kannankutty and Bruce A. Weinberg. (Science 12 October 2012: Vol. 338 no. 6104 pp. 196-197 DOI: 10.1126/science.1221840)

Appropriate that post number six thousand (6000) should report an article on data exchange standards.

But the article seems to be at war with itself.

Consider:

There is no single database solution. Data sets are too large, confidentiality issues will limit access, and parties with proprietary components are unlikely to participate in a single-provider solution. Security and licensing require flexible access. Users must be able to attach and integrate new information.

Unified standards for exchanging data could enable a Web-based distributed network, combining local and cloud storage and providing public-access data and tools, private workspace “sandboxes,” and versions of data to support parallel analysis. This infrastructure will likely concentrate existing resources, attract new ones, and maximize benefits from coordination and interoperability while minimizing resource drain and top-down control.

As quickly as the authors say “[t]here is no single database solution.”, they take a deep breath and outline the case for a uniform data sharing structure.

If there is no “single database solution,” it stands to reason there is no single infrastructure for sharing data. The same diversity that blocks the single database, impedes the single exchange infrastructure.

We need standards, but rather than unending quests for enlightened permanence, we should focus on temporary standards, to be replaced by other temporary standards, when circumstances or needs change.

A narrow range required to demonstrate benefits from temporary standards is a plus as well. A standard enabling data integration between departments at a hospital, one department at a time, will show benefits (if there are any to be had), far sooner than a standard that requires universal adoption prior to any benefits appearing.

The Topic Maps Data Model (TMDM) is an example of a narrow range standard.

While the TMDM can be extended, in its original form, subjects are reliably identified using IRI’s (along with data about those subjects). All that is required is that one or more parties use IRIs as identifiers, and not even the same IRIs.

The TMDM framework enables one or more parties to use their own IRIs and data practices, without prior agreement, and still have reliable merging of their data.

I think it is the without prior agreement part that distinguishes the Topic Maps Data Model from other data interchange standards.

We can skip all the tiresome discussion about who has the better name/terminology/taxonomy/ontology for subject X and get down to data interchange.

Data interchange is interesting, but what we find following data interchange is even more so.

More on that to follow, sooner rather than later, in the next six thousand posts.

(See the Donations link. Your encouragement would be greatly appreciated.)

### Lacking Data Integration, Cloud Computing Suffers

Friday, October 12th, 2012

Lacking Data Integration, Cloud Computing Suffers by David Linthicum.

From the post:

The findings of the Cloud Market Maturity study, a survey conducted jointly by Cloud Security Alliance (CSA) and ISACA, show that government regulations, international data privacy, and integration with internal systems dominate the top 10 areas where trust in the cloud is at its lowest.

The Cloud Market Maturity study examines the maturity of cloud computing and helps identify market changes. In addition, the report provides detailed information on the adoption of cloud services at all levels within global companies, including senior executives.

Study results reveal that cloud users from 50 countries expressed the lowest level of confidence in the following (ranked from most reliable to least reliable):

• Government regulations keeping pace with the market
• Exit strategies
• International data privacy
• Legal issues
• Contract lock in
• Data ownership and custodian responsibilities
• Longevity of suppliers
• Integration of cloud with internal systems
• Credibility of suppliers
• Testing and assurance

Questions:

As “big data” gets “bigger,” will cloud integration issues get better or worse?

Do you prefer disposable data integration or reusable data integration? (For bonus points, why?)

### Mule ESB 3.3.1

Thursday, September 13th, 2012

Mule ESB 3.3.1 by Ramiro Rinaudo.

I got the “memo” on 4 September 2012 but it got lost in my inbox. Sorry.

From the post:

Mule ESB 3.3.1 represents a significant amount of effort on the back of Mule ESB 3.3 and our happiness with the result is multiplied by the number of products that are part of this release. We are releasing new versions with multiple enhancements and bug fixes to all of the major stack components in our Enterprise Edition. This includes:

### Misconceptions holding back use of data integration tools [Selling tools or data integration?]

Tuesday, August 28th, 2012

Misconceptions holding back use of data integration tools by Rick Sherman.

From the post:

There’s no question that data integration technology is a good thing. So why aren’t businesses using it as much as they should be?

Data integration software has evolved significantly from the days when it primarily consisted of extract, transform and load (ETL) tools. The technologies available now can automate the process of integrating data from source systems around the world in real time if that’s what companies want. Data integration tools can also increase IT productivity and make it easier to incorporate new data sources into data warehouses and business intelligence (BI) systems for users to analyze.

But despite tremendous gains in the capabilities and performance of data integration tools, as well as expanded offerings in the marketplace, much of the data integration projects in corporate enterprises are still being done through manual coding methods that are inefficient and often not documented. As a result, most companies haven’t gained the productivity and code-reuse benefits that automated data integration processes offer. Instead, they’re deluged with an ever-expanding backlog of data integration work, including the need to continually update and fix older, manually coded integration programs.

Rick’s first sentence captures the problem with promoting data integration:

“There’s no question that data integration technology is a good thing.”

Hypothetical survey of Fortune 1,000 CEO’s:

 Question Agree Disagree Data integration may be a good thing 100% 0% Data integration technology is a good thing 0.001% 99.999%

Data integration may be a good thing. Depends on what goal or mission is furthered by data integration.

Data integration, by hand, manual coding or data mining, isn’t an end unto itself. Only a means to an end.

Specific data integration, tied to a mission or goal of an organization, has a value to be evaluated against the cost of the tool or service.

Otherwise, we are selling tools of no particular value for some unknown purpose.

Sounds like a misconception of the sales process to me.

### Data Integration Services & Hortonworks Data Platform

Thursday, June 28th, 2012

From the post:

What’s possible with all this data?

Data Integration is a key component of the Hadoop solution architecture. It is the first obstacle encountered once your cluster is up and running. Ok, I have a cluster… now what? Do I write a script to move the data? What is the language? Isn’t this just ETL with HDFS as another target?Well, yes…

Sure you can write custom scripts to perform a load, but that is hardly repeatable and not viable in the long term. You could also use Apache Sqoop (available in HDP today), which is a tool to push bulk data from relational stores into HDFS. While effective and great for basic loads, there is work to be done on the connections and transforms necessary in these types of flows. While custom scripts and Sqoop are both viable alternatives, they won’t cover everything and you still need to be a bit technical to be successful.

For wide scale adoption of Apache Hadoop, tools that abstract integration complexity are necessary for the rest of us. Enter Talend Open Studio for Big Data. We have worked with Talend in order to deeply integrate their graphical data integration tools with HDP as well as extend their offering beyond HDFS, Hive, Pig and HBase into HCatalog (metadata service) and Oozie (workflow and job scheduler).

Jim covers four advantages of using Talend:

• Bridge the skills gap
• HCatalog Integration
• Connect to the entire enterprise
• Graphic Pig Script Creation

Definitely something to keep in mind.

### Getting Started with Apache Camel

Saturday, June 16th, 2012

Getting Started with Apache Camel

6/28/2012 10:00 AM EST

From the webpage:

Description:

This session will teach you how to get a good start with Apache Camel. It will cover the basic concepts of Camel such as Enterprise Integration Patterns and Domain Specific Languages, all explained with simple examples demonstrating the theory applied in practice using code. We will then discuss how you can get started developing with Camel and how to setup a new project from scratch—using Maven and Eclipse tooling. This session includes live demos that show how to build Camel applications in Java, Spring, OSGi Blueprint and alternative languages such as Scala and Groovy. We demonstrate how to build custom components and we will share highlights of the upcoming Apache Camel 2.10 release.

Speaker:

Claus Ibsen has worked on Apache Camel for years and he shares a great deal of his expertise as a co-author of Manning’s Camel in Action book. He is a principal engineer working for FuseSource specializing in the enterprise integration space. He lives in Sweden near Malmo with his wife and dog.

Data integration talents are great, but coupled with integration tools, they are even better! See you are the webinar!

### Mozilla Ignite [Challenge - $15,000] Friday, June 15th, 2012 Mozilla Ignite From the webpage: Calling all developers, network engineers and community catalysts. Mozilla and the National Science Foundation (NSF) invite designers, developers and everyday people to brainstorm and build applications for the faster, smarter Internet of the future. The goal: create apps that take advantage of next-generation networks up to 250 times faster than today, in areas that benefit the public — like education, healthcare, transportation, manufacturing, public safety and clean energy. Designing for the internet of the future The challenge begins with a “Brainstorming Round” where anyone can submit and discuss ideas. The best ideas will receive funding and support to become a reality. Later rounds will focus specifically on application design and development. All are welcome to participate in the brainstorming round. ﻿﻿﻿﻿BRAINSTORM What would you do with 1 Gbps? What apps would you create for deeply programmable networks 250x faster than today? Now through August 23rd, let’s brainstorm.$15,000 in prizes.

The challenge is focused specifically on creating public benefit in the U.S. The deadline for idea submissions is August 23, 2012.

Here is the entry website.

I assume the 1Gbps is actual and not as measured by the marketing department of the local cable company.

That would have to be from a source that can push 1 Gbps to you and you be capable of handling it. (Upstream limitations being what chokes my local speed down.)

I went looking for an example of what that would mean and came up with: “…[you] can download 23 episodes of 30 Rock in less than two minutes.

On the whole, I would rather not.

What other uses would you suggest for 1Gbps network speeds?

Assuming you have the capacity to push back at the same speed, I wonder what that means in terms of querying/viewing data as a topic map?

Transformation to a topic map for only for a subset of data?

Looking forward to seeing your entries!

### Breaking Silos – Carrot or Stick?

Thursday, June 7th, 2012

Alex Popescu, in Silos Are Built for a Reason quotes Greg Lowe saying:

In a typical large enterprise, there are competitions for resources and success, competing priorities and lots of irrelevant activities that are happening that can become distractions from accomplishing the goals of the teams.

Another reason silos are built has to do with affiliation. This is by choice, not by edict. By building groups where you share a shared set of goals, you effectively have an area of focus with a group of people interested in the same area and/or outcome.

There are many more reasons and impacts of why silos are built, but I simply wanted to establish that silos are built for a purpose with legitimate business needs in mind.

Alex then responds:

Legitimate? Maybe. Productive? I don’t really think so.

Greg’s original post is: Breaking down silos, what does that mean?

Greg asks about the benefits of breaking down silos:

• Are the silos mandatory?
• What would breaking down silos enable in the business?
• What do silos do to your business today?
• What incentive is there for these silos to go away?
• Is your company prepared for transparency?
• How will leadership deal with “Monday morning quarterbacks?”

As you can see, there are many benefits to silos as well as challenges. By developing a deeper understanding of the silos and why they get created, you can then have a better handle on whether the silos are beneficial or detrimental to the organization.

I would add to Greg’s question list:

• Which stakeholders benefit from the silos?
• What is that benefit?
• It there a carrot or stick that out weighs that benefit? (in the view of the stakeholder)
• Do you have the political capital to take the stakeholders on and win?

If your answer are:

• List of names
• List of benefits
• Yes, list of carrots/sticks
• No

Then you are in good company.

Intelligence silos persist despite the United States being at war with identifiable terrorist groups.

Generalized benefit or penalty for failure, isn’t a winning argument to break a data silo.

Specific benefits and penalties must matter to stakeholders. Then you have a chance to break a data silo.

Good luck!

### Cascading 2.0

Thursday, June 7th, 2012

Cascading 2.0

From the post:

We are happy to announce that Cascading 2.0 is now publicly available for download.

http://www.cascading.org/downloads/

This release includes a number of new features. Specifically:

• Apache 2.0 Licensing
• Support for Hadoop 1.0.2
• Local and Hadoop planner modes, where local runs in memory without Hadoop dependencies
• HashJoin pipe for “map side joins”
• Merge pipe for “map side merges”
• Simple Checkpointing for capturing intermediate data as a file
• Improved Tap and Scheme APIs

We have also created a new top-level project on GitHub for all community sponsored Cascading projects:

https://github.com/Cascading

From the documentation:

What is Cascading?

Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows on a single computing node or distributed computing cluster. On a single node, Cascading’s “local mode” can be used to efficiently test code and process local files before being deployed on a cluster. On a distributed computing cluster using Apache Hadoop platform, Cascading adds an abstraction layer over the Hadoop API, greatly simplifying Hadoop application development, job creation, and job scheduling.

Don’t miss the extensions to Cascading: Cascading Extensions. Any summary would be unfair. Take a look for yourself. Coverage of any of these you would like to point out?

I first spotted Cascading 2.0 at Alex Popescu’s myNoSQL.

### Apache Camel Tutorial

Wednesday, June 6th, 2012

If you haven’t seen Apache Camel Tutorial Business Partners (other tutorials here), you need to give it a close look:

So there’s a company, which we’ll call Acme. Acme sells widgets, in a fairly unusual way. Their customers are responsible for telling Acme what they purchased. The customer enters into their own systems (ERP or whatever) which widgets they bought from Acme. Then at some point, their systems emit a record of the sale which needs to go to Acme so Acme can bill them for it. Obviously, everyone wants this to be as automated as possible, so there needs to be integration between the customer’s system and Acme.

Sadly, Acme’s sales people are, technically speaking, doormats. They tell all their prospects, “you can send us the data in whatever format, using whatever protocols, whatever. You just can’t change once it’s up and running.”

The result is pretty much what you’d expect. Taking a random sample of 3 customers:

• Customer 1: XML over FTP
• Customer 2: CSV over HTTP
• Customer 3: Excel via e-mail

Now on the Acme side, all this has to be converted to a canonical XML format and submitted to the Acme accounting system via JMS. Then the Acme accounting system does its stuff and sends an XML reply via JMS, with a summary of what it processed (e.g. 3 line items accepted, line item #2 in error, total invoice 123.45). Finally, that data needs to be formatted into an e-mail, and sent to a contact at the customer in question (“Dear Joyce, we received an invoice on 1/2/08. We accepted 3 line items totaling 123.45, though there was an error with line items #2 [invalid quantity ordered]. Thank you for your business. Love, Acme.”).

You don’t have to be a “doormat” to take data as you find it.

Intercepted communications are unlikely to use your preferred terminology for locations or actions. Ditto for web/blog pages.

If you are thinking about normalization of data streams by producing subject-identity enhanced data streams, then you are thinking what I am thinking about Apache Camel.

For further information:

Apache Camel Documentation

Apache Camel homepage

### Uncertainty Principle for Serendipity?

Tuesday, May 22nd, 2012

Curt Monash writes in Cool analytic stories

There are several reasons it’s hard to confirm great analytic user stories. First, there aren’t as many jaw-dropping use cases as one might think. For as I wrote about performance, new technology tends to make things better, but not radically so. After all, if its applications are …

… all that bloody important, then probably people have already been making do to get it done as best they can, even in an inferior way.

Further, some of the best stories are hard to confirm; even the famed beer/diapers story isn’t really true. Many application areas are hard to nail down due to confidentiality, especially but not only in such “adversarial” domains as anti-terrorism, anti-spam, or anti-fraud.

How will we “know” when better data display/mining techniques enable more serendipity?

Anecdotal stories about serendipity abound.

Measuring serendipity requires knowing: (rate of serendipitous discoveries x importance of serendipitous discoveries)/ opportunity for serendipitous discoveries.

Need to add in a multiplier effect for the impact that one serendipitous discovery may have to create opportunities or other serendipitous discoveries (a serendipitous criticality point) and probably some other things I have overlooked.

What would you add to the equation?

Realizing that we may be staring at the “right” answer and never realize it.

How’s that for an uncertainty principle?

### Talend Updates

Sunday, May 20th, 2012

Talend updates data tools to 5.1.0

From the post:

Talend has updated all the applications that run on its Open Studio unified platform to version 5.1.0. Talend’s Open Studio is an Eclipse-based environment that hosts the company’s Data Integration, Big Data, Data Quality, MDM (Master Data Management) and ESB (Enterprise Service Bus) products. The system allows a user to, using the Data Integration as an example, use a GUI to define processes that can extract data from the web, databases, files or other resources, process that data, and feed it on to other systems. The resulting definition can then be compiled into a production application.

In the 5.10 update, OpenStudio for Data Integration has, according to the release notes, been given enhanced XML mapping and support for XML documents in its SOAP, JMS, File and Mom components. A new component has also been added to help manage Kerberos security. Open Studio for Data Quality has been enhanced with new ways to apply an analysis on multiple files, and the ability to drill down through business rules to see the invalid, as well as valid, records selected by the rules.

Upgrading following a motherboard failure so I will be throwing the latest version of software on the new box.

Comments or suggestions on the Talend updates?

### Identifying And Weighting Integration Hypotheses On Open Data Platforms

Wednesday, May 16th, 2012

Identifying And Weighting Integration Hypotheses On Open Data Platforms by Julian Eberius, Katrin Braunschweig, Maik Thiele, and Wolfgang Lehner.

Abstract:

Open data platforms such as data.gov or opendata.socrata.com provide a huge amount of valuable information. Their free-for-all nature, the lack of publishing standards and the multitude of domains and authors represented on these platforms lead to new integration and standardization problems. At the same time, crowd-based data integration techniques are emerging as new way of dealing with these problems. However, these methods still require input in form of specific questions or tasks that can be passed to the crowd. This paper discusses integration problems on Open Data Platforms, and proposes a method for identifying and ranking integration hypotheses in this context. We will evaluate our findings by conducting a comprehensive evaluation using on one of the largest Open Data platforms.

This is interesting work on Open Data platforms but it is marred by claims such as:

Open Data Platforms have some unique integration problems that do not appear in classical integration scenarios and which can only be identi ed using a global view on the level of datasets. These problems include partial- or duplicated datasets, partitioned datasets, versioned datasets and others, which will be described in detail in Section 4.

Really?

Would come as a surprise to the World Data Centre for Aerosols which had Synthesis and INtegration of Global Aerosol Data Sets. Contract No. ENV4-CT98-0780 (DG 12 –EHKN) produced on data sets from 1999 to 2001. One of the specific issues they addressed were duplicate data sets.

More than a decade ago counts for a “classical integration scenario” I think.

Another quibble. Cited sources do not support the text.

New forms of data management such as dataspaces and pay-as-you-go data integration [2, 6] are a hot topic in database research. They are strongly related to Open Data Platforms in that they assume large sets of heterogeneous data sources lacking a global or mediated schemata, which still should be queried uniformly.

2 M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. SIGMOD Rec., 34:27{33, December 2005.

6 J. Madhavan, S. R. Je ery, S. Cohen, X. . Dong, D. Ko, C. Yu, A. Halevy, and G. Inc. Web-scale Data Integration: You Can Only A fford to Pay As You Go. In Proc. of CIDR-07, 2007.

Articles written seven (7) and five (5) years ago, do not justify a “hot topic(s) in database research.” claim today.

There are other issues, major and minor but for all that, this is important work.

I want to see reports that do justice to its importance.

### ETL 2.0 – Data Integration Comes of Age

Monday, May 14th, 2012

ETL 2.0 – Data Integration Comes of Age by Robin Bloor PhD & Rebecca Jozwiak.

Well…., sort of.

It is a “white paper” and all that implies but when you read:

Versatility of Transformations and Scalability

All ETL products provide some transformations but few are versatile. Useful transformations may involve translating data formats and coded values between the data sources and the target (if they are, or need to be, different). They may involve deriving calculated values, sorting data, aggregating data, or joining data. They may involve transposing data (from columns to rows) or transposing single columns into multiple columns. They may involve performing look-ups and substituting actual values with looked-up values accordingly, applying validations (and rejecting records that fail) and more. If the ETL tool cannot perform such transformations, they will have to be hand coded elsewhere – in the database or in an application.

It is extremely useful if transformations can draw data from multiple sources and data joins can be performed between such sources “in flight,” eliminating the need for costly and complex staging. Ideally, an ETL 2.0 product will be rich in transformation options since its role is to eliminate the need for direct coding all such data transformations.

you start to lose what little respect you had for industry “white papers.”

Not once in this white paper is the term “semantics” used. It is also innocent of using the term “documentation.”

Don’t you think an ETL 2.0 application should enable re-use of “useful transformations?”

Wouldn’t that be a good thing?

Instead of IT staff starting from zero with every transformation request?

Failure to capture the semantics of data leaves you at ETL 2.0, while everyone else is at ETL 3.0.

Where does your business sense tell you about that choice?

(ETL 3.0 – Documented, re-usable, semantics for data and data structures. Enables development of transformation modules for particular data sources.)

### Graphical Data Mapping with Mule

Monday, April 23rd, 2012

Graphical Data Mapping with Mule

May 3, 2012

From the announcement:

Do you struggle to transform data as part of your integration efforts? Has data transformation become a major pain? You life is about to become a whole lot simpler!

See the new data mapping capabilities of Mule 3.3 in action! Fully integrated with Mule Studio at design time and Mule ESB at run time, Mule’s data mapping empowers developers to build data transformations through a graphical interface without writing custom code.

Join Mateo Almenta Reca, MuleSoft’s Director of Product Management, for a demo-focused preview of:

• An overview of data mapping capabilities in Mule 3.3
• Design considerations and deployment of applications that utilize data mapping
• Several live demonstrations of building various data transformations