Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 23, 2012

Graphical Data Mapping with Mule

Filed under: Data Integration,Mule — Patrick Durusau @ 5:58 pm

Graphical Data Mapping with Mule

May 3, 2012

From the announcement:

Do you struggle to transform data as part of your integration efforts? Has data transformation become a major pain? You life is about to become a whole lot simpler!

See the new data mapping capabilities of Mule 3.3 in action! Fully integrated with Mule Studio at design time and Mule ESB at run time, Mule’s data mapping empowers developers to build data transformations through a graphical interface without writing custom code.

Join Mateo Almenta Reca, MuleSoft’s Director of Product Management, for a demo-focused preview of:

  • An overview of data mapping capabilities in Mule 3.3
  • Design considerations and deployment of applications that utilize data mapping
  • Several live demonstrations of building various data transformations

April 16, 2012

Working with your Data: Easier and More Fun

Filed under: Data Fusion,Data Integration,Fusion Tables — Patrick Durusau @ 7:15 pm

Working with your Data: Easier and More Fun by Rebecca Shapley.

From the post:

The Fusion Tables team has been a little quiet lately, but that’s just because we’ve been working hard on a whole bunch of new stuff that makes it easier to discover, manage and visualize data.

New features from Fusion Tables include:

  • Faceted search
  • Multiple tabs
  • Line charts
  • Graph visualizations
  • New API that returns JSON
  • and more features on the way!

The ability of tools to ease users into data mining, visualization and exploration continues to increase.

Question: How do you counter mis-application of a tool with a sophisticated looking result?

Constraint-Based XML Query Rewriting for Data Integration

Constraint-Based XML Query Rewriting for Data Integration by Cong Yu and Lucian Popa.

Abstract:

We study the problem of answering queries through a target schema, given a set of mappings between one or more source schemas and this target schema, and given that the data is at the sources. The schemas can be any combination of relational or XML schemas, and can be independently designed. In addition to the source-to-target mappings, we consider as part of the mapping scenario a set of target constraints specifying additional properties on the target schema. This becomes particularly important when integrating data from multiple data sources with overlapping data and when such constraints can express data merging rules at the target. We define the semantics of query answering in such an integration scenario, and design two novel algorithms, basic query rewrite and query resolution, to implement the semantics. The basic query rewrite algorithm reformulates target queries in terms of the source schemas, based on the mappings. The query resolution algorithm generates additional rewritings that merge related information from multiple sources and assemble a coherent view of the data, by incorporating target constraints. The algorithms are implemented and then evaluated using a comprehensive set of experiments based on both synthetic and real-life data integration scenarios.

Who does this sound like?:

Data merging is notoriously hard for data integration and often not dealt with. Integration of scientific data, however, offers many complex scenarios where data merging is required. For example, proteins (each with a unique protein id) are often stored in multiple biological databases, each of which independently maintains different aspects of the protein data (e.g., structures, biological functions, etc.). When querying on a given protein through a target schema, it is important to merge all its relevant data (e.g., structures from one source, functions from another) given the constraint that protein id identifies all components of the protein.

When target constraints are present, it is not enough to consider only the mappings for query answering. The target instance that a query should “observe” must be defined by the interaction between all the mappings from the sources and all the target constraints. This interaction can be quite complex when schemas and mappings are nested and when the data merging rules can enable each other, possibly, in a recursive way. Hence, one of the first problems that we study in this paper is what it means, in a precise sense, to answer the target queries in the “best” way, given that the target instance is specified, indirectly, via the mappings and the target constraints. The rest of the paper will then address how to compute the correct answers without materializing the full target instance, via two novel algorithms that rewrite the target query into a set of corresponding source queries.

Wrong! 😉

The ACM reports sixty-seven (67) citations of this paper as of today. (Paper published in 2004.) Summaries of any of the citing literature welcome!

The question of data integration persists to this day. I take that to indicate that whatever the merits of this approach, data integration issues remain unsolved.

What are the merits/demerits of this approach?

Mule Summits

Filed under: Data Integration,Mule — Patrick Durusau @ 12:42 pm

Mule Summits

From the webpage:

Mule Summit brings together the core Mule development team and Mule users for the premier event of the year for anyone involved in integration. It offers a great opportunity to learn about key product developments, influence the product roadmap, and interact with other Mule users who share their best practices.

Locations announced in Europe and the U.S.

Cheaper than a conference + the core development team. What more could you want? (OK, Dallas isn’t Amsterdam but you chose to live there.) 😉

Mule Webinars

Filed under: Data Integration,Mule — Patrick Durusau @ 12:41 pm

Mule Webinars

I was looking for a web listing of Mule Summits and ran across this archive of prior Mule webinars.

Tried to pick my/your favorites and finally just put in the entire list. Enjoy!

February 1, 2012

Pentaho open sources ‘big data’ integration tools under Apache 2.0

Filed under: BigData,Data Integration,Kettle — Patrick Durusau @ 4:35 pm

Pentaho open sources ‘big data’ integration tools under Apache 2.0

Chris Kanaracus writes:

Business intelligence vendor Pentaho is releasing as open source a number of tools related to “big data” in the 4.3 release of its Kettle data-integration platform and has moved the project overall to the Apache 2.0 license, the company announced Monday.

While Kettle had always been available in a community edition at no charge, the tools being open sourced were previously only available in the company’s commercialized edition. They include integrations for Hadoop’s file system and MapReduce as well as connectors to NoSQL databases such as Cassandra and MongoDB.

Those technologies are some of the most popular tools associated with the analysis of “big data,” an industry buzzword referring to the ever-larger amounts of unstructured information being generated by websites, sensors and other sources, along with transactional data from enterprise applications.

The big data components will still be offered as part of a commercial package, Pentaho Business Analytics Enterprise Edition, which bundles in tech support maintenance and additional functionality, said Doug Moran, company co-founder and big data product manager.

Who would have thought as recently as two years ago that big data analysis would face an embarrassment of open source riches?

Even though “open source,” production use of any of the “open source” tools in a big data environment requires a substantial investment of human and technical resources.

I see the usual promotional webinars but for unstructured data, I wonder why we don’t see the usual suspects in competitions like TREC?

Ranking in such an event should not be the only consideration but at least would be a public test of the various software offerings.

January 7, 2012

First Look — Talend

Filed under: Data Integration,Data Management,MDM,Talend — Patrick Durusau @ 4:03 pm

First Look — Talend

From the post:

Talend has been around for about 6 years and the original focus was on “democratizing” data integration – making it cheaper, easier, quicker and less maintenance-heavy. They originally wanted to build an open source alternative for data integration. In particular they wanted to make sure that there was a product that worked for smaller companies and smaller projects, not just for large data warehouse efforts.

Talend has 400 employees in 8 countries and 2,500 paying customers for their Enterprise product. Talend uses an “open core” philosophy where the core product is open source and the enterprise version wraps around this as a paid product. They have expanded from pure data integration into a broader platform with data quality and MDM and a year ago they acquired an open source ESB vendor and earlier this year released a Talend branded version of this ESB.

I have the Talend software but need to spend some time working through the tutorials, etc.

A review from a perspective of subject identity and re-use of subject identification.

It may help me to simply start posting as I work through the software rather than waiting to create an edited review of the whole. Which I could always fashion from the pieces if it looked useful.

Watch for the start of my review of Talend this next week.

January 6, 2012

Querying Semi-Structured Data

Querying Semi-Structured Data

The Semi-structured data and P2P graph databases post I point to has a broken reference to Serge Abiteboul’s “Querying Semi-Structured Data.” Since I could not correct it there and the topic is of interest for topic maps, I created this entry for it here.

From the Introduction:

The amount of data of all kinds available electronically has increased dramatically in recent years. The data resides in diff erent forms, ranging from unstructured data in le systems to highly structured in relational database systems. Data is accessible through a variety of interfaces including Web browsers, database query languages, application-specifi c interfaces, or data exchange formats. Some of this data is raw data, e.g., images or sound. Some of it has structure even if the structure is often implicit, and not as rigid or regular as that found in standard database systems. Sometimes the structure exists but has to be extracted from the data. Sometimes also it exists but we prefer to ignore it for certain purposes such as browsing. We call here semi-structured data this data that is (from a particular viewpoint) neither raw data nor strictly typed, i.e., not table-oriented as in a relational model or sorted-graph as in object databases.

As will seen later when the notion of semi-structured data is more precisely defi ned, the need for semi-structured data arises naturally in the context of data integration, even when the data sources are themselves well-structured. Although data integration is an old topic, the need to integrate a wider variety of data-formats (e.g., SGML or ASN.1 data) and data found on the Web has brought the topic of semi-structured data to the forefront of research.

The main purpose of the paper is to isolate the essential aspects of semi-structured data. We also survey some proposals of models and query languages for semi-structured data. In particular, we consider recent works at Stanford U. and U. Penn on semi-structured data. In both cases, the motivation is found in the integration of heterogeneous data. The “lightweight” data models they use (based on labelled graphs) are very similar.

As we shall see, the topic of semi-structured data has no precise boundary. Furthermore, a theory of semi-structured data is still missing. We will try to highlight some important issues in this context.

The paper is organized as follows. In Section 2, we discuss the particularities of semi-structured data. In Section 3, we consider the issue of the data structure
and in Section 4, the issue of the query language.

A bit dated, 1996, but still worth reading. Updating the paper would make a nice semester size project

BTW, note the download graphics. Makes me think that archives should have an “anonymous notice” feature that allows anyone downloading a paper to send an email to anyone who has downloaded the paper in the past, without disclosing the emails of the prior downloaders.

I would really like to know what the people in Jan/Feb of 2011 were looking for? Perhaps they are working on an update of the paper? Or would like to collaborate on updating the paper.

Seems like a small “feature” that would allow researchers to contact others without disclosure of email addresses (other than for the sender of course).

Formal publication data:

Abiteboul, S. (1996) Querying Semi-Structured Data. Technical Report. Stanford InfoLab. (Publication Note: Database Theory – ICDT ’97, 6th International Conference, Delphi, Greece, January 8-10, 1997)

January 5, 2012

Interoperability Driven Integration of Biomedical Data Sources

Interoperability Driven Integration of Biomedical Data Sources by Douglas Teodoro, Rémy Choquet, Daniel Schober, Giovanni Mels, Emilie Pasche, Patrick Ruch, and Christian Lovis.

Abstract:

In this paper, we introduce a data integration methodology that promotes technical, syntactic and semantic interoperability for operational healthcare data sources. ETL processes provide access to different operational databases at the technical level. Furthermore, data instances have they syntax aligned according to biomedical terminologies using natural language processing. Finally, semantic web technologies are used to ensure common meaning and to provide ubiquitous access to the data. The system’s performance and solvability assessments were carried out using clinical questions against seven healthcare institutions distributed across Europe. The architecture managed to provide interoperability within the limited heterogeneous grid of hospitals. Preliminary scalability result tests are provided.

Appears in:

Studies in Health Technology and Informatics
Volume 169, 2011
User Centred Networked Health Care – Proceedings of MIE 2011
Edited by Anne Moen, Stig Kjær Andersen, Jos Aarts, Petter Hurlen
ISBN 978-1-60750-805-2

I have been unable to find a copy online, well, other than the publisher’s copy, at $20 for four pages. I have written to one of the authors requesting a personal use copy as I would like to report back on what it proposes.

December 26, 2011

Mondeca helps to bring Electronic Patient Record to reality

Filed under: Biomedical,Data Integration,Health care,Medical Informatics — Patrick Durusau @ 8:13 pm

Mondeca helps to bring Electronic Patient Record to reality

This has been out for a while but I just saw it today.

From the post:

Data interoperability is one of the key issues in assembling unified Electronic Patient Records, both within and across healthcare providers. ASIP Santé, the French national healthcare agency responsible for implementing nation-wide healthcare management systems, has been charged to ensure such interoperability for the French national healthcare.

The task is a daunting one since most healthcare providers use their own custom terminologies and medical codes. This is due to a number of issues with standard terminologies: 1) standard terminologies take too long to be updated with the latest terms; 2) significant internal data, systems, and expertise rely on the usage of legacy custom terminologies; and 3) a part of the business domain is not covered by a standard terminology.

The only way forward was to align the local custom terminologies and codes with the standard ones. This way local data can be automatically converted into the standard representation, which will in turn allow to integrate it with the data coming from other healthcare providers.

I assume the alignment of local custom terminologies is an ongoing process so as the local terminologies change, re-alignment occurs as well?

Kudos to Mondeca for they played an active role in the early days of XTM and I suspect that experience has influenced (for the good), their approach to this project.

December 7, 2011

DIM 2012 : IEEE International Workshop on Data Integration and Mining

Filed under: Conferences,Data Integration,Data Mining — Patrick Durusau @ 8:12 pm

DIM 2012 : IEEE International Workshop on Data Integration and Mining

Important Dates:

When Aug 8, 2012 – Aug 10, 2012
Where Las Vegas, Nevada, USA
Submission Deadline Mar 31, 2012
Notification Due Apr 30, 2012
Final Version Due May 14, 2012

From the website:

Given the emerging global Information-centric IT landscape that has tremendous social and economic implications, effectively processing and integrating humungous volumes of information from diverse sources to enable effective decision making and knowledge generation have become one of the most significant challenges of current times. Information Reuse and Integration (IRI) seeks to maximize the reuse of information by creating simple, rich, and reusable knowledge representations and consequently explores strategies for integrating this knowledge into systems and applications. IRI plays a pivotal role in the capture, representation, maintenance, integration, validation, and extrapolation of information; and applies both information and knowledge for enhancing decision-making in various application domains.

This conference explores three major tracks: information reuse, information integration, and reusable systems. Information reuse explores theory and practice of optimizing representation; information integration focuses on innovative strategies and algorithms for applying integration approaches in novel domains; and reusable systems focus on developing and deploying models and corresponding processes that enable Information Reuse and Integration to play a pivotal role in enhancing decision-making processes in various application domains.

The IEEE IRI conference serves as a forum for researchers and practitioners from academia, industry, and government to present, discuss, and exchange ideas that address real-world problems with real-world solutions. Theoretical and applied papers are both included. The conference program will include special sessions, open forum workshops, panels and keynote speeches.

Note the emphasis on integration. In topic maps we would call that merging.

I think that bodes well for the future of topic maps. Provided that we “steal the march” so to speak.

We have spent years, decades for some of us thinking about data integration issues. Let’s not hide our bright lights under a basket.

December 5, 2011

Talend 5

Filed under: Data Governance,Data Integration — Patrick Durusau @ 7:48 pm

Talend 5

Talend 5 consists of:

  • Talend Open Studio for Data Integration (formerly Talend Open Studio), the most widely used open source data integration/ETL tool in the world.
  • Talend Open Studio for Data Quality (formerly Talend Open Profiler), the only open source enterprise data profiling tool.
  • Talend Open Studio for MDM (formerly Talend MDM Community Edition), the first – and only – open source MDM solution.
  • Talend Open Studio for ESB (formerly Talend ESB Studio Standard Edition), the easy to use open source ESB based on leading Apache Software Foundation integration projects.

From BusinessWire article.

Massive file downloads running now.

Are you using Talend? Thoughts/suggestions on testing/comparisons?

November 16, 2011

expressor – Data Integration Platform

Filed under: Data Integration,Software — Patrick Durusau @ 8:18 pm

expressor – Data Integration Platform

I ran across expressor while reading a blog entry that will be going through Facebook and Twitter data with it as integration software.

It has a community edition but apparently only runs on Windows (XP and Windows 7, there’s a smart move).

Before I download/install, any comments? Suggestions for other integration tasks?

Thanks!

Oh, the post that got me started on this: expressor: Enterprise Application Integration with Social Networking Applications. Realize that expressor is an ETL tool but sometimes that is what a job requires.

Data Integration Remains a Major IT Headache

Filed under: Data Integration,Marketing — Patrick Durusau @ 2:13 pm

Data Integration Remains a Major IT Headache

From the webpage:

Click through for results from a survey on data integration, conducted by BeyeNetwork on behalf of Syncsort.

…. (with regard to data integration tools)

In particular, the survey makes it clear that not only is data integration still costly, a lot of manual coding is required. The end result is that the fundamentals of data integration are still a big enough issue in most IT organizations to thwart the achievement of strategic business goals.

Complete with bar and pie charts! 😉

If data integration is a problem in the insular data enclaves of today, do you think data integration will get easier when foreign big data comes on the scene?

That’s what I think too.

I will ask BeyeNetwork if they asked this question:

How much manual coded data was the subject of manual coding before?

Or perhaps better:

Where did coders get the information for repeated manual coding of the data? (with follow up questions based on the responses to refine that further)

Reasoning that how we maintain information about data (read metadata) can have an influence on the cost of manual coding, i.e., discovery of what the data means (or is thought to mean).

It isn’t possible to escape manual coding, at least if we want useful data integration. We can, however, explore how to make manual coding less burdensome.

I say we can’t escape manual coding because unless by happenstance two data sets shared the same semantics, I am not real sure how they would be integrated sight unseen with any expectation of a meaningful result.

Or to put it differently, meaningful data integration efforts, like lunches, are not free.

PS: And you thought I was going to say topic maps were the answer to data integration headaches. 😉 Maybe, maybe, depends on your requirements.

You should never buy technology or software because of its name, everyone else is using it, your boss saw it during a Super Bowl half-time show, or similar reasons. I am as confident that topic maps will prove to be the viable solution in some cases as I am that other solutions are more appropriate in others. Topic mappers should not be afraid to say so.

November 15, 2011

Hadoop and Data Quality, Data Integration, Data Analysis

Filed under: Data Analysis,Data Integration,Hadoop — Patrick Durusau @ 7:58 pm

Hadoop and Data Quality, Data Integration, Data Analysis by David Loshin.

From the post:

If you have been following my recent thread, you will of course be anticipating this note, in which we examine the degree to which our favorite data-oriented activities are suited to the elastic yet scalable massive parallelism promised by Hadoop. Let me first summarize the characteristics of problems or tasks that are amenable to the programming model:

  1. Two-Phased (2-φ) – one or more iterations of “computation” followed by “reduction.”
  2. Big data – massive data volumes preclude using traditional platforms
  3. Data parallel (Data-||) – little or no data dependence
  4. Task parallel (Task-||) – task dependence collapsible within phase-switch from Map to Reduce
  5. Unstructured data – No limit on requiring data to be structured
  6. Communication “light” – requires limited or no inter-process communication except what is required for phase-switch from Map to Reduce

OK, so I happen to agree with David’s conclusions. (see his post for the table) That isn’t the only reason I posted this note.

Rather I think this sort of careful analysis lends itself to test cases, which we can post and share with specification of the tasks performed.

Much cleaner and more enjoyable than the debates measured by who can sink the lowest fastest.

Test cases to suggest anyone?

October 16, 2011

Hadoop User Group UK: Data Integration

Filed under: Data Integration,Flume,Hadoop,MapReduce,Pig,Talend — Patrick Durusau @ 4:12 pm

Hadoop User Group UK: Data Integration

Three presentations captured as podcasts from the Hadoop User Group UK:

LEVERAGING UNSTRUCTURED DATA STORED IN HADOOP

FLUME FOR DATA LOADING INTO HDFS / HIVE (SONGKICK)

LEVERAGING MAPREDUCE WITH TALEND: HADOOP, HIVE, PIG, AND TALEND FILESCALE

Fresh as of 13 October 2011.

Thanks to Skills Matter for making the podcasts available!

October 7, 2011

LDIF – Linked Data Integration Framework Version 0.3.

Filed under: Data Integration,Linked Data,LOD — Patrick Durusau @ 6:17 pm

LDIF – Linked Data Integration Framework Version 0.3

From the email announcement:

The LDIF – Linked Data Integration Framework can be used within Linked Data applications to translate heterogeneous data from the Web of Linked Data into a clean local target representation while keeping track of data provenance. LDIF provides an expressive mapping language for translating data from the various vocabularies that are used on the Web into a consistent, local target vocabulary. LDIF includes an identity resolution component which discovers URI aliases in the input data and replaces them with a single target URI based on user-provided matching heuristics. For provenance tracking, the LDIF framework employs the Named Graphs data model.

Compared to the previous release 0.2, the new LDIF release provides:

  • data access modules for gathering data from the Web via file download, crawling and accessing SPARQL endpoints. Web data is cached locally for further processing.
  • a scheduler for launching data import and integration jobs as well as for regularly updating the local cache with data from remote sources.
  • a second use case that shows how LDIF is used to gather and integrate data from several music-related Web data sources.

More information about LDIF, concrete usage examples and performance details are available at http://www4.wiwiss.fu-berlin.de/bizer/ldif/

Over the next months, we plan to extend LDIF along the following lines:

  1. Implement a Hadoop Version of the Runtime Environment in order to be able to scale to really large amounts of input data. Processes and data will be distributed over a cluster of machines.
  2. Add a Data Quality Evaluation and Data Fusion Module which allows Web data to be filtered according to different data quality assessment policies and provides for fusing Web data according to different conflict resolution methods.

Uses SILK (SILK – Link Discovery Framework Version 2.5) identity resolution semantics.

September 30, 2011

Four Levels of Data Integration (Charteris White Paper)

Filed under: Data Integration — Patrick Durusau @ 7:06 pm

Four Levels of Data Integration (Charteris White Paper)

From the post:

Application Integration is the biggest cost driver of corporate IT. While it has been popular to emphasise the business process integration aspects of EAI, it remains true that data integration is a huge part of the problem, responsible for much of the cost of EAI. You cannot begin to do process integration without some data integration.

Data integration is an N-squared problem. If you have N different systems or sources of data to integrate, you may need to build as many as N(N -1) different data exchange interfaces between them – near enough to N2. For large companies, where N may run into the hundreds, and N2 may be more than 100,000, this looks an impossible problem.

In practice, the figures are not quite that huge. In our experience, a typical system may interface to between 5 and 30 other systems – so the total number of interfaces is between 5N and 30N. Even this makes a prohibitive number of data interfaces to build and maintain. Many IT managers quietly admit that they just cannot maintain the necessary number of data interfaces, because the cost would be prohibitive. Then business users are forced to live with un-integrated, inconsistent data and fragmented processes, at great cost to the business.

The bad news is that N just got bigger. New commercial imperatives, the rise of e-commerce, XML and web services require companies of all sizes to integrate data and processes with their business partners’ data and processes. If you make an unsolved problem bigger, it generally remains unsolved.

I was searching for N-squared references when I encountered this paper. You can see what I think is the topic map answer to the N-squared problem at: Semantic Integration: N-Squared to N+1 (and decentralized).

Semantic Integration: N-Squared to N+1 (and decentralized)

Filed under: Data Integration,Mapping,Marketing,Semantics,TMDM,Topic Maps — Patrick Durusau @ 7:02 pm

Data Integration: The Relational Logic Approach pays homage to what is called the N-squared problem. The premise of N-squared for data integration is that every distinct identification must be mapped to every other distinct identification. Here is a graphic of the N-squared problem.

Two usual responses, depending upon the proposed solution.

First, get thee to a master schema (probably the most common). That is map every distinct data source to a common schema and all clients have to interact with that one schema. Case closed. Except data sources come and go, as do clients so there is maintenance overhead. Maintenance can take time to agree on updates.

Second, no system integrates every other possible source of data, so the fear of N-squared is greatly exaggerated. Not unlike the sudden rush for “big data” solutions whether the client has “big data” or not. Who would want to admit to having “medium” or even “small” data?

The third response that is of topic maps. The assumption that every identification must map to every other identification means things get ugly in a hurry. But topic maps question the premise of the N-Squared problem, that every identification must map to every other identification.

Here is an illustration of how five separate topic maps, with five different identifications of a popular comic book character (Superman), can be combined and yet avoid the N-Squared problem. In fact, topic maps offer an N+1 solution to the problem.

Each snippet, written in Compact Topic Map (CTM) syntax represents a separate topic map.


en-superman
http://en.wikipedia.org/wiki/Super_man ;
- "Superman" ;
- altname: "Clark Kent" .

***


de-superman
http://de.wikipedia.org/wiki/Superman ;
- "Superman" ;
- birthname: "Kal-El" .

***


fr-superman
http://fr.wikipedia.org/wiki/Superman ;
- "Superman" ;
birthplace: "Krypton" .

***


it-superman
http://it.wikipedia.org/wiki/Superman ;
- "Superman" ;
- altname: "Man of Steel" .

***


eo-superman
http://eo.wikipedia.org/wiki/Superman ;
- "Superman" ;
- altname: "Clark Joseph Kent" .

Copied into a common file, superman-N-squared.ctm, nothing happens. That’s because they all have different subject identifiers. What if I add to the file/topic map, the following topic:


superman
http://en.wikipedia.org/wiki/Super_man ;
http://de.wikipedia.org/wiki/Superman ;
http://fr.wikipedia.org/wiki/Superman ;
http://it.wikipedia.org/wiki/Superman ;
http://eo.wikipedia.org/wiki/Superman .

Results in the file, superman-N-squared-solution.ctm.

Ooooh.

Or an author know one other identifier. So long as any group of authors uses at least one common identifier between any two maps, it results in the merger of their separate topic maps. (Ordering of the merges may be an issue.)

Another way to say that is that the trigger for merging of identifications is decentralized.

Which gives you a lot more eyes on the data, potential subjects and relationships between subjects.

PS: Did you know that the English and German versions gives Superman’s cover name as “Clark Kent,” while the French, Italian and Esperanto versions give his cover name as “Clark Joeseph Kent?”

PPS: The files are both here, Superman-Semantics-01.zip.

September 29, 2011

Data Integration: The Relational Logic Approach

Filed under: Data Integration,Logic — Patrick Durusau @ 6:34 pm

Data Integration: The Relational Logic Approach by Michael Genesereth of Stanford University.

Abstract:

Data integration is a critical problem in our increasingly interconnected but inevitably heterogeneous world. There are numerous data sources available in organizational databases and on public information systems like the World Wide Web. Not surprisingly, the sources often use different vocabularies and different data structures, being created, as they are, by different people, at different times, for different purposes.

The goal of data integration is to provide programmatic and human users with integrated access to multiple, heterogeneous data sources, giving each user the illusion of a single, homogeneous database designed for his or her specific need. The good news is that, in many cases, the data integration process can be automated.

This book is an introduction to the problem of data integration and a rigorous account of one of the leading approaches to solving this problem, viz., the relational logic approach. Relational logic provides a theoretical framework for discussing data integration. Moreover, in many important cases, it provides algorithms for solving the problem in a computationally practical way. In many respects, relational logic does for data integration what relational algebra did for database theory several decades ago. A companion web site provides interactive demonstrations of the algorithms.

Interactive edition with working examples: http://logic.stanford.edu/dataintegration/. (As near as I can tell, the entire text. Although referred to as the “companion” website.)

When the author said Datalog, I thought of Lar Marius:

In our examples here and throughout the book, we encode relationships between and among schemas as rules in a language called Datalog. In many cases, the rules are expressed in a simple version of Datalog called Basic Datalog; in other cases, rules are written in more elaborate versions, viz., Functional Datalog and Disjunctive Datalog. In the following paragraphs, we look at Basic Datalog first, then Functional Datalog, and finally Disjunctive Datalog. The presentation here is casual; formal details are given in Chapter 2.

Bottom line is that the author advocates a master schema approach but you should read book for yourself. It makes a number of good points about data integration issues and the limitations of various techniques. Plus you may learn some Datalog along the way!

September 26, 2011

VOGCLUSTERS: an example of DAME web application

Filed under: Astroinformatics,Data Integration,Marketing — Patrick Durusau @ 6:59 pm

VOGCLUSTERS: an example of DAME web application by Marco Castellani, Massimo Brescia, Ettore Mancini, Luca Pellecchia, and Giuseppe Longo.

Abstract:

We present the alpha release of the VOGCLUSTERS web application, specialized for data and text mining on globular clusters. It is one of the web2.0 technology based services of Data Mining & Exploration (DAME) Program, devoted to mine and explore heterogeneous information related to globular clusters data.

VOGCLUSTERS (The alpha website.)

From the webpage:

This page is the entry point to the VOGCLUSTERS Web Application (alpha release) specialized for data and text mining on globular clusters. It is a toolset of DAME Program to manage and explore GC data in various formats.

In this page the users can obtain news, documentation and technical support about the web application.

The goal of the project VOGCLUSTERS is the design and development of a web application specialized in the data and text mining activities for astronomical archives related to globular clusters. Main services are employed for the simple and quick navigation in the archives (uniformed under VO standards and constraints) and their manipulation to correlate and integrate internal scientific information. The project has not to be intended as a straightforward website for the globular clusters, but as a web application. A website usually refers to the front-end interface through which the public interact with your information online. Websites are typically informational in nature with a limited amount of advanced functionality. Simple websites consist primarily of static content where the data displayed is the same for every visitor and content changes are infrequent. More advanced websites may have management and interactive content. A web application, or equivalently Rich Internet Application (RIA) usually includes a website component but features additional advanced functionality to replace or enhance existing processes. The interface design objective behind a web application is to simulate the intuitive, immediate interaction a user experiences with a desktop application.

Note the use of DAME as a foundation to “…manage and explore GC data in various formats.”

Just in case you are unaware, astronomy/radio astronomy, along with High Energy Physics (HEP) were the original big data.

If you have an interest in astronomy, this would be a good project to follow and perhaps to suggest topic map techniques.

Effective marketing of topic maps requires more than writing papers and hoping that someone reads them. Invest your time and effort into a project, then suggest (appropriately) the use of topic maps. You and your proposal will have more credibility that way.

September 19, 2011

Future Integration Needs: Embracing Complex Data

Filed under: Data Integration — Patrick Durusau @ 7:53 pm

Future Integration Needs: Embracing Complex Data is a report from the Aberdeen Group that I found posted at the Informatica website.

It is the sort of white paper that you can leave with executives so they can evaluate the costs of not integrating their data streams.

Two points that I point out for your amusement:

First, data integration isn’t a new topic nor did someone wake up last week and realize that data integration could lead to all the benefits that are extolled in this white paper. I suspect the advantages of integrated data systems has been touted to businesses for as long as data systems, manual or otherwise have existed.

The question the white paper does not answer (or even raise) is why do data integration issues persist? Just in the digital age, decades have been spent pointing the problem out and proposing solutions. A white paper that answered that question might help find solutions.

As it is, the white paper says “if you had a solution to this problem, for which we don’t know the cause, you would be better off.” No doubt but not very comforting.

BTW, in case you didn’t notice, the “n = 122” you keep seeing in the article means the sweeping claims are made on the basis of 122 respondents to a survey. It doesn’t say if it was one of those call you during dinner sort of phone surveys or not.

The second point to notice is that the conclusion of the paper is that you need a single product to use for data integration. Gee, I wonder where you would find software like that! 😉

I am sure the Informatica software is quite capable but my concern remains one of how do we transition from one software/format to another? Legacy formats and even code have proven to be more persistent than any one imagined. Software/formats don’t so much migrate as expand to fill the increasing amount of digital data.

Now that would be an interesting metric to ask the digital universe is expanding crowd. How many formats are coming online to represent the expanding amount of data? And where are we going to get the maps to move from one to another?

September 12, 2011

Apache Camel

Filed under: Data Analysis,Data Engine,Data Integration — Patrick Durusau @ 8:25 pm

Apache Camel

New release as of 25 July 2011.

The Apache Camel site self describes as:

Apache Camel is a powerful open source integration framework based on known Enterprise Integration Patterns with powerful Bean Integration.

Camel lets you create the Enterprise Integration Patterns to implement routing and mediation rules in either a Java based Domain Specific Language (or Fluent API), via Spring based Xml Configuration files or via the Scala DSL. This means you get smart completion of routing rules in your IDE whether in your Java, Scala or XML editor.

Apache Camel uses URIs so that it can easily work directly with any kind of Transport or messaging model such as HTTP, ActiveMQ, JMS, JBI, SCA, MINA or CXF Bus API together with working with pluggable Data Format options. Apache Camel is a small library which has minimal dependencies for easy embedding in any Java application. Apache Camel lets you work with the same API regardless which kind of Transport used, so learn the API once and you will be able to interact with all the Components that is provided out-of-the-box.

Apache Camel has powerful Bean Binding and integrated seamless with popular frameworks such as Spring and Guice.

Apache Camel has extensive Testing support allowing you to easily unit test your routes.


….

So don’t get the hump, try Camel today! 🙂

Comments/suggestions?

I am going to be working through some of the tutorials and other documentation. Anything I should be looking for?

Apache Camel is a powerful open source integration framework based on known Enterprise Integration Patterns with powerful Bean Integration.

Camel lets you create the Enterprise Integration Patterns to implement routing and mediation rules in either a Java based Domain Specific Language (or Fluent API), via Spring based Xml Configuration files or via the Scala DSL. This means you get smart completion of routing rules in your IDE whether in your Java, Scala or XML editor.

Apache Camel uses URIs so that it can easily work directly with any kind of Transport or messaging model such as HTTP, ActiveMQ, JMS, JBI, SCA, MINA or CXF Bus API together with working with pluggable Data Format options. Apache Camel is a small library which has minimal dependencies for easy embedding in any Java application. Apache Camel lets you work with the same API regardless which kind of Transport used, so learn the API once and you will be able to interact with all the Components that is provided out-of-the-box.

Apache Camel has powerful Bean Binding and integrated seamless with popular frameworks such as Spring and Guice.

Apache Camel has extensive Testing support allowing you to easily unit test your routes.

September 10, 2011

GTD – Global Terrorism Database

Filed under: Authoring Topic Maps,Data,Data Integration,Data Mining,Dataset — Patrick Durusau @ 6:08 pm

GTD – Global Terrorism Database

From the homepage:

The Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2010 (with annual updates planned for the future). Unlike many other event databases, the GTD includes systematic data on domestic as well as international terrorist incidents that have occurred during this time period and now includes more than 98,000 cases.

While chasing down a paper that didn’t make the cut I ran across this data source.

Lacking an agreed upon definition of terrorism (see Chomsky for example), you may or may not find what you consider to be incidents of terrorism in this dataset.

Never the less, it is a dataset of events of popular interest and can be used to attract funding for your data integration project using topic maps.

September 6, 2011

Improving Entity Resolution with Global Constraints

Filed under: Data Integration,Data Mining,Entity Resolution — Patrick Durusau @ 7:00 pm

Improving Entity Resolution with Global Constraints by Jim Gemmell, Benjamin I. P. Rubinstein, and Ashok K. Chandra.

Abstract:

Some of the greatest advances in web search have come from leveraging socio-economic properties of online user behavior. Past advances include PageRank, anchor text, hubs-authorities, and TF-IDF. In this paper, we investigate another socio-economic property that, to our knowledge, has not yet been exploited: sites that create lists of entities, such as IMDB and Netflix, have an incentive to avoid gratuitous duplicates. We leverage this property to resolve entities across the different web sites, and find that we can obtain substantial improvements in resolution accuracy. This improvement in accuracy also translates into robustness, which often reduces the amount of training data that must be labeled for comparing entities across many sites. Furthermore, the technique provides robustness when resolving sites that have some duplicates, even without first removing these duplicates. We present algorithms with very strong precision and recall, and show that max weight matching, while appearing to be a natural choice turns out to have poor performance in some situations. The presented techniques are now being used in the back-end entity resolution system at a major Internet search engine.

Relies on entity resolution that has been performed in another context. I rather like that, as opposed to starting at ground zero.

I was amused that “adult titles” were excluded from the data set. I don’t have the numbers right off hand but “adult titles” account for a large percentage of movie income. Not unlike using stock market data but excluding all finance industry stocks. Seems incomplete.

September 1, 2011

Spatio Temporal data Integration and Retrieval

Filed under: Conferences,Data Integration,Information Retrieval,Spatial Index — Patrick Durusau @ 6:06 pm

STIR 2012 : ICDE 2012 Workshop on Spatio Temporal data Integration and Retrieval

Dates:

When Apr 1, 2012 – Apr 1, 2012
Where Washington DC, USA
Submission Deadline Oct 21, 2011

From the notice:

International Workshop on Spatio Temporal data Integration and Retrieval (STIR2012) in conjunction with ICDE 2012

April 1, 2012, Washington DC, USA

http://research.ihost.com/stir12/index.html

As the world?s population increases and it puts increasing demands on the planet?s limited resources due to shifting life-styles, we not only need to monitor how we consume resources but also optimize resource usage. Some examples of the planet?s limited resources are water, energy, land, food and air. Today, significant challenges exist for reducing usage of these resources, while maintaining quality of life. The challenges range from understanding regionally varied impacts of global environmental change, through tracking diffusion of avian flu and responding to natural disasters, to adapting business practice to dynamically changing resources, markets and geopolitical situations. For these and many other challenges reference to location – and time – is the glue that connects disparate data sources. Furthermore, most of the systems and solutions that will be built to solve the above challenges are going to be heavily depend on structured data (generated by sensors and sensor based applications) which will be streaming in real-time, come in large volumes and will have spatial and temporal aspects to them.

This workshop is focused on making the research in information integration and retrieval more relevant to the challenges in systems with significant spatial and temporal components.

Sounds like they are playing our song!

August 28, 2011

10 Weeks to Lean Integration

Filed under: Data Integration — Patrick Durusau @ 7:56 pm

10 Weeks to Lean Integration by John Schmidt.

From the post:

Lean Integration is a management system that emphasizes focusing on the customer, driving continuous improvements, and the elimination of waste in end-to-end data integration and application integration activities.

Lean practices are well-established in other disciplines such as manufacturing, supply-chain management, and software development to name just a few, but the application of Lean to the integration discipline is new.

Based on my research, no-one has tackled this topic directly in the form of a paper or book. But the world is a big place, so if some of you readers have come across prior works, please let me know. In the meantime, you heard it here first!

The complete list of posts:

Week 1: Introduction of Lean Integration (this posting)
Week 2: Eliminating waste
Week 3: Sustaining knowledge
Week 4: Planning for change
Week 5: Delivering fast
Week 6: Empowering the team
Week 7: Building in quality
Week 8: Optimizing the whole
Week 9: Deming’s 14 Points
Week 10: Practical Implementation Considerations

I don’t necessarily disagree with the notion of reducing variation in an enterprise. I do think integration solutions need to be flexible enough to adapt to variation encountered in the “wild” as it were.

I do appreciate John’s approach to integration that treats it as more than a technical problem. Integration (as other projects) is an organization issue as much as it is a technical one.

August 27, 2011

What is a Customer?

Filed under: Data Analysis,Data Integration — Patrick Durusau @ 9:11 pm

I ran across a series of posts where David Loshin explores the question: “What is a Customer?” or as he puts it in The Most Dangerous Question to Ask Data Professionals:

Q: What is the most dangerous question to ask data professionals?

A: “What is the definition of customer?”

And he includes some examples:

  • “customer” = person who gave us money for some of our stuff
  • “customer” = the person using our stuff
  • “customer” = the guy who approved the payment for our stuff
  • “customer account manager” = salesperson
  • “customer service” = complaints office
  • “customer representative” = gadfly

and explores the semantic confusion about how we use “customer.”

In Single Views Of The Customer, David explores the hazards and dangers of a single definition of customer.

When Is A Customer Not A Customer? starts to stray into more familiar territory when he says:

Here are the two pieces of our current puzzle: we have multiple meanings for the term “customer” but we want a single view of whatever that term means. To address this Zen-like conundrum we have to go beyond our context and think differently about the end, not the means. Here are two ideas to drill into: entity vs. role and semantic differentiation.

and after some interesting discussion (which you should read) he concludes:

What we can start to see is that a “customer” is not really a data type, nor is it really a customer. Rather, a “customer” is a role played by some entity (in this case, either an individual or an organization) within some functional context at different points of particular business processes. In the next post let’s decide how we can use this to rethink the single view of the customer.

I will be posting an update when the next post appears.

August 18, 2011

Building data startups: Fast, big, and focused

Filed under: Analytics,BigData,Data,Data Analysis,Data Integration — Patrick Durusau @ 6:54 pm

Building data startups: Fast, big, and focused (O’Reilly original)

Republished by Forbes as:
Data powers a new breed of startup

Based on the talk Building data startups: Fast, Big, and Focused

by Michael E. Driscoll

From the post:

A new breed of startup is emerging, built to take advantage of the rising tides of data across a variety of verticals and the maturing ecosystem of tools for its large-scale analysis.

These are data startups, and they are the sumo wrestlers on the startup stage. The weight of data is a source of their competitive advantage. But like their sumo mentors, size alone is not enough. The most successful of data startups must be fast (with data), big (with analytics), and focused (with services).

Describes the emerging big data stack and says:

The competitive axes and representative technologies on the Big Data stack are illustrated here. At the bottom tier of data, free tools are shown in red (MySQL, Postgres, Hadoop), and we see how their commercial adaptations (InfoBright, Greenplum, MapR) compete principally along the axis of speed; offering faster processing and query times. Several of these players are pushing up towards the second tier of the data stack, analytics. At this layer, the primary competitive axis is scale: few offerings can address terabyte-scale data sets, and those that do are typically proprietary. Finally, at the top layer of the big data stack lies the services that touch consumers and businesses. Here, focus within a specific sector, combined with depth that reaches downward into the analytics tier, is the defining competitive advantage.

The future isn’t going to be getting users to develop topic maps but your use of topic maps (and other tools) to create data products of interest to users.

Think of it as being the difference between selling oil change equipment versus being the local Jiffy Lube. (Sorry, for non-U.S. residents, Jiffy Lube is a chain of oil change and other services. Some 2,000 locations in the North America.) I dare say that Jiffy Lube and its competitors do more auto services than users of oil change equipment.

Integration Imperatives Around Complex Big Data

Filed under: BigData,Data as Service (DaaS),Data Integration,Marketing — Patrick Durusau @ 6:52 pm

Integration Imperatives Around Complex Big Data

  • Informatica Corporation (NASDAQ: INFA), the world’s number one independent provider of data integration software, today announced the availability of a new research report from the Aberdeen Group that shows how organizations can get the most from their data integration assets in the face of rapidly growing data volumes and increasing data complexity.
  • Entitled: Future Integration Needs: Embracing Complex Data, the Aberdeen report reveals that:
    • Big Data is the new reality – In 2010, organizations experienced a staggering average data volume growth of 40 percent.
    • XML adoption has increased dramatically – XML is the most common semi-structured data source that organizations integrate. 74 percent of organizations are integrating XML from external sources. 66 percent of organizations are integrating XML from internal sources.
    • Data complexity is skyrocketing – In the next 12 months enterprises plan to introduce more complex unstructured data sources – including office productivity documents, email, web content and social media data – than any other data type.
    • External data sources are proliferating – On average, organizations are integrating 14 external data sources, up from 11 a year ago.
    • Integration costs are rising – As integration of external data rises, it continues to be a labor- and cost-intensive task, with organizations integrating external sources spending 25 percent of their total integration budget in this area.
  • For example, according to Aberdeen, organizations that have effectively integrated complex data are able to:
    • Use up to 50 percent larger data sets for business intelligence and analytics.
    • Integrate twice as successfully external unstructured data into business processes (40 percent vs. 19 percent).
    • Deliver critical information in the required time window 2.5 times more often via automated data refresh.
    • Slash the incidence of errors in their data almost in half compared to organizations relying on manual intervention when performing data updates and refreshes.
    • Spend an average of 43 percent less on integration software (based on 2010 spend).
    • Develop integration competence more quickly with significantly lower services and support expenditures, resulting in less costly business results.

I like the 25% of data integration budgets being spend on integrating external data. Imagine making that easier for enterprises with a topic map based service.

Maybe “Data as service (DaaS)” will evolve from simply being data delivery to dynamic integration of data from multiple sources. Where currency, reliability, composition, and other features of the data are on a sliding scale of value.

« Newer PostsOlder Posts »

Powered by WordPress