Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 24, 2011

KNIME Version 2.4.0 released

Filed under: Data Analysis,Data Integration,Data Mining — Patrick Durusau @ 6:45 pm

KNIME Version 2.4.0 released

From the release notice:

We have just released KNIME v2.4, a feature release with a lot of new functionality and some bug fixes. The highlights of this release are:

  • Enhancements around meta node handling (collapse/expand & custom dialogs)
  • Usability improvements (e.g. auto-layout, fast node insertion by double-click)
  • Polished loop execution (e.g. parallel loop execution available from labs)
  • Better PMML processing (added PMML preprocessing, which will also be presented at this year's KDD conference)
  • Many new nodes, including a whole suite of XML processing nodes, cross-tabulation and nodes for data preprocessing and data mining, including ensemble learning methods.

In case you aren’t familiar with KNIME, it is self-described as:

KNIME (Konstanz Information Miner) is a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform. From day one, KNIME has been developed using rigorous software engineering practices and is currently being used actively by over 6,000 professionals all over the world, in both industry and academia.

What would you do the same/differently for a topic map interface?

July 21, 2011

ELN Integration: Avoiding the Spaghetti Bowl

Filed under: Data Integration,ELN Integration — Patrick Durusau @ 6:11 pm

ELN Integration: Avoiding the Spaghetti Bowl by Michael H. Elliott. (Scientific Computing, May 2011)

Michael writes:

…over 20 percent of the average scientist’s time is spend on non-value-added data aggregation, transcription, formatting and manual documentation. [p.19]

…in a recent survey of over 400 scientists, “integrating data from multiple systems” was cited as the number one laboratory data management challenge. [p. 19]

The multiple terminologies various groups use can also impact integration. For example, what a “lot” or “batch” can vary by who you ask: the medicinal chemist, formulator, or biologics process development scientist. A common vocabulary can be one of the biggest stumbling blocks, as it involves either gaining consensus, defining semantic relationships and/or data transformations. [p.21]

Good article that highlights the on-going difficulty that scientists face with ELN (Electronic Lab Notebook) solutions.

It was refreshing to hear someone mention organizational and operational issues being “…more difficult to address than writing code.”

Technical solutions cannot address personnel, organizational or semantic issues.

However tempting it may be to “wait and see,” the personnel, organizational and semantic issues you had before an integration solution will be there post-integration solution. That’s a promise.

June 20, 2011

MAD Skills: New Analysis Practices for Big Data

Filed under: Analytics,BigData,Data Integration,SQL — Patrick Durusau @ 3:33 pm

MAD Skills: New Analysis Practices for Big Data by Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton.

Abstract:

As massive data acquisition and storage becomes increasingly aff ordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world’s largest advertising networks at Fox Audience Network, using the Greenplum parallel database system. We describe database design methodologies that support the agile working style of analysts in these settings. We present data-parallel algorithms for sophisticated statistical techniques, with a focus on density methods. Finally, we reflect on database system features that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms.

I found this passage very telling:

These desires for speed and breadth of data raise tensions with Data Warehousing orthodoxy. Inmon describes the traditional view:

There is no point in bringing data … into the data warehouse environment without integrating it. If the data arrives at the data warehouse in an unintegrated state, it cannot be used to support a corporate view of data. And a corporate view of data is one of the essences of the architected environment [13]

Unfortunately, the challenge of perfectly integrating a new data source into an “architected” warehouse is often substantial, and can hold up access to data for months – or in many cases, forever. The architectural view introduces friction into analytics, repels data sources from the warehouse, and as a result produces shallow incomplete warehouses. It is the opposite of the MAD ideal.

Marketing question for topic maps: Do you want a shallow, incomplete data warehouse?

Admittedly there is more to it, topic maps enable the integration of both data structures as well as the data itself. Both are subjects in the view of topic maps. Not to mention capturing the reasons why certain structures or data were mapped to other structures or data. I think the name for that is an audit trail.

Perhaps we should ask: Does your data integration methodology offer an audit trail?

(See MADLib for the source code growing out of this effort.)

June 12, 2011

U.S. DoD Is Buying. Are You Selling?

Filed under: BigData,Data Analysis,Data Integration,Data Mining — Patrick Durusau @ 4:14 pm

CTOVision.com reports: Big Data is Critical to the DoD Science and Technology Investment Agenda

Of the seven reported priorities:

(1) Data to Decisions – science and applications to reduce the cycle time and manpower requirements for analysis and use of large data sets.

(2) Engineered Resilient Systems – engineering concepts, science, and design tools to protect against malicious compromise of weapon systems and to develop agile manufacturing for trusted and assured defense systems.

(3) Cyber Science and Technology – science and technology for efficient, effective cyber capabilities across the spectrum of joint operations.

(4) Electronic Warfare / Electronic Protection – new concepts and technology to protect systems and extend capabilities across the electro-magnetic spectrum.

(5) Counter Weapons of Mass Destruction (WMD) – advances in DoD’s ability to locate, secure, monitor, tag, track, interdict, eliminate and attribute WMD weapons and materials.

(6) Autonomy – science and technology to achieve autonomous systems that reliably and safely accomplish complex tasks, in all environments.

(7) Human Systems – science and technology to enhance human-machine interfaces to increase productivity and effectiveness across a broad range of missions

I don’t see any where topic maps would be out of place.

Do you?

May 7, 2011

Structuring data integration models and data integration architecture

Filed under: Data Integration — Patrick Durusau @ 5:51 pm

Structuring data integration models and data integration architecture

By Anthony David Giordano

From the post:

In this excerpt from Data Integration Blueprint and Modeling, readers will learn how to build a business case for a new data integration design process and how to improve the development process for data integration modeling. Readers will also get tips on leveraging process modeling for data integration and designing data integration architecture models, plus definitions for three data integration modeling types – physical, logical and conceptual.

Interesting enough that I bought a copy of the book.

Mostly to see where in data integration design would it make the most sense to pitch topic maps.

May also find clues where topic maps would make the best fit in data integration tools.

If you are familiar with this book, please comment.

April 20, 2011

5 Reasons Why Product Data Integration is Like Chasing Roadrunners

Filed under: Data Integration,Marketing — Patrick Durusau @ 2:16 pm

5 Reasons Why Product Data Integration is Like Chasing Roadrunners

Abstract:

Integrating product data carries a tremendous value, but cleanly integrating that data across multiple applications, data stores, countries and businesses can be as elusive a goal as catching the famed Looney Tunes character.

So why do it?

As a report from Automotive Aftermarket Industry Association pointed out, assuming $100 billion in transactions between suppliers and direct customer in the aftermarket each year, the shared savings potential tops $1.7 billion annually by eliminating product data errors in the supply chain. That’s just potential savings in one industry, in one year.

Note to self: The 1.7% savings on transaction errors requires a flexible and accurate mapping form one party’s information system to another. Something topic maps excel at.

You know what they say, a few $billion here, a few $billion there, and pretty soon you are talking about real money.

April 13, 2011

One Mashboard to Rule Them All

Filed under: BI,Data Integration,Mashups — Patrick Durusau @ 1:26 pm

One Mashboard to Rule Them All

From the announcement:

Webinar Overview: We’ll be showcasing real-world examples of Jaspersoft dashboards,adding to your already extensive technical knowledge. Dashboards, with their instant answers for executives and business users, and mashboards, ideal for integrating multiple data sources for improved organizational decision-making are among the most frequently requested BI deliverables. Join us for everything you wanted to know about Jaspersoft Platforms.

April 20, 2011 1:00 pm, Eastern Daylight Time (New York, GMT-04:00)
April 20, 2011 10:00 am, Pacific Daylight Time (San Francisco, GMT-07:00)
April 20, 2011 6:00 pm, Western European Summer Time (London, GMT+01:00)

There is an open source side to Jaspersoft, Jasperforge.org.

Stats from the JasperForge.org site:

206224 members
163 today
1707 last 7 days
6643 last 30 days
13176296 downloads
255 public projects
182 private projects
85193 forum entries

A community where I would like to pose the question: “How do you re-use a mashup created by someone else?”

And given that it has an open source side, a place to pose topic maps as an answer.

March 16, 2011

Data Integration: Moving Beyond ETL

Filed under: Data Governance,Data Integration,Marketing — Patrick Durusau @ 3:16 pm

Data Integration: Moving Beyond ETL

A sponsored white-paper by DataFlux, www.dataflux.com.

Where ETL = Extract Transform Load

Many of the arguments made in this paper fit quite easily with topic map solutions.

DataFlux appears to be selling data governance based solutions, although it appears to take an evolutionary approach to implementing such solutions.

It occurs to me that topic maps could be one stage in the documentation and evolution of data governance solutions.

High marks for a white paper that doesn’t claim IT salvation from a particular approach.

February 2, 2011

Data Governance, Data Architecture and Metadata Essentials – Webinar

Filed under: Data Governance,Data Integration,Marketing — Patrick Durusau @ 9:20 am

Data Governance, Data Architecture and Metadata Essentials

Date: February 24, 2011 Time: 9:00AM PT

Speaker: David Loshin

From the website:

The absence of data governance standards is a critical failure point for enterprise data repurposing. As the rates of data volume grows, you want to make sure you are employing the correct practices and standards to make the most of this volume of information. Data can be your company’s best or worst asset. Join David Loshin, industry expert on data governance for this informative webcast.

I suppose it goes without saying that an absence of data governance means that a topic map effort to use outside data is going to be even more expensive. Or perhaps not.

People have been urging documentation of data practices since before the advent of the digital computer. That is still the starting point for any data governance.

What you don’t know about you can’t govern. It’s just that simple. (Can’t merge it with outside data either. But if your internal systems are toast, topic maps aren’t going to save you.)

January 28, 2011

Next Generation Data Integration – Webinar

Filed under: Data Integration,Marketing — Patrick Durusau @ 9:41 am

Next Generation Data Integration

Date: April 12, 2011 Time: 9:00AM PT

Speaker: Philip Russom

From the website:

Data integration (DI) has undergone an impressive evolution in recent years. Today, DI is a rich set of powerful techniques, including ETL (extract, transform, and load), data federation, replication, synchronization, change data capture, natural language processing, business-to-business data exchange, and more. Furthermore, vendor products for DI have achieved maturity, users have grown their DI teams to epic proportions, competency centers regularly staff DI work, new best practices continue to arise (like collaborative DI and agile DI), and DI as a discipline has earned its autonomy from related practices like data warehousing and database administration.

Given these and the many other generational changes data integration has gone through recently, it’s natural that many people aren’t quite up-to-date with the full potential of modern data integration. Based on a recent TDWI Best Practices report this webinar seeks to cure that malady by redefining data integration in modern terms, plus showing where it’s going with its next generation. This information will help user organizations make more enlightened decisions, as they upgrade, modernize, and expand existing data integration solutions, plus plan infrastructure for next generation data integration.

Every group (tribe as Jack Park would call them) has its own terminology when it comes to data and managing data.

As you can tell from the description of the webinar, data integration is concerned with many of the same issues as topic maps. Albeit under different names.

Regard this as an opportunity to visit another tribe and learn some new terminology.

And some new ideas you can use with topic maps.

January 7, 2011

Apache OODT – Top Level Project

Filed under: Data Integration,Data Mining,Data Models,OODT,Software — Patrick Durusau @ 6:02 am

Apache OODT is the first ASF Top Level Project status for NASA developed software.

From the website:

Just what is Apache™ OODT?

It’s metadata for middleware (and vice versa):

  • Transparent access to distributed resources
  • Data discovery and query optimization
  • Distributed processing and virtual archives

But it’s not just for science! It’s also a software architecture:

  • Models for information representation
  • Solutions to knowledge capture problems
  • Unification of technology, data, and metadata

Looks like a project that could benefit from having topic maps as part of its tool kit.

Check out the 0.1 OODT release and see what you think.

December 27, 2010

Data Management Slam Dunk – SPAM Warning

Filed under: Data Integration,Knowledge Management,Software — Patrick Durusau @ 2:19 pm

The Data Management Slam Dunk: A Unified Integration Platform is a spam message that landed in my inbox today.

I have heard good things about Talend software but gibberish like:

There will never be a silver bullet for marshalling the increasing volumes of data, but at least there is one irrefutable truth: a unified data management platform can solve most of the problems that information managers encounter. In fact, by creating a centralized repository for data definitions, lineage, transformations and movements, companies can avoid many troubles before they occur.

makes me wonder if any of it is true?

Did you notice that the irrefutable fact is a sort of magic incantation?

If everything is dumped in one place, troubles just melt away.

It isn’t that simple.

The “presentation” never gives a clue as to how anyone would achieve these benefits in practice. It just keeps repeating the benefits and oh, that Talend is the way to get them.

Not quite as annoying as one of those belly-buster infomercials but almost.

I have been planning on reviewing the Talend software from a topic map perspective.

Suggestions of issues, concerns or particularly elegant parts that I should be aware of are most welcome.

November 27, 2010

Successful Data Integration Projects Require A Diverse Approach

Filed under: Data Integration,Marketing,Topic Maps — Patrick Durusau @ 9:57 pm

Successful Data Integration Projects Require A Diverse Approach (may require registration)

Apologies but even though you may have to register (free), I thought this story was worth mentioning.

If just for the observation that ETL (extract, transform, load) is “…a lot like throwing a bomb when all that’s needed is a bullet.” I have a less generous explanation but perhaps another time.

My point here is that data integration is a hot topic and topic maps can be part of the solution set.

No, I am not going to do one of those “…window of opportunity is closing…” routines because:

1) The MDM (master data management) folks haven’t cracked this nut since the 1980’s.

2) The Semantic Web effort, with a decade of hard work, has managed to re-invent the vocabulary problem in URIs. (I still think we should send the W3C a fruit basket.)

3) Every solution is itself an opportunity for subject identity integration with other solutions. (It is a self-perpetuating business opportunity. Next to having an addictive product, the best kind.)

Making topic maps relevant to data integration is going to require that we move away from the file format = topic maps approach.

Customers should understand that topic maps put them in change of managing their data, with their identifications. (With the potential to benefit from other identifications of the same subjects.)

That is the real diversity in data integration.

November 23, 2010

October 17, 2010

IEEE Computer Society Technical Committee on Semantic Computing (TCSEM)

The IEEE Computer Society Technical Committee on Semantic Computing (TCSEM)

addresses the derivation and matching of the semantics of computational content to that of naturally expressed user intentions in order to retrieve, manage, manipulate or even create content, where “content” may be anything including video, audio, text, software, hardware, network, process, etc.

Being organized by Phillip C-Y Sheu (UC Irvine), psheu@uci.edu, Phone: +1 949 824 2660. Volunteers are needed for both organizational and technical committees.

This is a good way to meet people, make a positive contribution and, have a lot of fun.

October 4, 2010

A multimodal dialogue mashup for medical image semantics

Filed under: Data Integration,Interface Research/Design — Patrick Durusau @ 4:26 am

A multimodal dialogue mashup for medical image semantics Authors: Daniel Sonntag, and Manuel Möller Keywords: collaborative environments, design, touchscreen interface

Abstract:

This paper presents a multimodal dialogue mashup where different users are involved in the use of different user interfaces for the annotation and retrieval of medical images. Our solution is a mashup that integrates a multimodal interface for speech-based annotation of medical images and dialogue-based image retrieval with a semantic image annotation tool for manual annotations on a desktop computer. A remote RDF repository connects the annotation and querying task into a common framework and serves as the semantic backend system for the advanced multimodal dialogue a radiologist can use.

With regard to the semantics of the interface the authors say:

In a complex interaction system, a common ground of terms and structures is absolutely necessary. A shared representation and a common knowledge base ease the dataflow within the system and avoid costly and error-prone transformation processes.

I disagree with both statements but concede that for a particular use cases, the cost of dataflow question will be resolved differently.

I like the article as an example of interface design.

September 19, 2010

Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data

Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data

Destined to be a deeply influential resource.

Read the paper, use the application for a week Chem2Bio2RDF, then answer these questions:

  1. Choose three (3) subjects that are identified in this framework.
  2. For each subject, how is it identified in this framework?
  3. For each subject, have you seen it in another framework or system?
  4. For each subject seen in another framework/system, how was it identified there?

Extra credit: What one thing would you change about any of the identifications in this system? Why?

September 9, 2010

High-Performance Dynamic Pattern Matching over Disordered Streams

Filed under: Data Integration,Data Mining,Pattern Recognition,Subject Identity,Topic Maps — Patrick Durusau @ 4:12 pm

High-Performance Dynamic Pattern Matching over Disordered Streams by Badrish Chandramouli, Jonathan Goldstein, and David Maier came to me by way of Jack Park.

From the abstract:

Current pattern-detection proposals for streaming data recognize the need to move beyond a simple regular-expression model over strictly ordered input. We continue in this direction, relaxing restrictions present in some models, removing the requirement for ordered input, and permitting stream revisions (modification of prior events). Further, recognizing that patterns of interest in modern applications may change frequently over the lifetime of a query, we support updating of a pattern specification without blocking input or restarting the operator.

In case you missed it, this is related to: Experience in Extending Query Engine for Continuous Analytics.

The algorithmic trading use case in this article made me think of Nikita Ogievetsky. For those of you who do not know Nikita, he is an XSLT/topic map maven, currently working in the finance industry.

Do trading interfaces allow user definition of subjects to be identified in data streams? And/or merged with subjects identified in other data streams? Or is that an upgrade from the basic service?

September 5, 2010

Experience in Extending Query Engine for Continuous Analytics

Filed under: Data Integration,Data Mining,SQL,TMQL,Uncategorized — Patrick Durusau @ 4:37 pm

Experience in Extending Query Engine for Continuous Analytics by Qiming Chen and Meichun Hsu has this problem statement:

Streaming analytics is a data-intensive computation chain from event streams to analysis results. In response to the rapidly growing data volume and the increasing need for lower latency, Data Stream Management Systems (DSMSs) provide a paradigm shift from the load-first analyze-later mode of data warehousing….

Moving from load-first analyze-later has implications for topic maps over data warehouses. Particularly when events that are subjects may only have a transient existence in a data stream.

This is on my reading list to prepare to discuss TMQL in Leipzig.

PS: Only five days left to register for TMRA 2010. It is a don’t miss event.

“Linguistic terms do not hold exact meaning….”

Filed under: Data Integration,Fuzzy Sets,Information Retrieval,Subject Identity — Patrick Durusau @ 10:36 am

In some background research I ran across:

One of the most important applications of fuzzy set theory is the concept of linguistic variables. A linguistic variable is a variable whose values are not numbers, but words or sentences in a natural or artificial language. The value of a linguistic variable is defined as an element of its term set? a predefined set of appropriate linguistic terms. Linguistic terms are essentially subjective categories for a linguistic variable.

Linguistic terms do not hold exact meaning, however, and may be understood differently by different people. The boundaries of a given term are rather subjective, and may also depend on the situation. Linguistic terms therefore cannot be expressed by ordinary set theory; rather, each linguistic term is associated with a fuzzy set. (“Soft sets and soft groups,” by Haci Akta? and Naim Ça?man, Information Sciences, Volume 177, Issue 13, 1 July 2007, Pages 2726-2735

Fuzzy sets are yet another useful approach that has recognized linguistic uncertainty as an issue and developed mechanisms to address it.

What is “linguistic uncertainty” if it isn’t a question of “subject identity?”

Fuzzy sets have developed another way to answer questions about subject identity.

As topic maps mature I want to see the development of equivalences between approaches to subject identity.

Imagine a topic map system consisting of a medical scanning system that is identifying “subjects” in cultures using rough sets, with equivalences to “subjects” identified in published literature using fuzzy sets, that is refined by “subjects” from user contributions and interactions using PSIs or other mechanisms. (Or other mechanisms, past, present or future.)

September 4, 2010

Master Data Management – Successes?

Filed under: Data Integration,Semantics — Patrick Durusau @ 8:23 pm

With email and articles on master data management running neck and neck with Nigeria widows entrusting me with $millions of US dollars on deposit in some third country, I decided to take another (brief) look:

While MDM vendors will probably tell you that the high success rate is due to their superior technology, Baseline’s Jill Dyche, who analyzed the survey results, has come to a different conclusion.

Most current MDM projects have focused on just “low-hanging fruit,” Dyche said. They often tackle jobs like reconciling names and addresses, leaving the more challenging work — sorting out product specifications and other data from numerous internal and external sources, for example — for phase 2 and beyond. As MDM project deployments grow more complex, ‘drama’ could follow, By Jeff Kelly, News Editor

Drama? Well, HR versus Accounting, or Sales versus Production over the unified master record sounds like drama to me. Not to mention the entrenched interests in particular systems.

Topic maps, managing diverse semantics with less drama, what’s there not to like?

August 29, 2010

Journal of Artificial Intelligence Research – Journal

Filed under: Data Integration,Merging,Subject Identity — Patrick Durusau @ 7:23 pm

Journal of Artificial Intelligence Research is one of the oldest electronic journals on the Internet, not to mention that it offers free access to all its contents.

While some of the articles have titles like “The Strategy-Proofness Landscape of Merging”, P. Everaere, S. Konieczny and P. Marquis (2007), Volume 28, pages 49-105, they raise issues that sophisticated topic mappers will need to be able to discuss intelligently with data analysts.

Information Fusion – Journal

Filed under: Data Integration,Merging,Subject Identity — Patrick Durusau @ 6:59 pm

Information Fusion covers a number of areas of direct interest to topic map researchers and developers. An incomplete list includes:

  • Fusion Learning In Imperfect, Imprecise And Incomplete Environments
  • Intelligent Techniques For Fusion Processing
  • Fusion System Design And Algorithmic Issues
  • Fusion System Computational Resources and Demands Optimization
  • Special Purpose Hardware Dedicated To Fusion Applications

If you are considering this as a publication venue, consider their “open access” (quotes are theirs) before making that choice.

August 22, 2010

Domain Bridging Associations Support Creativity

Filed under: Data Integration,Heterogeneous Data,Mapping,Semantics — Patrick Durusau @ 10:21 am

Domain Bridging Associations Support Creativity by Tobias Kötter, Kilian Thiel, and Michael R. Berthold, offers the following abstract:

This paper proposes a new approach to support creativity through assisting the discovery of unexpected associations across different domains. This is achieved by integrating information from heterogeneous domains into a single network, enabling the interactive discovery of links across the corresponding information resources. We discuss three different pattern of domain crossing associations in this context.

Does that sound familiar to anyone?

Part of the continuing irony that semantic integration research suffers from a lack of semantic integration.

I am just at the tip of this particular iceberg of research so please chime in with pointers to conferences, proceedings, articles, books, etc.

The Universtät Konstanz, Nycomed Chair for Bioinformatics and Data Mining, Publications page, where I found this paper and a number of other resources.

July 28, 2010

Don’t Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources (1997)

Filed under: Data Integration,Database,Semantic Diversity,Software — Patrick Durusau @ 7:54 pm

Don’t Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources (1997) by Mary Tork Roth isn’t the latest word on wrappers but is well written. (longer version, A Wrapper Architecture for Legacy Data Sources (1997) )

The wrapper idea is a good one, although Roth uses it in the context of a unified schema, which is then queried. With a topic map, you could query on the basis of any of the underlying schemas and get the data from all the underlying data sources.

That result is possible because a topic map has one representative for a subject and can have any number of sources for information about that single subject.

I haven’t done a user survey but suspect most users would prefer to search for/access data using familiar schemas rather than new “unified” schemas.

July 13, 2010

The FLAMINGO Project on Data Cleaning – Site

The FLAMINGO Project on Data Cleaning is the other project that has influenced the self-similarity work with MapReduce.

From the project description:

Supporting fuzzy queries is becoming increasingly more important in applications that need to deal with a variety of data inconsistencies in structures, representations, or semantics. Many existing algorithms require an offline analysis of data sets to construct an efficient index structure to support online query processing. Fuzzy join queries of data sets are more time consuming due to the computational complexity. The PI is studying three research problems: (1) constructing high-quality inverted lists for fuzzy search queries using Hadoop; (2) supporting fuzzy joins of large data sets using Hadoop; and (3) using the developed techniques to improve data quality of large collections of documents.

See the project webpage to learn more about their work on “us[ing] limited programming primitives in the cloud to implement index structures and search algorithms.”

The relationship between “dirty” data and the increase in data overall is at least linear, but probably worse. Far worse. Whether data is “dirty” depends on your perspective. The more data that appears on “***” format (fill in the one you like the least) the dirtier the universe of data has become. “Dirty” data will be with you always.

ASTERIX: A Highly Scalable Parallel Platform for Semistructured Data Management and Analysis – SITE

ASTERIX: A Highly Scalable Parallel Platform for Semistructured Data Management and Analysis is one of the projects behind the self-similarity and MapReduce posting.

From the project page:

The ASTERIX project is developing new technologies for ingesting, storing, managing, indexing, querying, analyzing, and subscribing to vast quantities of semi-structured information. The project is combining ideas from three distinct areas – semi-structured data, parallel databases, and data-intensive computing – to create a next-generation, open source software platform that scales by running on large, shared-nothing computing clusters.

Home of Hydrax Hyrax: Demonstrating a New Foundation for Data-Parallel Computation, “out-of-the-box support for common distributed communication patterns and set-oriented data operators.” (Need I say more?)

July 11, 2010

Efficient Parallel Set-Similarity Joins Using MapReduce

Efficient Parallel Set-Similarity Joins Using MapReduce by Rares Vernica, Michael J. Carey, and, Chen Li, Department of Computer Science, University of California, Irvine, used Citeseer (1.3M publications) and DBLP (1.2M publications) and “…increased their sizes as needed.”

The contributions of this paper are:

  • “We describe efficient ways to partition a large dataset across nodes in order to balance the workload and minimize the need for replication. Compared to the equi-join case, the set-similarity joins case requires “partitioning” the data based on set contents.
  • We describe efficient solutions that exploit the MapReduce framework. We show how to efficiently deal with problems such as partitioning, replication, and multiple
    inputs by manipulating the keys used to route the data in the framework.
  • We present methods for controlling the amount of data kept in memory during a join by exploiting the properties of the data that needs to be joined.
  • We provide algorithms for answering set-similarity self-join queries end-to-end, where we start from records containing more than just the join attribute and end with actual pairs of joined records.
  • We show how our set-similarity self-join algorithms can be extended to answer set-similarity R-S join queries.
  • We present strategies for exceptional situations where, even if we use the finest-granularity partitioning method, the data that needs to be held in the main memory of one node is too large to fit.”

A number of lessons and insights relevant to topic maps in this paper.

Makes me think of domain specific (as well as possibly one or more “general”) set-similarity join interchange languages! What are you thinking of?

June 9, 2010

Motivations For Data Integration

Filed under: Data Integration,Marketing — Patrick Durusau @ 8:40 am

Talend Reference Library offers collections of case studies and white papers to make the case for data integration.

I can’t say that I care for some of the solutions that are proffered but I am aware that having a hammer (topic maps) doesn’t mean everything I see is a nail. 😉

You do have to submit contact information to download the papers.

The papers are useful as guides on making the case for data integration (read topic maps) to management level personnel. Not too much on the technical side and always keeping a focus on issues of concern to them, costs, customer satisfaction, missed opportunities, etc.

Save the “cool” stuff for when you meet with the geeks in the IT department, after you have the contract.

June 8, 2010

Semantic Overlay Networks

GridVine: Building Internet-Scale Semantic Overlay Networks sounds like they are dealing with topic map like issues to me. You be the judge:

This paper addresses the problem of building scalable semantic overlay networks. Our approach follows the principle of data independence by separating a logical layer, the semantic overlay for managing and mapping data and metadata schemas, from a physical layer consisting of a structured peer-to-peer overlay network for efficient routing of messages. The physical layer is used to implement various functions at the logical layer, including attribute-based search, schema management and schema mapping management. The separation of a physical from a logical layer allows us to process logical operations in the semantic overlay using different physical execution strategies. In particular we identify iterative and recursive strategies for the traversal of semantic overlay networks as two important alternatives. At the logical layer we support semantic interoperability through schema inheritance and semantic gossiping. Thus our system provides a complete solution to the implementation of semantic overlay networks supporting both scalability and interoperability.

The concept of “semantic gossiping” enables semantic similarity to be established the combination of local mappings, that is by adding the mappings together. (Similar to the set behavior of subject identifiers/locators in the TMDM. That is to say if you merge two topic maps, any additional subject identifiers, previously unknown to the first topic map, with enable those topics to merge with topics in later merges where previously they may not have.)

Open Question: If everyone concedes that:

  • we live in a heterogeneous world
  • we have stored vast amounts of heterogeneous data
  • we are going to continue to create/store even vaster amounts of heterogeneous data
  • we keep maintaining and creating more heterogeneous data structures to store our heterogeneous data

If every starting point is heterogeneous, shouldn’t heterogeneous solutions be the goal?

Such as supporting heterogeneous mapping technologies? (Granting there will also be a limit to those supported at any one time but it should be possible to extend to embrace others.)

Author Bibliographies:

Karl Aberer

Phillipe Cudré-Mauroux

Manfred Hauswirth

Tim Van Pelt

« Newer PostsOlder Posts »

Powered by WordPress