Data Mining « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 9, 2010

High-Performance Dynamic Pattern Matching over Disordered Streams

Filed under: Data Integration,Data Mining,Pattern Recognition,Subject Identity,Topic Maps — Patrick Durusau @ 4:12 pm

High-Performance Dynamic Pattern Matching over Disordered Streams by Badrish Chandramouli, Jonathan Goldstein, and David Maier came to me by way of Jack Park.

From the abstract:

Current pattern-detection proposals for streaming data recognize the need to move beyond a simple regular-expression model over strictly ordered input. We continue in this direction, relaxing restrictions present in some models, removing the requirement for ordered input, and permitting stream revisions (modification of prior events). Further, recognizing that patterns of interest in modern applications may change frequently over the lifetime of a query, we support updating of a pattern specification without blocking input or restarting the operator.

In case you missed it, this is related to: Experience in Extending Query Engine for Continuous Analytics.

The algorithmic trading use case in this article made me think of Nikita Ogievetsky. For those of you who do not know Nikita, he is an XSLT/topic map maven, currently working in the finance industry.

Do trading interfaces allow user definition of subjects to be identified in data streams? And/or merged with subjects identified in other data streams? Or is that an upgrade from the basic service?

Comments Off

September 5, 2010

Experience in Extending Query Engine for Continuous Analytics

Filed under: Data Integration,Data Mining,SQL,TMQL,Uncategorized — Patrick Durusau @ 4:37 pm

Experience in Extending Query Engine for Continuous Analytics by Qiming Chen and Meichun Hsu has this problem statement:

Streaming analytics is a data-intensive computation chain from event streams to analysis results. In response to the rapidly growing data volume and the increasing need for lower latency, Data Stream Management Systems (DSMSs) provide a paradigm shift from the load-first analyze-later mode of data warehousing….

Moving from load-first analyze-later has implications for topic maps over data warehouses. Particularly when events that are subjects may only have a transient existence in a data stream.

This is on my reading list to prepare to discuss TMQL in Leipzig.

PS: Only five days left to register for TMRA 2010. It is a don’t miss event.

Comments (1)

July 13, 2010

The FLAMINGO Project on Data Cleaning – Site

Filed under: Data Integration,Data Mining,Heterogeneous Data,Information Retrieval,MapReduce,Semantic Diversity,Software — Patrick Durusau @ 5:28 am

The FLAMINGO Project on Data Cleaning is the other project that has influenced the self-similarity work with MapReduce.

From the project description:

Supporting fuzzy queries is becoming increasingly more important in applications that need to deal with a variety of data inconsistencies in structures, representations, or semantics. Many existing algorithms require an offline analysis of data sets to construct an efficient index structure to support online query processing. Fuzzy join queries of data sets are more time consuming due to the computational complexity. The PI is studying three research problems: (1) constructing high-quality inverted lists for fuzzy search queries using Hadoop; (2) supporting fuzzy joins of large data sets using Hadoop; and (3) using the developed techniques to improve data quality of large collections of documents.

See the project webpage to learn more about their work on “us[ing] limited programming primitives in the cloud to implement index structures and search algorithms.”

The relationship between “dirty” data and the increase in data overall is at least linear, but probably worse. Far worse. Whether data is “dirty” depends on your perspective. The more data that appears on “***” format (fill in the one you like the least) the dirtier the universe of data has become. “Dirty” data will be with you always.

Comments (1)

July 8, 2010

Taking Your Tool Kit to the Next Level

Filed under: Data Mining,Information Retrieval,Search Engines — Patrick Durusau @ 7:53 pm

Online Mathematics Textbooks is a good stop if you want to take your tool kit to the next level.

Plug-n-play indexing and search engines will do a lot out of the box but aren’t going to distinguish you from the competition.

Understanding the underlying algorithms will help make the data mining you do to populate your topic map qualitatively different.

Here’s your chance to brush up on your math skills without monetary investment.

***
PS: At some point, maybe at TMRA, a group of us need to draft an outline for a topic maps curriculum. Would have to include topic maps, obviously, but would also need to include courses in Information Retrieval, User Interfaces, Natural Language Processing, Classification, Math, what else? Would need to have “minors” in some particular subject area.

Comments (5)

June 24, 2010

27th International Conference on Machine Learning (ICML 2010) – Proceedings

Filed under: Conferences,Data Mining — Patrick Durusau @ 7:09 pm

Proceedings of the 27th International Conference on Machine Learning (ICML 2010) are available.

If you are interested in the next generation of assistive tools for authoring topic maps or using them before your competition does, it would be hard to find a better starting place.

One of my interests is in text archives, so interactive construction of a topic map with an application that searches an archive for subjects or relationships for subjects would be cool. Perhaps that “learns” your preferences as you accept or reject its suggestions. And that “knows” what others have found building topic maps for the same archive. You can follow or not follow their paths into the archive.

Comments Off

June 22, 2010

Unstructured Data or Unmapped Data?

Filed under: Data Mining,Marketing — Patrick Durusau @ 10:55 am

The Wikipedia article on unstructured data makes it clear that data may have a structure, but that “unstructured data” means one not readily recognizable to a computer.

The term unstructured data bothers me because any text has a structure. If it didn’t, we would not be able to read it. It would just be a jumble of symbols. Oh, sorry. Apologies to any AI agents “reading” this post. But that is how traditional computers see a text, just a jumble of symbols.

When people view a text, they see structure, recognize subjects, etc. Moreover, different people can look at the same text and see different structures and/or subjects.

There are topic maps that are written to enforce a “correct” view of a body of data and those are certainly useful in many cases. Topic maps also support users identifying the structures and subjects they see in a text, along side identifications made by others.

The extent to which users view texts and leave trails as it were of the structures and subjects they identified in a text (or body of texts), those trails form maps that can be useful to others.

Think of it as tagging but with explicit subject identity. The relationships to a particular text, its author, and a variety of other details could be extracted automatically and with a minimum of effort on the part of the user. A topic map application could even suggest subjects or associations for a user to confirm based on their reading.

Suggest: unmapped data.

Captures both the sense of exploration as well as allowing for multiple mappings.

Thoughts?

Comments Off

June 20, 2010

“What Is I.B.M’s Watson?” – Review

Filed under: Data Mining,Semantic Diversity,Subject Identity — Patrick Durusau @ 7:34 pm

What Is I.B.M.’s Watson? appears in the New York Time Magazine on 20 June 2010. IBM or more precisely David Ferrucci and his team at IBM have made serious progress towards a useful question-answering machine. (On Ferrucci see, Ferrucci – DBLP, Ferrucci – Scientific Commons)

It won’t spoil the article to say that raw computing horsepower (BlueGene servers) plays a role in the success of the Watson project. But, there is another aspect of the project that makes it relevant to topic maps.

Rather than relying on a few algorithms to analyze questions, Watson uses more than a hundred and as summarized by the article:

Another set of algorithms ranks these answers according to plausibility; for example, if dozens of algorithms working in different directions all arrive at the same answer, it’s more likely to be the right one. In essence, Watson thinks in probabilities. It produces not one single “right” answer, but an enormous number of possibilities, then ranks them by assessing how likely each one is to answer the question.

Transpose that into a topic maps setting and imagine that you are using probabilistic merging algorithms that are applied interactively by a user in real time.

Suddenly we are not talking about a technology for hand curated information resources but an assistive technology that would enable human users go deep knowledge diving into the sea of information resources. While generating buoys and markers for others to follow.

Our ability to do that will depend on processing power, creative use and development of “probabilistic merging” algorithms and a Topic Maps Query Language that supports querying of non-topic map data and creation of content based on the results of those queries.

****

PS: For more information on the Watson project, see: What Is Watson?, part of IBM’s DeepQA project.

Comments Off

May 29, 2010

Association Rules

Filed under: Authoring Topic Maps,Data Mining — Patrick Durusau @ 6:28 am

Apologies for posting on association rules in Private Mining of Association Rules, a term of art that might be confusing to topic map advocates, without defining it.

When we buy an item online, most retailers suggest that other buyers also purchased … some list of items. The “association” of those items together can be represented by a Boolean vector, composed of values for the presence or absence of an item. To form an association rule, such a vector is accompanied by support and confidence values.

The support value indicates the percentage of a data set where the association occurs. That is the items in question appear together.

The confidence value indicates what percentage of one value is accompanied by another.

Minimums of these values are known as minimal support threshold and minimal confidence threshold and typically appear together.

For more information on “association rules,” see Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber, at page 229. (The publication date for the second edition in WorldCat (the link on the title) is wrong. Should be 2006.)

Supplemental Materials for Data Mining. I am checking on the status of the apparent 3rd edition so you might want to wait on buying a copy. Would make a great text for an advanced topic maps course that focused on populating a topic map.

Comments Off

May 28, 2010

Private Mining of Association Rules

Filed under: Data Mining,Searching — Patrick Durusau @ 12:50 pm

Private Mining of Association Rules (2005) examines how parties can share association rules for data mining, without sharing data.

The authors develop a secure collaborative association rule mining protocol based upon a homomorphic encryption scheme.

Developing a similar approach for topic maps would be a nice doctoral project. Association rule data mining and the associated privacy concerns are well known. Combining those in a topic map context would be an interesting piece of work.

Author Bibliographies:

Justin Z. Zhan

Stan Matwin

LiWu Chang

Comments (1)

« Newer Posts