Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 12, 2013

Aggregation Module – Phase 1 – Functional Design (ElasticSearch #3300)

Filed under: Aggregation,ElasticSearch,Merging,Search Engines,Topic Maps — Patrick Durusau @ 2:47 pm

Aggregation Module – Phase 1 – Functional Design (ElasticSearch Issue #3300)

From the post:

The new aggregations module is due to elasticsearch 1.0 release, and aims to serve as the next generation replacement for the functionality we currently refer to as “faceting”. Facets, currently provide a great way to aggregate data within a document set context. This context is defined by the executed query in combination with the different levels of filters that are defined (filtered queries, top level filters, and facet level filters). Although powerful as is, the current facets implementation was not designed from ground up to support complex aggregations and thus limited. The main problem with the current implementation stem in the fact that they are hard coded to work on one level and that the different types of facets (which account for the different types of aggregations we support) cannot be mixed and matched dynamically at query time. It is not possible to compose facets out of other facet and the user is effectively bound to the top level aggregations that we defined and nothing more than that.

The goal with the new aggregations module is to break the barriers the current facet implementation put in place. The new name (“Aggregations”) also indicate the intention here – a generic yet extremely powerful framework for defining aggregations – any type of aggregation. The idea here is to have each aggregation defined as a “standalone” aggregation that can perform its task within any context (as a top level aggregation or embedded within other aggregations that can potentially narrow its computation scope). We would like to take all the knowledge and experience we’ve gained over the years working with facets and apply it when building the new framework.

(…)

If you have been following the discussion about “what would we do differently with topic maps” in the XTM group at LinkedIn, this will be of interest.

What is an aggregation if it is not a selection of items matching some criteria, which you can then “merge” together for presentation to a user?

Or “merge” together for further querying?

That is inconsistent with the imperative programming model of the TMDM, but it has the potential to open up distributed and parallel processing of topic maps.

Same paradigm but with greater capabilities.

June 24, 2013

Succinct data structures for representing equivalence classes

Filed under: Computer Science,Data Structures,Equivalence Class,Merging — Patrick Durusau @ 1:59 pm

Succinct data structures for representing equivalence classes by Moshe Lewenstein, J. Ian Munro, and Venkatesh Raman.

Abstract:

Given a partition of an n element set into equivalence classes, we consider time-space tradeoffs for representing it to support the query that asks whether two given elements are in the same equivalence class. This has various applications including for testing whether two vertices are in the same component in an undirected graph or in the same strongly connected component in a directed graph.

We consider the problem in several models.

— Concerning labeling schemes where we assign labels to elements and the query is to be answered just by examining the labels of the queried elements (without any extra space): if each vertex is required to have a unique label, then we show that a label space of (\sum_{i=1}^n \lfloor {n \over i} \rfloor) is necessary and sufficient. In other words, \lg n + \lg \lg n + O(1) bits of space are necessary and sufficient for representing each of the labels. This slightly strengthens the known lower bound and is in contrast to the known necessary and sufficient bound of \lceil \lg n \rceil for the label length, if each vertex need not get a unique label.

–Concerning succinct data structures for the problem when the n elements are to be uniquely assigned labels from label set {1, 2, …n}, we first show that \Theta(\sqrt n) bits are necessary and sufficient to represent the equivalence class information. This space includes the space for implicitly encoding the vertex labels. We can support the query in such a structure in O(\lg n) time in the standard word RAM model. We then develop structures resulting in one where the queries can be supported in constant time using O({\sqrt n} \lg n) bits of space. We also develop space efficient structures where union operation along with the equivalence query can be answered fast.

On the down side, this technique would not support merging based on arbitrary choice of properties.

On the up side, this technique does support merging based on pre-determined properties for merging.

The latter being the more common case, I commend this article to you for a close read.

June 4, 2013

Excision [Forgetting But Remembering You Forgot (Datomic)]

Filed under: Datomic,Merging — Patrick Durusau @ 10:13 am

Excision

From the post:

It is a key value proposition of Datomic that you can tell not only what you know, but how you came to know it. When you add a fact:

conn.transact(list(":db/add", 42, ":firstName", "John"));

Datomic does more than merely record that 42‘s first name is “John“. Each datom is also associated with a transaction entity, which records the moment (:db/txInstant) the datom was recorded.

(…)

Given this information model, it is easy to see that Datomic can support queries that tell you:

  • what you know now
  • what you knew at some point in the past
  • how and when you came to know any particular datom

So far so good, but there is a fly in the ointment. In certain situations you may be forced to excise data, pulling it out root and branch and forgetting that you ever knew it. This may happen if you store data that must comply with privacy or IP laws, or you may have a regulatory requirement to keep records for seven years and then “shred” them. For these scenarios, Datomic provides excision.

One approach to the unanswered question of what does it means to delete something from a topic map?

Especially interesting because you can play with the answer that Datomic provides.

Doesn’t address the issue of what it means to delete a topic that has caused other topics to merge.

I first saw this in Christophe Lalanne’s A bag of tweets / May 2013.

April 20, 2013

Fast algorithm for successively merging k-overlapping sets?

Filed under: Merging,Overlapping Sets,Sets — Patrick Durusau @ 12:56 pm

Fast algorithm for successively merging k-overlapping sets?

As posted:

Consider the following algorithm for clustering sets: Begin with n sets, S1, S2,…,Sn, such that sum_{i = 1}^n |Si| = m, and successively merge sets with at least k elements in common. E.g., if S1 = {1, 2, 3}, S2 = {3, 4, 5}, and S3 = {5, 6, 7}, and k = 1 then S1 can be merged with S2 to create S1′ = {1, 2, 3, 4, 5}, and S1′ can be merged with S3 to create S1” = {1,…,7}.

Warmup question: Can this clustering algorithm be implemented efficiently for k = 1?

Answer to warmup question: If the sets only need to overlap by one element to be merged as in the example above, the clustering can be performed in O(m) time using connected components if you are careful.

Harder question: Suppose the sets must overlap by at least 2 (or k) elements to be merged. Is there an efficient algorithm for this case (i.e., close to linear time)? The challenge here is that you can have cases like S1 = {1, 2, 3}, S2 = {2, 4, 5}, S3 = {1, 4, 5}, with k = 2. Note that in this case S1 can be merged with S2 and S3, but only after S2 and S3 are merged to create S2′ so that S1 and S2′ share both elements 1 and 2.

I saw this on Theoretical Computer Science Stack Exchange earlier today.

Reminded me of overlapping set members test for [subject identifiers], [item identifiers], [subject locators], [subject identifiers] and [item identifiers], property of the other, [reified] properties. (Well, reified is a simple match, not a set.)

I have found some early work on the overlapping set member question but also work on other measures of similarity on set members.

Working up a list of papers.

March 10, 2013

Why Data Lineage is Your Secret … Weapon [Auditing Topic Maps]

Filed under: Data Quality,Merging,Provenance — Patrick Durusau @ 8:42 pm

Why Data Lineage is Your Secret Data Quality Weapon by Dylan Jones.

From the post:

Data lineage means many things to many people but it essentially refers to provenance – how do you prove where your data comes from?

It’s really a simple exercise. Just pull an imaginary string of data from where the information presents itself, back through the labyrinth of data stores and processing chains, until you can go no further.

I’m constantly amazed by why so few organisations practice sound data lineage management despite having fairly mature data quality or even data governance programs. On a side note, if ever there was a justification for the importance of data lineage management then just take a look at the brand damage caused by the recent European horse meat scandal.

But I digress. Why is data lineage your secret data quality weapon?

The simple answer is that data lineage forces your organisation to address two big issues that become all too apparent:

  • Lack of ownership
  • Lack of formal information chain design

Or to put it into a topic map context, can you trace what topics merged to create the topic you are now viewing?

And if you can’t trace, how can you audit the merging of topics?

And if you can’t audit, how do you determine the reliability of your topic map?

That is reliability in terms of date (freshness), source (reliable or not), evaluation (by screeners), comparison (to other sources), etc.

Same questions apply to all data aggregation systems.

Or as Mrs. Weasley tells Ginny:

“Never trust anything that can think for itself if you can’t see where it keeps its brain.”


Correction: Wesley -> Weasley. We had a minister friend over Sunday and were discussing the former, not the latter. 😉

March 6, 2013

VIAF: The Virtual International Authority File

Filed under: Authority Record,Library,Library Associations,Merging,Subject Identity — Patrick Durusau @ 11:19 am

VIAF: The Virtual International Authority File

From the webpage:

VIAF, implemented and hosted by OCLC, is a joint project of several national libraries plus selected regional and trans-national library agencies. The project’s goal is to lower the cost and increase the utility of library authority files by matching and linking widely-used authority files and making that information available on the Web.

The “about” link at the bottom of the page is broken (in the English version). A working “about” link for VIAF reports:

At a glance

  • A collaborative effort between national libraries and organizations contributing name authority files, furthering access to information
  • All authority data for a given entity is linked together into a “super” authority record
  • A convenient way for the library community and other agencies to repurpose bibliographic data produced by libraries serving different language communities

The Virtual International Authority File (VIAF) is an international service designed to provide convenient access to the world’s major name authority files. Its creators envision the VIAF as a building block for the Semantic Web to enable switching of the displayed form of names for persons to the preferred language and script of the Web user. VIAF began as a joint project with the Library of Congress (LC), the Deutsche Nationalbibliothek (DNB), the Bibliothèque nationale de France (BNF) and OCLC. It has, over the past decade, become a cooperative effort involving an expanding number of other national libraries and other agencies. At the beginning of 2012, contributors include 20 agencies from 16 countries.

Most large libraries maintain lists of names for people, corporations, conferences, and geographic places, as well as lists to control works and other entities. These lists, or authority files, have been developed and maintained in distinctive ways by individual library communities around the world. The differences in how to approach this work become evident as library data from many communities is combined in shared catalogs such as OCLC’s WorldCat.

VIAF helps to make library authority files less expensive to maintain and more generally useful to the library domain and beyond. To achieve this, VIAF matches and links the authority files of national libraries and groups all authority records for a given entity into a merged “super” authority record that brings together the different names for that entity. By linking disparate names for the same person or organization, VIAF provides a convenient means for a wider community of libraries and other agencies to repurpose bibliographic data produced by libraries serving different language communities.

If you were to substitute for ‘”super” authority record,” the term topic, you would be part of the way towards a topic map.

Topics gather information about a given entity into a single location.

Topics differ from the authority records you find at VIAF in two very important ways:

  1. First, topics, unlike authority records, have the ability to merge with other topics, creating new topics that have more information than any of the original topics.
  2. Second, authority records are created by, well, authorities. Do you see your name or the name of your organization on the list at VIAF? Topics can be created by anyone and merged with other topics on terms chosen by the possessor of the topic map. You don’t have to wait for an authority to create the topic or approve your merging of it.

There are definite advantages to having authorities and authority records, but there are also advantages to having the freedom to describe your world, in your terms.

January 16, 2013

Tombstones in Topic Map Future?

Watching the What’s New in Cassandra 1.2 (Notes) webcast and encountered an unfamiliar term: “tombstones.”

If you are already familiar with the concept, skip to another post.

If you’re not, the concept is used in distributed systems that maintain “eventual” consistency by the nodes replicating their content. Which works if all nodes are available but what if you delete data and a node is unavailable? When it comes back, the other nodes are “missing” data that needs to be replicated.

From the description at the Cassandra wiki, DistributedDeletes, not an easy problem to solve.

So, Cassandra turns it into a solvable problem.

Deletes are implemented with a special value known as a tombstone. The tombstone is propogated to nodes that missed the initial delete.

Since you will eventually want to delete the tombstones as well, a grace period can be set, which is slightly longer than the period needed to replace a non-responding node.

Distributed topic maps will face the same issue.

Complicated by imperative programming models of merging that make changes in properties that alter merging difficult to manage.

Perhaps functional models of merging, as with other forms of distributed processing, will carry the day.

January 14, 2013

On Graph Computing [Shared Vertices/Merging]

Filed under: Graphs,Merging,Networks — Patrick Durusau @ 8:35 pm

On Graph Computing by Marko A. Rodriguez.

Marko writes elegantly about graphs and I was about to put this down as another graph advocacy post. Interesting but graph followers have heard this story before.

But I do read the materials I cite and Marko proceeds to define three separate graphs, software, discussion and concept. Each of which has some vertexes in common with one or both of the others.

Then he has this section:

A Multi-Domain Graph

The three previous scenarios (software, discussion, and concept) are representations of real-world systems (e.g. GitHub, Google Groups, and Wikipedia). These seemingly disparate models can be seamlessly integrated into a single atomic graph structure by means of shared vertices. For instance, in the associated diagram, Gremlin is a Titan dependency, Titan is developed by Matthias, and Matthias writes messages on Aurelius’ mailing list (software merges with discussion). Next, Blueprints is a Titan dependency and Titan is tagged graph (software merges with concept). The dotted lines identify other such cross-domain linkages that demonstrate how a universal model is created when vertices are shared across domains. The integrated, universal model can be subjected to processes that provide richer (perhaps, more intelligent) services than what any individual model could provide alone.

Shared vertices sounds a lot like merging in the topic map sense to me.

It isn’t clear from the post what requirements may or may not exist for vertices to be “shared.”

Or how you would state the requirements for sharing vertices?

Or how to treat edges that become duplicates when separate vertices they connect now become the same vertices?

If “shared vertices” support what we call merging in topic maps, perhaps there are graph installations waiting to wake up as topic maps.

January 8, 2013

MapReduce: Detecting Cycles in Network Graph [Merging Duplicate Identifiers]

Filed under: Giraph,MapReduce,Merging — Patrick Durusau @ 11:47 am

MapReduce: Detecting Cycles in Network Graph by Ricky Ho.

From the post:

I recently received an email from an audience of my blog on Map/Reduce algorithm design regarding how to detect whether a graph is acyclic using Map/Reduce. I think this is an interesting problem and can imagine there can be wide range of application to it.

Although I haven’t solved this exact problem in the past, I’d like to sketch out my thoughts on a straightforward approach, which may not be highly optimized. My goal is to invite other audience who has solved this problem to share their tricks.

To define the problem: Given a simple directed graph, we want to tell whether it contains any cycles.

Relevant to processing of identifiers in topic maps which may occur on more than one topic (prior to merging).

What is your solution in a mapreduce context?

January 3, 2013

Merging Data Virtualization?

Filed under: Data Virtualization,Merging — Patrick Durusau @ 1:53 pm

I saw some ad-copy from a company that “wrote the book” on data virtualization (well, “a” book on data virtualization anyway).

Searched a bit in their documentation and elsewhere, but could not find an answer to my questions (below).

Assume departments 1 and 2, each with a data virtualization layer between their apps and the same backend resources:

Data Virtualization, Two Separate Layers

Requirement: Don’t maintain two separate data virtualization layers for the same resources.

Desired result is:

Data Virtualization, One Layer

Questions: Must I return to the data resources to discover their semantices? To merge the two data virtualization layers?

Some may object there should only be one data virtualization layer.

OK, so we have Department 1 – circa 2013 and Department 1 – circa 2015, different data virtualization requirements:

Data Virtualization, Future Layer

Desired result:

Data Virtualization, Future One Layer

Same Questions:

Question: Must I return to the data resources to discover their semantics? To merge the existing and proposed data virtualizaton layers?

The semantics of each item in the data sources (one hopes) was determined for the original data virtualization layer.

It’s wasteful to re-discover the same semantics for changes in data virtualization layers.

Curious, how rediscovery of semantics is avoided in data virtualization software?

Or for that matter, how do you interchange data virtualization layer mappings?

December 28, 2012

Computing Strongly Connected Components [Distributed Merging?]

Filed under: Graphs,Merging,Topic Maps — Patrick Durusau @ 7:19 pm

Computing Strongly Connected Components by Mark C. Chu-Carroll.

From the post:

As promised, today I’m going to talk about how to compute the strongly connected components of a directed graph. I’m going to go through one method, called Kosaraju’s algorithm, which is the easiest to understand. It’s possible to do better that Kosaraju’s by a factor of 2, using an algorithm called Tarjan’s algorithm, but Tarjan’s is really just a variation on the theme of Kosaraju’s.

Kosaraju’s algorithm is amazingly simple. It uses a very clever trick based on the fact that if you reverse all of the edges in a graph, the resulting graph has the same strongly connected components as the original. So using that, we can get the SCCs by doing a forward traversal to find an ordering of vertices, then doing a traversal of the reverse of the graph in the order generated by the first traversal.

That may sound a bit mysterious, but it’s really very simple. Take the graph G, and do a recursive depth-first traversal of the graph, along with an initially empty stack of vertices. As the recursion started at a vertex V finishes, push V onto the stack. At the end of the traversal, you’ll have all of the vertices of G in the stack. The order of the reverse traversal will be starting with the vertices on the top of that stack.

So you reverse all of the edges in the graph, creating the reverse graph, G’. Start with the vertex on top of the stack, and to a traversal from that vertex. All of the nodes reachable from that vertex form one strongly connected component. Remove everything in that SCC from the stack, and then repeat the process with the new top of the stack. When the stack is empty, you’ll have accumulated all of the SCCs.

Is graph decomposition suggestive of way to manage distributed merging?

Instead of using “strongly connected” as a criteria for decomposition, one or more subject characteristics are used to decompose the topic map.

Natural language is one that comes immediately to mind. Based on the content most likely to be viewed by a particular audience.

What subject characteristics would you use to decompose a topic map?

Graph Contraction and Minors [Merging by Another Name?]

Filed under: Graphs,Merging — Patrick Durusau @ 7:00 pm

Graph Contraction and Minors by Mark C. Chu-Carroll.

From the post:

Another useful concept in simple graph theory is *contraction* and its result, *minors* of graphs. The idea is that there are several ways of simplifying a graph in order to study its properties: cutting edges, removing vertices, and decomposing a graph are all methods we’ve seen before. Contraction is a different technique that works by *merging* vertices, rather than removing them.

Here’s how contraction works. Suppose you have a graph G. Pick a pair of vertices v and w which are adjacent in G. You can create a graph G’ which is a contraction of G by replacing v and w with a *single* vertex, and taking any edge in G which is incident on either v or w, and making it incident on the newly contracted node.

i-3ac4f4a9649d81b95317acdc3576e890-contractor.jpg

Not quite merging in the topic map sense because edges between vertexes are the basis for “merging.”

Instead of on the basis of properties of subjects the vertexes represent.

Still, deeply intriguing work.

This post was first published in July of 2007, so recent it’s not.

More resources you would suggest?

I first saw this in a tweet by Marko A. Rodriguez.

November 25, 2012

HIVE: Handy Integration and Visualisation of multimodal Experimental Data

Filed under: Bioinformatics,Biomedical,Graphs,Mapping,Merging,Visualization — Patrick Durusau @ 2:05 pm

HIVE: Handy Integration and Visualisation of multimodal Experimental Data

From the webpage:

HIVE is an Add-on for the VANTED system. VANTED is a graph editor extended for the visualisation and analysis of biological experimental data in context of pathways/networks. HIVE stands for

Handy Integration and Visualisation of multimodal Experimental Data

and extends the functionality of VANTED by adding the handling of volumes and images, together with a workspace approach, allowing one to integrate data of different biological data domains.

You need to see the demo video to appreciate this application!

It offers import of data, mapping rules to merge data from different data sets, easy visualization as a graph and other features.

Did I mention it also has 3-D image techniques as well?

PS: Yes, it is another example of “Who moved my acronym?”

November 2, 2012

Predicting what topics will trend on Twitter [Predicting Merging?]

Filed under: Merging,Prediction,Time Series,Tweets — Patrick Durusau @ 1:40 pm

Predicting what topics will trend on Twitter

From the post:

Twitter’s home page features a regularly updated list of topics that are “trending,” meaning that tweets about them have suddenly exploded in volume. A position on the list is highly coveted as a source of free publicity, but the selection of topics is automatic, based on a proprietary algorithm that factors in both the number of tweets and recent increases in that number.

At the Interdisciplinary Workshop on Information and Decision in Social Networks at MIT in November, Associate Professor Devavrat Shah and his student, Stanislav Nikolov, will present a new algorithm that can, with 95 percent accuracy, predict which topics will trend an average of an hour and a half before Twitter’s algorithm puts them on the list — and sometimes as much as four or five hours before.

If you can’t attend the Interdisciplinary Workshop on Information and Decision in Social Networks workshop, which has an exciting final program, try Stanislav Nikolov thesis, Trend or No Trend: A Novel Nonparametric Method for Classifying Time Series.

Abstract:

In supervised classification, one attempts to learn a model of how objects map to labels by selecting the best model from some model space. The choice of model space encodes assumptions about the problem. We propose a setting for model specification and selection in supervised learning based on a latent source model. In this setting, we specify the model by a small collection of unknown latent sources and posit that there is a stochastic model relating latent sources and observations. With this setting in mind, we propose a nonparametric classification method that is entirely unaware of the structure of these latent sources. Instead, our method relies on the data as a proxy for the unknown latent sources. We perform classification by computing the conditional class probabilities for an observation based on our stochastic model. This approach has an appealing and natural interpretation — that an observation belongs to a certain class if it sufficiently resembles other examples of that class.

We extend this approach to the problem of online time series classification. In the binary case, we derive an estimator for online signal detection and an associated implementation that is simple, efficient, and scalable. We demonstrate the merit of our approach by applying it to the task of detecting trending topics on Twitter. Using a small sample of Tweets, our method can detect trends before Twitter does 79% of the time, with a mean early advantage of 1.43 hours, while maintaining a 95% true positive rate and a 4% false positive rate. In addition, our method provides the flexibility to perform well under a variety of tradeoffs between types of error and relative detection time.

This will be interesting in many classification contexts.

Particularly predicting what topics a user will say represent the same subject.

October 15, 2012

Transaction Searching: Unifying Field Names

Filed under: Merging,Splunk — Patrick Durusau @ 2:14 pm

Transaction Searching: Unifying Field Names posted by David Carraso.

From the post:

Problem

You need to build transactions from multiple data sources that use different field names for the same identifier.

Solution

Typically, you can join transactions with common fields like:

... | transaction username

But when the username identifier is called different names (login, name, user, owner, and so on) in different data sources, you need to normalize the field names.

If sourcetype A only contains field_A and sourcetype B only contains field_B, create a new field called field_Z which is either field_A or field_B, depending on which is present in an event. You can then build the transaction based on the value of field_Z.

sourcetype=A OR sourcetype=B
| eval field_Z = coalesce(field_A, field_B)
| transaction field_Z

Looks a lot like a topic map merging operation doesn’t it?

But “looks a lot like” doesn’t mean it is “the same as” a topic map merging operation.

How would you say it is different?

While the outcome may be the same as a merging operation (which are legend defined), I would say that I don’t know how we got from A or B to Z?

That is next month or six months from now, or even two years down the road, I have C and I want to modify this transaction.

Question: Can I safely modify this transaction to add C?

I suspect the answer is:

“We don’t know. Have to go back to confirm what A and B (as well as C) mean and get back to you on that question.”

For a toy example that seems like overkill, but what if you have thousands of columns spread over hundreds of instances of active data systems.

Still feel confident about offering an answer without researching it?

Topic map based merging could give you that confidence.

Even if like Scotty, you say two weeks and have the answer later that day, to age a bit before delivering it ahead of schedule. 😉

September 28, 2012

Merging Data Sets Based on Partially Matched Data Elements

Filed under: Merging,R,Record Linkage — Patrick Durusau @ 4:40 am

Merging Data Sets Based on Partially Matched Data Elements by Tony Hirst.

From the post:

A tweet from @coneee yesterday about merging two datasets using columns of data that don’t quite match got me wondering about a possible R recipe for handling partial matching. The data in question related to country names in a datafile that needed fusing with country names in a listing of ISO country codes.

Reminds me of the techniques used in record linkage (epidemiology). There, unlike topic maps, the records from diverse sources were mapped to a target record layout and then analyzed.

Quite powerful but lossy with regard to the containers of the original data.

August 4, 2012

Genetic algorithms: a simple R example

Filed under: Genetic Algorithms,Merging,Subject Identity — Patrick Durusau @ 6:49 pm

Genetic algorithms: a simple R example by Bart Smeets.

From the post:

Genetic algorithm is a search heuristic. GAs can generate a vast number of possible model solutions and use these to evolve towards an approximation of the best solution of the model. Hereby it mimics evolution in nature.

GA generates a population, the individuals in this population (often called chromosomes) have a given state. Once the population is generated, the state of these individuals is evaluated and graded on their value. The best individuals are then taken and crossed-over – in order to hopefully generate ‘better’ offspring – to form the new population. In some cases the best individuals in the population are preserved in order to guarantee ‘good individuals’ in the new generation (this is called elitism).

The GA site by Marek Obitko has a great tutorial for people with no previous knowledge on the subject.

As the size of data stores increase, the cost of personal judgement on each subject identity test will as well. Genetic algorithms may be one way of creating subject identity tests in such situations.

In any event, it won’t harm anyone to be aware of the basic contours of the technique.

I first saw this at R-Bloggers.

August 3, 2012

Column Statistics in Hive

Filed under: Cloudera,Hive,Merging,Statistics — Patrick Durusau @ 2:48 pm

Column Statistics in Hive by Shreepadma Venugopalan.

From the post:

Over the last couple of months the Hive team at Cloudera has been working hard to bring a bunch of exciting new features to Hive. In this blog post, I’m going to talk about one such feature – Column Statistics in Hive – and how Hive’s query processing engine can benefit from it. The feature is currently a work in progress but we expect it to be available for review imminently.

Motivation

While there are many possible execution plans for a query, some plans are more optimal than others. The query optimizer is responsible for generating an efficient execution plan for a given SQL query from the space of all possible plans. Currently, Hive’s query optimizer uses rules of thumbs to generate an efficient execution plan for a query. While such rules of thumb optimizations transform the query plan into a more efficient one, the resulting plan is not always the most efficient execution plan.

In contrast, the query optimizer in a traditional RDBMS is cost based; it uses the statistical properties of the input column values to estimate the cost alternative query plans and chooses the plan with the lowest cost. The cost model for query plans assigns an estimated execution cost to the plans. The cost model is based on the CPU and I/O costs of query execution for every operator in the query plan. As an example consider a query that represents a join among {A, B, C} with the predicate {A.x == B.x == C.x}. Assume table A has a total of 500 records, table B has a total of 6000 records, table C has a total of 1000 records. In the absence of cost based query optimization, the system picks the join order specified by the user. In our example, let us further assume that the result of joining A and B yields 2000 records and the result of joining A and C yields 50 records.Hence the cost of performing the join between A, B and C, without join reordering, is the cost of joining A and B + cost of joining the output of A Join B with C. In our example this would result in a cost of (500 * 6000) + (2000 * 1000). On the other hand, a cost based optimizer (CBO) in a RDBMS would pick the more optimal alternate order [(A Join C) Join B] thus resulting in a cost of (500 * 1000) + (50 * 6000). However, in order to pick the more optimal join order the CBO needs cardinality estimates on the join column.

Today, Hive supports statistics at the table and partition level – count of files, raw data size, count of rows etc, but doesn’t support statistics on column values. These table and partition level statistics are insufficient for the purpose of building a CBO because they don’t provide any information about the individual column values. Hence obtaining the statistical summary of the column values is the first step towards building a CBO for Hive.

In addition to join reordering, Hive’s query optimizer will be able to take advantage of column statistics to decide whether to perform a map side aggregation as well as estimate the cardinality of operators in the execution plan better.

Some days I wonder where improvements to algorithms and data structures are going to lead?

Other days, I just enjoy the news.

Today is one of the latter.

PS: What a cost based optimizer (CBO) would look like for merging operations? Or perhaps better, merge cost estimator (MCE)? Metered merging anyone?

July 21, 2012

The Amazing Mean Shift Algorithm

Filed under: Algorithms,Merging — Patrick Durusau @ 7:47 pm

The Amazing Mean Shift Algorithm by Larry Wasserman.

From the post:

The mean shift algorithm is a mode-based clustering method due to Fukunaga and Hostetler (1975) that is commonly used in computer vision but seems less well known in statistics.

The steps are: (1) estimate the density, (2) find the modes of the density, (3) associate each data point to one mode.

If you are puzzling over why I cited this post, it might help if you read “(3)” as:

(3) merge data points associated with one mode.

The notion that topics can only be merged on the basis of URLs, actually discrete values of any sort, is one way to think about merging. Your data may or may not admit to robust processing on that basis.

Those are all very good ways to merge topics, if and only if that works for your data.

If not, then you need to find ways that work with your data.

Efficient Core Maintenance in Large Dynamic Graphs

Filed under: Graphs,Merging,Topic Map Software — Patrick Durusau @ 7:26 pm

Efficient Core Maintenance in Large Dynamic Graphs by Rong-Hua Li and Jeffrey Xu Yu.

Abstract:

The $k$-core decomposition in a graph is a fundamental problem for social network analysis. The problem of $k$-core decomposition is to calculate the core number for every node in a graph. Previous studies mainly focus on $k$-core decomposition in a static graph. There exists a linear time algorithm for $k$-core decomposition in a static graph. However, in many real-world applications such as online social networks and the Internet, the graph typically evolves over time. Under such applications, a key issue is to maintain the core number of nodes given the graph changes over time. A simple implementation is to perform the linear time algorithm to recompute the core number for every node after the graph is updated. Such simple implementation is expensive when the graph is very large. In this paper, we propose a new efficient algorithm to maintain the core number for every node in a dynamic graph. Our main result is that only certain nodes need to update their core number given the graph is changed by inserting/deleting an edge. We devise an efficient algorithm to identify and recompute the core number of such nodes. The complexity of our algorithm is independent of the graph size. In addition, to further accelerate the algorithm, we develop two pruning strategies by exploiting the lower and upper bounds of the core number. Finally, we conduct extensive experiments over both real-world and synthetic datasets, and the results demonstrate the efficiency of the proposed algorithm.

Maintenance of topic maps in the face of incoming information is an important issue.

I am intrigued by the idea of only certain nodes requiring updating based on addition/deletion of edges to a graph. Certainly true with topic maps and my question is whether this work can be adapted to work outside of core number updates?

Or perhaps more clearly, can it be adapted to work with a basis for merging topic maps? Or should core numbers be adapted for processing topic maps.

Questions I would be exploring if I had a topic maps lab. Maybe I should work up a proposal for one at an investment site.

July 11, 2012

Compressive Genomics [Compression as Merging]

Filed under: Bioinformatics,Compression,Genome,Merging,Scalability — Patrick Durusau @ 2:27 pm

Compressive genomics by Po-Ru Loh, Michael Baym, and Bonnie Berger (Nature Biotechnology 30, 627–630 (2012) doi:10.1038/nbt.2241)

From the introduction:

In the past two decades, genomic sequencing capabilities have increased exponentially[cites omitted] outstripping advances in computing power[cites omitted]. Extracting new insights from the data sets currently being generated will require not only faster computers, but also smarter algorithms. However, most genomes currently sequenced are highly similar to ones already collected[cite omitted]; thus, the amount of new sequence information is growing much more slowly.

Here we show that this redundancy can be exploited by compressing sequence data in such a way as to allow direct computation on the compressed data using methods we term ‘compressive’ algorithms. This approach reduces the task of computing on many similar genomes to only slightly more than that of operating on just one. Moreover, its relative advantage over existing algorithms will grow with the accumulation of genomic data. We demonstrate this approach by implementing compressive versions of both the Basic Local Alignment Search Tool (BLAST)[cite omitted] and the BLAST-Like Alignment Tool (BLAT)[cite omitted], and we emphasize how compressive genomics will enable biologists to keep pace with current data.

Software available at: Compression-accelerated BLAST and BLAT.

A new line of attack on searching “big data.”

Making “big data” into “smaller data” and enabling analysis of it while still “smaller data.”

Enabling the searching of highly similar genomes by compression is a form of merging isn’t it? That is a sequence (read subject) that occurs multiple times over similar genomes is given a single representative, while preserving its relationship to all the individual genome instances.

What makes merger computationally tractable here and yet topic may systems, at least some of them, are reported to have scalability issues: Scalability of Topic Map Systems by Marcel Hoyer?

What other examples of computationally tractable merging would you suggest? Including different merging approaches/algorithms. Thinking it might be a useful paper/study to work from scalable merging examples towards less scalable ones. Perhaps to discover what choices have an impact on scalability.

July 1, 2012

SkyQuery: …Parallel Probabilistic Join Engine… [When Static Mapping Isn’t Enough]

Filed under: Astroinformatics,Bayesian Data Analysis,Dynamic Mapping,Identity,Merging,SQL — Patrick Durusau @ 4:41 pm

SkyQuery: An Implementation of a Parallel Probabilistic Join Engine for Cross-Identification of Multiple Astronomical Databases by László Dobos, Tamás Budavári, Nolan Li, Alexander S. Szalay, and István Csabai.

Abstract:

Multi-wavelength astronomical studies require cross-identification of detections of the same celestial objects in multiple catalogs based on spherical coordinates and other properties. Because of the large data volumes and spherical geometry, the symmetric N-way association of astronomical detections is a computationally intensive problem, even when sophisticated indexing schemes are used to exclude obviously false candidates. Legacy astronomical catalogs already contain detections of more than a hundred million objects while the ongoing and future surveys will produce catalogs of billions of objects with multiple detections of each at different times. The varying statistical error of position measurements, moving and extended objects, and other physical properties make it necessary to perform the cross-identification using a mathematically correct, proper Bayesian probabilistic algorithm, capable of including various priors. One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios. Consequently, a novel system is necessary that can cross-identify multiple catalogs on-demand, efficiently and reliably. In this paper, we present our solution based on a cluster of commodity servers and ordinary relational databases. The cross-identification problems are formulated in a language based on SQL, but extended with special clauses. These special queries are partitioned spatially by coordinate ranges and compiled into a complex workflow of ordinary SQL queries. Workflows are then executed in a parallel framework using a cluster of servers hosting identical mirrors of the same data sets.

Astronomy is a cool area to study and has data out the wazoo, but I was struck by:

One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios.

Is identity with sharp edges, susceptible to pair-wise mapping, the common case?

Or do we just see some identity issues that way?

Commend the paper to you as an example of dynamic merging practice.

June 30, 2012

Station Maps: Browser-Based 3D Maps of the London Underground

Filed under: Mapping,Maps,Merging,Topic Maps — Patrick Durusau @ 6:47 pm

Station Maps: Browser-Based 3D Maps of the London Underground

From Information Asthetics:

Station Maps [aeracode.org] by programmer Andrew Godwin contains a large collection of browser-based (HTML5) 3D maps depicting different London Underground/DLR stations.

Most of the stations are modelled from memory in combination with a few diagrams found online. This means that the models are not totally accurate, but they should represent the right layout, shape and layering of the stations.

Every map has some underlying structure/ontology onto which other information is added.

Real time merging of train, security camera, security forces, event, etc., information onto such maps is one aspect of merging based on location/interest. Not all information is equally useful to all parties.

June 22, 2012

Merging Data Presents Practical Challenges

Filed under: BigData,Merging — Patrick Durusau @ 2:40 pm

Merging Data Presents Practical Challenges

From the post:

Merging structured and unstructured data remains a great challenge for many enterprise users and enterprise search capabilities are lagging behind the trend to collect as much data as possible–with the intent of eventually finding the infrastructure to address it.

The independent, non-profit research group, the Association for Information and Image Management, (AIIM) with backing from sponsor Actuate (BIRT BI vendor) hit on some of this challenges with the release of a new survey-based report.

Blah, blah, blah.

Which brings to the reason you should read and encourage others to read this blog:

http://www.xenos.com/xe/info/aiim-big-data-report – Link to where you can register and get a copy of the report.

I hunted through various press releases, etc. and included a link to the source of the materials quoted.

Some commercial sites apparently either don’t want you to see the source materials or don’t know how to author hyperlinks. Hard to say for sure.

I want you to:

  1. Know where material came from, and
  2. Have the opportunity (whether you take it or no is your issue) to agree/disagree with my comments on material. Can’t do that or at least not easily if you have to hunt it down.

Once found by any of us, information should remain found for all of us. (I rather like that, may have to make further use of it.)

May 22, 2012

Happy Go Lucky Identification/Merging?

Filed under: Identity,Merging — Patrick Durusau @ 3:32 pm

MIT News: New mathematical framework formalizes oddball programming techniques

From the post:

Two years ago, Martin Rinard’s group at MIT’s Computer Science and Artificial Intelligence Laboratory proposed a surprisingly simple way to make some computer procedures more efficient: Just skip a bunch of steps. Although the researchers demonstrated several practical applications of the technique, dubbed loop perforation, they realized it would be a hard sell. “The main impediment to adoption of this technique,” Imperial College London’s Cristian Cadar commented at the time, “is that developers are reluctant to adopt a technique where they don’t exactly understand what it does to the program.”

I like that for making topic maps scale, “…skip a bunch of steps….”

Topic maps, the semantic web and similar semantic ventures are erring on the side of accuracy.

We are often mistaken about facts, faces, identifications in semantic terminology.

Why think we can build programs or machines that can do better?

Let’s stop rolling the identification stone up the hill.

Ask “how accurate does the identification/merging need to be?”

The answer for aiming a missile is probably different than sorting emails in a discovery process.

If you believe in hyperlinks:

Proving Acceptability Properties of Relaxed Nondeterministic Approximate Programs Michael Carbin, Deokhwan Kim, Sasa Misailovic, and Martin Rinard, Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2012) Beijing, China June 2012

From Martin Rinard’s publication page.

Has other interesting reading.

March 9, 2012

Functional thinking: Functional design patterns, Part 1

Filed under: Design,Functional Programming,Merging,Topic Map Software — Patrick Durusau @ 8:45 pm

Functional thinking: Functional design patterns, Part 1 – How patterns manifest in the functional world

Summary

Contrary to popular belief, design patterns exist in functional programming — but they sometimes differ from their object-oriented counterparts in appearance and behavior. In this installment of Functional thinking, Neal Ford looks at ways in which patterns manifest in the functional paradigm, illustrating how the solutions differ.

From the article:

Some contingents in the functional world claim that the concept of the design pattern is flawed and isn’t needed in functional programming. A case can be made for that view under a narrow definition of pattern — but that’s an argument more about semantics than use. The concept of a design pattern — a named, cataloged solution to a common problem — is alive and well. However, patterns sometimes take different guises under different paradigms. Because the building blocks and approaches to problems are different in the functional world, some of the traditional Gang of Four patterns (see Resources) disappear, while others preserve the problem but solve it radically differently. This installment and the next investigate some traditional design patterns and rethink them in a functional way.

A functional approach to topic maps lends a certain elegance to the management of merging questions. 😉

Comments?

December 4, 2011

Unified Graph Views with Multigraph

Filed under: Graphs,Merging — Patrick Durusau @ 8:15 pm

Unified Graph Views with Multigraph by Joshua Shinavier.

From the post:

Next in line in this parade of new Blueprints utilities: a Graph implementation called MultiGraph which has just been pushed to Tinkubator:

https://github.com/tinkerpop/tinkubator

MultiGraph wraps multiple, lower-level Graph implementations and provides a combined graph view, unifying vertices and edges by id. So, for example, if you have a vertex with an id of “Arthur” in graph #1 and another vertex with an id of “Arthur” in graph #2, and you put those graphs into a MultiGraph, the unified vertex with the id “Arthur” will have all of the properties of either vertex, as well as all of the edges to or from either vertex. Any vertices and edges which exist in some graphs but not in others will also exist in the MultiGraph view.

Using ids to trigger merging and precedence to cope with conflicting values for edges. I don’t think “precedence” is going to be very robust in the long run but every project has to start somewhere. Preserving provenance after merging is likely to be a requirement in many applications.

This and similar discussions happen at the Gremlin-Users group.

November 9, 2011

How Common Is Merging?

Filed under: Dataset,Merging,Topic Map Software,Topic Maps — Patrick Durusau @ 7:44 pm

I started wondering about how common merging is in topic maps because I discovered a lack I have not seen before. There aren’t any large test collections of topic maps for CS types to break their clusters against. The sort of thing that challenges their algorithms and hardware.

But test collections should have some resemblance to actual data sets, at least if that is known with any degree of certainty. Or at least be one of the available data sets.

As a first step towards exploring this issue, I grepped for topics in the Opera and CIA Fact Book and got:

  • Opera topic map: 29,738
  • CIA Fact Book: 111,154

for a total of 140,892 topic elements. After merging the two maps, there were 126,204 topic elements. So I count that as merging 14,688 topic elements.

Approximately 10% of the topics in the two sets.

A very crude way to go about this but I was looking for rough numbers that may provoke some discussion and more refined measurements.

I mention that because one thought I had was to simply “cat” the various topic maps at the topicmapslab.de in CTM format together into one file and to “cat” that file until I have 1 million, 10 million and 100 million topic sets (approximately). Just a starter set to see what works/doesn’t work before scaling up the data sets.

Creating the files in this manner is going to result in a “merge heavy” topic map due to the duplication of content. That may not be a serious issue and perhaps better that it be that way in order to stress algorithms, etc. It would have the advantage that we could merge the original set and then project the number of merges that should be found in the various sets.

Suggestions/comments?

October 28, 2011

Dedupe, Merge, and Purge: the Art of Normalization

Filed under: Deduplication,Merging,Purge — Patrick Durusau @ 3:14 pm

Dedupe, Merge, and Purge: the Art of Normalization by Tyler Bell and Leo Polovets.

From the description:

Big Noise always accompanies Big Data, especially when extracting entities from the tangle of duplicate, partial, fragmented and heterogeneous information we call the Internet. The ~17m physical businesses in the US, for example, are found on over 1 billion webpages and endpoints across 5 million domains and applications. Organizing such a disparate collection of pages into a canonical set of things requires a combination of distributed data processing and human-based domain knowledge. This presentation stresses the importance of entity resolution within a business context and provides real-world examples and pragmatic insight into the process of canonicalization.

I like the Big Noise line. That may have some traction. Certainly will be the case that when users started having contact with unfiltered big data, they are likely to be as annoyed as they are with web searching.

The answer to their questions is likely “out there” but it lies just beyond their grasp. Failing they won’t blame the Big Noise or their own lack of skill but the inoffensive (and ineffective) tools at hand. Guess it is a good thing search engines are free save for advertising.

PS: The slides have a number of blanks. I have written to the authors, well, to their company, asking for corrected slides to be posted.

September 29, 2011

Subject Normalization

Filed under: Merging,TMDM,TMRM,Topic Maps — Patrick Durusau @ 6:35 pm

Another way to explain topic maps is in terms of Database normalization, except that I would call it subject normalization. That is every subject that is explicitly represented in the topic map appears once and only once, with relations to other subjects being recast to point to this single representative and all properties of the subject gathered to that one place.

One obvious advantage is that the shipping and accounting departments, for example, both have access to updated information for a customer as soon as entered by the other. And although they may gather different information about a customer, that information can be (doesn’t have to be) available to both of them.

Unlike database normalization, subject normalization in topic maps does not require rewriting of database tables, which can cause data access problems. Subject normalization (merging) occurs automatically, based on the presence of properties defined by the Topic Maps Data Model (TMDM).

And unlike OWL same:As, subject normalization in topic maps does not require knowledge of the “other” subject representative. That is I can insert an identifier that I know has been used for a subject, without knowledge it has been used in this topic map, and topics representing that subject will automatically merge (or be normalized).

Subject normalization in the terms of the TMDM, reduces the redundancy of information items. Which is true enough but not the primary experience of users with subject normalization. How many copies of a subject representative (information items) a system has is of little concern for an end-user.

What does concern end-users is getting the most complete and up-to-date information on a subject, however that is accomplished.

Topic maps accomplish that goal by empowering users to add identifiers to subject representatives that result in subject normalization. It doesn’t get any easier than that.

« Newer PostsOlder Posts »

Powered by WordPress