## Archive for the ‘Merging’ Category

### Excision [Forgetting But Remembering You Forgot (Datomic)]

Tuesday, June 4th, 2013

Excision

From the post:

It is a key value proposition of Datomic that you can tell not only what you know, but how you came to know it. When you add a fact:

conn.transact(list(":db/add", 42, ":firstName", "John"));

Datomic does more than merely record that 42‘s first name is “John“. Each datom is also associated with a transaction entity, which records the moment (:db/txInstant) the datom was recorded.

(…)

Given this information model, it is easy to see that Datomic can support queries that tell you:

• what you know now
• what you knew at some point in the past
• how and when you came to know any particular datom

So far so good, but there is a fly in the ointment. In certain situations you may be forced to excise data, pulling it out root and branch and forgetting that you ever knew it. This may happen if you store data that must comply with privacy or IP laws, or you may have a regulatory requirement to keep records for seven years and then “shred” them. For these scenarios, Datomic provides excision.

One approach to the unanswered question of what does it means to delete something from a topic map?

Especially interesting because you can play with the answer that Datomic provides.

Doesn’t address the issue of what it means to delete a topic that has caused other topics to merge.

I first saw this in Christophe Lalanne’s A bag of tweets / May 2013.

### Fast algorithm for successively merging k-overlapping sets?

Saturday, April 20th, 2013

Fast algorithm for successively merging k-overlapping sets?

As posted:

Consider the following algorithm for clustering sets: Begin with n sets, S1, S2,…,Sn, such that sum_{i = 1}^n |Si| = m, and successively merge sets with at least k elements in common. E.g., if S1 = {1, 2, 3}, S2 = {3, 4, 5}, and S3 = {5, 6, 7}, and k = 1 then S1 can be merged with S2 to create S1′ = {1, 2, 3, 4, 5}, and S1′ can be merged with S3 to create S1” = {1,…,7}.

Warmup question: Can this clustering algorithm be implemented efficiently for k = 1?

Answer to warmup question: If the sets only need to overlap by one element to be merged as in the example above, the clustering can be performed in O(m) time using connected components if you are careful.

Harder question: Suppose the sets must overlap by at least 2 (or k) elements to be merged. Is there an efficient algorithm for this case (i.e., close to linear time)? The challenge here is that you can have cases like S1 = {1, 2, 3}, S2 = {2, 4, 5}, S3 = {1, 4, 5}, with k = 2. Note that in this case S1 can be merged with S2 and S3, but only after S2 and S3 are merged to create S2′ so that S1 and S2′ share both elements 1 and 2.

I saw this on Theoretical Computer Science Stack Exchange earlier today.

Reminded me of overlapping set members test for [subject identifiers], [item identifiers], [subject locators], [subject identifiers] and [item identifiers], property of the other, [reified] properties. (Well, reified is a simple match, not a set.)

I have found some early work on the overlapping set member question but also work on other measures of similarity on set members.

Working up a list of papers.

### Why Data Lineage is Your Secret … Weapon [Auditing Topic Maps]

Sunday, March 10th, 2013

Why Data Lineage is Your Secret Data Quality Weapon by Dylan Jones.

From the post:

Data lineage means many things to many people but it essentially refers to provenance – how do you prove where your data comes from?

It’s really a simple exercise. Just pull an imaginary string of data from where the information presents itself, back through the labyrinth of data stores and processing chains, until you can go no further.

I’m constantly amazed by why so few organisations practice sound data lineage management despite having fairly mature data quality or even data governance programs. On a side note, if ever there was a justification for the importance of data lineage management then just take a look at the brand damage caused by the recent European horse meat scandal.

But I digress. Why is data lineage your secret data quality weapon?

The simple answer is that data lineage forces your organisation to address two big issues that become all too apparent:

• Lack of ownership
• Lack of formal information chain design

Or to put it into a topic map context, can you trace what topics merged to create the topic you are now viewing?

And if you can’t trace, how can you audit the merging of topics?

And if you can’t audit, how do you determine the reliability of your topic map?

That is reliability in terms of date (freshness), source (reliable or not), evaluation (by screeners), comparison (to other sources), etc.

Same questions apply to all data aggregation systems.

Or as Mrs. Weasley tells Ginny:

“Never trust anything that can think for itself if you can’t see where it keeps its brain.”

Correction: Wesley -> Weasley. We had a minister friend over Sunday and were discussing the former, not the latter.

### VIAF: The Virtual International Authority File

Wednesday, March 6th, 2013

VIAF: The Virtual International Authority File

From the webpage:

VIAF, implemented and hosted by OCLC, is a joint project of several national libraries plus selected regional and trans-national library agencies. The project’s goal is to lower the cost and increase the utility of library authority files by matching and linking widely-used authority files and making that information available on the Web.

At a glance

• A collaborative effort between national libraries and organizations contributing name authority files, furthering access to information
• All authority data for a given entity is linked together into a “super” authority record
• A convenient way for the library community and other agencies to repurpose bibliographic data produced by libraries serving different language communities

The Virtual International Authority File (VIAF) is an international service designed to provide convenient access to the world’s major name authority files. Its creators envision the VIAF as a building block for the Semantic Web to enable switching of the displayed form of names for persons to the preferred language and script of the Web user. VIAF began as a joint project with the Library of Congress (LC), the Deutsche Nationalbibliothek (DNB), the Bibliothèque nationale de France (BNF) and OCLC. It has, over the past decade, become a cooperative effort involving an expanding number of other national libraries and other agencies. At the beginning of 2012, contributors include 20 agencies from 16 countries.

Most large libraries maintain lists of names for people, corporations, conferences, and geographic places, as well as lists to control works and other entities. These lists, or authority files, have been developed and maintained in distinctive ways by individual library communities around the world. The differences in how to approach this work become evident as library data from many communities is combined in shared catalogs such as OCLC’s WorldCat.

VIAF helps to make library authority files less expensive to maintain and more generally useful to the library domain and beyond. To achieve this, VIAF matches and links the authority files of national libraries and groups all authority records for a given entity into a merged “super” authority record that brings together the different names for that entity. By linking disparate names for the same person or organization, VIAF provides a convenient means for a wider community of libraries and other agencies to repurpose bibliographic data produced by libraries serving different language communities.

If you were to substitute for ‘”super” authority record,” the term topic, you would be part of the way towards a topic map.

Topics gather information about a given entity into a single location.

Topics differ from the authority records you find at VIAF in two very important ways:

1. First, topics, unlike authority records, have the ability to merge with other topics, creating new topics that have more information than any of the original topics.
2. Second, authority records are created by, well, authorities. Do you see your name or the name of your organization on the list at VIAF? Topics can be created by anyone and merged with other topics on terms chosen by the possessor of the topic map. You don’t have to wait for an authority to create the topic or approve your merging of it.

There are definite advantages to having authorities and authority records, but there are also advantages to having the freedom to describe your world, in your terms.

### Tombstones in Topic Map Future?

Wednesday, January 16th, 2013

Watching the What’s New in Cassandra 1.2 (Notes) webcast and encountered an unfamiliar term: “tombstones.”

If you’re not, the concept is used in distributed systems that maintain “eventual” consistency by the nodes replicating their content. Which works if all nodes are available but what if you delete data and a node is unavailable? When it comes back, the other nodes are “missing” data that needs to be replicated.

From the description at the Cassandra wiki, DistributedDeletes, not an easy problem to solve.

So, Cassandra turns it into a solvable problem.

Deletes are implemented with a special value known as a tombstone. The tombstone is propogated to nodes that missed the initial delete.

Since you will eventually want to delete the tombstones as well, a grace period can be set, which is slightly longer than the period needed to replace a non-responding node.

Distributed topic maps will face the same issue.

Complicated by imperative programming models of merging that make changes in properties that alter merging difficult to manage.

Perhaps functional models of merging, as with other forms of distributed processing, will carry the day.

### On Graph Computing [Shared Vertices/Merging]

Monday, January 14th, 2013

On Graph Computing by Marko A. Rodriguez.

Marko writes elegantly about graphs and I was about to put this down as another graph advocacy post. Interesting but graph followers have heard this story before.

But I do read the materials I cite and Marko proceeds to define three separate graphs, software, discussion and concept. Each of which has some vertexes in common with one or both of the others.

Then he has this section:

A Multi-Domain Graph

The three previous scenarios (software, discussion, and concept) are representations of real-world systems (e.g. GitHub, Google Groups, and Wikipedia). These seemingly disparate models can be seamlessly integrated into a single atomic graph structure by means of shared vertices. For instance, in the associated diagram, Gremlin is a Titan dependency, Titan is developed by Matthias, and Matthias writes messages on Aurelius’ mailing list (software merges with discussion). Next, Blueprints is a Titan dependency and Titan is tagged graph (software merges with concept). The dotted lines identify other such cross-domain linkages that demonstrate how a universal model is created when vertices are shared across domains. The integrated, universal model can be subjected to processes that provide richer (perhaps, more intelligent) services than what any individual model could provide alone.

Shared vertices sounds a lot like merging in the topic map sense to me.

It isn’t clear from the post what requirements may or may not exist for vertices to be “shared.”

Or how you would state the requirements for sharing vertices?

Or how to treat edges that become duplicates when separate vertices they connect now become the same vertices?

If “shared vertices” support what we call merging in topic maps, perhaps there are graph installations waiting to wake up as topic maps.

### MapReduce: Detecting Cycles in Network Graph [Merging Duplicate Identifiers]

Tuesday, January 8th, 2013

MapReduce: Detecting Cycles in Network Graph by Ricky Ho.

From the post:

I recently received an email from an audience of my blog on Map/Reduce algorithm design regarding how to detect whether a graph is acyclic using Map/Reduce. I think this is an interesting problem and can imagine there can be wide range of application to it.

Although I haven’t solved this exact problem in the past, I’d like to sketch out my thoughts on a straightforward approach, which may not be highly optimized. My goal is to invite other audience who has solved this problem to share their tricks.

To define the problem: Given a simple directed graph, we want to tell whether it contains any cycles.

Relevant to processing of identifiers in topic maps which may occur on more than one topic (prior to merging).

What is your solution in a mapreduce context?

### Merging Data Virtualization?

Thursday, January 3rd, 2013

I saw some ad-copy from a company that “wrote the book” on data virtualization (well, “a” book on data virtualization anyway).

Searched a bit in their documentation and elsewhere, but could not find an answer to my questions (below).

Assume departments 1 and 2, each with a data virtualization layer between their apps and the same backend resources:

Requirement: Don’t maintain two separate data virtualization layers for the same resources.

Desired result is:

Questions: Must I return to the data resources to discover their semantices? To merge the two data virtualization layers?

Some may object there should only be one data virtualization layer.

OK, so we have Department 1 – circa 2013 and Department 1 – circa 2015, different data virtualization requirements:

Desired result:

Same Questions:

Question: Must I return to the data resources to discover their semantics? To merge the existing and proposed data virtualizaton layers?

The semantics of each item in the data sources (one hopes) was determined for the original data virtualization layer.

It’s wasteful to re-discover the same semantics for changes in data virtualization layers.

Curious, how rediscovery of semantics is avoided in data virtualization software?

Or for that matter, how do you interchange data virtualization layer mappings?

### Computing Strongly Connected Components [Distributed Merging?]

Friday, December 28th, 2012

Computing Strongly Connected Components by Mark C. Chu-Carroll.

From the post:

As promised, today I’m going to talk about how to compute the strongly connected components of a directed graph. I’m going to go through one method, called Kosaraju’s algorithm, which is the easiest to understand. It’s possible to do better that Kosaraju’s by a factor of 2, using an algorithm called Tarjan’s algorithm, but Tarjan’s is really just a variation on the theme of Kosaraju’s.

Kosaraju’s algorithm is amazingly simple. It uses a very clever trick based on the fact that if you reverse all of the edges in a graph, the resulting graph has the same strongly connected components as the original. So using that, we can get the SCCs by doing a forward traversal to find an ordering of vertices, then doing a traversal of the reverse of the graph in the order generated by the first traversal.

That may sound a bit mysterious, but it’s really very simple. Take the graph G, and do a recursive depth-first traversal of the graph, along with an initially empty stack of vertices. As the recursion started at a vertex V finishes, push V onto the stack. At the end of the traversal, you’ll have all of the vertices of G in the stack. The order of the reverse traversal will be starting with the vertices on the top of that stack.

So you reverse all of the edges in the graph, creating the reverse graph, G’. Start with the vertex on top of the stack, and to a traversal from that vertex. All of the nodes reachable from that vertex form one strongly connected component. Remove everything in that SCC from the stack, and then repeat the process with the new top of the stack. When the stack is empty, you’ll have accumulated all of the SCCs.

Is graph decomposition suggestive of way to manage distributed merging?

Instead of using “strongly connected” as a criteria for decomposition, one or more subject characteristics are used to decompose the topic map.

Natural language is one that comes immediately to mind. Based on the content most likely to be viewed by a particular audience.

What subject characteristics would you use to decompose a topic map?

### Graph Contraction and Minors [Merging by Another Name?]

Friday, December 28th, 2012

Graph Contraction and Minors by Mark C. Chu-Carroll.

From the post:

Another useful concept in simple graph theory is *contraction* and its result, *minors* of graphs. The idea is that there are several ways of simplifying a graph in order to study its properties: cutting edges, removing vertices, and decomposing a graph are all methods we’ve seen before. Contraction is a different technique that works by *merging* vertices, rather than removing them.

Here’s how contraction works. Suppose you have a graph G. Pick a pair of vertices v and w which are adjacent in G. You can create a graph G’ which is a contraction of G by replacing v and w with a *single* vertex, and taking any edge in G which is incident on either v or w, and making it incident on the newly contracted node.

Not quite merging in the topic map sense because edges between vertexes are the basis for “merging.”

Instead of on the basis of properties of subjects the vertexes represent.

Still, deeply intriguing work.

This post was first published in July of 2007, so recent it’s not.

More resources you would suggest?

I first saw this in a tweet by Marko A. Rodriguez.

### HIVE: Handy Integration and Visualisation of multimodal Experimental Data

Sunday, November 25th, 2012

HIVE: Handy Integration and Visualisation of multimodal Experimental Data

From the webpage:

HIVE is an Add-on for the VANTED system. VANTED is a graph editor extended for the visualisation and analysis of biological experimental data in context of pathways/networks. HIVE stands for

Handy Integration and Visualisation of multimodal Experimental Data

and extends the functionality of VANTED by adding the handling of volumes and images, together with a workspace approach, allowing one to integrate data of different biological data domains.

You need to see the demo video to appreciate this application!

It offers import of data, mapping rules to merge data from different data sets, easy visualization as a graph and other features.

Did I mention it also has 3-D image techniques as well?

PS: Yes, it is another example of “Who moved my acronym?”

### Predicting what topics will trend on Twitter [Predicting Merging?]

Friday, November 2nd, 2012

Predicting what topics will trend on Twitter

From the post:

Twitter’s home page features a regularly updated list of topics that are “trending,” meaning that tweets about them have suddenly exploded in volume. A position on the list is highly coveted as a source of free publicity, but the selection of topics is automatic, based on a proprietary algorithm that factors in both the number of tweets and recent increases in that number.

At the Interdisciplinary Workshop on Information and Decision in Social Networks at MIT in November, Associate Professor Devavrat Shah and his student, Stanislav Nikolov, will present a new algorithm that can, with 95 percent accuracy, predict which topics will trend an average of an hour and a half before Twitter’s algorithm puts them on the list — and sometimes as much as four or five hours before.

If you can’t attend the Interdisciplinary Workshop on Information and Decision in Social Networks workshop, which has an exciting final program, try Stanislav Nikolov thesis, Trend or No Trend: A Novel Nonparametric Method for Classifying Time Series.

Abstract:

In supervised classification, one attempts to learn a model of how objects map to labels by selecting the best model from some model space. The choice of model space encodes assumptions about the problem. We propose a setting for model specification and selection in supervised learning based on a latent source model. In this setting, we specify the model by a small collection of unknown latent sources and posit that there is a stochastic model relating latent sources and observations. With this setting in mind, we propose a nonparametric classification method that is entirely unaware of the structure of these latent sources. Instead, our method relies on the data as a proxy for the unknown latent sources. We perform classification by computing the conditional class probabilities for an observation based on our stochastic model. This approach has an appealing and natural interpretation — that an observation belongs to a certain class if it sufficiently resembles other examples of that class.

We extend this approach to the problem of online time series classification. In the binary case, we derive an estimator for online signal detection and an associated implementation that is simple, efficient, and scalable. We demonstrate the merit of our approach by applying it to the task of detecting trending topics on Twitter. Using a small sample of Tweets, our method can detect trends before Twitter does 79% of the time, with a mean early advantage of 1.43 hours, while maintaining a 95% true positive rate and a 4% false positive rate. In addition, our method provides the flexibility to perform well under a variety of tradeoffs between types of error and relative detection time.

This will be interesting in many classification contexts.

Particularly predicting what topics a user will say represent the same subject.

### Transaction Searching: Unifying Field Names

Monday, October 15th, 2012

Transaction Searching: Unifying Field Names posted by David Carraso.

From the post:

Problem

You need to build transactions from multiple data sources that use different field names for the same identifier.

Solution

Typically, you can join transactions with common fields like:

... | transaction username

But when the username identifier is called different names (login, name, user, owner, and so on) in different data sources, you need to normalize the field names.

If sourcetype A only contains field_A and sourcetype B only contains field_B, create a new field called field_Z which is either field_A or field_B, depending on which is present in an event. You can then build the transaction based on the value of field_Z.

sourcetype=A OR sourcetype=B | eval field_Z = coalesce(field_A, field_B) | transaction field_Z

Looks a lot like a topic map merging operation doesn’t it?

But “looks a lot like” doesn’t mean it is “the same as” a topic map merging operation.

How would you say it is different?

While the outcome may be the same as a merging operation (which are legend defined), I would say that I don’t know how we got from A or B to Z?

That is next month or six months from now, or even two years down the road, I have C and I want to modify this transaction.

Question: Can I safely modify this transaction to add C?

“We don’t know. Have to go back to confirm what A and B (as well as C) mean and get back to you on that question.”

For a toy example that seems like overkill, but what if you have thousands of columns spread over hundreds of instances of active data systems.

Topic map based merging could give you that confidence.

Even if like Scotty, you say two weeks and have the answer later that day, to age a bit before delivering it ahead of schedule.

### Merging Data Sets Based on Partially Matched Data Elements

Friday, September 28th, 2012

From the post:

A tweet from @coneee yesterday about merging two datasets using columns of data that don’t quite match got me wondering about a possible R recipe for handling partial matching. The data in question related to country names in a datafile that needed fusing with country names in a listing of ISO country codes.

Reminds me of the techniques used in record linkage (epidemiology). There, unlike topic maps, the records from diverse sources were mapped to a target record layout and then analyzed.

Quite powerful but lossy with regard to the containers of the original data.

### Genetic algorithms: a simple R example

Saturday, August 4th, 2012

Genetic algorithms: a simple R example by Bart Smeets.

From the post:

Genetic algorithm is a search heuristic. GAs can generate a vast number of possible model solutions and use these to evolve towards an approximation of the best solution of the model. Hereby it mimics evolution in nature.

GA generates a population, the individuals in this population (often called chromosomes) have a given state. Once the population is generated, the state of these individuals is evaluated and graded on their value. The best individuals are then taken and crossed-over – in order to hopefully generate ‘better’ offspring – to form the new population. In some cases the best individuals in the population are preserved in order to guarantee ‘good individuals’ in the new generation (this is called elitism).

The GA site by Marek Obitko has a great tutorial for people with no previous knowledge on the subject.

As the size of data stores increase, the cost of personal judgement on each subject identity test will as well. Genetic algorithms may be one way of creating subject identity tests in such situations.

In any event, it won’t harm anyone to be aware of the basic contours of the technique.

I first saw this at R-Bloggers.

### Column Statistics in Hive

Friday, August 3rd, 2012

Column Statistics in Hive by Shreepadma Venugopalan.

From the post:

Over the last couple of months the Hive team at Cloudera has been working hard to bring a bunch of exciting new features to Hive. In this blog post, I’m going to talk about one such feature – Column Statistics in Hive – and how Hive’s query processing engine can benefit from it. The feature is currently a work in progress but we expect it to be available for review imminently.

Motivation

While there are many possible execution plans for a query, some plans are more optimal than others. The query optimizer is responsible for generating an efficient execution plan for a given SQL query from the space of all possible plans. Currently, Hive’s query optimizer uses rules of thumbs to generate an efficient execution plan for a query. While such rules of thumb optimizations transform the query plan into a more efficient one, the resulting plan is not always the most efficient execution plan.

In contrast, the query optimizer in a traditional RDBMS is cost based; it uses the statistical properties of the input column values to estimate the cost alternative query plans and chooses the plan with the lowest cost. The cost model for query plans assigns an estimated execution cost to the plans. The cost model is based on the CPU and I/O costs of query execution for every operator in the query plan. As an example consider a query that represents a join among {A, B, C} with the predicate {A.x == B.x == C.x}. Assume table A has a total of 500 records, table B has a total of 6000 records, table C has a total of 1000 records. In the absence of cost based query optimization, the system picks the join order specified by the user. In our example, let us further assume that the result of joining A and B yields 2000 records and the result of joining A and C yields 50 records.Hence the cost of performing the join between A, B and C, without join reordering, is the cost of joining A and B + cost of joining the output of A Join B with C. In our example this would result in a cost of (500 * 6000) + (2000 * 1000). On the other hand, a cost based optimizer (CBO) in a RDBMS would pick the more optimal alternate order [(A Join C) Join B] thus resulting in a cost of (500 * 1000) + (50 * 6000). However, in order to pick the more optimal join order the CBO needs cardinality estimates on the join column.

Today, Hive supports statistics at the table and partition level – count of files, raw data size, count of rows etc, but doesn’t support statistics on column values. These table and partition level statistics are insufficient for the purpose of building a CBO because they don’t provide any information about the individual column values. Hence obtaining the statistical summary of the column values is the first step towards building a CBO for Hive.

In addition to join reordering, Hive’s query optimizer will be able to take advantage of column statistics to decide whether to perform a map side aggregation as well as estimate the cardinality of operators in the execution plan better.

Some days I wonder where improvements to algorithms and data structures are going to lead?

Other days, I just enjoy the news.

Today is one of the latter.

PS: What a cost based optimizer (CBO) would look like for merging operations? Or perhaps better, merge cost estimator (MCE)? Metered merging anyone?

### The Amazing Mean Shift Algorithm

Saturday, July 21st, 2012

The Amazing Mean Shift Algorithm by Larry Wasserman.

From the post:

The mean shift algorithm is a mode-based clustering method due to Fukunaga and Hostetler (1975) that is commonly used in computer vision but seems less well known in statistics.

The steps are: (1) estimate the density, (2) find the modes of the density, (3) associate each data point to one mode.

If you are puzzling over why I cited this post, it might help if you read “(3)” as:

(3) merge data points associated with one mode.

The notion that topics can only be merged on the basis of URLs, actually discrete values of any sort, is one way to think about merging. Your data may or may not admit to robust processing on that basis.

Those are all very good ways to merge topics, if and only if that works for your data.

If not, then you need to find ways that work with your data.

### Efficient Core Maintenance in Large Dynamic Graphs

Saturday, July 21st, 2012

Efficient Core Maintenance in Large Dynamic Graphs by Rong-Hua Li and Jeffrey Xu Yu.

Abstract:

The $k$-core decomposition in a graph is a fundamental problem for social network analysis. The problem of $k$-core decomposition is to calculate the core number for every node in a graph. Previous studies mainly focus on $k$-core decomposition in a static graph. There exists a linear time algorithm for $k$-core decomposition in a static graph. However, in many real-world applications such as online social networks and the Internet, the graph typically evolves over time. Under such applications, a key issue is to maintain the core number of nodes given the graph changes over time. A simple implementation is to perform the linear time algorithm to recompute the core number for every node after the graph is updated. Such simple implementation is expensive when the graph is very large. In this paper, we propose a new efficient algorithm to maintain the core number for every node in a dynamic graph. Our main result is that only certain nodes need to update their core number given the graph is changed by inserting/deleting an edge. We devise an efficient algorithm to identify and recompute the core number of such nodes. The complexity of our algorithm is independent of the graph size. In addition, to further accelerate the algorithm, we develop two pruning strategies by exploiting the lower and upper bounds of the core number. Finally, we conduct extensive experiments over both real-world and synthetic datasets, and the results demonstrate the efficiency of the proposed algorithm.

Maintenance of topic maps in the face of incoming information is an important issue.

I am intrigued by the idea of only certain nodes requiring updating based on addition/deletion of edges to a graph. Certainly true with topic maps and my question is whether this work can be adapted to work outside of core number updates?

Or perhaps more clearly, can it be adapted to work with a basis for merging topic maps? Or should core numbers be adapted for processing topic maps.

Questions I would be exploring if I had a topic maps lab. Maybe I should work up a proposal for one at an investment site.

### Compressive Genomics [Compression as Merging]

Wednesday, July 11th, 2012

Compressive genomics by Po-Ru Loh, Michael Baym, and Bonnie Berger (Nature Biotechnology 30, 627–630 (2012) doi:10.1038/nbt.2241)

From the introduction:

In the past two decades, genomic sequencing capabilities have increased exponentially[cites omitted] outstripping advances in computing power[cites omitted]. Extracting new insights from the data sets currently being generated will require not only faster computers, but also smarter algorithms. However, most genomes currently sequenced are highly similar to ones already collected[cite omitted]; thus, the amount of new sequence information is growing much more slowly.

Here we show that this redundancy can be exploited by compressing sequence data in such a way as to allow direct computation on the compressed data using methods we term ‘compressive’ algorithms. This approach reduces the task of computing on many similar genomes to only slightly more than that of operating on just one. Moreover, its relative advantage over existing algorithms will grow with the accumulation of genomic data. We demonstrate this approach by implementing compressive versions of both the Basic Local Alignment Search Tool (BLAST)[cite omitted] and the BLAST-Like Alignment Tool (BLAT)[cite omitted], and we emphasize how compressive genomics will enable biologists to keep pace with current data.

Software available at: Compression-accelerated BLAST and BLAT.

A new line of attack on searching “big data.”

Making “big data” into “smaller data” and enabling analysis of it while still “smaller data.”

Enabling the searching of highly similar genomes by compression is a form of merging isn’t it? That is a sequence (read subject) that occurs multiple times over similar genomes is given a single representative, while preserving its relationship to all the individual genome instances.

What makes merger computationally tractable here and yet topic may systems, at least some of them, are reported to have scalability issues: Scalability of Topic Map Systems by Marcel Hoyer?

What other examples of computationally tractable merging would you suggest? Including different merging approaches/algorithms. Thinking it might be a useful paper/study to work from scalable merging examples towards less scalable ones. Perhaps to discover what choices have an impact on scalability.

### SkyQuery: …Parallel Probabilistic Join Engine… [When Static Mapping Isn't Enough]

Sunday, July 1st, 2012

SkyQuery: An Implementation of a Parallel Probabilistic Join Engine for Cross-Identification of Multiple Astronomical Databases by László Dobos, Tamás Budavári, Nolan Li, Alexander S. Szalay, and István Csabai.

Abstract:

Multi-wavelength astronomical studies require cross-identification of detections of the same celestial objects in multiple catalogs based on spherical coordinates and other properties. Because of the large data volumes and spherical geometry, the symmetric N-way association of astronomical detections is a computationally intensive problem, even when sophisticated indexing schemes are used to exclude obviously false candidates. Legacy astronomical catalogs already contain detections of more than a hundred million objects while the ongoing and future surveys will produce catalogs of billions of objects with multiple detections of each at different times. The varying statistical error of position measurements, moving and extended objects, and other physical properties make it necessary to perform the cross-identification using a mathematically correct, proper Bayesian probabilistic algorithm, capable of including various priors. One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios. Consequently, a novel system is necessary that can cross-identify multiple catalogs on-demand, efficiently and reliably. In this paper, we present our solution based on a cluster of commodity servers and ordinary relational databases. The cross-identification problems are formulated in a language based on SQL, but extended with special clauses. These special queries are partitioned spatially by coordinate ranges and compiled into a complex workflow of ordinary SQL queries. Workflows are then executed in a parallel framework using a cluster of servers hosting identical mirrors of the same data sets.

Astronomy is a cool area to study and has data out the wazoo, but I was struck by:

One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios.

Is identity with sharp edges, susceptible to pair-wise mapping, the common case?

Or do we just see some identity issues that way?

Commend the paper to you as an example of dynamic merging practice.

### Station Maps: Browser-Based 3D Maps of the London Underground

Saturday, June 30th, 2012

Station Maps: Browser-Based 3D Maps of the London Underground

Station Maps [aeracode.org] by programmer Andrew Godwin contains a large collection of browser-based (HTML5) 3D maps depicting different London Underground/DLR stations.

Most of the stations are modelled from memory in combination with a few diagrams found online. This means that the models are not totally accurate, but they should represent the right layout, shape and layering of the stations.

Every map has some underlying structure/ontology onto which other information is added.

Real time merging of train, security camera, security forces, event, etc., information onto such maps is one aspect of merging based on location/interest. Not all information is equally useful to all parties.

### Merging Data Presents Practical Challenges

Friday, June 22nd, 2012

Merging Data Presents Practical Challenges

From the post:

Merging structured and unstructured data remains a great challenge for many enterprise users and enterprise search capabilities are lagging behind the trend to collect as much data as possible–with the intent of eventually finding the infrastructure to address it.

The independent, non-profit research group, the Association for Information and Image Management, (AIIM) with backing from sponsor Actuate (BIRT BI vendor) hit on some of this challenges with the release of a new survey-based report.

Blah, blah, blah.

Which brings to the reason you should read and encourage others to read this blog:

http://www.xenos.com/xe/info/aiim-big-data-report – Link to where you can register and get a copy of the report.

I hunted through various press releases, etc. and included a link to the source of the materials quoted.

Some commercial sites apparently either don’t want you to see the source materials or don’t know how to author hyperlinks. Hard to say for sure.

I want you to:

1. Know where material came from, and
2. Have the opportunity (whether you take it or no is your issue) to agree/disagree with my comments on material. Can’t do that or at least not easily if you have to hunt it down.

Once found by any of us, information should remain found for all of us. (I rather like that, may have to make further use of it.)

### Happy Go Lucky Identification/Merging?

Tuesday, May 22nd, 2012

From the post:

Two years ago, Martin Rinard’s group at MIT’s Computer Science and Artificial Intelligence Laboratory proposed a surprisingly simple way to make some computer procedures more efficient: Just skip a bunch of steps. Although the researchers demonstrated several practical applications of the technique, dubbed loop perforation, they realized it would be a hard sell. “The main impediment to adoption of this technique,” Imperial College London’s Cristian Cadar commented at the time, “is that developers are reluctant to adopt a technique where they don’t exactly understand what it does to the program.”

I like that for making topic maps scale, “…skip a bunch of steps….”

Topic maps, the semantic web and similar semantic ventures are erring on the side of accuracy.

We are often mistaken about facts, faces, identifications in semantic terminology.

Why think we can build programs or machines that can do better?

Let’s stop rolling the identification stone up the hill.

Ask “how accurate does the identification/merging need to be?”

The answer for aiming a missile is probably different than sorting emails in a discovery process.

Proving Acceptability Properties of Relaxed Nondeterministic Approximate Programs Michael Carbin, Deokhwan Kim, Sasa Misailovic, and Martin Rinard, Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2012) Beijing, China June 2012

### Functional thinking: Functional design patterns, Part 1

Friday, March 9th, 2012

Functional thinking: Functional design patterns, Part 1 – How patterns manifest in the functional world

Summary

Contrary to popular belief, design patterns exist in functional programming — but they sometimes differ from their object-oriented counterparts in appearance and behavior. In this installment of Functional thinking, Neal Ford looks at ways in which patterns manifest in the functional paradigm, illustrating how the solutions differ.

From the article:

Some contingents in the functional world claim that the concept of the design pattern is flawed and isn’t needed in functional programming. A case can be made for that view under a narrow definition of pattern — but that’s an argument more about semantics than use. The concept of a design pattern — a named, cataloged solution to a common problem — is alive and well. However, patterns sometimes take different guises under different paradigms. Because the building blocks and approaches to problems are different in the functional world, some of the traditional Gang of Four patterns (see Resources) disappear, while others preserve the problem but solve it radically differently. This installment and the next investigate some traditional design patterns and rethink them in a functional way.

A functional approach to topic maps lends a certain elegance to the management of merging questions.

### Unified Graph Views with Multigraph

Sunday, December 4th, 2011

Unified Graph Views with Multigraph by Joshua Shinavier.

From the post:

Next in line in this parade of new Blueprints utilities: a Graph implementation called MultiGraph which has just been pushed to Tinkubator:

MultiGraph wraps multiple, lower-level Graph implementations and provides a combined graph view, unifying vertices and edges by id. So, for example, if you have a vertex with an id of “Arthur” in graph #1 and another vertex with an id of “Arthur” in graph #2, and you put those graphs into a MultiGraph, the unified vertex with the id “Arthur” will have all of the properties of either vertex, as well as all of the edges to or from either vertex. Any vertices and edges which exist in some graphs but not in others will also exist in the MultiGraph view.

Using ids to trigger merging and precedence to cope with conflicting values for edges. I don’t think “precedence” is going to be very robust in the long run but every project has to start somewhere. Preserving provenance after merging is likely to be a requirement in many applications.

This and similar discussions happen at the Gremlin-Users group.

### How Common Is Merging?

Wednesday, November 9th, 2011

I started wondering about how common merging is in topic maps because I discovered a lack I have not seen before. There aren’t any large test collections of topic maps for CS types to break their clusters against. The sort of thing that challenges their algorithms and hardware.

But test collections should have some resemblance to actual data sets, at least if that is known with any degree of certainty. Or at least be one of the available data sets.

As a first step towards exploring this issue, I grepped for topics in the Opera and CIA Fact Book and got:

• Opera topic map: 29,738
• CIA Fact Book: 111,154

for a total of 140,892 topic elements. After merging the two maps, there were 126,204 topic elements. So I count that as merging 14,688 topic elements.

Approximately 10% of the topics in the two sets.

A very crude way to go about this but I was looking for rough numbers that may provoke some discussion and more refined measurements.

I mention that because one thought I had was to simply “cat” the various topic maps at the topicmapslab.de in CTM format together into one file and to “cat” that file until I have 1 million, 10 million and 100 million topic sets (approximately). Just a starter set to see what works/doesn’t work before scaling up the data sets.

Creating the files in this manner is going to result in a “merge heavy” topic map due to the duplication of content. That may not be a serious issue and perhaps better that it be that way in order to stress algorithms, etc. It would have the advantage that we could merge the original set and then project the number of merges that should be found in the various sets.

### Dedupe, Merge, and Purge: the Art of Normalization

Friday, October 28th, 2011

Dedupe, Merge, and Purge: the Art of Normalization by Tyler Bell and Leo Polovets.

From the description:

Big Noise always accompanies Big Data, especially when extracting entities from the tangle of duplicate, partial, fragmented and heterogeneous information we call the Internet. The ~17m physical businesses in the US, for example, are found on over 1 billion webpages and endpoints across 5 million domains and applications. Organizing such a disparate collection of pages into a canonical set of things requires a combination of distributed data processing and human-based domain knowledge. This presentation stresses the importance of entity resolution within a business context and provides real-world examples and pragmatic insight into the process of canonicalization.

I like the Big Noise line. That may have some traction. Certainly will be the case that when users started having contact with unfiltered big data, they are likely to be as annoyed as they are with web searching.

The answer to their questions is likely “out there” but it lies just beyond their grasp. Failing they won’t blame the Big Noise or their own lack of skill but the inoffensive (and ineffective) tools at hand. Guess it is a good thing search engines are free save for advertising.

PS: The slides have a number of blanks. I have written to the authors, well, to their company, asking for corrected slides to be posted.

### Subject Normalization

Thursday, September 29th, 2011

Another way to explain topic maps is in terms of Database normalization, except that I would call it subject normalization. That is every subject that is explicitly represented in the topic map appears once and only once, with relations to other subjects being recast to point to this single representative and all properties of the subject gathered to that one place.

One obvious advantage is that the shipping and accounting departments, for example, both have access to updated information for a customer as soon as entered by the other. And although they may gather different information about a customer, that information can be (doesn’t have to be) available to both of them.

Unlike database normalization, subject normalization in topic maps does not require rewriting of database tables, which can cause data access problems. Subject normalization (merging) occurs automatically, based on the presence of properties defined by the Topic Maps Data Model (TMDM).

And unlike OWL same:As, subject normalization in topic maps does not require knowledge of the “other” subject representative. That is I can insert an identifier that I know has been used for a subject, without knowledge it has been used in this topic map, and topics representing that subject will automatically merge (or be normalized).

Subject normalization in the terms of the TMDM, reduces the redundancy of information items. Which is true enough but not the primary experience of users with subject normalization. How many copies of a subject representative (information items) a system has is of little concern for an end-user.

What does concern end-users is getting the most complete and up-to-date information on a subject, however that is accomplished.

Topic maps accomplish that goal by empowering users to add identifiers to subject representatives that result in subject normalization. It doesn’t get any easier than that.

### Linked Data Semantic Issues (same for topic maps?)

Tuesday, September 27th, 2011

Sebastian Schaffert posted a message on the pub-lod@w3c.org list that raised several issues about Linked Data. Issues that sound relevant to topic maps. See what you think.

From the post:

We are working together with many IT companies (with excellent software developers) and trying to convince them that Semantic Web technologies are superior for information integration. They are already overwhelmed when they have to understand that a database ID for an object is not enough. If they have to start distinguishing between the data object and the real world entity the object might be representing, they will be lost completely.

I guess being told that a “real world entity” may have different ways to be identified must seem to be the road to perdition.

Curious because the “real world” is a messy place. Or is that the problem? That the world of developers is artificially “clean,” at least as far as identification and reference.

Perhaps CS programs need to train developers for encounter with the messy “real world.”

From the post:

> When you dereference the URL for a person (such as …/561666514#), you get back RDF. Our _expectation_, of course, is that that RDF will include some remarks about that person (…/561666514#), but there can be no guarantee of this, and no guarantee that it won’t include more information than you asked for. All you can reliably expect is that _something_ will come back, which the service believes to be true and hopes will be useful. You add this to your knowledge of the world, and move on.

There I have my main problem. If I ask for “A”, I am not really interested in “B”. What our client implementation therefore does is to throw away everything that is about B and only keeps data about A. Which is – in case of the FB data – nothing. The reason why we do this is that often you will get back a large amount of irrelevant (to us) data even if you only requested information about a specific resource. I am not interested in the 999 other resources the service might also want to offer information about, I am only interested in the data I asked for. Also, you need to have some kind of “handle” on how to start working with the data you get back, like:
1. I ask for information about A, and the server gives me back what it knows about A (there, my expectation again …)
2. From the data I get, I specifically ask for some common properties, like A foaf:name ?N and do something with the bindings of N. Now how would I know how to even formulate the query if I ask for A but get back B?

Ouch! That one cuts a little close.

What about the folks who are “…not really interested in ‘B’.” ?

How do topic maps serve their interests?

Or is that a matter of topic map design? What information to include?

That “merging” and what gets “merged” is a user/client decision?

That is how it works in practice simply due to time, resources, and other constraints.

Marketing questions:

How to discover data users would like to have appear with other data, prior to having a contract to do so?

Can we re-purpose search logs for that?

### User Generated Content and the Social Graph(thoughts on merging)

Sunday, July 24th, 2011

User Generated Content and the Social Graph by Chris Chandler.

Uses Twitter as a case study. Covers Gizzard and FlockDB, both of which were written in Scala.

Wants to coin the term “MoSQL! (More than SQL).”

A couple of points of interest to topic mappers.

Relationships maintained as forward and backward edges. That is:

“A follows B” and

“B is followed by A”

Twitter design decision: Never delete edges!

Curious if any of the current topic map implementations follow that strategy with regard to merging? Thinking that the act of creating a new item either references other existing item (in which case create new version of those items) or it is an entirely new item.

In either case, a subsequent call returns a set of matching items and if more than one, take the most recent one by timestamp.

As Chris says, disk space is cheap.