Archive for the ‘Merging’ Category
Tuesday, June 4th, 2013
Excision
From the post:
It is a key value proposition of Datomic that you can tell not only what you know, but how you came to know it. When you add a fact:
conn.transact(list(":db/add", 42, ":firstName", "John"));
Datomic does more than merely record that 42‘s first name is “John“. Each datom is also associated with a transaction entity, which records the moment (:db/txInstant) the datom was recorded.
(…)
Given this information model, it is easy to see that Datomic can support queries that tell you:
- what you know now
- what you knew at some point in the past
- how and when you came to know any particular datom
So far so good, but there is a fly in the ointment. In certain situations you may be forced to excise data, pulling it out root and branch and forgetting that you ever knew it. This may happen if you store data that must comply with privacy or IP laws, or you may have a regulatory requirement to keep records for seven years and then “shred” them. For these scenarios, Datomic provides excision.
One approach to the unanswered question of what does it means to delete something from a topic map?
Especially interesting because you can play with the answer that Datomic provides.
Doesn’t address the issue of what it means to delete a topic that has caused other topics to merge.
I first saw this in Christophe Lalanne’s A bag of tweets / May 2013.
Posted in Datomic, Merging | No Comments »
Saturday, April 20th, 2013
Fast algorithm for successively merging k-overlapping sets?
As posted:
Consider the following algorithm for clustering sets: Begin with n sets, S1, S2,…,Sn, such that sum_{i = 1}^n |Si| = m, and successively merge sets with at least k elements in common. E.g., if S1 = {1, 2, 3}, S2 = {3, 4, 5}, and S3 = {5, 6, 7}, and k = 1 then S1 can be merged with S2 to create S1′ = {1, 2, 3, 4, 5}, and S1′ can be merged with S3 to create S1” = {1,…,7}.
Warmup question: Can this clustering algorithm be implemented efficiently for k = 1?
Answer to warmup question: If the sets only need to overlap by one element to be merged as in the example above, the clustering can be performed in O(m) time using connected components if you are careful.
Harder question: Suppose the sets must overlap by at least 2 (or k) elements to be merged. Is there an efficient algorithm for this case (i.e., close to linear time)? The challenge here is that you can have cases like S1 = {1, 2, 3}, S2 = {2, 4, 5}, S3 = {1, 4, 5}, with k = 2. Note that in this case S1 can be merged with S2 and S3, but only after S2 and S3 are merged to create S2′ so that S1 and S2′ share both elements 1 and 2.
I saw this on Theoretical Computer Science Stack Exchange earlier today.
Reminded me of overlapping set members test for [subject identifiers], [item identifiers], [subject locators], [subject identifiers] and [item identifiers], property of the other, [reified] properties. (Well, reified is a simple match, not a set.)
I have found some early work on the overlapping set member question but also work on other measures of similarity on set members.
Working up a list of papers.
Posted in Merging, Overlapping Sets, Sets | No Comments »
Sunday, March 10th, 2013
Why Data Lineage is Your Secret Data Quality Weapon by Dylan Jones.
From the post:
Data lineage means many things to many people but it essentially refers to provenance – how do you prove where your data comes from?
It’s really a simple exercise. Just pull an imaginary string of data from where the information presents itself, back through the labyrinth of data stores and processing chains, until you can go no further.
I’m constantly amazed by why so few organisations practice sound data lineage management despite having fairly mature data quality or even data governance programs. On a side note, if ever there was a justification for the importance of data lineage management then just take a look at the brand damage caused by the recent European horse meat scandal.
But I digress. Why is data lineage your secret data quality weapon?
The simple answer is that data lineage forces your organisation to address two big issues that become all too apparent:
- Lack of ownership
- Lack of formal information chain design
Or to put it into a topic map context, can you trace what topics merged to create the topic you are now viewing?
And if you can’t trace, how can you audit the merging of topics?
And if you can’t audit, how do you determine the reliability of your topic map?
That is reliability in terms of date (freshness), source (reliable or not), evaluation (by screeners), comparison (to other sources), etc.
Same questions apply to all data aggregation systems.
Or as Mrs. Weasley tells Ginny:
“Never trust anything that can think for itself if you can’t see where it keeps its brain.”
Correction: Wesley -> Weasley. We had a minister friend over Sunday and were discussing the former, not the latter.
Posted in Data Quality, Merging, Provenance | No Comments »
Wednesday, March 6th, 2013
VIAF: The Virtual International Authority File
From the webpage:
VIAF, implemented and hosted by OCLC, is a joint project of several national libraries plus selected regional and trans-national library agencies. The project’s goal is to lower the cost and increase the utility of library authority files by matching and linking widely-used authority files and making that information available on the Web.
The “about” link at the bottom of the page is broken (in the English version). A working “about” link for VIAF reports:
At a glance
- A collaborative effort between national libraries and organizations contributing name authority files, furthering access to information
- All authority data for a given entity is linked together into a “super” authority record
- A convenient way for the library community and other agencies to repurpose bibliographic data produced by libraries serving different language communities
The Virtual International Authority File (VIAF) is an international service designed to provide convenient access to the world’s major name authority files. Its creators envision the VIAF as a building block for the Semantic Web to enable switching of the displayed form of names for persons to the preferred language and script of the Web user. VIAF began as a joint project with the Library of Congress (LC), the Deutsche Nationalbibliothek (DNB), the Bibliothèque nationale de France (BNF) and OCLC. It has, over the past decade, become a cooperative effort involving an expanding number of other national libraries and other agencies. At the beginning of 2012, contributors include 20 agencies from 16 countries.
Most large libraries maintain lists of names for people, corporations, conferences, and geographic places, as well as lists to control works and other entities. These lists, or authority files, have been developed and maintained in distinctive ways by individual library communities around the world. The differences in how to approach this work become evident as library data from many communities is combined in shared catalogs such as OCLC’s WorldCat.
VIAF helps to make library authority files less expensive to maintain and more generally useful to the library domain and beyond. To achieve this, VIAF matches and links the authority files of national libraries and groups all authority records for a given entity into a merged “super” authority record that brings together the different names for that entity. By linking disparate names for the same person or organization, VIAF provides a convenient means for a wider community of libraries and other agencies to repurpose bibliographic data produced by libraries serving different language communities.
If you were to substitute for ‘”super” authority record,” the term topic, you would be part of the way towards a topic map.
Topics gather information about a given entity into a single location.
Topics differ from the authority records you find at VIAF in two very important ways:
- First, topics, unlike authority records, have the ability to merge with other topics, creating new topics that have more information than any of the original topics.
- Second, authority records are created by, well, authorities. Do you see your name or the name of your organization on the list at VIAF? Topics can be created by anyone and merged with other topics on terms chosen by the possessor of the topic map. You don’t have to wait for an authority to create the topic or approve your merging of it.
There are definite advantages to having authorities and authority records, but there are also advantages to having the freedom to describe your world, in your terms.
Posted in Authority Record, Library, Library Associations, Merging, Subject Identity | No Comments »
Wednesday, January 16th, 2013
Watching the What’s New in Cassandra 1.2 (Notes) webcast and encountered an unfamiliar term: “tombstones.”
If you are already familiar with the concept, skip to another post.
If you’re not, the concept is used in distributed systems that maintain “eventual” consistency by the nodes replicating their content. Which works if all nodes are available but what if you delete data and a node is unavailable? When it comes back, the other nodes are “missing” data that needs to be replicated.
From the description at the Cassandra wiki, DistributedDeletes, not an easy problem to solve.
So, Cassandra turns it into a solvable problem.
Deletes are implemented with a special value known as a tombstone. The tombstone is propogated to nodes that missed the initial delete.
Since you will eventually want to delete the tombstones as well, a grace period can be set, which is slightly longer than the period needed to replace a non-responding node.
Distributed topic maps will face the same issue.
Complicated by imperative programming models of merging that make changes in properties that alter merging difficult to manage.
Perhaps functional models of merging, as with other forms of distributed processing, will carry the day.
Posted in Cassandra, Distributed Consistency, Distributed Systems, Functional Programming, Merging | No Comments »
Monday, January 14th, 2013
On Graph Computing by Marko A. Rodriguez.
Marko writes elegantly about graphs and I was about to put this down as another graph advocacy post. Interesting but graph followers have heard this story before.
But I do read the materials I cite and Marko proceeds to define three separate graphs, software, discussion and concept. Each of which has some vertexes in common with one or both of the others.
Then he has this section:
A Multi-Domain Graph
The three previous scenarios (software, discussion, and concept) are representations of real-world systems (e.g. GitHub, Google Groups, and Wikipedia). These seemingly disparate models can be seamlessly integrated into a single atomic graph structure by means of shared vertices. For instance, in the associated diagram, Gremlin is a Titan dependency, Titan is developed by Matthias, and Matthias writes messages on Aurelius’ mailing list (software merges with discussion). Next, Blueprints is a Titan dependency and Titan is tagged graph (software merges with concept). The dotted lines identify other such cross-domain linkages that demonstrate how a universal model is created when vertices are shared across domains. The integrated, universal model can be subjected to processes that provide richer (perhaps, more intelligent) services than what any individual model could provide alone.

Shared vertices sounds a lot like merging in the topic map sense to me.
It isn’t clear from the post what requirements may or may not exist for vertices to be “shared.”
Or how you would state the requirements for sharing vertices?
Or how to treat edges that become duplicates when separate vertices they connect now become the same vertices?
If “shared vertices” support what we call merging in topic maps, perhaps there are graph installations waiting to wake up as topic maps.
Posted in Graphs, Merging, Networks | No Comments »
Tuesday, January 8th, 2013
MapReduce: Detecting Cycles in Network Graph by Ricky Ho.
From the post:
I recently received an email from an audience of my blog on Map/Reduce algorithm design regarding how to detect whether a graph is acyclic using Map/Reduce. I think this is an interesting problem and can imagine there can be wide range of application to it.
Although I haven’t solved this exact problem in the past, I’d like to sketch out my thoughts on a straightforward approach, which may not be highly optimized. My goal is to invite other audience who has solved this problem to share their tricks.
To define the problem: Given a simple directed graph, we want to tell whether it contains any cycles.
Relevant to processing of identifiers in topic maps which may occur on more than one topic (prior to merging).
What is your solution in a mapreduce context?
Posted in Giraph, MapReduce, Merging | 1 Comment »
Thursday, January 3rd, 2013
I saw some ad-copy from a company that “wrote the book” on data virtualization (well, “a” book on data virtualization anyway).
Searched a bit in their documentation and elsewhere, but could not find an answer to my questions (below).
Assume departments 1 and 2, each with a data virtualization layer between their apps and the same backend resources:

Requirement: Don’t maintain two separate data virtualization layers for the same resources.
Desired result is:

Questions: Must I return to the data resources to discover their semantices? To merge the two data virtualization layers?
Some may object there should only be one data virtualization layer.
OK, so we have Department 1 – circa 2013 and Department 1 – circa 2015, different data virtualization requirements:

Desired result:

Same Questions:
Question: Must I return to the data resources to discover their semantics? To merge the existing and proposed data virtualizaton layers?
The semantics of each item in the data sources (one hopes) was determined for the original data virtualization layer.
It’s wasteful to re-discover the same semantics for changes in data virtualization layers.
Curious, how rediscovery of semantics is avoided in data virtualization software?
Or for that matter, how do you interchange data virtualization layer mappings?
Posted in Data Virtualization, Merging | No Comments »
Friday, December 28th, 2012
Computing Strongly Connected Components by Mark C. Chu-Carroll.
From the post:
As promised, today I’m going to talk about how to compute the strongly connected components of a directed graph. I’m going to go through one method, called Kosaraju’s algorithm, which is the easiest to understand. It’s possible to do better that Kosaraju’s by a factor of 2, using an algorithm called Tarjan’s algorithm, but Tarjan’s is really just a variation on the theme of Kosaraju’s.
Kosaraju’s algorithm is amazingly simple. It uses a very clever trick based on the fact that if you reverse all of the edges in a graph, the resulting graph has the same strongly connected components as the original. So using that, we can get the SCCs by doing a forward traversal to find an ordering of vertices, then doing a traversal of the reverse of the graph in the order generated by the first traversal.
That may sound a bit mysterious, but it’s really very simple. Take the graph G, and do a recursive depth-first traversal of the graph, along with an initially empty stack of vertices. As the recursion started at a vertex V finishes, push V onto the stack. At the end of the traversal, you’ll have all of the vertices of G in the stack. The order of the reverse traversal will be starting with the vertices on the top of that stack.
So you reverse all of the edges in the graph, creating the reverse graph, G’. Start with the vertex on top of the stack, and to a traversal from that vertex. All of the nodes reachable from that vertex form one strongly connected component. Remove everything in that SCC from the stack, and then repeat the process with the new top of the stack. When the stack is empty, you’ll have accumulated all of the SCCs.
Is graph decomposition suggestive of way to manage distributed merging?
Instead of using “strongly connected” as a criteria for decomposition, one or more subject characteristics are used to decompose the topic map.
Natural language is one that comes immediately to mind. Based on the content most likely to be viewed by a particular audience.
What subject characteristics would you use to decompose a topic map?
Posted in Graphs, Merging, Topic Maps | No Comments »
Friday, December 28th, 2012
Graph Contraction and Minors by Mark C. Chu-Carroll.
From the post:
Another useful concept in simple graph theory is *contraction* and its result, *minors* of graphs. The idea is that there are several ways of simplifying a graph in order to study its properties: cutting edges, removing vertices, and decomposing a graph are all methods we’ve seen before. Contraction is a different technique that works by *merging* vertices, rather than removing them.
Here’s how contraction works. Suppose you have a graph G. Pick a pair of vertices v and w which are adjacent in G. You can create a graph G’ which is a contraction of G by replacing v and w with a *single* vertex, and taking any edge in G which is incident on either v or w, and making it incident on the newly contracted node.

Not quite merging in the topic map sense because edges between vertexes are the basis for “merging.”
Instead of on the basis of properties of subjects the vertexes represent.
Still, deeply intriguing work.
This post was first published in July of 2007, so recent it’s not.
More resources you would suggest?
I first saw this in a tweet by Marko A. Rodriguez.
Posted in Graphs, Merging | No Comments »
Sunday, November 25th, 2012
HIVE: Handy Integration and Visualisation of multimodal Experimental Data
From the webpage:
HIVE is an Add-on for the VANTED system. VANTED is a graph editor extended for the visualisation and analysis of biological experimental data in context of pathways/networks. HIVE stands for
Handy Integration and Visualisation of multimodal Experimental Data
and extends the functionality of VANTED by adding the handling of volumes and images, together with a workspace approach, allowing one to integrate data of different biological data domains.
You need to see the demo video to appreciate this application!
It offers import of data, mapping rules to merge data from different data sets, easy visualization as a graph and other features.
Did I mention it also has 3-D image techniques as well?
PS: Yes, it is another example of “Who moved my acronym?”
Posted in Bioinformatics, Biomedical, Graphs, Mapping, Merging, Visualization | No Comments »
Monday, October 15th, 2012
Transaction Searching: Unifying Field Names posted by David Carraso.
From the post:
Problem
You need to build transactions from multiple data sources that use different field names for the same identifier.
Solution
Typically, you can join transactions with common fields like:
... | transaction username
But when the username identifier is called different names (login, name, user, owner, and so on) in different data sources, you need to normalize the field names.
If sourcetype A only contains field_A and sourcetype B only contains field_B, create a new field called field_Z which is either field_A or field_B, depending on which is present in an event. You can then build the transaction based on the value of field_Z.
sourcetype=A OR sourcetype=B
| eval field_Z = coalesce(field_A, field_B)
| transaction field_Z
Looks a lot like a topic map merging operation doesn’t it?
But “looks a lot like” doesn’t mean it is “the same as” a topic map merging operation.
How would you say it is different?
While the outcome may be the same as a merging operation (which are legend defined), I would say that I don’t know how we got from A or B to Z?
That is next month or six months from now, or even two years down the road, I have C and I want to modify this transaction.
Question: Can I safely modify this transaction to add C?
I suspect the answer is:
“We don’t know. Have to go back to confirm what A and B (as well as C) mean and get back to you on that question.”
For a toy example that seems like overkill, but what if you have thousands of columns spread over hundreds of instances of active data systems.
Still feel confident about offering an answer without researching it?
Topic map based merging could give you that confidence.
Even if like Scotty, you say two weeks and have the answer later that day, to age a bit before delivering it ahead of schedule.
Posted in Merging, Splunk | 1 Comment »
Friday, September 28th, 2012
Merging Data Sets Based on Partially Matched Data Elements by Tony Hirst.
From the post:
A tweet from @coneee yesterday about merging two datasets using columns of data that don’t quite match got me wondering about a possible R recipe for handling partial matching. The data in question related to country names in a datafile that needed fusing with country names in a listing of ISO country codes.
Reminds me of the techniques used in record linkage (epidemiology). There, unlike topic maps, the records from diverse sources were mapped to a target record layout and then analyzed.
Quite powerful but lossy with regard to the containers of the original data.
Posted in Merging, R, Record Linkage | No Comments »
Saturday, August 4th, 2012
Genetic algorithms: a simple R example by Bart Smeets.
From the post:
Genetic algorithm is a search heuristic. GAs can generate a vast number of possible model solutions and use these to evolve towards an approximation of the best solution of the model. Hereby it mimics evolution in nature.
GA generates a population, the individuals in this population (often called chromosomes) have a given state. Once the population is generated, the state of these individuals is evaluated and graded on their value. The best individuals are then taken and crossed-over – in order to hopefully generate ‘better’ offspring – to form the new population. In some cases the best individuals in the population are preserved in order to guarantee ‘good individuals’ in the new generation (this is called elitism).
The GA site by Marek Obitko has a great tutorial for people with no previous knowledge on the subject.
As the size of data stores increase, the cost of personal judgement on each subject identity test will as well. Genetic algorithms may be one way of creating subject identity tests in such situations.
In any event, it won’t harm anyone to be aware of the basic contours of the technique.
I first saw this at R-Bloggers.
Posted in Genetic Algorithms, Merging, Subject Identity | No Comments »
Friday, August 3rd, 2012
Column Statistics in Hive by Shreepadma Venugopalan.
From the post:
Over the last couple of months the Hive team at Cloudera has been working hard to bring a bunch of exciting new features to Hive. In this blog post, I’m going to talk about one such feature – Column Statistics in Hive – and how Hive’s query processing engine can benefit from it. The feature is currently a work in progress but we expect it to be available for review imminently.
Motivation
While there are many possible execution plans for a query, some plans are more optimal than others. The query optimizer is responsible for generating an efficient execution plan for a given SQL query from the space of all possible plans. Currently, Hive’s query optimizer uses rules of thumbs to generate an efficient execution plan for a query. While such rules of thumb optimizations transform the query plan into a more efficient one, the resulting plan is not always the most efficient execution plan.
In contrast, the query optimizer in a traditional RDBMS is cost based; it uses the statistical properties of the input column values to estimate the cost alternative query plans and chooses the plan with the lowest cost. The cost model for query plans assigns an estimated execution cost to the plans. The cost model is based on the CPU and I/O costs of query execution for every operator in the query plan. As an example consider a query that represents a join among {A, B, C} with the predicate {A.x == B.x == C.x}. Assume table A has a total of 500 records, table B has a total of 6000 records, table C has a total of 1000 records. In the absence of cost based query optimization, the system picks the join order specified by the user. In our example, let us further assume that the result of joining A and B yields 2000 records and the result of joining A and C yields 50 records.Hence the cost of performing the join between A, B and C, without join reordering, is the cost of joining A and B + cost of joining the output of A Join B with C. In our example this would result in a cost of (500 * 6000) + (2000 * 1000). On the other hand, a cost based optimizer (CBO) in a RDBMS would pick the more optimal alternate order [(A Join C) Join B] thus resulting in a cost of (500 * 1000) + (50 * 6000). However, in order to pick the more optimal join order the CBO needs cardinality estimates on the join column.
Today, Hive supports statistics at the table and partition level – count of files, raw data size, count of rows etc, but doesn’t support statistics on column values. These table and partition level statistics are insufficient for the purpose of building a CBO because they don’t provide any information about the individual column values. Hence obtaining the statistical summary of the column values is the first step towards building a CBO for Hive.
In addition to join reordering, Hive’s query optimizer will be able to take advantage of column statistics to decide whether to perform a map side aggregation as well as estimate the cardinality of operators in the execution plan better.
Some days I wonder where improvements to algorithms and data structures are going to lead?
Other days, I just enjoy the news.
Today is one of the latter.
PS: What a cost based optimizer (CBO) would look like for merging operations? Or perhaps better, merge cost estimator (MCE)? Metered merging anyone?
Posted in Cloudera, Hive, Merging, Statistics | No Comments »
Saturday, July 21st, 2012
The Amazing Mean Shift Algorithm by Larry Wasserman.
From the post:
The mean shift algorithm is a mode-based clustering method due to Fukunaga and Hostetler (1975) that is commonly used in computer vision but seems less well known in statistics.
The steps are: (1) estimate the density, (2) find the modes of the density, (3) associate each data point to one mode.
If you are puzzling over why I cited this post, it might help if you read “(3)” as:
(3) merge data points associated with one mode.
The notion that topics can only be merged on the basis of URLs, actually discrete values of any sort, is one way to think about merging. Your data may or may not admit to robust processing on that basis.
Those are all very good ways to merge topics, if and only if that works for your data.
If not, then you need to find ways that work with your data.
Posted in Algorithms, Merging | No Comments »
Saturday, July 21st, 2012
Efficient Core Maintenance in Large Dynamic Graphs by Rong-Hua Li and Jeffrey Xu Yu.
Abstract:
The $k$-core decomposition in a graph is a fundamental problem for social network analysis. The problem of $k$-core decomposition is to calculate the core number for every node in a graph. Previous studies mainly focus on $k$-core decomposition in a static graph. There exists a linear time algorithm for $k$-core decomposition in a static graph. However, in many real-world applications such as online social networks and the Internet, the graph typically evolves over time. Under such applications, a key issue is to maintain the core number of nodes given the graph changes over time. A simple implementation is to perform the linear time algorithm to recompute the core number for every node after the graph is updated. Such simple implementation is expensive when the graph is very large. In this paper, we propose a new efficient algorithm to maintain the core number for every node in a dynamic graph. Our main result is that only certain nodes need to update their core number given the graph is changed by inserting/deleting an edge. We devise an efficient algorithm to identify and recompute the core number of such nodes. The complexity of our algorithm is independent of the graph size. In addition, to further accelerate the algorithm, we develop two pruning strategies by exploiting the lower and upper bounds of the core number. Finally, we conduct extensive experiments over both real-world and synthetic datasets, and the results demonstrate the efficiency of the proposed algorithm.
Maintenance of topic maps in the face of incoming information is an important issue.
I am intrigued by the idea of only certain nodes requiring updating based on addition/deletion of edges to a graph. Certainly true with topic maps and my question is whether this work can be adapted to work outside of core number updates?
Or perhaps more clearly, can it be adapted to work with a basis for merging topic maps? Or should core numbers be adapted for processing topic maps.
Questions I would be exploring if I had a topic maps lab. Maybe I should work up a proposal for one at an investment site.
Posted in Graphs, Merging, Topic Map Software | No Comments »
Wednesday, July 11th, 2012
Compressive genomics by Po-Ru Loh, Michael Baym, and Bonnie Berger (Nature Biotechnology 30, 627–630 (2012) doi:10.1038/nbt.2241)
From the introduction:
In the past two decades, genomic sequencing capabilities have increased exponentially[cites omitted] outstripping advances in computing power[cites omitted]. Extracting new insights from the data sets currently being generated will require not only faster computers, but also smarter algorithms. However, most genomes currently sequenced are highly similar to ones already collected[cite omitted]; thus, the amount of new sequence information is growing much more slowly.
Here we show that this redundancy can be exploited by compressing sequence data in such a way as to allow direct computation on the compressed data using methods we term ‘compressive’ algorithms. This approach reduces the task of computing on many similar genomes to only slightly more than that of operating on just one. Moreover, its relative advantage over existing algorithms will grow with the accumulation of genomic data. We demonstrate this approach by implementing compressive versions of both the Basic Local Alignment Search Tool (BLAST)[cite omitted] and the BLAST-Like Alignment Tool (BLAT)[cite omitted], and we emphasize how compressive genomics will enable biologists to keep pace with current data.
Software available at: Compression-accelerated BLAST and BLAT.
A new line of attack on searching “big data.”
Making “big data” into “smaller data” and enabling analysis of it while still “smaller data.”
Enabling the searching of highly similar genomes by compression is a form of merging isn’t it? That is a sequence (read subject) that occurs multiple times over similar genomes is given a single representative, while preserving its relationship to all the individual genome instances.
What makes merger computationally tractable here and yet topic may systems, at least some of them, are reported to have scalability issues: Scalability of Topic Map Systems by Marcel Hoyer?
What other examples of computationally tractable merging would you suggest? Including different merging approaches/algorithms. Thinking it might be a useful paper/study to work from scalable merging examples towards less scalable ones. Perhaps to discover what choices have an impact on scalability.
Posted in Bioinformatics, Compression, Genome, Merging, Scalability | 1 Comment »
Sunday, July 1st, 2012
SkyQuery: An Implementation of a Parallel Probabilistic Join Engine for Cross-Identification of Multiple Astronomical Databases by László Dobos, Tamás Budavári, Nolan Li, Alexander S. Szalay, and István Csabai.
Abstract:
Multi-wavelength astronomical studies require cross-identification of detections of the same celestial objects in multiple catalogs based on spherical coordinates and other properties. Because of the large data volumes and spherical geometry, the symmetric N-way association of astronomical detections is a computationally intensive problem, even when sophisticated indexing schemes are used to exclude obviously false candidates. Legacy astronomical catalogs already contain detections of more than a hundred million objects while the ongoing and future surveys will produce catalogs of billions of objects with multiple detections of each at different times. The varying statistical error of position measurements, moving and extended objects, and other physical properties make it necessary to perform the cross-identification using a mathematically correct, proper Bayesian probabilistic algorithm, capable of including various priors. One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios. Consequently, a novel system is necessary that can cross-identify multiple catalogs on-demand, efficiently and reliably. In this paper, we present our solution based on a cluster of commodity servers and ordinary relational databases. The cross-identification problems are formulated in a language based on SQL, but extended with special clauses. These special queries are partitioned spatially by coordinate ranges and compiled into a complex workflow of ordinary SQL queries. Workflows are then executed in a parallel framework using a cluster of servers hosting identical mirrors of the same data sets.
Astronomy is a cool area to study and has data out the wazoo, but I was struck by:
One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios.
Is identity with sharp edges, susceptible to pair-wise mapping, the common case?
Or do we just see some identity issues that way?
Commend the paper to you as an example of dynamic merging practice.
Posted in Astroinformatics, Bayesian Data Analysis, Dynamic Mapping, Identity, Merging, SQL | No Comments »
Saturday, June 30th, 2012
Station Maps: Browser-Based 3D Maps of the London Underground
From Information Asthetics:
Station Maps [aeracode.org] by programmer Andrew Godwin contains a large collection of browser-based (HTML5) 3D maps depicting different London Underground/DLR stations.
Most of the stations are modelled from memory in combination with a few diagrams found online. This means that the models are not totally accurate, but they should represent the right layout, shape and layering of the stations.
Every map has some underlying structure/ontology onto which other information is added.
Real time merging of train, security camera, security forces, event, etc., information onto such maps is one aspect of merging based on location/interest. Not all information is equally useful to all parties.
Posted in Mapping, Maps, Merging, Topic Maps | No Comments »
Friday, June 22nd, 2012
Merging Data Presents Practical Challenges
From the post:
Merging structured and unstructured data remains a great challenge for many enterprise users and enterprise search capabilities are lagging behind the trend to collect as much data as possible–with the intent of eventually finding the infrastructure to address it.
The independent, non-profit research group, the Association for Information and Image Management, (AIIM) with backing from sponsor Actuate (BIRT BI vendor) hit on some of this challenges with the release of a new survey-based report.
Blah, blah, blah.
Which brings to the reason you should read and encourage others to read this blog:
http://www.xenos.com/xe/info/aiim-big-data-report – Link to where you can register and get a copy of the report.
I hunted through various press releases, etc. and included a link to the source of the materials quoted.
Some commercial sites apparently either don’t want you to see the source materials or don’t know how to author hyperlinks. Hard to say for sure.
I want you to:
- Know where material came from, and
- Have the opportunity (whether you take it or no is your issue) to agree/disagree with my comments on material. Can’t do that or at least not easily if you have to hunt it down.
Once found by any of us, information should remain found for all of us. (I rather like that, may have to make further use of it.)
Posted in BigData, Merging | No Comments »
Tuesday, May 22nd, 2012
MIT News: New mathematical framework formalizes oddball programming techniques
From the post:
Two years ago, Martin Rinard’s group at MIT’s Computer Science and Artificial Intelligence Laboratory proposed a surprisingly simple way to make some computer procedures more efficient: Just skip a bunch of steps. Although the researchers demonstrated several practical applications of the technique, dubbed loop perforation, they realized it would be a hard sell. “The main impediment to adoption of this technique,” Imperial College London’s Cristian Cadar commented at the time, “is that developers are reluctant to adopt a technique where they don’t exactly understand what it does to the program.”
I like that for making topic maps scale, “…skip a bunch of steps….”
Topic maps, the semantic web and similar semantic ventures are erring on the side of accuracy.
We are often mistaken about facts, faces, identifications in semantic terminology.
Why think we can build programs or machines that can do better?
Let’s stop rolling the identification stone up the hill.
Ask “how accurate does the identification/merging need to be?”
The answer for aiming a missile is probably different than sorting emails in a discovery process.
If you believe in hyperlinks:
Proving Acceptability Properties of Relaxed Nondeterministic Approximate Programs Michael Carbin, Deokhwan Kim, Sasa Misailovic, and Martin Rinard, Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2012) Beijing, China June 2012
From Martin Rinard’s publication page.
Has other interesting reading.
Posted in Identity, Merging | No Comments »
Friday, March 9th, 2012
Functional thinking: Functional design patterns, Part 1 – How patterns manifest in the functional world
Summary
Contrary to popular belief, design patterns exist in functional programming — but they sometimes differ from their object-oriented counterparts in appearance and behavior. In this installment of Functional thinking, Neal Ford looks at ways in which patterns manifest in the functional paradigm, illustrating how the solutions differ.
From the article:
Some contingents in the functional world claim that the concept of the design pattern is flawed and isn’t needed in functional programming. A case can be made for that view under a narrow definition of pattern — but that’s an argument more about semantics than use. The concept of a design pattern — a named, cataloged solution to a common problem — is alive and well. However, patterns sometimes take different guises under different paradigms. Because the building blocks and approaches to problems are different in the functional world, some of the traditional Gang of Four patterns (see Resources) disappear, while others preserve the problem but solve it radically differently. This installment and the next investigate some traditional design patterns and rethink them in a functional way.
A functional approach to topic maps lends a certain elegance to the management of merging questions.
Comments?
Posted in Design, Functional Programming, Merging, Topic Map Software | No Comments »
Sunday, December 4th, 2011
Unified Graph Views with Multigraph by Joshua Shinavier.
From the post:
Next in line in this parade of new Blueprints utilities: a Graph implementation called MultiGraph which has just been pushed to Tinkubator:
https://github.com/tinkerpop/tinkubator
MultiGraph wraps multiple, lower-level Graph implementations and provides a combined graph view, unifying vertices and edges by id. So, for example, if you have a vertex with an id of “Arthur” in graph #1 and another vertex with an id of “Arthur” in graph #2, and you put those graphs into a MultiGraph, the unified vertex with the id “Arthur” will have all of the properties of either vertex, as well as all of the edges to or from either vertex. Any vertices and edges which exist in some graphs but not in others will also exist in the MultiGraph view.
Using ids to trigger merging and precedence to cope with conflicting values for edges. I don’t think “precedence” is going to be very robust in the long run but every project has to start somewhere. Preserving provenance after merging is likely to be a requirement in many applications.
This and similar discussions happen at the Gremlin-Users group.
Posted in Graphs, Merging | No Comments »
Wednesday, November 9th, 2011
I started wondering about how common merging is in topic maps because I discovered a lack I have not seen before. There aren’t any large test collections of topic maps for CS types to break their clusters against. The sort of thing that challenges their algorithms and hardware.
But test collections should have some resemblance to actual data sets, at least if that is known with any degree of certainty. Or at least be one of the available data sets.
As a first step towards exploring this issue, I grepped for topics in the Opera and CIA Fact Book and got:
- Opera topic map: 29,738
- CIA Fact Book: 111,154
for a total of 140,892 topic elements. After merging the two maps, there were 126,204 topic elements. So I count that as merging 14,688 topic elements.
Approximately 10% of the topics in the two sets.
A very crude way to go about this but I was looking for rough numbers that may provoke some discussion and more refined measurements.
I mention that because one thought I had was to simply “cat” the various topic maps at the topicmapslab.de in CTM format together into one file and to “cat” that file until I have 1 million, 10 million and 100 million topic sets (approximately). Just a starter set to see what works/doesn’t work before scaling up the data sets.
Creating the files in this manner is going to result in a “merge heavy” topic map due to the duplication of content. That may not be a serious issue and perhaps better that it be that way in order to stress algorithms, etc. It would have the advantage that we could merge the original set and then project the number of merges that should be found in the various sets.
Suggestions/comments?
Posted in Dataset, Merging, Topic Map Software, Topic Maps | No Comments »
Friday, October 28th, 2011
Dedupe, Merge, and Purge: the Art of Normalization by Tyler Bell and Leo Polovets.
From the description:
Big Noise always accompanies Big Data, especially when extracting entities from the tangle of duplicate, partial, fragmented and heterogeneous information we call the Internet. The ~17m physical businesses in the US, for example, are found on over 1 billion webpages and endpoints across 5 million domains and applications. Organizing such a disparate collection of pages into a canonical set of things requires a combination of distributed data processing and human-based domain knowledge. This presentation stresses the importance of entity resolution within a business context and provides real-world examples and pragmatic insight into the process of canonicalization.
I like the Big Noise line. That may have some traction. Certainly will be the case that when users started having contact with unfiltered big data, they are likely to be as annoyed as they are with web searching.
The answer to their questions is likely “out there” but it lies just beyond their grasp. Failing they won’t blame the Big Noise or their own lack of skill but the inoffensive (and ineffective) tools at hand. Guess it is a good thing search engines are free save for advertising.
PS: The slides have a number of blanks. I have written to the authors, well, to their company, asking for corrected slides to be posted.
Posted in Deduplication, Merging, Purge | No Comments »
Thursday, September 29th, 2011
Another way to explain topic maps is in terms of Database normalization, except that I would call it subject normalization. That is every subject that is explicitly represented in the topic map appears once and only once, with relations to other subjects being recast to point to this single representative and all properties of the subject gathered to that one place.
One obvious advantage is that the shipping and accounting departments, for example, both have access to updated information for a customer as soon as entered by the other. And although they may gather different information about a customer, that information can be (doesn’t have to be) available to both of them.
Unlike database normalization, subject normalization in topic maps does not require rewriting of database tables, which can cause data access problems. Subject normalization (merging) occurs automatically, based on the presence of properties defined by the Topic Maps Data Model (TMDM).
And unlike OWL same:As, subject normalization in topic maps does not require knowledge of the “other” subject representative. That is I can insert an identifier that I know has been used for a subject, without knowledge it has been used in this topic map, and topics representing that subject will automatically merge (or be normalized).
Subject normalization in the terms of the TMDM, reduces the redundancy of information items. Which is true enough but not the primary experience of users with subject normalization. How many copies of a subject representative (information items) a system has is of little concern for an end-user.
What does concern end-users is getting the most complete and up-to-date information on a subject, however that is accomplished.
Topic maps accomplish that goal by empowering users to add identifiers to subject representatives that result in subject normalization. It doesn’t get any easier than that.
Posted in Merging, TMDM, TMRM, Topic Maps | No Comments »
Tuesday, September 27th, 2011
Sebastian Schaffert posted a message on the pub-lod@w3c.org list that raised several issues about Linked Data. Issues that sound relevant to topic maps. See what you think.
From the post:
We are working together with many IT companies (with excellent software developers) and trying to convince them that Semantic Web technologies are superior for information integration. They are already overwhelmed when they have to understand that a database ID for an object is not enough. If they have to start distinguishing between the data object and the real world entity the object might be representing, they will be lost completely.
I guess being told that a “real world entity” may have different ways to be identified must seem to be the road to perdition.
Curious because the “real world” is a messy place. Or is that the problem? That the world of developers is artificially “clean,” at least as far as identification and reference.
Perhaps CS programs need to train developers for encounter with the messy “real world.”
From the post:
> When you dereference the URL for a person (such as …/561666514#), you get back RDF. Our _expectation_, of course, is that that RDF will include some remarks about that person (…/561666514#), but there can be no guarantee of this, and no guarantee that it won’t include more information than you asked for. All you can reliably expect is that _something_ will come back, which the service believes to be true and hopes will be useful. You add this to your knowledge of the world, and move on.
There I have my main problem. If I ask for “A”, I am not really interested in “B”. What our client implementation therefore does is to throw away everything that is about B and only keeps data about A. Which is – in case of the FB data – nothing. The reason why we do this is that often you will get back a large amount of irrelevant (to us) data even if you only requested information about a specific resource. I am not interested in the 999 other resources the service might also want to offer information about, I am only interested in the data I asked for. Also, you need to have some kind of “handle” on how to start working with the data you get back, like:
1. I ask for information about A, and the server gives me back what it knows about A (there, my expectation again …)
2. From the data I get, I specifically ask for some common properties, like A foaf:name ?N and do something with the bindings of N. Now how would I know how to even formulate the query if I ask for A but get back B?
Ouch! That one cuts a little close.
What about the folks who are “…not really interested in ‘B’.” ?
How do topic maps serve their interests?
Or have we decided for them that more information about a subject is better?
Or is that a matter of topic map design? What information to include?
That “merging” and what gets “merged” is a user/client decision?
That is how it works in practice simply due to time, resources, and other constraints.
Marketing questions:
How to discover data users would like to have appear with other data, prior to having a contract to do so?
Can we re-purpose search logs for that?
Posted in LOD, Linked Data, Marketing, Merging, Topic Maps | No Comments »
Sunday, July 24th, 2011
User Generated Content and the Social Graph by Chris Chandler.
Uses Twitter as a case study. Covers Gizzard and FlockDB, both of which were written in Scala.
Wants to coin the term “MoSQL! (More than SQL).”
A couple of points of interest to topic mappers.
Relationships maintained as forward and backward edges. That is:
“A follows B” and
“B is followed by A”
Twitter design decision: Never delete edges!
Curious if any of the current topic map implementations follow that strategy with regard to merging? Thinking that the act of creating a new item either references other existing item (in which case create new version of those items) or it is an entirely new item.
In either case, a subsequent call returns a set of matching items and if more than one, take the most recent one by timestamp.
As Chris says, disk space is cheap.
Works for Twitter.
Posted in FlockDB, Gizzard, Merging, Social Graphs | No Comments »