## Archive for the ‘Merging’ Category

### We’re Bringing Learning to Rank to Elasticsearch [Merging Properties Query Dependent?]

Tuesday, February 14th, 2017

From the post:

It’s no secret that machine learning is revolutionizing many industries. This is equally true in search, where companies exhaust themselves capturing nuance through manually tuned search relevance. Mature search organizations want to get past the “good enough” of manual tuning to build smarter, self-learning search systems.

That’s why we’re excited to release our Elasticsearch Learning to Rank Plugin. What is learning to rank? With learning to rank, a team trains a machine learning model to learn what users deem relevant.

When implementing Learning to Rank you need to:

1. Measure what users deem relevant through analytics, to build a judgment list grading documents as exactly relevant, moderately relevant, not relevant, for queries
2. Hypothesize which features might help predict relevance such as TF*IDF of specific field matches, recency, personalization for the searching user, etc.
3. Train a model that can accurately map features to a relevance score
4. Deploy the model to your search infrastructure, using it to rank search results in production

Don’t fool yourself: underneath each of these steps lie complex, hard technical and non-technical problems. There’s still no silver bullet. As we mention in Relevant Search, manual tuning of search results comes with many of the same challenges as a good learning to rank solution. We’ll have more to say about the many infrastructure, technical, and non-technical challenges of mature learning to rank solutions in future blog posts.

… (emphasis in original)

A great post as always but of particular interest for topic map fans is this passage:

Many of these features aren’t static properties of the documents in the search engine. Instead they are query dependent – they measure some relationship between the user or their query and a document. And to readers of Relevant Search, this is what we term signals in that book.
… (emphasis in original)

Do you read this as suggesting the merging exhibited to users should depend upon their queries?

That two or more users, with different query histories could (should?) get different merged results from the same topic map?

Now that’s an interesting suggestion!

Enjoy this post and follow the blog for more of same.

(I have a copy of Relevant Search waiting to be read so I had better get to it!)

### Data Provenance: A Short Bibliography

Tuesday, September 6th, 2016

The video Provenance for Database Transformations by Val Tannen ends with a short bibliography.

Links and abstracts for the items in Val’s bibliography:

Provenance Semirings by Todd J. Green, Grigoris Karvounarakis, Val Tannen. (2007)

We show that relational algebra calculations for incomplete databases, probabilistic databases, bag semantics and whyprovenance are particular cases of the same general algorithms involving semirings. This further suggests a comprehensive provenance representation that uses semirings of polynomials. We extend these considerations to datalog and semirings of formal power series. We give algorithms for datalog provenance calculation as well as datalog evaluation for incomplete and probabilistic databases. Finally, we show that for some semirings containment of conjunctive queries is the same as for standard set semantics.

Update Exchange with Mappings and Provenance by Todd J. Green, Grigoris Karvounarakis, Zachary G. Ives, Val Tannen. (2007)

We consider systems for data sharing among heterogeneous peers related by a network of schema mappings. Each peer has a locally controlled and edited database instance, but wants to ask queries over related data from other peers as well. To achieve this, every peer’s updates propagate along the mappings to the other peers. However, this update exchange is filtered by trust conditions — expressing what data and sources a peer judges to be authoritative — which may cause a peer to reject another’s updates. In order to support such filtering, updates carry provenance information. These systems target scientific data sharing applications, and their general principles and architecture have been described in [20].

In this paper we present methods for realizing such systems. Specifically, we extend techniques from data integration, data exchange, and incremental view maintenance to propagate updates along mappings; we integrate a novel model for tracking data provenance, such that curators may filter updates based on trust conditions over this provenance; we discuss strategies for implementing our techniques in conjunction with an RDBMS; and we experimentally demonstrate the viability of our techniques in the ORCHESTRA prototype system.

Annotated XML: Queries and Provenance by J. Nathan Foster, Todd J. Green, Val Tannen. (2008)

We present a formal framework for capturing the provenance of data appearing in XQuery views of XML. Building on previous work on relations and their (positive) query languages, we decorate unordered XML with annotations from commutative semirings and show that these annotations suffice for a large positive fragment of XQuery applied to this data. In addition to tracking provenance metadata, the framework can be used to represent and process XML with repetitions, incomplete XML, and probabilistic XML, and provides a basis for enforcing access control policies in security applications.

Each of these applications builds on our semantics for XQuery, which we present in several steps: we generalize the semantics of the Nested Relational Calculus (NRC) to handle semiring-annotated complex values, we extend it with a recursive type and structural recursion operator for trees, and we define a semantics for XQuery on annotated XML by translation into this calculus.

Containment of Conjunctive Queries on Annotated Relations by Todd J. Green. (2009)

We study containment and equivalence of (unions of) conjunctive queries on relations annotated with elements of a commutative semiring. Such relations and the semantics of positive relational queries on them were introduced in a recent paper as a generalization of set semantics, bag semantics, incomplete databases, and databases annotated with various kinds of provenance information. We obtain positive decidability results and complexity characterizations for databases with lineage, why-provenance, and provenance polynomial annotations, for both conjunctive queries and unions of conjunctive queries. At least one of these results is surprising given that provenance polynomial annotations seem “more expressive” than bag semantics and under the latter, containment of unions of conjunctive queries is known to be undecidable. The decision procedures rely on interesting variations on the notion of containment mappings. We also show that for any positive semiring (a very large class) and conjunctive queries without self-joins, equivalence is the same as isomorphism.

Collaborative Data Sharing with Mappings and Provenance by Todd J. Green, dissertation. (2009)

A key challenge in science today involves integrating data from databases managed by different collaborating scientists. In this dissertation, we develop the foundations and applications of collaborative data sharing systems (CDSSs), which address this challenge. A CDSS allows collaborators to define loose confederations of heterogeneous databases, relating them through schema mappings that establish how data should flow from one site to the next. In addition to simply propagating data along the mappings, it is critical to record data provenance (annotations describing where and how data originated) and to support policies allowing scientists to specify whose data they trust, and when. Since a large data sharing confederation is certain to evolve over time, the CDSS must also efficiently handle incremental changes to data, schemas, and mappings.

We focus in this dissertation on the formal foundations of CDSSs, as well as practical issues of its implementation in a prototype CDSS called Orchestra. We propose a novel model of data provenance appropriate for CDSSs, based on a framework of semiring-annotated relations. This framework elegantly generalizes a number of other important database semantics involving annotated relations, including ranked results, prior provenance models, and probabilistic databases. We describe the design and implementation of the Orchestra prototype, which supports update propagation across schema mappings while maintaining data provenance and filtering data according to trust policies. We investigate fundamental questions of query containment and equivalence in the context of provenance information. We use the results of these investigations to develop novel approaches to efficiently propagating changes to data and mappings in a CDSS. Our approaches highlight unexpected connections between the two problems and with the problem of optimizing queries using materialized views. Finally, we show that semiring annotations also make sense for XML and nested relational data, paving the way towards a future extension of CDSS to these richer data models.

Provenance in Collaborative Data Sharing by Grigoris Karvounarakis, dissertation. (2009)

This dissertation focuses on recording, maintaining and exploiting provenance information in Collaborative Data Sharing Systems (CDSS). These are systems that support data sharing across loosely-coupled, heterogeneous collections of relational databases related by declarative schema mappings. A fundamental challenge in a CDSS is to support the capability of update exchange — which publishes a participant’s updates and then translates others’ updates to the participant’s local schema and imports them — while tolerating disagreement between them and recording the provenance of exchanged data, i.e., information about the sources and mappings involved in their propagation. This provenance information can be useful during update exchange, e.g., to evaluate provenance-based trust policies. It can also be exploited after update exchange, to answer a variety of user queries, about the quality, uncertainty or authority of the data, for applications such as trust assessment, ranking for keyword search over databases, or query answering in probabilistic databases.

To address these challenges, in this dissertation we develop a novel model of provenance graphs that is informative enough to satisfy the needs of CDSS users and captures the semantics of query answering on various forms of annotated relations. We extend techniques from data integration, data exchange, incremental view maintenance and view update to define the formal semantics of unidirectional and bidirectional update exchange. We develop algorithms to perform update exchange incrementally while maintaining provenance information. We present strategies for implementing our techniques over an RDBMS and experimentally demonstrate their viability in the ORCHESTRA prototype system. We define ProQL, iv a query language for provenance graphs that can be used by CDSS users to combine data querying with provenance testing as well as to compute annotations for their data, based on their provenance, that are useful for a variety of applications. Finally, we develop a prototype implementation ProQL over an RDBMS and indexing techniques to speed up provenance querying, evaluate experimentally the performance of provenance querying and the benefits of our indexing techniques.

Provenance for Aggregate Queries by Yael Amsterdamer, Daniel Deutch, Val Tannen. (2011)

We study in this paper provenance information for queries with aggregation. Provenance information was studied in the context of various query languages that do not allow for aggregation, and recent work has suggested to capture provenance by annotating the different database tuples with elements of a commutative semiring and propagating the annotations through query evaluation. We show that aggregate queries pose novel challenges rendering this approach inapplicable. Consequently, we propose a new approach, where we annotate with provenance information not just tuples but also the individual values within tuples, using provenance to describe the values computation. We realize this approach in a concrete construction, first for “simple” queries where the aggregation operator is the last one applied, and then for arbitrary (positive) relational algebra queries with aggregation; the latter queries are shown to be more challenging in this context. Finally, we use aggregation to encode queries with difference, and study the semantics obtained for such queries on provenance annotated databases.

Circuits for Datalog Provenance by Daniel Deutch, Tova Milo, Sudeepa Roy, Val Tannen. (2014)

The annotation of the results of database queries with provenance information has many applications. This paper studies provenance for datalog queries. We start by considering provenance representation by (positive) Boolean expressions, as pioneered in the theories of incomplete and probabilistic databases. We show that even for linear datalog programs the representation of provenance using Boolean expressions incurs a super-polynomial size blowup in data complexity. We address this with an approach that is novel in provenance studies, showing that we can construct in PTIME poly-size (data complexity) provenance representations as Boolean circuits. Then we present optimization techniques that embed the construction of circuits into seminaive datalog evaluation, and further reduce the size of the circuits. We also illustrate the usefulness of our approach in multiple application domains such as query evaluation in probabilistic databases, and in deletion propagation. Next, we study the possibility of extending the circuit approach to the more general framework of semiring annotations introduced in earlier work. We show that for a large and useful class of provenance semirings, we can construct in PTIME poly-size circuits that capture the provenance.

Incomplete but a substantial starting point exploring data provenance and its relationship/use with topic map merging.

To get a feel for “data provenance” just prior to the earliest reference here (2007), consider A Survey of Data Provenance Techniques by Yogesh L. Simmhan, Beth Plale, Dennis Gannon, published in 2005.

Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources.

The provenance of data products generated by complex transformations such as workflows is of considerable value to scientists. From it, one can ascertain the quality of the data based on its ancestral data and derivations, track back sources of errors, allow automated re-enactment of derivations to update a data, and provide attribution of data sources. Provenance is also essential to the business domain where it can be used to drill down to the source of data in a data warehouse, track the creation of intellectual property, and provide an audit trail for regulatory purposes.

In this paper we create a taxonomy of data provenance techniques, and apply the classification to current research efforts in the field. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. Our synthesis can help those building scientific and business metadata-management systems to understand existing provenance system designs. The survey culminates with an identification of open research problems in the field.

Another rich source of reading material!

### Merge 5 Proxies, Take Away 1 Proxy = ? [Data Provenance]

Monday, September 5th, 2016

Provenance for Database Transformations by Val Tannen. (video)

Description:

Database transformations (queries, views, mappings) take apart, filter,and recombine source data in order to populate warehouses, materialize views,and provide inputs to analysis tools. As they do so, applications often need to track the relationship between parts and pieces of the sources and parts and pieces of the transformations’ output. This relationship is what we call database provenance.

This talk presents an approach to database provenance that relies on two observations. First, provenance is a kind of annotation, and we can develop a general approach to annotation propagation that also covers other applications, for example to uncertainty and access control. In fact, provenance turns out to be the most general kind of such annotation,in a precise and practically useful sense. Second, the propagation of annotation through a broad class of transformations relies on just two operations: one when annotations are jointly used and one when they are used alternatively.This leads to annotations forming a specific algebraic structure, a commutative semiring.

The semiring approach works for annotating tuples, field values and attributes in standard relations, in nested relations (complex values), and for annotating nodes in (unordered) XML. It works for transformations expressed in the positive fragment of relational algebra, nested relational calculus, unordered XQuery, as well as for Datalog, GLAV schema mappings, and tgd constraints. Finally, when properly extended to semimodules it works for queries with aggregates. Specific semirings correspond to earlier approaches to provenance, while others correspond to forms of uncertainty, trust, cost, and access control.

What does happen when you subtract from a merge? (Referenced here as an “aggregation.”)

Although possible to paw through logs to puzzle out a result, Val suggests there are more robust methods at our disposal.

I watched this over the weekend and be forewarned, heavy sledding ahead!

This is an active area of research and I have only begun to scratch the surface for references.

I may discover differently, but the “aggregation” I have seen thus far relies on opaque strings.

Not that all uses of opaque strings are inappropriate, but imagine the power of treating a token as an opaque string for one use case and exploding that same token into key/value pairs for another.

Enjoy!

### How Entity-Resolved Data Dramatically Improves Analytics

Wednesday, June 10th, 2015

From the post:

In my last two blog posts, I’ve written about how Novetta Entity Analytics resolves entity data from multiple sources and formats, and why its speed and scalability are so important when analyzing large volumes of data. Today I’m going to discuss how analysts can achieve much better results than ever before by utilizing entity-resolved data in analytics applications.

When data from all available sources is combined and entities are resolved, individual records about a real-world entity’s transactions, actions, behaviors, etc. are aggregated and assigned to that person, organization, location, automobile, ship or any other entity type. When an application performs analytics on this entity-resolved data, the results offer much greater context than analytics on the unlinked, unresolved data most applications use today.

Analytics that present a complete view of all actions of an individual entity are difficult to deliver today as they can require many time-consuming and expensive manual processes. With entity-resolved data, complete information about each entity’s specific actions and behaviors is automatically linked so applications can perform analytics quickly and easily. Below are some examples of how applications, such as enterprise search, data warehouse and link analysis visualization, can employ entity-resolved data from Novetta Entity Analytics to provide more powerful analytics.

Marc isn’t long on specifics of how Novetta Entity Analytics works in his prior posts but I think we can all agree on his recitation of the benefits of entity resolution in this post.

Once we know the resolution of an entity or subject identity as we would say in topic maps, the payoffs are immediate and worthwhile. Search results are more relevant, aggregated (merged) data speeds up queries and multiple links are simplified as they are merged.

How we would get there varies but Marc does a good job of describing the benefits!

### KDE and The Semantic Desktop

Saturday, March 14th, 2015

From the post:

During the KDE4 years the Semantic Desktop was one of the main pillars of KDE. Nepomuk was a massive, all encompassing, and integrated with many different part of KDE. However few people know what The Semantic Desktop was all about, and where KDE is heading.

History

The Semantic Desktop as it was originally envisioned comprised of both the technology and the philosophy behind The Semantic Web.

The Semantic Web is built on top of RDF and Graphs. This is a special way of storing data which focuses more on understanding what the data represents. This was primarily done by carefully annotating what everything means, starting with the definition of a resource, a property, a class, a thing, etc.

This process of all data being stored as RDF, having a central store, with applications respecting the store and following the ontologies was central to the idea of the Semantic Desktop.

The Semantic Desktop cannot exist without RDF. It is, for all intents and purposes, what the term “semantic” implies.

A brief post-mortem on the KDE Semantic Desktop which relied upon NEPOMUK (Networked Environment for Personal, Ontology-based Management of Unified Knowledge) for RDF-based features. (NEPOMUK was an EU project.)

The post mentions complexity more than once. A friend recently observed that RDF was all about supporting AI and not capturing arbitrary statements by a user.

Such as providing alternative identifiers for subjects. With enough alternative identifications (including context, which “scope” partially captures in topic maps), I suspect a deep learning application could do pretty well at subject recognition, including appropriate relationships (associations).

But that would not be by trying to guess or formulate formal rules (a la RDF/OWL) but by capturing the activities of users as they provide alternative identifications of and relationships for subjects.

Hmmm, merging then would be a learned behavior by our applications. Will have to give that some serious thought!

I first saw this in a tweet by Stefano Bertolo.

### Machine Learning: The High-Interest Credit Card of Technical Debt (and Merging)

Sunday, December 14th, 2014

Machine Learning: The High-Interest Credit Card of Technical Debt by D. Sculley, et al.

Abstract:

Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns.

Under “entanglement” (referring to inputs) the authors announce the CACE principle:

Changing Anything Changes Everything

The net result of such changes is that prediction behavior may alter, either subtly or dramatically, on various slices of the distribution. The same principle applies to hyper-parameters. Changes in regularization strength, learning settings, sampling methods in training, convergence thresholds, and essentially every other possible tweak can have similarly wide ranging effects.

Entanglement is a native condition in topic maps as a result of the merging process. Yet, I don’t recall there being much discussion of how to evaluate the potential for unwanted entanglement or how to avoid entanglement (if desired).

You may have topics in a topic map where merging with later additions to the topic map is to be avoided. Perhaps to avoid the merging of spam topics that would otherwise overwhelm your content.

One way to avoid that and yet allow users to use links reported as subjectIdentifiers and subjectLocators under the TMDM would be to not report those properties for some set of topics to the topic map engine. The only property they could merge on would be their topicID, which hopefully you have concealed from public users.

Not unlike the traditions of Unix where some X ports are unavailable to any users other than root. Topics with IDs below N are skipped by the topic map engine for merging purposes, unless the merging is invoked by the equivalent of root.

No change in current syntax or modeling required, although a filter on topic IDs would need to be implemented to add this to current topic map applications.

I am sure there are other ways to prevent merging of some topics but this seems like a simple way to achieve that end.

Unfortunately it does not address the larger question of the “technical debt” incurred to maintain a topic map of any degree of sophistication.

Thoughts?

I first saw this in a tweet by Elias Ponvert.

### VSEARCH

Saturday, November 29th, 2014

VSEARCH: Open and free 64-bit multithreaded tool for processing metagenomic sequences, including searching, clustering, chimera detection, dereplication, sorting, masking and shuffling

From the webpage:

The aim of this project is to create an alternative to the USEARCH tool developed by Robert C. Edgar (2010). The new tool should:

• have open source code with an appropriate open source license
• be free of charge, gratis
• have a 64-bit design that handles very large databases and much more than 4GB of memory
• be as accurate or more accurate than usearch
• be as fast or faster than usearch

We have implemented a tool called VSEARCH which supports searching, clustering, chimera detection, dereplication, sorting and masking (commands --usearch_global, --cluster_smallmem, --cluster_fast, --uchime_ref, --uchime_denovo, --derep_fulllength, --sortbysize, --sortbylength and --maskfasta, as well as almost all their options).

VSEARCH stands for vectorized search, as the tool takes advantage of parallelism in the form of SIMD vectorization as well as multiple threads to perform accurate alignments at high speed. VSEARCH uses an optimal global aligner (full dynamic programming Needleman-Wunsch), in contrast to USEARCH which by default uses a heuristic seed and extend aligner. This results in more accurate alignments and overall improved sensitivity (recall) with VSEARCH, especially for alignments with gaps.

The same option names as in USEARCH version 7 has been used in order to make VSEARCH an almost drop-in replacement.

The reconciliation of characteristics that are different is the only way that merging in topic maps varies from the clustering found in bioinformatics programs like VSEARCH. The results are a cluster of items deemed “similar” on some basis and with topic maps, subject to further processing.

Scaling isn’t easy in bioinformatics but it hasn’t been found daunting either.

There is much to be learned from projects such as VSEARCH to inform the processing of topic maps.

I first saw this in a tweet by Torbjørn Rognes.

### Visual Classification Simplified

Sunday, November 23rd, 2014

Visual Classification Simplified

From the post:

Virtually all information governance initiatives depend on being able to accurately and consistently classify the electronic files and scanned documents being managed. Visual classification is the only technology that classifies both types of documents regardless of the amount or quality of text associated with them.

From the user perspective, visual classification is extremely easy to understand and work with. Once documents are collected, visual classification clusters or groups documents based on their appearance. This normalizes documents regardless of the types of files holding the content. The Word document that was saved to PDF will be grouped with that PDF and with the TIF that was made from scanning a paper copy of either document.

The clustering is automatic, there are no rules to write up front, no exemplars to select, no seed sets to try to tune. This is what a collection of documents might look like before visual classification is applied – no order and no way to classify the documents:

When the initial results of visual classification are presented to the client, the clusters are arranged according to the number of documents in each cluster. Reviewing the first clusters impacts the most documents. Based on reviewing one or two documents per cluster, the reviewer is able to determine (a) should the documents in the cluster be retained, and (b) if they should be retained, what document-type label to associate with the cluster.

By easily eliminating clusters that have no business or regulatory value, content collections can be dramatically reduced. Clusters that remain can have granular retention policies applied, be kept under appropriate access restrictions, and can be assigned business unit owners. Plus of course, the document-type labels can greatly assist users trying to find specific documents. (emphasis in original)

I suspect that BeyondRecognition, the host of this post, really means classification at the document level. A granularity that has plagued information retrieval for decades. Better than no retrieval at all but only just.

However, the graphics of visualization were just too good to pass up! Imagine that you are selecting merging criteria for a set of topics that represent subjects at a far lower granularity than document level.

With the results of those selections being returned to you as part of an interactive process.

If most topic map authoring is for aggregation, that is you author so that topics will merge, this would be aggregation by selection.

Hard to say for sure but I suspect that aggregation (merging) by selection would be far easier than authoring for aggregation.

Suggestions on how to test that premise?

### Topic Maps By Another Name

Tuesday, November 18th, 2014

Data Integration up to 90% Faster and Cheaper by Marty Loughlin.

From the post:

A powerful new approach to addressing this challenge involves using semantic web technology as the “data glue” to guide integration and dramatically simplify the process. There are several key components to this approach:

• Using semantic models to describe data in standard business terms (e.g., FIBO, CDISC, existing enterprise model etc.)
• Mapping source and target data to the semantic model instead of directly from source to target
• Combining these maps as needed to create end-to-end semantic descriptions of ETL jobs
• Automatically generating ETL code from the semantic descriptions for leading ETL tools (e.g., Informatica and Pentaho)

There are significant benefits to this approach:

• Data integration can be done by business analysts with minimal IT involvement
• Adding a new source or target only requires an expert in that system to map to the common model as all maps are reusable
• The time and cost do an integration project can be reduced up to 90%
• Projects can be repurposed to a new ETL tool with the click of a mouse
• The semantic model that describes that data, sources, maps and transformation is always up-to-date and can be queried for data meaning and lineage

The mapping of the source and target data to a semantic model is one use for a topic map. The topic map itself is then a data store to be queried using the source or target data models.

The primary differences (there are others) between topic maps and “data glue” is that topic maps don’t necessarily use MS Excel spreadsheets and aren’t called “data glue.”

I do appreciate Cambridge Semantics determining that a topic map-like mapping approach can save 90% on data integration projects.

That sounds a bit optimistic but marketing literature is always optimistic.

### Sir Tim Berners-Lee speaks out on data ownership

Thursday, October 9th, 2014

Sir Tim Berners-Lee speaks out on data ownership by Alex Hern.

From the post:

The data we create about ourselves should be owned by each of us, not by the large companies that harvest it, the Tim Berners-Lee, the inventor of the world wide web, said today.

Berners-Lee told the IPExpo Europe in London’s Excel Centre that the potential of big data will be wasted as its current owners use it to serve ever more “queasy” targeted advertising.

Berners-Lee, who wrote the first memo detailing the idea of the world wide web 25 years ago this year, while working for physics lab Cern in Switzerland, told the conference that the value of “merging” data was under-appreciated in many areas.

Speaking to public data providers, he said: “I’m not interested in your data; I’m interested in merging your data with other data. Your data will never be as exciting as what I can merge it with.

No disagreement with: …the value of “merging” data was under-appreciated in many areas. 😉

Considerable disagreement on how best to accomplish that merging but will be an empirical question when people wake up to the value of “merging” data.

Berners-Lee may be right about who “should” own data about ourselves, but that isn’t in fact who owns it now. Changing property laws means taking rights away from those with them under the current regime and creating new rights for others in a new system. Property laws have changed before but it requires more than slogans and wishful thinking to make it so.

### Pussy Stalking [Geo-Location as merge point]

Friday, July 25th, 2014

From the post:

Ever posted a picture of your cat online?

Unless your privacy settings avoid making APIs publicly available on sites like Flickr, Twitpic, Instagram or the like, there’s a cat stalker who knows where your liddl’ puddin’ lives, and he’s totally pwned your pussy by geolocating it.

That’s right, fat-faced grey one from Al Khobar in Saudi Arabia, Owen Mundy knows you live on Tabari Street.

Mundy, a data analyst, artist, and Associate Professor in the Department of Art at Florida State University, has been working on the data visualisation project, which is called I Know Where Your Cat Lives.
….

See Lisa’s post for the details about the “I Know Where Your Cat Lives” project.

The same data leakage is found in other types of photographs as well. Such as photographs by military personnel.

An enterprising collector could use geolocation as a merge point to collect all the photos made at a particular location. Or using geolocation ask “who?” for some location X.

Or perhaps a city map using geolocated images to ask “who?” Everyone may not know your name but with a large enough base of users, someone will.

PS: There is at least one app for facial recognition, NameTag. I don’t have a cellphone so you will have to comment on how well it works. I still like the idea of a “who?” site. Perhaps because I prefer human intell over data vacuuming.

### Readings in conflict-free replicated data types

Tuesday, July 22nd, 2014

Readings in conflict-free replicated data types by Christopher Meiklejohn.

From the post:

This is a work in progress post outlining research topics related to conflict-free replicated data types, or CRDTs.

Yesterday, Basho announced the release of Riak 2.0.0 RC1, which contains a comprehensive set of “data types” that can be used for building more robust distributed applications. For an overview of how to use these data types in Riak to avoid custom, and error prone, merge functions, see the Basho documentation site.

You’re probably more familiar with another name for these data types: conflict-free replicated data types (CRDTs). Simply put, CRDTs are data structures which capture some aspect of causality, along with providing interfaces for safely operating over the value and correctly merging state with diverged and concurrently edited structures.

This provides a very useful property when combined with an eventual consistency, or AP-focused, data store: Strong Eventual Consistency (SEC). Strong Eventual Consistency is an even stronger convergence property than eventual consistency: given that all updates are delivered to all replicas, there is no need for conflict resolution, given the conflict-free merge properties of the data structure. Simply put, correct replicas which have received all updates have the same state.

Here’s a great overview by one of the inventors of CRDTs, Marc Shapiro, where he discusses conflict-free replicated data types and their relation to strong eventual consistency.

In this Hacker News thread, there was an interesting discussion about why one might want to implement these on the server, why implementing them is non-trivial, and what the most recent research related to them consists of.

This post serves as a reading guide on the the various areas of conflict-free replicated data types. Papers are broken down into various areas and sorted in reverse chronologically.

Relevant to me because the new change tracking in ODF is likely to be informed by CRDTs and because eventually consistent merging is important for distributed topic maps.

Confusion would be the result if the order of merging topics results in different topic maps.

CRDTs are an approach to avoid that unhappy outcome.

Enjoy!

PS: Remember to grab a copy of Riak 2.0.0 RC1.

### Fun with CRDTs

Tuesday, May 20th, 2014

Fun with CRDTs by Richard Dallaway.

From the post:

At the end of last year I had some fun implementing a CRDT. These are data structures designed to combine together when you have no control over order of changes, timing of changes, or the number of participants in the data structure. The example I looked at was a sequential datatype, namely the WOOT CRDT for collaborative text editing.

Doesn’t:

combine together when you have no control over order of changes, timing of changes, or the number of participants in the data structure.

sound familiar? 😉

Richard points to:

He also recommends that you watch: Reconciling Eventually-Consistent Data with CRDTs by Noel Welsh, before viewing his video.

Great stuff!

### Parallel Graph Partitioning for Complex Networks

Monday, April 21st, 2014

Parallel Graph Partitioning for Complex Networks by Henning Meyerhenke, Peter Sanders, and, Christian Schulz.

Abstract:

Processing large complex networks like social networks or web graphs has recently attracted considerable interest. In order to do this in parallel, we need to partition them into pieces of about equal size. Unfortunately, previous parallel graph partitioners originally developed for more regular mesh-like networks do not work well for these networks. This paper addresses this problem by parallelizing and adapting the label propagation technique originally developed for graph clustering. By introducing size constraints, label propagation becomes applicable for both the coarsening and the refinement phase of multilevel graph partitioning. We obtain very high quality by applying a highly parallel evolutionary algorithm to the coarsened graph. The resulting system is both more scalable and achieves higher quality than state-of-the-art systems like ParMetis or PT-Scotch. For large complex networks the performance differences are very big. For example, our algorithm can partition a web graph with 3.3 billion edges in less than sixteen seconds using 512 cores of a high performance cluster while producing a high quality partition — none of the competing systems can handle this graph on our system.

Clustering in this article is defined by a node’s “neighborhood,” I am curious if defining a “neighborhood” based on multi-part (hierarchical?) identifiers might enable parallel processing of merging conditions?

While looking for resources on graph contraction, I encountered a series of lectures by Kanat Tangwongsan from: Parallel and Sequential Data Structures and Algorithms, 15-210 (Spring 2012) (link to the course schedule with numerous resources):

Lecture 17 — Graph Contraction I: Tree Contraction

Lecture 18 — Graph Contraction II: Connectivity and MSTs

Lecture 19 — Graph Contraction III: Parallel MST and MIS

Enjoy!

### Future Values of Merges

Thursday, April 17th, 2014

Multilisp: A Language for Concurrent Symbolic Computation by Robert H. Halstead. (ACM Transactions on Programming Languages and Systems, Vol. 7, No. 4, October 1985, Pages 501-538.)

Abstract:

Multilisp is a version of the Lisp dialect Scheme extended with constructs for parallel execution. Like Scheme, Multilisp is oriented toward symbolic computation. Unlike some parallel programming languages, Multilisp incorporates constructs for causing side effects and for explicitly introducing parallelism. The potential complexity of dealing with side effects in a parallel context is mitigated by the nature of the parallelism constructs and by support for abstract data types: a recommended Multilisp programming style is presented which, if followed, should lead to highly parallel, easily understandable programs. Multilisp is being implemented on the 32-processor Concert multiprocessor; however, it is ulti-mately intended for use on larger multiprocessors. The current implementation, called Concert Multilisp, is complete enough to run the Multilisp compiler itself and has been run on Concert prototypes including up to eight processors. Concert Multilisp uses novel techniques for task scheduling and garbage collection. The task scheduler helps control excessive resource utilization by means of an unfair scheduling policy; the garbage collector uses a multiprocessor algorithm based on the incremental garbage collector of Baker.

Of particular interest:

Multilisp’s principal construct for both creating tasks and synchronizing among them is the future. The construct ( future X ) immediately returns a future for the value of the expression X and concurrently begins evaluating X. When the evaluation of X yields a value, that value replaces the future. The future is said to be initially undetermined; it becomes determined when its value has been computed. An operation (such as addition) that needs to know the value of an undetermined future will be suspended until the future becomes determined, but many operations, such as assignment and parameter passing, do not need to know anything about the values of their operands and may be performed quite comfortably on undetermined futures. The use of futures often exposes surprisingly large amounts of parallelism in a program, as illustrated by a Quicksort program given in Figure 1.

The use of a future value for merge operations, which could avoid re-processing a topic map for each cycle of merging, sounds promising.

Deferral of the results isn’t just an old Lisp idea as you will find in: Counting complex disordered states by efficient pattern matching: chromatic polynomials and Potts partition functions by Marc Timme, Frank van Bussel, Denny Fliegner, and Sebastian Stolzenberg.

Abstract:

Counting problems, determining the number of possible states of a large system under certain constraints, play an important role in many areas of science. They naturally arise for complex disordered systems in physics and chemistry, in mathematical graph theory, and in computer science. Counting problems, however, are among the hardest problems to access computationally. Here, we suggest a novel method to access a benchmark counting problem, finding chromatic polynomials of graphs. We develop a vertex-oriented symbolic pattern matching algorithm that exploits the equivalence between the chromatic polynomial and the zero-temperature partition function of the Potts antiferromagnet on the same graph. Implementing this bottom-up algorithm using appropriate computer algebra, the new method outperforms standard top-down methods by several orders of magnitude, already for moderately sized graphs. As a first application, we compute chromatic polynomials of samples of the simple cubic lattice, for the first time computationally accessing three-dimensional lattices of physical relevance. The method offers straightforward generalizations to several other counting problems.

In lay person’s terms the work by Timme and company visits each node in a graph and records an expression that includes unkowns (futures?), that is the values at other nodes in the graph. Using pattern matching techniques, the algorithm then solves all of the unknowns and replaces them with appropriate values. How effective is this?

“The existing algorithm copies the whole network for each stage of the calculation and only changes one aspect of it each time,” explains Frank van Bussel of the Max Planck Institute for Dynamics and Self-Organization (MPIDS). Increasing the number of nodes dramatically increases the calculation time. For a square lattice the size of a chess board, this is estimated to be many billions of years. The new algorithm developed by the Göttingen-based scientists is significantly faster. “Our calculation for the chess board lattice only takes seven seconds,” explains Denny Fliegner from MPIDS. (A New Kind of Counting)

Hmmm, “many billions of years” versus “seven seconds.”

For further details on the pattern matching work see the project page at: Complex Disordered Systems: Statistical Physics and Symbolic Computation

Deferral of results looks like a fruitful area for research for topic maps in general and parallel processing of topic maps in particular.

I first saw the Multilisp paper in a tweet by Mom of Finland.

### Want to make a great puzzle game?…

Tuesday, April 1st, 2014

From the post:

Two years ago, Erik Demaine and three other researchers published a fun paper to the arXiv proving that most incarnations of classic nintendo games are NP-hard. This includes almost every Super Mario Brothers, Donkey Kong, and Pokemon title. Back then I wrote a blog post summarizing the technical aspects of their work, and even gave a talk on it to a room full of curious undergraduate math majors.

But while bad tech-writers tend to interpret NP-hard as “really really hard,” the truth is more complicated. It’s really a statement about computational complexity, which has a precise mathematical formulation. Sparing the reader any technical details, here’s what NP-hard implies for practical purposes:

You should abandon hope of designing an algorithm that can solve any instance of your NP-hard problem, but many NP-hard problems have efficient practical “good-enough” solutions.

The very definition of NP-hard means that NP-hard problems need only be hard in the worst case. For illustration, the fact that Pokemon is NP-hard boils down to whether you can navigate a vastly complicated maze of trainers, some of whom are guaranteed to defeat you. It has little to do with the difficulty of the game Pokemon itself, and everything to do with whether you can stretch some subset of the game’s rules to create a really bad worst-case scenario.

So NP-hardness has very little to do with human playability, and it turns out that in practice there are plenty of good algorithms for winning at Super Mario Brothers. They work really well at beating levels designed for humans to play, but we are highly confident that they would fail to win in the worst-case levels we can cook up. Why don’t we know it for a fact? Well that’s the $P \ne NP$ conjecture.

Can we say the same about combinatorial explosion problems? That under some set of assumptions such problems are intractable but that practical solutions do exist?

I mention this because one of the more shallow arguments in favor of a master ontology is that mapping between multiple ontologies (drum roll please), leads to a combinatorial explosion.

True, if you are dumb enough to insist on mapping from every known ontology to every other known ontology in an area, that’s very likely true.

However, if I only map from my ontology to the next ontology known to me, leaving other one to one mappings to others who want them, I am not closer to a combinatorial explosion than creating the mapping to a master ontology.

I must admit my bias in this case because I have no master ontology to sell you.

I prefer that you use a classification, taxonomy, ontology you already know. If it’s all the same to you.

Mapping your classification/taxonomy/ontology to another is something I would be happy to assist with.

PS: Do read Jeremy’s post in full. You will learn a lot and possibly pick up a new hobby.

### 150 Million Topics

Monday, March 31st, 2014

150 Million More Reasons to Love Bing Everyday by Richard Qian.

From the post:

At Bing, we understand that search is more than simply finding information and browsing a collection of blue links pointing to pages around the web. We’ve talked about doing instead of searching and how Bing continues to expand its approach to understand the actual world around us.

Today, you’ll see this come to life on Bing.com in a feature called Snapshot. Snapshot brings together information that you need at a glance, with rich connections to deeper information on the people, places, and things you care about made possible by our deep understanding of the real world. To accomplish this, Bing now tracks billions of entities and perhaps more importantly, the billions of relationships between them, all to get you the right data instantly while you search.

New Entities: Introducing Doctors, Lawyers, Dentists and Real Estate Properties
….

In case you are interested, the “Snapshot” is what ISO/IEC 13250 (Dec., 1999) defined as a topic.

topic: An aggregate of topic characteristics, including zero or more names, occurrences, and roles played in association with other topics, whose organizing principle is a single subject.

Unlike the topic maps effort, Bing conceals all the ugliness that underlies merging of information and delivers to users an immediately useful and consumable information product.

But also unlike the topic maps effort, Bing is about as useful as a wedgie when you are looking for information internal to your enterprise.

Why?

Mostly because the subjects internal to your organization don’t get mapped by Bing.

Can you guess who is going to have to do that work? Got a mirror handy?

Use Bing’s Snapshots to see what collated information can look like. Decide if that’s a look you want for all or some of your information.

Hard to turn down free advertising/marketing from MS. It happens so rarely.

Thanks MS!

### Wikidata in 2014 [stable identifiers]

Wednesday, January 22nd, 2014

Wikidata in 2014

From the development plans for Wikidata in 2014, it looks like a busy year.

There are a number of interesting work items but one in particular caught my attention:

Merges and redirects

When two different items about the same topic are created they can be merged. Labels, descriptions, aliases, sitelinks and statements are merged if they do not conflict. The item that is left empty can then be turned into a redirect to the other. This way, Wikidata IDs can be regarded as stable identifiers by 3rd-parties.

As more data sets come online, preserving “stable identifiers” from each data set is going to be important. You can’t know in advance which data set a particular researcher may have used as a source of identifiers.

Here of course they are talking about “stable identifiers” inside of Wikidata.

In principle though, I don’t see any reason we can treat “foreign” identifiers as stable.

You?

### Filtering: Seven Principles

Tuesday, January 7th, 2014

Filtering: Seven Principles by JP Rangaswami.

When you read “filters” in the seven rules, think merging rules.

From the post:

1. Filters should be built such that they are selectable by subscriber, not publisher.
2. Filters should intrinsically be dynamic, not static.
3. Filters should have inbuilt “serendipity” functionality.
4. Filters should be interchangeable, exchangeable, even tradeable.
5. The principal filters should be by choosing a variable and a value (or range of values) to include or exclude.
6. Secondary filters should then be about routing.
7. Network-based filters, “collaborative filtering” should then complete the set.

Nat Torkington comments on this list:

I think the basic is: 0: Customers should be able to run their own filters across the information you’re showing them.

+1!

And it should be simpler than hunting for .config/google-chrome/Default/User Stylesheets/Custom.css (for Chrome on Ubuntu).

Ideally a select (from a webpage) and choose an action.

The ability to dynamically select properties for merging would greatly enhance a user’s ability to explore and mine a topic map.

I first saw this in Nat Torkington’s Four short links: 6 January 2014.

### Idempotence Is Not a Medical Condition

Saturday, January 4th, 2014

Idempotence Is Not a Medical Condition by Pat Helland.

From the post:

The definition of distributed computing can be confusing. Sometimes, it refers to a tightly coupled cluster of computers working together to look like one larger computer. More often, however, it refers to a bunch of loosely related applications chattering together without a lot of system-level support.

This lack of support in distributed computing environments makes it difficult to write applications that work together. Messages sent between systems do not have crisp guarantees for delivery. They can get lost, and so, after a timeout, they are retried. The application on the other side of the communication may see multiple messages arrive where one was intended. These messages may be reordered and interleaved with different messages. Ensuring that the application behaves as intended can be very hard to design and implement. It is even harder to test.

In a world full of retried messages, idempotence is an essential property for reliable systems. Idempotence is a mathematical term meaning that performing an operation multiple times will have the same effect as performing it exactly one time. The challenges occur when messages are related to each other and may have ordering constraints. How are messages associated? What can go wrong? How can an application developer build a correctly functioning app without losing his or her mojo?

A very good discussion of idempotence in the context of distributed (message passing) systems. You may recall the TMRM defining merging operators to be idempotent. (Section 8 )

Pat’s examples on idempotence include:

1. Sweeping the floor is idempotent. If you sweep it multiple times, you still get a clean floor.
2. Baking a cake is not idempotent.
3. Baking a cake starting from a shopping list (if you don’t care about money) is idempotent.

As an aside, #2 is not idempotent because “a cake” means a particular cake. It can only be baked once, at least if you want to have an edible result. In #3, the act of baking from a shopping list (I prefer a recipe) and not the cake, is idempotent.

The post is quite good, particularly if you are interested in a reliable messaging based system.

I first saw this in Stuff The Internet Says On Scalability For January 3rd, 2014, which had the following note:

Pat Helland with a classically great article on Idempotence. Fortunately the article is not idempotent. Everytime you read it your brain updates with something new.

### Search …Business Critical in 2014

Friday, December 20th, 2013

Search Continues to Be Business Critical in 2014 by Martin White.

From the post:

I offer two topics that I see becoming increasingly important in 2014. One of these is cross-device search, where a search is initially conducted on a desktop and is continued on a smartphone, and vice-versa. There is a very good paper from Microsoft that sets out some of the issues. The second topic is continuous information seeking, where search tasks are carried out by more than one “searcher,” often in support of collaborative working. The book on this topic by Chirag Shah, a member of staff of Rutgers University, is a very good place to start.

Editor’s Note: Read more of Martin’s thoughts on search in Why All Search Projects Fail.

Gee, let me see, what would more than one searcher need to make their collaborative search results usable by the entire team?

Can you say merging? 😉

Martin has other, equally useful insights in the search space so don’t miss the rest of his post.

But also catch his “Why All Search Projects Fail.” Good reading before you sign a contract with a client.

### The Distributed Complexity of Large-scale Graph Processing

Tuesday, November 26th, 2013

The Distributed Complexity of Large-scale Graph Processing by Hartmut Klauck, Danupon Nanongkai, Gopal Pandurangan, Peter Robinson.

Abstract:

Motivated by the increasing need for fast distributed processing of large-scale graphs such as the Web graph and various social networks, we study a message-passing distributed computing model for graph processing and present lower bounds and algorithms for several graph problems. This work is inspired by recent large-scale graph processing systems (e.g., Pregel and Giraph) which are designed based on the message-passing model of distributed computing.

Our model consists of a point-to-point communication network of $k$ machines interconnected by bandwidth-restricted links. Communicating data between the machines is the costly operation (as opposed to local computation). The network is used to process an arbitrary $n$-node input graph (typically $k$ machines (a common implementation in many real world systems). Our goal is to study fundamental complexity bounds for solving graph problems in this model.

We present techniques for obtaining lower bounds on the distributed time complexity. Our lower bounds develop and use new bounds in random-partition communication complexity. We first show a lower bound of $\Omega(n/k)$ rounds for computing a spanning tree (ST) of the input graph. This result also implies the same bound for other fundamental problems such as computing a minimum spanning tree (MST). We also show an $\Omega(n/k^2)$ lower bound for connectivity, ST verification and other related problems.

We give algorithms for various fundamental graph problems in our model. We show that problems such as PageRank, MST, connectivity, and graph covering can be solved in $\tilde{O}(n/k)$ time, whereas for shortest paths, we present algorithms that run in $\tilde{O}(n/\sqrt{k})$ time (for $(1+\epsilon)$-factor approx.) and in $\tilde{O}(n/k)$ time (for $O(\log n)$-factor approx.) respectively.

The author’s state their main goal is:

…is to investigate the distributed time complexity, i.e., the number of distributed “rounds”, for solving various fundamental graph problems. The time complexity not only captures the (potential) speed up possible for a problem, but it also implicitly captures the communication cost of the algorithm as well, since links can transmit only a limited amount of bits per round; equivalently, we can view our model where instead of links, machines can send/receive only a limited amount of bits per round (cf. Section 1.1).

How would you investigate the number of “rounds,” to perform merging in a message passing topic map system?

With no one order of merging to reach a particular state, would you measure it statistically for some merging criteria N?

I first saw this in a tweet by Stefano Bertolo.

### A names backbone:…

Friday, November 22nd, 2013

A names backbone: a graph of taxonomy by Nicky Nicolson.

At first glance a taxonomy paper but as you look deeper:

Slide 34: Concepts layer: taxonomy as a graph

• Names are nodes
• Typed, directed relationships represent synonymy and taxonomic placement
• Evidence for taxonomic assertions provided as references
• …and again, standards bases import / export using TCS

Slide 35 shows a synonym_of relationship between two name nodes.

Slide 36 shows evidence attached to placement at one node and for the synonym_of link.

Slide 37 shows reuse of nodes to support “different taxonomic opinions.”

Slide 39 Persistent identification of concepts

We can re-create a sub-graph representing a concept at a particular point in time using:

1. Name ID
2. Classification
3. State

Users can link to a stable state of a concept

We can provide a feed of what has changed since

I mention this item in part because Peter Neubauer (Neo4j) suggested in an email that rather than “merging” nodes that subject sameness (my term, not his) could be represented as a relationship between nodes.

Much in the same way that synonym_of was represented in these slides.

And I like the documentation of the reason for synonymy.

The internal data format of Neo4j makes “merging” in the sense of creating one node to replace two or more other nodes impractical.

Perhaps replacing nodes with other nodes has practical limits?

Is “virtual merging” in your topic map future?

### [A]ggregation is really hard

Tuesday, November 5th, 2013

The reason that big data hasn’t taken economics by storm is that aggregation is really hard.

As near as I can tell, the quote originates from Klint Finley.

Let’s assume aggregation is “really hard,” which of necessity means it is also expensive.

Might be nice, useful, interesting, etc., to have “all” the available data for a subject organized together, what is the financial (or other) incentive to make it so?

And that incentive has to be out weigh the expense of reaching that goal. By some stated margin.

Suggestions on how to develop such a metric?

### Busting the King’s Gambit [Scaling Merging]

Thursday, October 17th, 2013

Rajlich: Busting the King’s Gambit, this time for sure by Frederic Friedel.

From the post (an interview with Vasik Rajlich):

Fifty years ago Bobby Fischer published a famous article, “A Bust to the King’s Gambit”, in which he claimed to have refuted this formerly popular opening. Now chess programmer IM Vasik Rajlich has actually done it, with technical means. 3000 processor cores, running for over four months, exhaustively analysed all lines that follow after 1.e4 e5 2.f4 exf4 and came to some extraordinary conclusions.

Vasik Rajlich: Okay, here’s what we have been doing. You know that Lukas Cimiotti has set up a cluster of computers, currently around 300 cores, which has been used by World Champions and World Champion candidates to prepare for their matches. It is arguably the most powerful entity to play chess, ever, anywhere. Well, that was until we hooked it up to a massively parallel cluster of IBM POWER 7 Servers provided by David Slate, senior manager of IBM’s Semantic Analysis and Integration department – 2,880 cores at 4.25 GHz, 16 terabytes of RAM, very similar to the hardware used by IBM’s Watson in winning the TV show “Jeopardy”. The IBM servers ran a port of the latest version of Rybka, and computation was split across the two clusters, with the Cimiotti cluster distributing the search to the IBM hardware.

We developed an algorithm which attempts to classify chess positions into wins, draws and losses. Using this algorithm, we have just finished classifying the King’s Gambit. In other words, the King’s Gambit is now solved.

Whoa, that’s quite a lot to digest. First of all what exactly do you mean when you say that the King’s Gambit is “solved”?

It’s solved in the sense that we know the outcome, just as we know the outcome for most five and six piece endings. Except that here we are dealing with a single starting position…

… which is?

1.e4 e5 2.f4 exf4. We now know the exact outcome of this position, assuming perfect play, of course. I know your next question, so I am going to pre-empt it: there is only one move that draws for White, and that is, somewhat surprisingly, 3.Be2. Every other move loses by force.

And what does that have to do with topic maps? Read a bit further:

Actually much more than “gazillions” – something in the order of 10^100, which is vastly more than the number of elementary particles in the universe. Obviously we could not go through all of them – nobody and nothing will ever be able to do that. But: you do not have to check every continuation. It’s similar to Alpha-Beta, which looks at a very small subset of possible moves but delivers a result that is identical to what you would get if you looked at every single move, down to the specified depth.

But Alpha-Beta reduces the search to about the square root of the total number of moves. The square root of 10^100, however…

Yes, I know. But think about it: you do not need to search every variation to mate. We only need to search a tiny fraction of the overall space. Whenever Rybka evaluates a position with a score of +/– 5.12 we don’t need to search any further, we have our proof that in the continuation there is going to be a win or loss, and there is a forced mate somewhere deep down in the tree. We tested a random sampling of positions of varying levels of difficulty that were evaluated at above 5.12, and we never saw a solution fail. So it is safe to use this assumption generally in the search.

I read that as meaning that we don’t have to search the entire merging space for any set of proxies/topics.

I suspect it will be domain specific but once certain properties or values are encountered, no merging is going to occur, ever.

They can be safely ignored for all future iterations of merging.

Which will quickly down the merging space that has to be examined.

Another way to scale is to reduce the problem size. Yes?

### ExpressionBlast:… [Value of Mapping and Interchanging Mappings]

Wednesday, October 2nd, 2013

ExpressionBlast: mining large, unstructured expression databases by Guy E Zinman, Shoshana Naiman, Yariv Kanfi, Haim Cohen and Ziv Bar-Joseph. (Nature Methods 10, 925–926 (2013))

From a letter to the editor:

To the Editor: The amount of gene expression data deposited in public repositories has grown exponentially over the last decade (Supplementary Fig. 1). Specifically, Gene Expression Omnibus (GEO)1 is one of largest expression-data repositories (Supplementary Table 1), containing hundreds of thousands of microarray and RNA-seq experiment results grouped into tens of thousands of series. Although accessible, data deposited in GEO are not well organized. Even among data sets for a single species there are many different platforms with different probe identifiers, different value scales and very limited annotations of the condition profiled by each array. Current methods for using GEO data to study signaling and other cellular networks either do not scale or cannot fully use the available information (Supplementary Table 2 and Supplementary Results).

To enable queries of such large expression databases, we developed ExpressionBlast (http://www.expression.cs.cmu.edu/): a computational method that uses automated text analysis to identify and merge replicates and determine the type of each array in the series (treatment or control; Fig. 1a and Supplementary Methods). Using this information, ExpressionBlast uniformly processes expression data sets in GEO across all experiments, species and platforms. This is achieved by standardizing the data in terms of gene identifiers, the meaning of the expression values (log ratios) and the distribution of these values (Fig. 1b and Supplementary Methods). Our processing steps achieved a high accuracy in identifying replicates and treatment control cases (Supplementary Results and Supplementary Table 3). We applied these processing steps to arrays from more than 900,000 individual samples collected from >40,000 studies in GEO (new series are updated on a weekly basis), which allowed us to create, to our knowledge, the largest collection of computationally annotated expression data currently available (Supplementary Results and Supplementary Table 4) (emphasis in original).

Now there is a letter to the editor!

How did the team create:

to our knowledge, the largest collection of computationally annotated expression data currently available….?

Hint: It wasn’t by creating a new naming system and then convincing the authors of > 40,000 studies to adopt a new naming system.

They achieved that result by:

This is achieved by standardizing the data in terms of gene identifiers, the meaning of the expression values (log ratios) and the distribution of these values (Fig. 1b and Supplementary Methods).

The benefit from this work begins where “merging” in the topic map sense ends.

One point of curiosity, among many, is the interchangeability of their rule based pattern expressions for merging replicates?

Even if the pattern expression language left execution up to the user, reliably exchanging mappings would be quite useful.

Perhaps a profile of an existing pattern expression language?

To avoid having to write one from scratch?

### Majority voting and information theory: What am I missing?

Monday, July 29th, 2013

Majority voting and information theory: What am I missing? by Panos Ipeirotis.

From the post:

In crowdsourcing, redundancy is a common approach to ensure quality. One of the questions that arises in this setting is the question of equivalence. Let’s assume that a worker has a known probability q of giving a correct answer, when presented with a choice of n possible answers. If I want to simulate one high-quality worker workers of quality q, how many workers of quality q < q do we need?

If you step away from match / no match type merging tests for topics, the question that Panos poses comes into play.

There has been prior work in the area where the question was the impact of quality (q) being less than or greater than 0.5. Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers by Victor S. Sheng, Foster Provost, Panagiotis G. Ipeirotis.

Panos’ question is why can’t he achieve a theoretical quality of 1.0 if he uses two workers with q = 0.85?

I agree that using high quality workers in series can improve over all results. However, as I respond to his blog post, probabilities are not additive.

They are ever probabilities. Could have, on occasion, two 0.85 workers in series transmit an answer perfectly. But that is only one possible outcome out of number of possible outcomes.

Sunday, July 14th, 2013

From the post:

It is reasonably well-known that people who examine search results often don’t go past the first few hits, perhaps stopping at the “fold” or at the end of the first page. It’s a habit we’ve acquired due to high-quality results to precision-oriented information needs. Google has trained us well.

But this habit may not always be useful when confronted with uncommon, recall-oriented, information needs. That is, when doing research. Looking only at the top few documents places too much trust in the ranking algorithm. In our SIGIR 2013 paper, we investigated what happens when a light-weight preview mechanism gives searchers a glimpse at the distribution of documents — new, re-retrieved but not seen, and seen — in the query they are about to execute.

The preview divides the top 100 documents retrieved by a query into 10 bins, and builds a stacked bar chart that represents the three categories of documents. Each category is represented by a color. New documents are shown in teal, re-retrieved ones in the light blue shade, and documents the searcher has already seen in dark blue. The figures below show some examples:

(…)

The blog post is great but you really need to ready the SIGIR paper in full.

Speaking of exploratory searching, is anyone working on exploratory merging?

That is where a query containing a statement of synonymy or polysemy from a searcher results in exploratory merging of topics?

I am assuming that experts in a particular domain will see merging opportunities that eluded automatic processes.

Seems like a shame to waste their expertise, which could be captured to improve a topic map for future users.

The SIGIR paper:

Looking Ahead: Query Preview in Exploratory Search

Abstract:

Exploratory search is a complex, iterative information seeking activity that involves running multiple queries, finding and examining many documents. We introduced a query preview interface that visualizes the distribution of newly-retrieved and re-retrieved documents prior to showing the detailed query results. When evaluating the preview control with a control condition, we found effects on both people’s information seeking behavior and improved retrieval performance. People spent more time formulating a query and were more likely to explore search results more deeply, retrieved a more diverse set of documents, and found more different relevant documents when using the preview. With more time spent on query formulation, higher quality queries were produced and as consequence the retrieval results improved; both average residual precision and recall was higher with the query preview present.

### Aggregation Module – Phase 1 – Functional Design (ElasticSearch #3300)

Friday, July 12th, 2013

Aggregation Module – Phase 1 – Functional Design (ElasticSearch Issue #3300)

From the post:

The new aggregations module is due to elasticsearch 1.0 release, and aims to serve as the next generation replacement for the functionality we currently refer to as “faceting”. Facets, currently provide a great way to aggregate data within a document set context. This context is defined by the executed query in combination with the different levels of filters that are defined (filtered queries, top level filters, and facet level filters). Although powerful as is, the current facets implementation was not designed from ground up to support complex aggregations and thus limited. The main problem with the current implementation stem in the fact that they are hard coded to work on one level and that the different types of facets (which account for the different types of aggregations we support) cannot be mixed and matched dynamically at query time. It is not possible to compose facets out of other facet and the user is effectively bound to the top level aggregations that we defined and nothing more than that.

The goal with the new aggregations module is to break the barriers the current facet implementation put in place. The new name (“Aggregations”) also indicate the intention here – a generic yet extremely powerful framework for defining aggregations – any type of aggregation. The idea here is to have each aggregation defined as a “standalone” aggregation that can perform its task within any context (as a top level aggregation or embedded within other aggregations that can potentially narrow its computation scope). We would like to take all the knowledge and experience we’ve gained over the years working with facets and apply it when building the new framework.

(…)

If you have been following the discussion about “what would we do differently with topic maps” in the XTM group at LinkedIn, this will be of interest.

What is an aggregation if it is not a selection of items matching some criteria, which you can then “merge” together for presentation to a user?

Or “merge” together for further querying?

That is inconsistent with the imperative programming model of the TMDM, but it has the potential to open up distributed and parallel processing of topic maps.

Same paradigm but with greater capabilities.

### Succinct data structures for representing equivalence classes

Monday, June 24th, 2013

Succinct data structures for representing equivalence classes by Moshe Lewenstein, J. Ian Munro, and Venkatesh Raman.

Abstract:

Given a partition of an n element set into equivalence classes, we consider time-space tradeoffs for representing it to support the query that asks whether two given elements are in the same equivalence class. This has various applications including for testing whether two vertices are in the same component in an undirected graph or in the same strongly connected component in a directed graph.

We consider the problem in several models.

— Concerning labeling schemes where we assign labels to elements and the query is to be answered just by examining the labels of the queried elements (without any extra space): if each vertex is required to have a unique label, then we show that a label space of (\sum_{i=1}^n \lfloor {n \over i} \rfloor) is necessary and sufficient. In other words, \lg n + \lg \lg n + O(1) bits of space are necessary and sufficient for representing each of the labels. This slightly strengthens the known lower bound and is in contrast to the known necessary and sufficient bound of \lceil \lg n \rceil for the label length, if each vertex need not get a unique label.

–Concerning succinct data structures for the problem when the n elements are to be uniquely assigned labels from label set {1, 2, …n}, we first show that \Theta(\sqrt n) bits are necessary and sufficient to represent the equivalence class information. This space includes the space for implicitly encoding the vertex labels. We can support the query in such a structure in O(\lg n) time in the standard word RAM model. We then develop structures resulting in one where the queries can be supported in constant time using O({\sqrt n} \lg n) bits of space. We also develop space efficient structures where union operation along with the equivalence query can be answered fast.

On the down side, this technique would not support merging based on arbitrary choice of properties.

On the up side, this technique does support merging based on pre-determined properties for merging.

The latter being the more common case, I commend this article to you for a close read.