Archive for the ‘Set Intersection’ Category

UpSet: Visualization of Intersecting Sets [Authoring Topic Maps – Waterfall or Agile?]

Tuesday, November 3rd, 2015

UpSet: Visualization of Intersecting Sets by Alexander Lex, Nils Gehlenborg, Hendrik Strobelt, Romain Vuillemot, Hanspeter Pfister.

From the post:

Understanding relationships between sets is an important analysis task that has received widespread attention in the visualization community. The major challenge in this context is the combinatorial explosion of the number of set intersections if the number of sets exceeds a trivial threshold. To address this, we introduce UpSet, a novel visualization technique for the quantitative analysis of sets, their intersections, and aggregates of intersections.

UpSet is focused on creating task-driven aggregates, communicating the size and properties of aggregates and intersections, and a duality between the visualization of the elements in a dataset and their set membership. UpSet visualizes set intersections in a matrix layout and introduces aggregates based on groupings and queries. The matrix layout enables the effective representation of associated data, such as the number of elements in the aggregates and intersections, as well as additional summary statistics derived from subset or element attributes.

Sorting according to various measures enables a task-driven analysis of relevant intersections and aggregates. The elements represented in the sets and their associated attributes are visualized in a separate view. Queries based on containment in specific intersections, aggregates or driven by attribute filters are propagated between both views. UpSet also introduces several advanced visual encodings and interaction methods to overcome the problems of varying scales and to address scalability.

Definitely paper and software to have on hand while you read and explore AggreSet, which I mentioned yesterday in: Exploring and Visualizing Pre-Topic Map Data.

Interested to hear your thoughts comparing the two.

Something to keep in mind is that topic map authoring can be thought of as a waterfall model, where ontological decisions, merging criteria, etc. are worked out in advance versus using an agile methodology, that explores data and iterates over it, allowing the topic map to grow and evolve.

An evolutionary topic map could well miss places the waterfall method would catch but if no one goes there, or not often, is that a real issue?

I must admit, I am less than fond of “agile” methodologies but that is from a bad experience where an inappropriate person was in charge of a project and thought a one paragraph description was sufficient for a new CMS system built upon subversion. Sufficient because the project was “agile.” Fortunately that project was tanked after a long struggle with management.

Perhaps I should think about the potential use of “agile” methodologies in authoring and evolving topic maps.


Efficient comparison of sets of intervals with NC-lists

Thursday, April 11th, 2013

Efficient comparison of sets of intervals with NC-lists by Matthias Zytnicki, YuFei Luo and Hadi Quesneville. (Bioinformatics (2013) 29 (7): 933-939. doi: 10.1093/bioinformatics/btt070)


Motivation: High-throughput sequencing produces in a small amount of time a large amount of data, which are usually difficult to analyze. Mapping the reads to the transcripts they originate from, to quantify the expression of the genes, is a simple, yet time demanding, example of analysis. Fast genomic comparison algorithms are thus crucial for the analysis of the ever-expanding number of reads sequenced.

Results: We used NC-lists to implement an algorithm that compares a set of query intervals with a set of reference intervals in two steps. The first step, a pre-processing done once for all, requires time O[#R log(#R) + #Q log(#Q)], where Q and R are the sets of query and reference intervals. The search phase requires constant space, and time O(#R + #Q + #M), where M is the set of overlaps. We showed that our algorithm compares favorably with five other algorithms, especially when several comparisons are performed.

Availability: The algorithm has been included to S–MART, a versatile tool box for RNA-Seq analysis, freely available at The algorithm can be used for many kinds of data (sequencing reads, annotations, etc.) in many formats (GFF3, BED, SAM, etc.), on any operating system. It is thus readily useable for the analysis of next-generation sequencing data.

Before you search for “NC-lists,” be aware that you will get this article as the first “hit” today in some popular search engines. Followed by a variety of lists for North Carolina.

A more useful search engine would allow me to choose the correct usage of a term and to re-run the query using the distinguished subject.

The expansion helps: Nested Containment List (NCList).

Familiar if you are working in bioinformatics.

More generally, consider the need to compare complex sequences of values for merging purposes.

Not a magic bullet but a technique you should keep in mind.

Origin: Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases, Alexander V. Alekseyenko and Christopher J. Lee. (Bioinformatics (2007) 23 (11): 1386-1393. doi: 10.1093/bioinformatics/btl647)

Linux cheat sheets [Unix Sets Anyone?]

Saturday, September 15th, 2012

Linux cheat sheets

John D. Cook points to three new Linux cheat sheets from Peteris Krumins:

While investigating, I ran across:

Set Operations in the Unix Shell Simplified

From that post:

Remember my article on Set Operations in the Unix Shell? I implemented 14 various set operations by using common Unix utilities such as diff, comm, head, tail, grep, wc and others. I decided to create a simpler version of that post that just lists the operations. I also created a .txt cheat-sheet version of it and to make things more interesting I added an Awk implementation of each set op. If you want a detailed explanations of each operation, go to the original article.

Fast Set Intersection in Memory [Foul! They Peeked!]

Monday, August 20th, 2012

Fast Set Intersection in Memory by Bolin Ding and Arnd Christian König.


Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a worst-case efficient way. In general, given k (preprocessed) sets, with totally n elements, we will show how to compute their intersection in expected time O(n / sqrt(w) + kr), where r is the intersection size and w is the number of bits in a machine-word. In addition,we introduce a very simple version of this algorithm that has weaker asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and workloads.

Important not only for the algorithm but how they arrived at it.

They peeked at the data.

Imagine that.

Not trying to solve the set intersection problem in the abstract but looking at data you are likely to encounter.

I am all for the pure theory side of things but there is something to be said for less airy (dare I say windy?) solutions. 😉

I first saw this at Theoretical Computer Science: Most efficient algorithm to compute set difference?

Fast Secure Computation of Set Intersection

Tuesday, October 19th, 2010

Fast Secure Computation of Set Intersection Authors: Stanis?aw Jarecki and Xiaomin Liu


Secure Protocol for Computing Set Intersection and Extensions. Secure computation of set intersection (or secure evaluation of a set intersection function) is a protocol which allows two parties, sender S and receiver R, to interact on their respective input sets X and Y in such a way that R learns X ? Y and S learns nothing. Secure computation of set intersection has numerous useful applications: For example, medical institutions could find common patients without learning any information about patients that are not in the intersection, different security agencies could search for common items in their databases without revealing any other information, the U.S. Department of Homeland Security can quickly find if there is a match between a passenger manifest and its terrorist watch list, etc.

Imagine partial sharing of a topic map in a secure environment.

The article has a useful review of work in this area.

Curious if this really prevents learning of additional information.

If the source is treated as a black box and subjects are projected on the basis of responses to different receivers, with mapping between those,…, well, that had better wait for a future post. (Or a contract from someone interested in breaching a secure system. 😉 )