Archive for the ‘Sets’ Category

Fast algorithm for successively merging k-overlapping sets?

Saturday, April 20th, 2013

Fast algorithm for successively merging k-overlapping sets?

As posted:

Consider the following algorithm for clustering sets: Begin with n sets, S1, S2,…,Sn, such that sum_{i = 1}^n |Si| = m, and successively merge sets with at least k elements in common. E.g., if S1 = {1, 2, 3}, S2 = {3, 4, 5}, and S3 = {5, 6, 7}, and k = 1 then S1 can be merged with S2 to create S1′ = {1, 2, 3, 4, 5}, and S1′ can be merged with S3 to create S1” = {1,…,7}.

Warmup question: Can this clustering algorithm be implemented efficiently for k = 1?

Answer to warmup question: If the sets only need to overlap by one element to be merged as in the example above, the clustering can be performed in O(m) time using connected components if you are careful.

Harder question: Suppose the sets must overlap by at least 2 (or k) elements to be merged. Is there an efficient algorithm for this case (i.e., close to linear time)? The challenge here is that you can have cases like S1 = {1, 2, 3}, S2 = {2, 4, 5}, S3 = {1, 4, 5}, with k = 2. Note that in this case S1 can be merged with S2 and S3, but only after S2 and S3 are merged to create S2′ so that S1 and S2′ share both elements 1 and 2.

I saw this on Theoretical Computer Science Stack Exchange earlier today.

Reminded me of overlapping set members test for [subject identifiers], [item identifiers], [subject locators], [subject identifiers] and [item identifiers], property of the other, [reified] properties. (Well, reified is a simple match, not a set.)

I have found some early work on the overlapping set member question but also work on other measures of similarity on set members.

Working up a list of papers.

Efficient comparison of sets of intervals with NC-lists

Thursday, April 11th, 2013

Efficient comparison of sets of intervals with NC-lists by Matthias Zytnicki, YuFei Luo and Hadi Quesneville. (Bioinformatics (2013) 29 (7): 933-939. doi: 10.1093/bioinformatics/btt070)

Abstract:

Motivation: High-throughput sequencing produces in a small amount of time a large amount of data, which are usually difficult to analyze. Mapping the reads to the transcripts they originate from, to quantify the expression of the genes, is a simple, yet time demanding, example of analysis. Fast genomic comparison algorithms are thus crucial for the analysis of the ever-expanding number of reads sequenced.

Results: We used NC-lists to implement an algorithm that compares a set of query intervals with a set of reference intervals in two steps. The first step, a pre-processing done once for all, requires time O[#R log(#R) + #Q log(#Q)], where Q and R are the sets of query and reference intervals. The search phase requires constant space, and time O(#R + #Q + #M), where M is the set of overlaps. We showed that our algorithm compares favorably with five other algorithms, especially when several comparisons are performed.

Availability: The algorithm has been included to S–MART, a versatile tool box for RNA-Seq analysis, freely available at http://urgi.versailles.inra.fr/Tools/S-Mart. The algorithm can be used for many kinds of data (sequencing reads, annotations, etc.) in many formats (GFF3, BED, SAM, etc.), on any operating system. It is thus readily useable for the analysis of next-generation sequencing data.

Before you search for “NC-lists,” be aware that you will get this article as the first “hit” today in some popular search engines. Followed by a variety of lists for North Carolina.

A more useful search engine would allow me to choose the correct usage of a term and to re-run the query using the distinguished subject.

The expansion helps: Nested Containment List (NCList).

Familiar if you are working in bioinformatics.

More generally, consider the need to compare complex sequences of values for merging purposes.

Not a magic bullet but a technique you should keep in mind.

Origin: Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases, Alexander V. Alekseyenko and Christopher J. Lee. (Bioinformatics (2007) 23 (11): 1386-1393. doi: 10.1093/bioinformatics/btl647)

Confluently Persistent Sets and Maps

Wednesday, January 23rd, 2013

Confluently Persistent Sets and Maps by Olle Liljenzin.

Abstract:

Ordered sets and maps play important roles as index structures in relational data models. When a shared index in a multi-user system is modified concurrently, the current state of the index will diverge into multiple versions containing the local modifications performed in each work flow. The confluent persistence problem arises when versions should be melded in commit and refresh operations so that modifications performed by different users become merged.

Confluently Persistent Sets and Maps are functional binary search trees that support efficient set operations both when operands are disjoint and when they are overlapping. Treap properties with hash values as priorities are maintained and with hash-consing of nodes a unique representation is provided. Non-destructive set merge algorithms that skip inspection of equal subtrees and a conflict detecting meld algorithm based on set merges are presented. The meld algorithm is used in commit and refresh operations. With m modifications in one flow and n items in total, the expected cost of the operations is O(m log(n/m)).

Is this an avenue for coordination between distinct topic maps?

Or is consistency of distinct topic maps an application-based requirement?

Count unique items in a text file using Erlang

Wednesday, October 17th, 2012

Count unique items in a text file using Erlang by Paolo D’Incau.

From the post:

Many times during our programming daily routine, we have to deal with log files. Most of the log files I have seen so far are just text files where the useful information are stored line by line.

Let’s say you are implementing a super cool game backend in Erlang, probably you would end up with a bunch of servers implementing several actions (e.g. authentication, chat, store character progress etc etc); well I am pretty sure you would not store the characters info in a text file, but maybe (and I said maybe) you could find useful to store in a text file some of the information that comes from the authentication server.

Unique in the sense you are thinking.

But that happens, even in topic maps.

Linux cheat sheets [Unix Sets Anyone?]

Saturday, September 15th, 2012

Linux cheat sheets

John D. Cook points to three new Linux cheat sheets from Peteris Krumins:

While investigating, I ran across:

Set Operations in the Unix Shell Simplified

From that post:

Remember my article on Set Operations in the Unix Shell? I implemented 14 various set operations by using common Unix utilities such as diff, comm, head, tail, grep, wc and others. I decided to create a simpler version of that post that just lists the operations. I also created a .txt cheat-sheet version of it and to make things more interesting I added an Awk implementation of each set op. If you want a detailed explanations of each operation, go to the original article.

Fast Set Intersection in Memory [Foul! They Peeked!]

Monday, August 20th, 2012

Fast Set Intersection in Memory by Bolin Ding and Arnd Christian König.

Abstract:

Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a worst-case efficient way. In general, given k (preprocessed) sets, with totally n elements, we will show how to compute their intersection in expected time O(n / sqrt(w) + kr), where r is the intersection size and w is the number of bits in a machine-word. In addition,we introduce a very simple version of this algorithm that has weaker asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and workloads.

Important not only for the algorithm but how they arrived at it.

They peeked at the data.

Imagine that.

Not trying to solve the set intersection problem in the abstract but looking at data you are likely to encounter.

I am all for the pure theory side of things but there is something to be said for less airy (dare I say windy?) solutions. ;-)

I first saw this at Theoretical Computer Science: Most efficient algorithm to compute set difference?

What’s the Difference? Efficient Set Reconciliation without Prior Context

Monday, August 6th, 2012

What’s the Difference? Efficient Set Reconciliation without Prior Context by David Eppstein, Michael T. Goodrich, Frank Uyeda, and George Varghese.

Abstract:

We describe a synopsis structure, the Difference Digest, that allows two nodes to compute the elements belonging to the set difference in a single round with communication overhead proportional to the size of the difference times the logarithm of the keyspace. While set reconciliation can be done efficiently using logs, logs require overhead for every update and scale poorly when multiple users are to be reconciled. By contrast, our abstraction assumes no prior context and is useful in networking and distributed systems applications such as trading blocks in a peer-to-peer network, and synchronizing link-state databases after a partition.

Our basic set-reconciliation method has a similarity with the peeling algorithm used in Tornado codes [6], which is not surprising, as there is an intimate connection between set difference and coding. Beyond set reconciliation, an essential component in our Difference Digest is a new estimator for the size of the set difference that outperforms min-wise sketches [3] for small set differences.

Our experiments show that the Difference Digest is more efficient than prior approaches such as Approximate Reconciliation Trees [5] and Characteristic Polynomial Interpolation [17]. We use Difference Digests to implement a generic KeyDiff service in Linux that runs over TCP and returns the sets of keys that differ between machines.

Distributed topic maps anyone?

GNU C++ hash_set vs STL std::set: my notebook

Tuesday, July 10th, 2012

GNU C++ hash_set vs STL std::set: my notebook by Pierre Lindenbaum.

Pierre compares the C++ template set of the C++ Standard Template library to the GNU non-standard hash-based set on a set of random numbers to insert/remove.

The results may surprise you.

Worth investigating if you are removing duplicates post-query.

Operations on soft sets revisited

Tuesday, May 15th, 2012

Operations on soft sets revisited by Ping Zhu and Qiaoyan Wen.

Abstract:

Soft sets, as a mathematical tool for dealing with uncertainty, have recently gained considerable attention, including some successful applications in information processing, decision, demand analysis, and forecasting. To construct new soft sets from given soft sets, some operations on soft sets have been proposed. Unfortunately, such operations cannot keep all classical set-theoretic laws true for soft sets. In this paper, we redefine the intersection, complement, and difference of soft sets and investigate the algebraic properties of these operations along with a known union operation. We find that the new operation system on soft sets inherits all basic properties of operations on classical sets, which justifies our definitions.

An interesting paper will get you interested in soft sets if you aren’t already.

It isn’t easy going, even with the Alice and Bob examples, which I am sure the authors found immediately intuitive.

If you have data where numeric values cannot be assigned, it will be worth your while to explore this paper and the literature on soft sets.

data modelling and FRBR WEMI ontology

Tuesday, February 21st, 2012

data modelling and FRBR WEMI ontology

Jonathan Rochkind writes to defend the FRBR WEMI ontology:

Karen Coyle writes on the RDA listserv:

FRBR claims to be based on a “relational” model, as in “relational database.” That is not tomorrow’s data model; it is yesterday’s, although it is a step toward tomorrow’s model. The difficulty is that FRBR was conceived of in the early 1990′s, and completed in the late 1990′s. That makes it about 15 years old.

I think it would have been just as much a mistake to tie the FRBR model to an RDF model as it would have/was to tie it to a relational database model. Whatever we come up with is going to last us more than 15 years, and things will change again. Now, I’ll admit that I’m heretically still suspicious that an RDF data model will in fact be ‘the future’. But even if it is, there will be another future (or simultaneous futures plural).

And concludes:

I tend to think they should have just gone with ‘set theory’ oriented language, because it is, I think, the most clear, while still being abstract enough to make it harder to think the WEMI ontology is tied to some particular technology like relational databases OR linked data. I think WEMI gets it right regardless of whether you speak in the language of ‘relational’, ‘set theory’, ‘object orientation’ or ‘linked data’/RDF.

Leaving my qualms about RDF to one side, I write to point out that choosing “set theory” is a choice of a particular technology or if you like, tradition.

If that sounds odd, consider how many times you have used set theory in the last week, month, year? Unless you are a logician or introductory mathematics professor, the odds are that the number is zero (0) (or the empty set, {},for any logicians reading this post).

Choosing “set theory” is to choose a methodology that very few people use in practice. The vast majority of people make choices, evaluate outcomes, live complex lives innocent of the use of set theory.

I don’t object to FRBR or other efforts choosing to use “set theory” but recognize it is a minority practice.

One that elevates a minority over the majority of users.

SEISA: set expansion by iterative similarity aggregation

Friday, April 1st, 2011

SEISA: set expansion by iterative similarity aggregation by Yeye He, University of Wisconsin-Madison, Madison, WI, USA, and Dong Xin, Microsoft Research, Redmond, WA, USA.

In this paper, we study the problem of expanding a set of given seed entities into a more complete set by discovering other entities that also belong to the same concept set. A typical example is to use “Canon” and “Nikon” as seed entities, and derive other entities (e.g., “Olympus”) in the same concept set of camera brands. In order to discover such relevant entities, we exploit several web data sources, including lists extracted from web pages and user queries from a web search engine. While these web data are highly diverse with rich information that usually cover a wide range of the domains of interest, they tend to be very noisy. We observe that previously proposed random walk based approaches do not perform very well on these noisy data sources. Accordingly, we propose a new general framework based on iterative similarity aggregation, and present detailed experimental results to show that, when using general-purpose web data for set expansion, our approach outperforms previous techniques in terms of both precision and recall.

To the uses of set expansion mentioned by the authors:

Set expansion systems are of practical importance and can be used in various applications. For instance, web search engines may use the set expansion tools to create a comprehensive entity repository (for, say, brand names of each product category), in order to deliver better results to entity-oriented queries. As another example, the task of named entity recognition can also leverage the results generated by set expansion tools [13]

I would add:

  • augmented authoring of navigation tools for text corpora
  • discovery of related entities (for associations)

While the authors concentrate on web-based documents, which for the most part are freely available, the techniques shown here could be just as easily applied to commercial texts or used to generate pay-for-view results.

It would have to really be a step up to get people to pay a premium for navigation of free content, but given the noisy nature of most information sites, that is certainly possible.

Who Identified Roger Magoulas?

Monday, January 31st, 2011

Did you know that Roger Magoulas appears 28 times on the O’Reilly website? (as of 01-29-2010)

With the following 5 hyperlink texts:

Can you name the year that Tim O’Reilly used a hyperlink for Roger Magoulas three times but hasn’t since then?

One consistent resolution for Roger Magoulas, reflecting updates and presented without hand-authoring HTML would be nice.

But, that’s just me.

What do you think?

Pseudo-Code: A New Definition

Monday, January 31st, 2011

How to Speed up Machine Learning using a Set-Oriented Approach

The detail article for Need faster machine learning? Take a set-oriented approach, which I mentioned in a separate post.

Well, somewhat more detail.

Gives new meaning to pseudo-code:

The application side becomes:

Computing the model:

Fetch “compute-model over data items”

Classifying new items:

Fetch “classify over data items”

I am reminded of the cartoon with two people at a blackboard and one of them says: I think you should be more explicit in step two., where the text reads: Then a miracle occurs.

How about you?

Need faster machine learning? Take a
set-oriented approach

Saturday, January 29th, 2011

Need faster machine learning? Take a set-oriented approach.

Roger Magoulas, using not small iron reports:

The result: The training set was processed and the sample data set classified in six seconds. We were able to classify the entire 400,000-record data set in under six minutes — more than a four-orders-of-magnitude records processed per minute (26,000-fold) improvement. A process that would have run for days, in its initial implementation, now ran in minutes! The performance boost let us try out different feature options and thresholds to optimize the classifier. On the latest run, a random sample showed the classifier working with 92% accuracy.

or

set-oriented machine learning makes for:

  • Handling larger and more diverse data sets
  • Applying machine learning to a larger set of problems
  • Faster turnarounds
  • Less risk
  • Better focus on a problem
  • Improved accuracy, greater understanding and more usable results
  • Seems to me sameness of subject representation is a classification task. Yes?

    Going from days to minutes sounds attractive to me.

    How about you?

    Efficient set intersection for inverted indexing

    Monday, January 10th, 2011

    Efficient set intersection for inverted indexing Authors: J. Shane Culpepper, Alistair Moffat Keywords: Compact data structures, information retrieval, set intersection, set representation, bitvector, byte-code

    Abstract:

    Conjunctive Boolean queries are a key component of modern information retrieval systems, especially when Web-scale repositories are being searched. A conjunctive query q is equivalent to a |q|-way intersection over ordered sets of integers, where each set represents the documents containing one of the terms, and each integer in each set is an ordinal document identifier. As is the case with many computing applications, there is tension between the way in which the data is represented, and the ways in which it is to be manipulated. In particular, the sets representing index data for typical document collections are highly compressible, but are processed using random access techniques, meaning that methods for carrying out set intersections must be alert to issues to do with access patterns and data representation. Our purpose in this article is to explore these trade-offs, by investigating intersection techniques that make use of both uncompressed “integer” representations, as well as compressed arrangements. We also propose a simple hybrid method that provides both compact storage, and also faster intersection computations for conjunctive querying than is possible even with uncompressed representations.

    The treatment of set intersection caught my attention.

    Unlike document sets, topic maps have restricted sets of properties or property values that will form the basis for set intersection (merging in topic maps lingo).

    Topic maps also differ in that identity bearing properties are never ignored, whereas in searching a reverse index, terms can be included in the index that are ignored in a particular query.

    What impact those characteristics will have on set intersection for topic maps remains a research question.

    Fast Secure Computation of Set Intersection

    Tuesday, October 19th, 2010

    Fast Secure Computation of Set Intersection Authors: Stanis?aw Jarecki and Xiaomin Liu

    Introduction:

    Secure Protocol for Computing Set Intersection and Extensions. Secure computation of set intersection (or secure evaluation of a set intersection function) is a protocol which allows two parties, sender S and receiver R, to interact on their respective input sets X and Y in such a way that R learns X ? Y and S learns nothing. Secure computation of set intersection has numerous useful applications: For example, medical institutions could find common patients without learning any information about patients that are not in the intersection, different security agencies could search for common items in their databases without revealing any other information, the U.S. Department of Homeland Security can quickly find if there is a match between a passenger manifest and its terrorist watch list, etc.

    Imagine partial sharing of a topic map in a secure environment.

    The article has a useful review of work in this area.

    Curious if this really prevents learning of additional information.

    If the source is treated as a black box and subjects are projected on the basis of responses to different receivers, with mapping between those,…, well, that had better wait for a future post. (Or a contract from someone interested in breaching a secure system. ;-) )