Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 3, 2015

UpSet: Visualization of Intersecting Sets [Authoring Topic Maps – Waterfall or Agile?]

Filed under: Set Intersection,Sets,Topic Maps — Patrick Durusau @ 8:43 pm

UpSet: Visualization of Intersecting Sets by Alexander Lex, Nils Gehlenborg, Hendrik Strobelt, Romain Vuillemot, Hanspeter Pfister.

From the post:

Understanding relationships between sets is an important analysis task that has received widespread attention in the visualization community. The major challenge in this context is the combinatorial explosion of the number of set intersections if the number of sets exceeds a trivial threshold. To address this, we introduce UpSet, a novel visualization technique for the quantitative analysis of sets, their intersections, and aggregates of intersections.

UpSet is focused on creating task-driven aggregates, communicating the size and properties of aggregates and intersections, and a duality between the visualization of the elements in a dataset and their set membership. UpSet visualizes set intersections in a matrix layout and introduces aggregates based on groupings and queries. The matrix layout enables the effective representation of associated data, such as the number of elements in the aggregates and intersections, as well as additional summary statistics derived from subset or element attributes.

Sorting according to various measures enables a task-driven analysis of relevant intersections and aggregates. The elements represented in the sets and their associated attributes are visualized in a separate view. Queries based on containment in specific intersections, aggregates or driven by attribute filters are propagated between both views. UpSet also introduces several advanced visual encodings and interaction methods to overcome the problems of varying scales and to address scalability.

Definitely paper and software to have on hand while you read and explore AggreSet, which I mentioned yesterday in: Exploring and Visualizing Pre-Topic Map Data.

Interested to hear your thoughts comparing the two.

Something to keep in mind is that topic map authoring can be thought of as a waterfall model, where ontological decisions, merging criteria, etc. are worked out in advance versus using an agile methodology, that explores data and iterates over it, allowing the topic map to grow and evolve.

An evolutionary topic map could well miss places the waterfall method would catch but if no one goes there, or not often, is that a real issue?

I must admit, I am less than fond of “agile” methodologies but that is from a bad experience where an inappropriate person was in charge of a project and thought a one paragraph description was sufficient for a new CMS system built upon subversion. Sufficient because the project was “agile.” Fortunately that project was tanked after a long struggle with management.

Perhaps I should think about the potential use of “agile” methodologies in authoring and evolving topic maps.

Suggestions/comments?

November 2, 2015

Exploring and Visualizing Pre-Topic Map Data

Filed under: Aggregation,Data Aggregation,Sets,Topic Maps,Visualization — Patrick Durusau @ 3:06 pm

AggreSet: Rich and Scalable Set Exploration using Visualizations of Element Aggregations by M. Adil Yalçın, Niklas Elmqvist, and Benjamin B. Bederson.

Abstract:

Datasets commonly include multi-value (set-typed) attributes that describe set memberships over elements, such as genres per movie or courses taken per student. Set-typed attributes describe rich relations across elements, sets, and the set intersections. Increasing the number of sets results in a combinatorial growth of relations and creates scalability challenges. Exploratory tasks (e.g. selection, comparison) have commonly been designed in separation for set-typed attributes, which reduces interface consistency. To improve on scalability and to support rich, contextual exploration of set-typed data, we present AggreSet. AggreSet creates aggregations for each data dimension: sets, set-degrees, set-pair intersections, and other attributes. It visualizes the element count per aggregate using a matrix plot for set-pair intersections, and histograms for set lists, set-degrees and other attributes. Its non-overlapping visual design is scalable to numerous and large sets. AggreSet supports selection, filtering, and comparison as core exploratory tasks. It allows analysis of set relations inluding subsets, disjoint sets and set intersection strength, and also features perceptual set ordering for detecting patterns in set matrices. Its interaction is designed for rich and rapid data exploration. We demonstrate results on a wide range of datasets from different domains with varying characteristics, and report on expert reviews and a case study using student enrollment and degree data with assistant deans at a major public university.

These two videos will give you a better overview of AggreSet than I can. The first one is about 30 seconds and the second one about 5 minutes.

The visualization of characters from Les Misérables (the second video) is a dynamite demonstration of how you could explore pre-topic map data with an eye towards creating roles and associations between characters as well as with the text.

First use case that pops to mind would be harvesting the fan posts on Harry Potter and crossing them with a similar listing of characters from the Harry Potter book series. With author, date, book, character, etc., relationships.

While you are at the GitHub site: https://github.com/adilyalcin/Keshif/tree/master/AggreSet, be sure to bounce up a level to Keshif:

Keshif is a web-based tool that lets you browse and understand datasets easily.

To start using Keshif:

  • Get the source code from github,
  • Explore the existing datasets and their source codes, and
  • Check out the wiki.

Or just go directly to the Keshif site, with 110 datasets (as of today)>

For the impatient, see Loading Data.

For the even more impatient:

You can load data to Keshif from :

  • Google Sheets
  • Text File
    • On Google Drive
    • On Dropbox
    • File on your webserver

Text File Types

Keshif can be used with the following data file types:

  • CSV / TSV
  • JSON
  • XML
  • Any other file type that you can load and parse in JavaScript. See Custom Data Loading

Hint: The dataset explorer at the frontpage indexes demos by file type and resource. Filter by data source to find example source code on how to apply a specific file loading approach.

The critical factor, in addition to its obvious usefulness, is that it works in a web browser. You don’t have to install software, set Java paths, download additional libraries, etc.

Are you using the modern web browser as your target for user facing topic map applications?

I first saw this in a tweet by Christophe Lalanne.

December 22, 2014

Rethinking set theory

Filed under: Identifiers,Sets — Patrick Durusau @ 3:21 pm

Rethinking set theory by Tom Leinster.

From the introduction:

Mathematicians manipulate sets with con fidence almost every day of their working lives. We do so whenever we work with sets of real or complex numbers, or with vector spaces, topological spaces, groups, or any of the many other set-based structures. These underlying set-theoretic manipulations are so automatic that we seldom give them a thought, and it is rare that we make mistakes in what we do with sets.

However, very few mathematicians could accurately quote what are often referred to as `the’ axioms of set theory. We would not dream of working with, say, Lie algebras without first learning the axioms. Yet many of us will go our whole lives without learning `the’ axioms for sets, with no harm to the accuracy of our work. This suggests that we all carry around with us, more or
less subconsciously, a reliable body of operating principles that we use when manipulating sets.

What if we were to write down some of these principles and adopt them as our axioms for sets? The message of this article is that this can be done, in a simple, practical way. We describe an
axiomatization due to F. William Lawvere [3, 4], informally summarized in Fig. 1. The axioms suffice for very nearly everything mathematicians ever do with sets. So we can, if we want, abandon the classical axioms entirely and use these instead.

Don’t try to read this after a second or third piece of pie. 😉

What captured my interest was the following:

The root of the problem is that in the frame-work of ZFC, the elements of a set are always sets too. Thus, given a set X, it always makes sense in ZFC to ask what the elements of the elements of X; are. Now, a typical set in ordinary mathematics is ℝ. But accost a mathematician at random and ask them `what are the elements of Π?’, and they will probably assume they misheard you, or ask you what you’re talking about, or else tell you that your question makes no sense. If forced to answer, they might reply that real numbers have no elements. But this too is in conflict with ZFC’s usage of `set’: if all elements of ℝ are sets, and they all have no elements, then they are all the empty set, from which it follows that all real numbers are equal. (emphasis added)

The author explores the perils of using “set” with two different meanings in ZFC and what it could mean to define “set” as it is used in practice by mathematicians.

For my part, the “…elements of a set are always sets too” resonates with the concept that all identifiers can be resolved into identifiers.

For example: firstName = Patrick.

The token firstName, despite its popularity on customs forms, is not a semantic primitive recognized by all readers. While for some processing purposes, by agents hired to delay, harass and harry tired travelers, firstName is sufficient, it can in fact be resolved into tuples that represent equivalences to firstName or provide additional information about that identifier.

For example:

name = "firstName"

alt = "given name"

alt = "forename"

alt = "Christian name"

Which slightly increases my chances of finding an equivalent, if I am not familiar with firstName. I say “slightly increases” because names of individual people are subject to a rich heritage of variation based on language, culture, custom, all of which have changed over time. The example is just a tiny number of possible alternatives possible in English.

When I say “…it can in fact be resolved…” should not be taken to require that every identifier be so resolved or that the resulting identifiers extend to some particular level of resolution. Noting that we could similarly expand forename or alt and the identifiers we find in their expansions.

The question that a topic maps designer has to answer is “what expansions of identifiers are useful for a particular set of uses?” Do the identifiers need to survive their current user? (Think legacy ETL.) Will the data need to be combined with data using other identifiers? Will queries need to be made across data sets with conflicting identifiers? Is there data that could be merged on a subject by subject basis? Is there any value in a subject by subject merging?

To echo a sentiment that I heard in Leading from the Back: Making Data Science Work at a UX-driven Business, it isn’t the fact you can merge information about a subject that’s important. It is the value-add to a customer that results from that merging that is important.

Value-add for customers before toys for IT.*

I first saw this in a tweet by onepaperperday.

*This is a tough one for me, given my interests in language and theory. But I am trying to do better.

April 20, 2013

Fast algorithm for successively merging k-overlapping sets?

Filed under: Merging,Overlapping Sets,Sets — Patrick Durusau @ 12:56 pm

Fast algorithm for successively merging k-overlapping sets?

As posted:

Consider the following algorithm for clustering sets: Begin with n sets, S1, S2,…,Sn, such that sum_{i = 1}^n |Si| = m, and successively merge sets with at least k elements in common. E.g., if S1 = {1, 2, 3}, S2 = {3, 4, 5}, and S3 = {5, 6, 7}, and k = 1 then S1 can be merged with S2 to create S1′ = {1, 2, 3, 4, 5}, and S1′ can be merged with S3 to create S1” = {1,…,7}.

Warmup question: Can this clustering algorithm be implemented efficiently for k = 1?

Answer to warmup question: If the sets only need to overlap by one element to be merged as in the example above, the clustering can be performed in O(m) time using connected components if you are careful.

Harder question: Suppose the sets must overlap by at least 2 (or k) elements to be merged. Is there an efficient algorithm for this case (i.e., close to linear time)? The challenge here is that you can have cases like S1 = {1, 2, 3}, S2 = {2, 4, 5}, S3 = {1, 4, 5}, with k = 2. Note that in this case S1 can be merged with S2 and S3, but only after S2 and S3 are merged to create S2′ so that S1 and S2′ share both elements 1 and 2.

I saw this on Theoretical Computer Science Stack Exchange earlier today.

Reminded me of overlapping set members test for [subject identifiers], [item identifiers], [subject locators], [subject identifiers] and [item identifiers], property of the other, [reified] properties. (Well, reified is a simple match, not a set.)

I have found some early work on the overlapping set member question but also work on other measures of similarity on set members.

Working up a list of papers.

April 11, 2013

Efficient comparison of sets of intervals with NC-lists

Filed under: Bioinformatics,Set Intersection,Sets — Patrick Durusau @ 1:00 pm

Efficient comparison of sets of intervals with NC-lists by Matthias Zytnicki, YuFei Luo and Hadi Quesneville. (Bioinformatics (2013) 29 (7): 933-939. doi: 10.1093/bioinformatics/btt070)

Abstract:

Motivation: High-throughput sequencing produces in a small amount of time a large amount of data, which are usually difficult to analyze. Mapping the reads to the transcripts they originate from, to quantify the expression of the genes, is a simple, yet time demanding, example of analysis. Fast genomic comparison algorithms are thus crucial for the analysis of the ever-expanding number of reads sequenced.

Results: We used NC-lists to implement an algorithm that compares a set of query intervals with a set of reference intervals in two steps. The first step, a pre-processing done once for all, requires time O[#R log(#R) + #Q log(#Q)], where Q and R are the sets of query and reference intervals. The search phase requires constant space, and time O(#R + #Q + #M), where M is the set of overlaps. We showed that our algorithm compares favorably with five other algorithms, especially when several comparisons are performed.

Availability: The algorithm has been included to S–MART, a versatile tool box for RNA-Seq analysis, freely available at http://urgi.versailles.inra.fr/Tools/S-Mart. The algorithm can be used for many kinds of data (sequencing reads, annotations, etc.) in many formats (GFF3, BED, SAM, etc.), on any operating system. It is thus readily useable for the analysis of next-generation sequencing data.

Before you search for “NC-lists,” be aware that you will get this article as the first “hit” today in some popular search engines. Followed by a variety of lists for North Carolina.

A more useful search engine would allow me to choose the correct usage of a term and to re-run the query using the distinguished subject.

The expansion helps: Nested Containment List (NCList).

Familiar if you are working in bioinformatics.

More generally, consider the need to compare complex sequences of values for merging purposes.

Not a magic bullet but a technique you should keep in mind.

Origin: Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases, Alexander V. Alekseyenko and Christopher J. Lee. (Bioinformatics (2007) 23 (11): 1386-1393. doi: 10.1093/bioinformatics/btl647)

January 23, 2013

Confluently Persistent Sets and Maps

Filed under: Functional Programming,Maps,Python,Sets — Patrick Durusau @ 7:42 pm

Confluently Persistent Sets and Maps by Olle Liljenzin.

Abstract:

Ordered sets and maps play important roles as index structures in relational data models. When a shared index in a multi-user system is modified concurrently, the current state of the index will diverge into multiple versions containing the local modifications performed in each work flow. The confluent persistence problem arises when versions should be melded in commit and refresh operations so that modifications performed by different users become merged.

Confluently Persistent Sets and Maps are functional binary search trees that support efficient set operations both when operands are disjoint and when they are overlapping. Treap properties with hash values as priorities are maintained and with hash-consing of nodes a unique representation is provided. Non-destructive set merge algorithms that skip inspection of equal subtrees and a conflict detecting meld algorithm based on set merges are presented. The meld algorithm is used in commit and refresh operations. With m modifications in one flow and n items in total, the expected cost of the operations is O(m log(n/m)).

Is this an avenue for coordination between distinct topic maps?

Or is consistency of distinct topic maps an application-based requirement?

October 17, 2012

Count unique items in a text file using Erlang

Filed under: Erlang,Sets — Patrick Durusau @ 9:08 am

Count unique items in a text file using Erlang by Paolo D’Incau.

From the post:

Many times during our programming daily routine, we have to deal with log files. Most of the log files I have seen so far are just text files where the useful information are stored line by line.

Let’s say you are implementing a super cool game backend in Erlang, probably you would end up with a bunch of servers implementing several actions (e.g. authentication, chat, store character progress etc etc); well I am pretty sure you would not store the characters info in a text file, but maybe (and I said maybe) you could find useful to store in a text file some of the information that comes from the authentication server.

Unique in the sense you are thinking.

But that happens, even in topic maps.

September 15, 2012

Linux cheat sheets [Unix Sets Anyone?]

Filed under: Linux OS,Set Intersection,Sets — Patrick Durusau @ 3:07 pm

Linux cheat sheets

John D. Cook points to three new Linux cheat sheets from Peteris Krumins:

While investigating, I ran across:

Set Operations in the Unix Shell Simplified

From that post:

Remember my article on Set Operations in the Unix Shell? I implemented 14 various set operations by using common Unix utilities such as diff, comm, head, tail, grep, wc and others. I decided to create a simpler version of that post that just lists the operations. I also created a .txt cheat-sheet version of it and to make things more interesting I added an Awk implementation of each set op. If you want a detailed explanations of each operation, go to the original article.

August 20, 2012

Fast Set Intersection in Memory [Foul! They Peeked!]

Filed under: Algorithms,Memory,Set Intersection,Sets — Patrick Durusau @ 4:06 pm

Fast Set Intersection in Memory by Bolin Ding and Arnd Christian König.

Abstract:

Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a worst-case efficient way. In general, given k (preprocessed) sets, with totally n elements, we will show how to compute their intersection in expected time O(n / sqrt(w) + kr), where r is the intersection size and w is the number of bits in a machine-word. In addition,we introduce a very simple version of this algorithm that has weaker asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and workloads.

Important not only for the algorithm but how they arrived at it.

They peeked at the data.

Imagine that.

Not trying to solve the set intersection problem in the abstract but looking at data you are likely to encounter.

I am all for the pure theory side of things but there is something to be said for less airy (dare I say windy?) solutions. 😉

I first saw this at Theoretical Computer Science: Most efficient algorithm to compute set difference?

August 6, 2012

What’s the Difference? Efficient Set Reconciliation without Prior Context

Filed under: Distributed Systems,P2P,Set Reconciliation,Sets,Topic Map Software — Patrick Durusau @ 4:56 pm

What’s the Difference? Efficient Set Reconciliation without Prior Context by David Eppstein, Michael T. Goodrich, Frank Uyeda, and George Varghese.

Abstract:

We describe a synopsis structure, the Difference Digest, that allows two nodes to compute the elements belonging to the set difference in a single round with communication overhead proportional to the size of the difference times the logarithm of the keyspace. While set reconciliation can be done efficiently using logs, logs require overhead for every update and scale poorly when multiple users are to be reconciled. By contrast, our abstraction assumes no prior context and is useful in networking and distributed systems applications such as trading blocks in a peer-to-peer network, and synchronizing link-state databases after a partition.

Our basic set-reconciliation method has a similarity with the peeling algorithm used in Tornado codes [6], which is not surprising, as there is an intimate connection between set difference and coding. Beyond set reconciliation, an essential component in our Difference Digest is a new estimator for the size of the set difference that outperforms min-wise sketches [3] for small set differences.

Our experiments show that the Difference Digest is more efficient than prior approaches such as Approximate Reconciliation Trees [5] and Characteristic Polynomial Interpolation [17]. We use Difference Digests to implement a generic KeyDiff service in Linux that runs over TCP and returns the sets of keys that differ between machines.

Distributed topic maps anyone?

July 10, 2012

GNU C++ hash_set vs STL std::set: my notebook

Filed under: Deduplication,Hashing,Sets — Patrick Durusau @ 7:35 am

GNU C++ hash_set vs STL std::set: my notebook by Pierre Lindenbaum.

Pierre compares the C++ template set of the C++ Standard Template library to the GNU non-standard hash-based set on a set of random numbers to insert/remove.

The results may surprise you.

Worth investigating if you are removing duplicates post-query.

May 15, 2012

Operations on soft sets revisited

Filed under: Sets,Soft Sets — Patrick Durusau @ 3:59 pm

Operations on soft sets revisited by Ping Zhu and Qiaoyan Wen.

Abstract:

Soft sets, as a mathematical tool for dealing with uncertainty, have recently gained considerable attention, including some successful applications in information processing, decision, demand analysis, and forecasting. To construct new soft sets from given soft sets, some operations on soft sets have been proposed. Unfortunately, such operations cannot keep all classical set-theoretic laws true for soft sets. In this paper, we redefine the intersection, complement, and difference of soft sets and investigate the algebraic properties of these operations along with a known union operation. We find that the new operation system on soft sets inherits all basic properties of operations on classical sets, which justifies our definitions.

An interesting paper will get you interested in soft sets if you aren’t already.

It isn’t easy going, even with the Alice and Bob examples, which I am sure the authors found immediately intuitive.

If you have data where numeric values cannot be assigned, it will be worth your while to explore this paper and the literature on soft sets.

February 21, 2012

data modelling and FRBR WEMI ontology

Filed under: FRBR,RDF,Sets — Patrick Durusau @ 8:00 pm

data modelling and FRBR WEMI ontology

Jonathan Rochkind writes to defend the FRBR WEMI ontology:

Karen Coyle writes on the RDA listserv:

FRBR claims to be based on a “relational” model, as in “relational database.” That is not tomorrow’s data model; it is yesterday’s, although it is a step toward tomorrow’s model. The difficulty is that FRBR was conceived of in the early 1990′s, and completed in the late 1990′s. That makes it about 15 years old.

I think it would have been just as much a mistake to tie the FRBR model to an RDF model as it would have/was to tie it to a relational database model. Whatever we come up with is going to last us more than 15 years, and things will change again. Now, I’ll admit that I’m heretically still suspicious that an RDF data model will in fact be ‘the future’. But even if it is, there will be another future (or simultaneous futures plural).

And concludes:

I tend to think they should have just gone with ‘set theory’ oriented language, because it is, I think, the most clear, while still being abstract enough to make it harder to think the WEMI ontology is tied to some particular technology like relational databases OR linked data. I think WEMI gets it right regardless of whether you speak in the language of ‘relational’, ‘set theory’, ‘object orientation’ or ‘linked data’/RDF.

Leaving my qualms about RDF to one side, I write to point out that choosing “set theory” is a choice of a particular technology or if you like, tradition.

If that sounds odd, consider how many times you have used set theory in the last week, month, year? Unless you are a logician or introductory mathematics professor, the odds are that the number is zero (0) (or the empty set, {},for any logicians reading this post).

Choosing “set theory” is to choose a methodology that very few people use in practice. The vast majority of people make choices, evaluate outcomes, live complex lives innocent of the use of set theory.

I don’t object to FRBR or other efforts choosing to use “set theory” but recognize it is a minority practice.

One that elevates a minority over the majority of users.

April 1, 2011

SEISA: set expansion by iterative similarity aggregation

Filed under: Aggregation,Sets — Patrick Durusau @ 4:11 pm

SEISA: set expansion by iterative similarity aggregation by Yeye He, University of Wisconsin-Madison, Madison, WI, USA, and Dong Xin, Microsoft Research, Redmond, WA, USA.

In this paper, we study the problem of expanding a set of given seed entities into a more complete set by discovering other entities that also belong to the same concept set. A typical example is to use “Canon” and “Nikon” as seed entities, and derive other entities (e.g., “Olympus”) in the same concept set of camera brands. In order to discover such relevant entities, we exploit several web data sources, including lists extracted from web pages and user queries from a web search engine. While these web data are highly diverse with rich information that usually cover a wide range of the domains of interest, they tend to be very noisy. We observe that previously proposed random walk based approaches do not perform very well on these noisy data sources. Accordingly, we propose a new general framework based on iterative similarity aggregation, and present detailed experimental results to show that, when using general-purpose web data for set expansion, our approach outperforms previous techniques in terms of both precision and recall.

To the uses of set expansion mentioned by the authors:

Set expansion systems are of practical importance and can be used in various applications. For instance, web search engines may use the set expansion tools to create a comprehensive entity repository (for, say, brand names of each product category), in order to deliver better results to entity-oriented queries. As another example, the task of named entity recognition can also leverage the results generated by set expansion tools [13]

I would add:

  • augmented authoring of navigation tools for text corpora
  • discovery of related entities (for associations)

While the authors concentrate on web-based documents, which for the most part are freely available, the techniques shown here could be just as easily applied to commercial texts or used to generate pay-for-view results.

It would have to really be a step up to get people to pay a premium for navigation of free content, but given the noisy nature of most information sites, that is certainly possible.

January 31, 2011

Who Identified Roger Magoulas?

Filed under: Examples,Sets,Topic Maps — Patrick Durusau @ 7:26 am

Did you know that Roger Magoulas appears 28 times on the O’Reilly website? (as of 01-29-2010)

With the following 5 hyperlink texts:

Can you name the year that Tim O’Reilly used a hyperlink for Roger Magoulas three times but hasn’t since then?

One consistent resolution for Roger Magoulas, reflecting updates and presented without hand-authoring HTML would be nice.

But, that’s just me.

What do you think?

Pseudo-Code: A New Definition

Filed under: Machine Learning,Sets,Subject Identity,Topic Maps — Patrick Durusau @ 7:24 am

How to Speed up Machine Learning using a Set-Oriented Approach

The detail article for Need faster machine learning? Take a set-oriented approach, which I mentioned in a separate post.

Well, somewhat more detail.

Gives new meaning to pseudo-code:

The application side becomes:

Computing the model:

Fetch “compute-model over data items”

Classifying new items:

Fetch “classify over data items”

I am reminded of the cartoon with two people at a blackboard and one of them says: I think you should be more explicit in step two., where the text reads: Then a miracle occurs.

How about you?

January 29, 2011

Need faster machine learning? Take a
set-oriented approach

Filed under: Machine Learning,Sets,Subject Identity — Patrick Durusau @ 5:00 pm

Need faster machine learning? Take a set-oriented approach.

Roger Magoulas, using not small iron reports:

The result: The training set was processed and the sample data set classified in six seconds. We were able to classify the entire 400,000-record data set in under six minutes — more than a four-orders-of-magnitude records processed per minute (26,000-fold) improvement. A process that would have run for days, in its initial implementation, now ran in minutes! The performance boost let us try out different feature options and thresholds to optimize the classifier. On the latest run, a random sample showed the classifier working with 92% accuracy.

or

set-oriented machine learning makes for:

  • Handling larger and more diverse data sets
  • Applying machine learning to a larger set of problems
  • Faster turnarounds
  • Less risk
  • Better focus on a problem
  • Improved accuracy, greater understanding and more usable results
  • Seems to me sameness of subject representation is a classification task. Yes?

    Going from days to minutes sounds attractive to me.

    How about you?

    January 10, 2011

    Efficient set intersection for inverted indexing

    Filed under: Data Structures,Information Retrieval,Sets — Patrick Durusau @ 4:08 pm

    Efficient set intersection for inverted indexing Authors: J. Shane Culpepper, Alistair Moffat Keywords: Compact data structures, information retrieval, set intersection, set representation, bitvector, byte-code

    Abstract:

    Conjunctive Boolean queries are a key component of modern information retrieval systems, especially when Web-scale repositories are being searched. A conjunctive query q is equivalent to a |q|-way intersection over ordered sets of integers, where each set represents the documents containing one of the terms, and each integer in each set is an ordinal document identifier. As is the case with many computing applications, there is tension between the way in which the data is represented, and the ways in which it is to be manipulated. In particular, the sets representing index data for typical document collections are highly compressible, but are processed using random access techniques, meaning that methods for carrying out set intersections must be alert to issues to do with access patterns and data representation. Our purpose in this article is to explore these trade-offs, by investigating intersection techniques that make use of both uncompressed “integer” representations, as well as compressed arrangements. We also propose a simple hybrid method that provides both compact storage, and also faster intersection computations for conjunctive querying than is possible even with uncompressed representations.

    The treatment of set intersection caught my attention.

    Unlike document sets, topic maps have restricted sets of properties or property values that will form the basis for set intersection (merging in topic maps lingo).

    Topic maps also differ in that identity bearing properties are never ignored, whereas in searching a reverse index, terms can be included in the index that are ignored in a particular query.

    What impact those characteristics will have on set intersection for topic maps remains a research question.

    October 19, 2010

    Fast Secure Computation of Set Intersection

    Filed under: Security,Set Intersection,Sets — Patrick Durusau @ 6:21 am

    Fast Secure Computation of Set Intersection Authors: Stanis?aw Jarecki and Xiaomin Liu

    Introduction:

    Secure Protocol for Computing Set Intersection and Extensions. Secure computation of set intersection (or secure evaluation of a set intersection function) is a protocol which allows two parties, sender S and receiver R, to interact on their respective input sets X and Y in such a way that R learns X ? Y and S learns nothing. Secure computation of set intersection has numerous useful applications: For example, medical institutions could find common patients without learning any information about patients that are not in the intersection, different security agencies could search for common items in their databases without revealing any other information, the U.S. Department of Homeland Security can quickly find if there is a match between a passenger manifest and its terrorist watch list, etc.

    Imagine partial sharing of a topic map in a secure environment.

    The article has a useful review of work in this area.

    Curious if this really prevents learning of additional information.

    If the source is treated as a black box and subjects are projected on the basis of responses to different receivers, with mapping between those,…, well, that had better wait for a future post. (Or a contract from someone interested in breaching a secure system. 😉 )

    Powered by WordPress