Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 25, 2011

A Compressed Self-Index for Genomic Databases

Filed under: Bioinformatics,Biomedical,Indexing — Patrick Durusau @ 6:07 pm

A Compressed Self-Index for Genomic Databases by Travis Gagie, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi.

Abstract:

Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals’ genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv (RLZ) compression takes advantage of this property: it stores the first genome uncompressed or as an FM-index, then compresses the other genomes with a variant of LZ77 that copies phrases only from the first genome. RLZ achieves good compression and supports fast random access; in this paper we show how to support fast search as well, thus obtaining an efficient compressed self-index.

As the authors note, an area with rapidly increasing need for efficient effective indexing.

It would be a step forward to see a comparison of this method on a common genome set with:

I suppose I am presuming a common genome data set for indexing demonstrations.

Questions:

  • Is there a common genome data set for comparison of indexing techniques?
  • Are there other indexing techniques that should be included in a comparison?

Obviously important for topic maps used in genome projects.

But insights about identification of subjects that vary only slightly in one (or more) dimensions to identify different subjects, will be useful in other contexts.

An easy example would be isotopes. Let’s see, ah, celestial or other coordinate systems. Don’t know but would guess that spectra from stars/galaxies are largely common. (Do you know for sure?) What other data sets have subjects that are identified on the basis of small or incremental changes in a largely identical identifier?

A Faster Grammar-Based Self-Index

Filed under: Bioinformatics,Biomedical,Indexing — Patrick Durusau @ 6:06 pm

A Faster Grammar-Based Self-Index by Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, Simon J. Puglisi.

Abstract:

To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on straight-line programs and LZ77. In this paper we show how, given a balanced straight-line program for a string (S [1..n]) whose LZ77 parse consists of $z$ phrases, we can add $\Oh{z \log \log z}$ words and obtain a compressed self-index for $S$ such that, given a pattern (P [1..m]), we can list the $\occ$ occurrences of $P$ in $S$ in $\Oh{m^2 + (m + \occ) \log \log n}$ time. All previous self-indexes are either larger or slower in the worst case.

Updated version of the paper I covered at: A Faster LZ77-Based Index.

In a very real sense, indexing is fundamental to information retrieval. That is to say that when information is placed in electronic storage, the only way to retrieve it is via indexing. The index may be one to one with a memory location and hence not terribly efficient, but the fact remains that an index is part of every information retrieval transaction.

December 24, 2011

Cleo: Flexible, partial, out-of-order and real-time typeahead search

Filed under: Forward Index,Indexing,Typeahead Search — Patrick Durusau @ 4:44 pm

Cleo: Flexible, partial, out-of-order and real-time typeahead search

From the webpage:

Cleo is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead and autocomplete services. It is suitable for data sets of various sizes from different domains. The Cleo software library is published under the terms of the Apache Software License version 2.0, a copy of which has been included in the LICENSE file shipped with the Cleo distribution.

Not to be mistaken with query autocomplete, Cleo does not suggest search terms or queries. Cleo is a library for developing applications that can perform real typeahead queries and deliver instantaneous typeahead results/objects/elements as you type.

Cleo is also different from general-purpose search libraries because 1) it does not evaluate search terms but the prefixes of those terms, and 2) it enables search by means of Bloom Filter and forward indexes rather than inverted indexes.

You may be amused by the definition of “forward index” offered by NIST:

An index into a set of texts. This is usually created as the first step to making an inverted index.

Or perhaps more usefully from the Wikipedia entry on Index (Search Engine):

The forward index stores a list of words for each document. The following is a simplified form of the forward index:

Forward Index
Document Words
Document 1 the,cow,says,moo
Document 2 the,cat,and,the,hat
Document 3 the,dish,ran,away,with,the,spoon

The rationale behind developing a forward index is that as documents are parsing, it is better to immediately store the words per document. The delineation enables Asynchronous system processing, which partially circumvents the inverted index update bottleneck.[19] The forward index is sorted to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index.

So, was it an indexing performance issue that lead to use of a “forward index” or was it some other capability?

Suggestions on what “typeahead search” would/could mean in a topic map context?

December 23, 2011

Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Filed under: Edit Distance,Indexing,Similarity,String Matching — Patrick Durusau @ 4:28 pm

Bed-tree: an all-purpose index structure for string similarity search based on edit distance by Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, and Divesh Srivastava.

Abstract:

Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the Bed-tree, a B+-tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to state-of-the-art methods on string similarity search, the Bed-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time.

The authors classify similarity queries as:

Type Example
range address in customer database
top-k results of search engine
all-pairs joins
pairs of proteins or genes

There are a couple of things that trouble me about the paper.

First:

6.3 Top-K construction

In many cases, top-k queries are more practical than range queries. However, existing indexing schemes with inverted lists do not naturally support such queries. To illustrate
the performance benefits of our proposals, we implemented a simple strategy with Flamingo, by increasing the range query threshold gradually until more than k string results
are found. Notice that we use the same Bed-tree structures to support all different types of queries. Thus, we skip the performance comparisons on index construction but focus on query processing efficiency. (emphasis added)

I am not sure what is meant by inverted lists “…do not naturally support …[top-k queries].” Inverted list structures are wildly popular among WWW search engines so I would like to know more about this notion of “…naturally support….”

Moreover, indexes aren’t simply used, they are created as well. Puzzling that we are left to wonder how long it will take to have a Bed-tree database that we can use.

Second, there are a couple of fairly serious mis-citation errors in the paper. The authors refer to “Flamingo” and “Mismatch” (from 2008) as comparisons but the articles cited: “[15] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257–266, 2008” and “C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933–944, 2008, respectively, are innocent of any such implementations.

C. Li is the faculty adviser for the Flamingo project, which has a new release since I mentioned it at: The FLAMINGO Project on Data Cleaning, but you don’t cite a project by picking a paper at random that doesn’t mention the project. (If you haven’t looked at the FLAMINGO project, its software and papers you really should.)

C. Xiao and company have a “mismatch” filter but it isn’t ever used as the name of an implementation.

Tracing the history of advances in computer science is easier or at least less time consuming if researchers don’t have to chase rabbits in the form of bad citations. Not to mention that if you aren’t cited properly, you may not get full credit for all the work you have actually done. Good citation practices are in everyone’s interest.

Similarity Search on Bregman Divergence: Towards NonMetric Indexing

Filed under: Indexing,NonMetric Indexing — Patrick Durusau @ 4:28 pm

Similarity Search on Bregman Divergence: Towards NonMetric Indexing by Zhenjie Zhang, Beng Chin, Ooi Srinivasan, Parthasarathy Anthony, K. H. Tung. (2009)

Abstract:

In this paper, we examine the problem of indexing over non-metric distance functions. In particular, we focus on a general class of distance functions, namely Bregman Divergence [6], to support nearest neighbor and range queries. Distance functions such as KL-divergence and Itakura-Saito distance, are special cases of Bregman divergence, with wide applications in statistics, speech recognition and time series analysis among others. Unlike in metric spaces, key properties such as triangle inequality and distance symmetry do not hold for such distance functions. A direct adaptation of existing indexing infrastructure developed for metric spaces is thus not possible. We devise a novel solution to handle this class of distance measures by expanding and mapping points in the original space to a new extended space. Subsequently, we show how state-of-the-art tree-based indexing methods, for low to moderate dimensional datasets, and vector approximation file (VA-file) methods, for high dimensional datasets, can be adapted on this extended space to answer such queries efficiently. Improved distance bounding techniques and distribution-based index optimization are also introduced to improve the performance of query answering and index construction respectively, which can be applied on both the R-trees and VA files. Extensive experiments are conducted to validate our approach on a variety of datasets and a range of Bregman divergence functions.

This paper hasn’t been cited a lot (yet) but I think it will be of particular interest to the topic map community. Mostly because natural languages, the stuff most users use to describe/identify subjects, are inherently nonmetric.

Just as a placeholder for future conversation, it occurs to me that there is a performance continuum for subject-sameness tests. How complex must matching become with some number X of topics before performance degrades? Or perhaps better,, what are the comparative performance characteristics of different subject-sameness tests?

December 21, 2011

Self-Index based on LZ77 (thesis)

Filed under: Indexing,LZ77 — Patrick Durusau @ 7:23 pm

Self-Index based on LZ77 (thesis) by Sebastian Kreft, Gonzalo Navarro (advisor).

Abstract:

Domains like bioinformatics, version control systems, collaborative editing systems (wiki), and others, are producing huge data collections that are very repetitive. That is, there are few differences between the elements of the collection. This fact makes the compressibility of the collection extremely high. For example, a collection with all different versions of a Wikipedia article can be compressed up to the 0.1% of its original space, using the Lempel-Ziv 1977 (LZ77) compression scheme.

Many of these repetitive collections handle huge amounts of text data. For that reason, we require a method to store them efficiently, while providing the ability to operate on them. The most common operations are the extraction of random portions of the collection and the search for all the occurrences of a given pattern inside the whole collection.

A self-index is a data structure that stores a text in compressed form and allows to find the occurrences of a pattern efficiently. On the other hand, self-indexes can extract any substring of the collection, hence they are able to replace the original text. One of the main goals when using these indexes is to store them within main memory.

In this thesis we present a scheme for random text extraction from text compressed with a Lempel-Ziv parsing. Additionally, we present a variant of LZ77, called LZ-End, that efficiently extracts text using space close to that of LZ77.

The main contribution of this thesis is the first self-index based on LZ77/LZ-End and oriented to repetitive texts, which outperforms the state of the art (the RLCSA self-index) in many aspects. Finally, we present a corpus of repetitive texts, coming from several application domains. We aim at providing a standard set of texts for research and experimentation, hence this corpus is publicly available.

Despite the world economic woes, instabiilty in a number of places, something comes along that looks very promising and makes my day! Will take a while to read but looks quite promising.

December 9, 2011

Database Indexes for the Inquisitive Mind

Filed under: Database,Indexing — Patrick Durusau @ 8:14 pm

Database Indexes for The Inquisitive Mind by Nuno Job

From the post:

I’ve used to be a developer advocate an awesome database product called MarkLogic, a NoSQL Document Database for the Enterprise. Now it’s pretty frequent that people ask me about database stuff.

In here I’m going to try to explain some fun stuff you can do with indexes. Not going to talk about implementing them but just about what they solve.

The point here is to help you reason about the choices you have when you are implementing stuff to speed up your applications. I’m sure if you think an idea is smart and fun you’ll research what’s the best algorithm to implement it.

If you are curious about MarkLogic you can always check the Inside MarkLogic white-paper.

Very nice introduction to database indexes. There is more to learn, as much if not more than you would care to. 😉

December 5, 2011

Medical Text Indexer (MTI)

Filed under: Bioinformatics,Biomedical,Indexing — Patrick Durusau @ 7:42 pm

Medical Text Indexer (MTI) (formerly the Indexing Initiative System (IIS))

From the webpage:

The MTI system consists of software for applying alternative methods of discovering MeSH headings for citation titles and abstracts and then combining them into an ordered list of recommended indexing terms. The top portion of the diagram consists of three paths, or methods, for creating a list of recommended indexing terms: MetaMap Indexing, Trigrams and PubMed Related Citations. The first two paths actually compute UMLS Metathesaurus® concepts which are passed to the Restrict to MeSH process. The results from each path are weighted and combined using the Clustering process. The system is highly parameterized not only by path weights but also by several parameters specific to the Restrict to MeSH and Clustering processes.

A prototype MTI system described below had two additional indexing methods which were removed because their results were subsumed by the three remaining methods.

Deeply interesting and relevant work to topic maps.

November 10, 2011

Indexing Sound: Musical Riffs to Gunshots

Filed under: Indexing,Similarity,Sound — Patrick Durusau @ 6:37 pm

Sound, Digested: New Software Tool Provides Unprecedented Searches of Sound, from Musical Riffs to Gunshots

From the post:

Audio engineers have developed a novel artificial intelligence system for understanding and indexing sound, a unique tool for both finding and matching previously un-labeled audio files.

Having concluded beta testing with one of the world’s largest Hollywood sound studios and leading media streaming and hosting services, Imagine Research of San Francisco, Calif., is now releasing MediaMined™ (http://www.imagine-research.com/node/51) for applications ranging from music composition to healthcare.

….

One of the key innovations of the new technology is the ability to perform sound-similarity searches. Now, when a musician wants a track with a matching feel to mix into a song, or an audio engineer wants a slightly different sound effect to work into a film, the process can be as simple as uploading an example file and browsing the detected matches.

“There are many tools to analyze and index sound, but the novel, machine-learning approach of MediaMined™ was one reason we felt the technology could prove important,” says Errol Arkilic, the NSF program director who helped oversee the Imagine Research grants. “The software enables users to go beyond finding unique objects, allowing similarity searches–free of the burden of keywords–that generate previously hidden connections and potentially present entirely new applications.”

Or from the Imagine Research Applications page:

Organize Sound

Automatically index the acoustic content of video, audio, and live streams across a companies web services. Analyze web-crawled data, user-generated content, professional broadcast content, and live streaming events.

Benefits:

  • Millions of minutes of content are now searchable
  • Recommending related content increases audience and viewer consumption
  • Better content discovery, intelligent navigation within media files
  • Search audio content with ease and accuracy
  • Audio content-aware targeted ads – improves ad performance and revenue

Search Sound

Perform sound-similarity searches for sounds and music by using example sounds.
Search for production music that matches a given track.
Perform the rhythmic similarity searches

Benefits:

  • Recommending related content increases audience and viewer consumption
  • Music/Audio licensing portals provide a unique-selling point: find content based on an input seed track.
  • Improved monetization of existing content with similarity-search and recommendations

And if you have a topic map of music, producers, studios, albums, etc., this could supplement your topic map based on similarity measures every time a new recording is released, or music is uploaded to a website or posted to YouTube. So you know who to contact for whatever reason.

A topic map of videos could serve up matches and links with thumbnails for videos with similar sound content based on a submitted sample, a variation as it were on “more of same.”

A topic map of live news feeds could detect repetition of news stories and with timing information could map the copying of content from one network to the next. Or provide indexing of news accounts without the necessity of actually sitting through the broadcasts. That is an advantage not mentioned above.

Sound recognition isn’t my area so if this is old news or there are alternatives to suggest to topic mappers, please sing out! (Sorry!)

November 4, 2011

Near-real-time readers with Lucene’s SearcherManager and NRTManager

Filed under: Indexing,Lucene,Software — Patrick Durusau @ 6:11 pm

Near-real-time readers with Lucene’s SearcherManager and NRTManager

From the post:

Last time, I described the useful SearcherManager class, coming in the next (3.5.0) Lucene release, to periodically reopen your IndexSearcher when multiple threads need to share it. This class presents a very simple acquire/release API, hiding the thread-safe complexities of opening and closing the underlying IndexReaders.

But that example used a non near-real-time (NRT) IndexReader, which has relatively high turnaround time for index changes to become visible, since you must call IndexWriter.commit first.

If you have access to the IndexWriter that’s actively changing the index (i.e., it’s in the same JVM as your searchers), use an NRT reader instead! NRT readers let you decouple durability to hardware/OS crashes from visibility of changes to a new IndexReader. How frequently you commit (for durability) and how frequently you reopen (to see new changes) become fully separate decisions. This controlled consistency model that Lucene exposes is a nice “best of both worlds” blend between the traditional immediate and eventual consistency models.

Getting into the hardcore parts of Lucene!

Understanding Lucene (or a similar indexing engine) is critical to both mining data as well as delivery of topic map based information to users.

November 2, 2011

Processing & Twitter

Filed under: Indexing,Processing — Patrick Durusau @ 6:24 pm

Processing & Twitter

From the post:

** Since I first released this tutorial in 2009, it has received thousands of views and has hopefully helped some of you get started with building projects incorporating Twitter with Processing. In late 2010, Twitter changed the way that authorization works, so I’ve updated the tutorial to get it inline with the new Twitter API functionality.

Accessing information from the Twitter API with Processing is (reasonably) easy. A few people have sent me e-mails asking how it all works, so I thought I’d write a very quick tutorial to get everyone up on their feet.

We don’t need to know too much about how the Twitter API functions, because someone has put together a very useful Java library to do all of the dirty work for us. It’s called twitter4j, and you can download it here. We’ll be using this in the first step of the building section of this tutorial.

Visualizing Twitter messages with Processing (graphics language) is good practice for any type of streaming data.

I really don’t understand the attraction of word clouds but know that many people like them. What I think would be cool would be what looks like a traditional index for browsing where the words darken based on their frequency in the stream of material and perhaps even have see-also entries. Imagine that with a feed from CiteSeer or the ACM.

October 30, 2011

Experimenting with Real-Time Search using Sphinx

Filed under: Indexing,Sphinx — Patrick Durusau @ 7:04 pm

Experimenting with Real-Time Search using Sphinx by Jeremy Zawodny.

From the post:

In the last few weeks I’ve been experimenting with using real-time indexes in Sphinx to allow full-text searching of data recently added to craigslist. While this posting is not intended to be a comprehensive guide to what I’ve implemented (that may come later in the form of a few posts and/or a conference talk) or a tutorial for doing it yourself, I have had a few questions about it and learned a few things along the way.

Will help you evaluate Sphinx for your indexing needs.

October 29, 2011

FM-Indexes and Backwards Search

Filed under: FM-Indexes,Indexing,Wavelet Trees — Patrick Durusau @ 7:27 pm

FM-Indexes and Backwards Search

From the post:

Last time (way back in June! I have got to start blogging consistently again) I discussed a gorgeous data structure called the Wavelet Tree. When a Wavelet Tree is stored using RRR sequences, it can answer rank and select operations in O(log A) time, where A is the size of the alphabet. If the size of the alphabet is 2, we could just use RRR by itself, which answers rank and select in O(1) time for binary strings. RRR also compresses the binary strings, and hence compresses a Wavelet Tree which is stored using RRR.

So far so good, but I suspect rank and select queries seem to be of limited use right now (although once you are familiar with the available structures, applications show up often). One of the neatest uses of rank that I’ve seen is in substring search, which is certainly a wide reaching problem (for a very recent application to genome assembly, see Jared Simpson’s paper from 2010 called Efficient construction of an assembly string graph using the FM-index)

Also, my apologies but I am using 1-basing instead of 0-basing, because that is how I did my diagrams a year ago 🙂 (and bear with my lack of nicely typeset math, I am migrating to GitHub pages where I will be allowed to use Mathjax soon)

If you are interested in stretching your indexing muscles a bit, this post will get them limbered up!

October 22, 2011

How to clone Wikipedia and index it with Solr

Filed under: Indexing,Solr — Patrick Durusau @ 3:17 pm

How to clone Wikipedia and index it with Solr

Looks like it is going to be a Wikipedia sorta day! 😉 Seriously, Wikipedia is increasing in importance with every new page or edit. Not to mention that the cited posts will give you experience with a variety of approaches and tools to dealing with Wikipedia as a data set.

From the post:

A major milestone for ZimZaz: I have (finally) successfully cloned Wikipedia and indexed it with Solr. It took about six weeks in calendar time and felt like a lot more. If I had made no mistakes and had to learn nothing, I could have done it in less than a business week. In the spirit of documenting my work and helping others, here are the key steps along the way.

Kudos to the author for documenting what went wrong and what went right!

October 21, 2011

Using MongoDB in Anger

Filed under: Database,Indexing,MongoDB,NoSQL — Patrick Durusau @ 7:26 pm

Using MongoDB in Anger

Tips on building high performance applications with MongoDB.

Covers four topics:

  • Schema design
  • Indexing
  • Concurrency
  • Durability

Excellent presentation!

One of the first presentations I have seen that recommends a book about another product. Well, High Performance MySQL and MongoDB in Action.

October 16, 2011

CENDI Agency Indexing System Descriptions: A Baseline Report

Filed under: Government Data,Indexing,Thesaurus — Patrick Durusau @ 4:08 pm

CENDI Agency Indexing System Descriptions: A Baseline Report (1998)

In some ways a bit dated but also a snap-shot in time of the indexing practices of the:

  • National Technical Information Service (NTIS),
  • Department of Energy, Office of Scientific and Technical Information (DOE OSTI),
  • US Geological Survey/Biological Resources Division (USGS/BRD),
  • National Aeronautics and Space Administration, STI Program (NASA),
  • National Library of Medicine/National Institutes of Health (NLM),
  • National Air Intelligence Center (NAIC),
  • Defense Technical Information Center (DTIC).

The summary reads:

Software/technology identification for automatic support to indexing. As the resources for providing human indexing become more precious, agencies are looking for technology support. DTIC, NASA, and NAIC already have systems in place to supply candidate terms. New systems are under development and are being tested at NAIC and NLM. The aim of these systems is to decrease the burden of work borne by indexers.

Training and personnel issues related to combining cataloging and indexing functions. DTIC and NASA have combined the indexing and cataloging functions. This reduces the paper handling and the number of “stations” in the workflow. The need for a separate cataloging function decreases with the advent of EDMS systems and the scanning of documents with some automatic generation of cataloging information based on this scanning. However, the merger of these two diverse functions has been a challenge, particularly given the difference in skill level of the incumbents.

Thesaurus maintenance software. Thesaurus management software is key to the successful development and maintenance of controlled vocabularies. NASA has rewritten its system internally for a client/server environment. DTIC has replaced its systems with a commercial-off-the-shelf product. NTIS and USGS/BRD are interested in obtaining software that would support development of more structured vocabularies.

Linked or multi-domain thesauri. Both NTIS and USGS/BRD are interested in this approach. NTIS has been using separate thesauri for the main topics of the document. USGS/BRD is developing a controlled vocabulary to support metadata creation and searching but does not want to develop a vocabulary from scratch. In both cases, there is concern about the resources for development and maintenance of an agency-specific thesaurus. Being able to link to multiple thesauri that are maintained by their individual “owners” would reduce the investment and development time.

Full-text search engines and human indexing requirements. It is clear that the explosion of information on the web (both relevant web sites and web-published documents) cannot be indexed in the old way. There are not enough resources; yet, the chaos of the web bets for more subject organization. The view of current full-text search engines is that the users often miss relevant documents and retrieve a lot of “noise”. The future of web searching is unclear and demands or requirements that it might place on indexing is unknown.

Quality Control in a production environment. As resources decrease and timeliness becomes more important, there are fewer resources available for quality control of the records. The aim is to build the quality in at the beginning, when the documents are being indexed, rather than add review cycles. However, it is difficult to maintain quality in this environment.

Training time. The agencies face indexer turnover and the need to produce at ever-increasing rates. Training time has been shortened over the years. There is a need to determine how to make shorter training periods more effective.

Indexing systems designed for new environments, especially distributed indexing. An alternative to centralized indexers is a more distributed environment that can take advantage of cottage labor and contract employees. However, this puts increasing demands on the indexing system. It must be remotely accessible, yet secure. It must provide equivalent levels of validation and up-front quality control.

Major project: Update this report, focusing on the issues listed in the summary.

Large Data Sets and Ever Changing Terminology

Filed under: BigData,Indexing — Patrick Durusau @ 4:08 pm

Large Data Sets and Ever Changing Terminology from TaxoDiary.

From the post:

Indexing enables accurate, consistent retrieval to the full depth and breadth of the collection. This does not mean that the statistics-based systems the government loves so much will go away, but they are learning to embrace the addition of taxonomy terms as indexing.

To answer your question, relevant metadata, tagging, normalization of entity references and similar indexing functions just make it easier to allow a person to locate what’s needed. Easy to say and very hard to do.

Search is like having to stand in a long line waiting to order a cold drink on a hot day. So there will always be dissatisfaction because “search” stands between you and what you want. You want the drink but hate the line. That said, I think the reason controlled indexing (taxonomy or thesaurus) is so popular compared to the free ranging keywords is that they have control. They make moving through the line efficient. You know how long the wait is and what terms you need to achieve the result.

I like the “…cold drink on a hot day” comparison. Goes on to point out the problems created by “ever changing terminology.” Which isn’t something that is going to stop happening. People have been inventing new terminologies probably as long as we have been able to communicate. However poorly we do that.

The post does advocate use of MAI (machine assisted indexing) from the author’s company but the advantages of indexing ring true whatever you use to achieve that result.

I do think the author should get kudo’s for pointing out that indexing is a hard problem. No magic cures, no syntax to save everyone, and no static solutions. As domains change, so do indexes. It is just as simple as that.

October 5, 2011

The concept of “aboutness” in subject indexing

Filed under: Indexing,Information Overload — Patrick Durusau @ 6:49 pm

I just finished reading a delightful paper by W. J. Hutchins, ‘The concept of “aboutness” in subject indexing,’ which was presented at a Colloquium on Aboutness held by the Co-ordinate Indexing Group, 18 April 1977 and was reprinted in Readings in Information Retrieval, edited by Karen Sparck Jones and Peter Willett, Morgan Kaufman Publishers, Inc., San Francisco, California, 1997.

I discovered the paper in a hard copy of Readings in Information Retrieval, but it is also online, The concept of “aboutness” in subject indexing.

Hutchins writes in his abstract:

The common view of the ‘aboutness’ of documents is that the index entries (or classifications) assigned to documents represent or indicate in some way the total contents of documents; indexing and classifying are seen as processes involving the ‘summarization’ of the texts of documents. In this paper an alternative concept of ‘aboutness’ is proposed based on an analysis of the linguistic organization of texts, which is felt to be more appropriate in many indexing environments (particularly in non-specialized libraries and information services) and which has implications for the evaluation of the effectiveness of indexing systems.

You can read the details of how he suggests discovering the “aboutness” of documents but I was struck by his observation that the ‘summarization’ practice furthers the end of exhaustive search. Under Objectives of indexing, Hutchins says:

In the context of the special library and similarly specialized information services, the ‘summarization’ approach to subject indexing is most appropriate. Indexers are generally able to define clearly the interests and levels of knowledge of the readers they are serving; they are thus able to produce ‘summaries’ biased in the most helpful directions for their readers. More importantly, indexers can normally assume that most users are already very knowledgeable on most of the topics they look for in the indexes provided. They can assume that the usual search is for references to all documents treating a particular topic, since any one may have something ‘new’ to say about it that the reader did not know before. The fact that some references will lead users to texts which tell them nothing they did not previously know should not normally worry them unduly—it is the penalty they expect to pay for the assurance that the search has been as exhaustive as feasible.

Exhaustive search is one type of search that drives tests for the success of indexing:

The now traditional parameters of ‘recall’, ‘precision’ and ‘fallout’ are clearly valid for systems in which success is measured in terms of the ability to retrieve all documents which have something to say on a particular topic—that is to say, in systems based on the ‘summarization’ approach.*

You could say that full-text indexing/searching is different from ‘summarization’ by a professional indexer, but is it? Or have we simply substituted non-professional indexers into the process?

With ‘summarization,’ a professional indexer chooses terms that represent the content of a document. With full-text searching, the terms chosen on an ad-hoc basis by a user come to represent a ‘summary’ of entire documents. And in both cases, all the documents so summarized are returned to the user, in other words, the search is exhaustive.

Google/Bing/Yahoo! searches are examples of exhaustive searches of little value. I can find two or three thousand (2000-3000) new pages of material relevant to topic map issues everyday. Can you say information overload?

Or is that information volume overload? That out of the two or three thousand (2000-3000) pages per day, probably more like fifty to one hundred (50-100) pages are worth my attention. That is what “old-style” indexing brought to the professional researcher.

September 27, 2011

A Faster LZ77-Based Index

Filed under: Bioinformatics,Biomedical,Indexing — Patrick Durusau @ 7:19 am

A Faster LZ77-Based Index by Travis Gagie and Pawel Gawrychowski.

Abstract:

Suppose we are given an AVL-grammar with $r$ rules for a string (S [1..n]) whose LZ77 parse consists of $z$ phrases. Then we can add $\Oh{z \log \log z}$ words and obtain a compressed self-index for $S$ such that, given a pattern (P [1..m]), we can list the occurrences of $P$ in $S$ in $\Oh{m^2 + (m + \occ) \log \log n}$ time.

Not the best abstract I have ever read. At least in terms of attracting the most likely audience to be interested.

I would have started with: “Indexing of genomes, which are 99.9% same, can be improved in terms of searching, response times and reporting of secondary occurrences.” Then follow with the technical description of the contribution. Don’t make people work for a reason to read the paper.

Any advancement in indexing, but particularly in an area like genomics, is important to topic maps.


Update: See the updated version of this paper: A Faster Grammar-Based Self-Index.

September 26, 2011

Lucene and Solr’s CheckIndex to the Rescue!

Filed under: Indexing,Lucene,Solr — Patrick Durusau @ 7:03 pm

Lucene and Solr’s CheckIndex to the Rescue! by Rafał Kuć.

From the post:

While using Lucene and Solr we are used to a very high reliability. However, there may come a day when Solr will inform us that our index is corrupted, and we need to do something about it. Is the only way to repair the index to restore it from the backup or do full indexation? No – there is hope in the form of CheckIndex tool.

What is CheckIndex ?

CheckIndex is a tool available in the Lucene library, which allows you to check the files and create new segments that do not contain problematic entries. This means that this tool, with little loss of data is able to repair a broken index, and thus save us from having to restore the index from the backup (of course if we have it) or do the full indexing of all documents that were stored in Solr.

The question about when the last backup was run at the end of the article isn’t meant to be funny.

When I was training to be a NetWare sysadmin, more than a little while ago, one of the manuals advised that the #1 reason for sysadmins being fired was failure to maintain proper backups. I suspect that is probably still the case. Or at least I hope it is. There really is no excuse for failing to maintain proper backups.

Index external websites with Apache Nutch

Filed under: Indexing,Solr — Patrick Durusau @ 7:02 pm

Index external websites with Apache Nutch by Stefan Sprenger.

Walks through using Apache Nutch with Solr.

Comes at an opportune time because I have a data set (URIs) that I want to explore using a variety of methods. No one of which will be useful for all use cases.

If you need a mapping metaphor, think of it as setting off into unexplored territory and the map (read tool) I am using changes the landscape I will have to explore.

Probably not doing the first instalment this week but either late this week or early next.

Building Distributed Indexing for Solr: MurmurHash3 for Java

Filed under: Indexing,Java,Solr — Patrick Durusau @ 7:01 pm

Building Distributed Indexing for Solr: MurmurHash3 for Java by Yonik Seeley.

From the post:

Background

I needed a really good hash function for the distributed indexing we’re implementing for Solr. Since it will be used for partitioning documents, it needed to be really high quality (well distributed) since we don’t want uneven shards. It also needs to be cross-platform, so a client could calculate this hash value themselves if desired, to predict which node has a given document.

MurmurHash3

MurmurHash3 is one of the top favorite new hash function these days, being both really fast and of high quality. Unfortunately it’s written in C++, and a quick google did not yield any suitable high quality port. So I took 15 minutes (it’s small!) to port the 32 bit version, since it should be faster than the other versions for small keys like document ids. It works in 32 bit chunks and produces a 32 bit hash – more than enough for partitioning documents by hash code.

Something for your Solr friends.

September 25, 2011

Artificial Intelligence Resources

Filed under: Artificial Intelligence,Indexing,Searching,Semantic Diversity,Topic Maps — Patrick Durusau @ 7:48 pm

Artificial Intelligence Resources

A collection of collections of resources on artificial intelligence. Useful but also illustrates a style of information delivery that has advantages over “search style foraging” and disadvantages as well.

It’s biggest advantage over “search style foraging” is that it presents a manageable listing of resources and not several thousand links. Even very dedicated researchers are unlikely to follow links > hundreds and even if you did, some of the material would be outdated by the time you reached it.

Another advantage is that one hopes (I haven’t tried all the links) that the resources have been vetted to some degree, with the superficial and purely advertising sites being filtered out. Results are more “hit” than “miss,” which with search results can be a very mixed bag.

But a manageable list is just that, manageable, the very link you need may have missed the cut-off point. Had to stop somewhere.

And you can’t know the author’s criteria for the listing. Their definition of “algorithm” may broader or narrower than your own.

In the days of professional indexes, researchers learned a sense for the categories used by indexing services. At least that was a smaller set than the vocabulary range of every author.

How would you use topic maps to bridge the gap between those two solutions?

September 19, 2011

The Joy of Indexing

Filed under: Indexing,MongoDB — Patrick Durusau @ 7:55 pm

The Joy of Indexing by Kyle Banker.

From the post:

We spend quite a lot of time at 10gen supporting MongoDB users. The questions we receive are truly legion but, as you might guess, they tend to overlap. We get frequent queries on sharding, replica sets, and the idiosyncrasies of JavaScript, but the one subject that never fails to appear each day on our mailing list is indexing.

Now, to be clear, I’m not talking about how to create an index. That’s easy. The trouble runs much deeper. It’s knowing how indexes work and having the intuition to create the best indexes for your queries and your data set. Lacking this intuition, your production database will eventually slow to a crawl, you’ll upgrade your hardware in vain, and when all else fails, you’ll blame both gods and men.

This need not be your fate. You can understand indexing! All that’s required is the right mental model, and over the course of this series, that’s just what I hope to provide.

But caveat emptor: what follows is a thought experiment. To get the most out of this post, you can’t skim it. Read every word. Use your imagination. Think through the quizzes. Do this, and your indexing struggles may soon be no more.

Very useful post and one that anyone starting to create indexes by automated means needs to read.

Curious how readers with a background in indexing feel about the description?

What would you instruct a reader to do differently if they were manually creating an index to this cookbook?

September 14, 2011

Secondary Indexes in Riak

Filed under: Indexing,Riak — Patrick Durusau @ 7:02 pm

Secondary Indexes in Riak

Hey, “…alternate keys, one-to-many relationships, or many-to-many relationships…,” it sounds like they are playing the topic map song!

From the post:

Developers building an application on Riak typically have a love/hate relationship with Riak’s simple key/value-based approach to storing data. It’s great that anyone can grok the basics (3 simple operations, get/put/delete) quickly. It’s convenient that you can store anything imaginable as an object’s value: an integer, a blob of JSON data, an image, an MP3. And the distributed, scalable, failure-tolerant properties that a key/value storage model enables can be a lifesaver depending on your use case.

But things get much less rosy when faced with the challenge of representing alternate keys, one-to-many relationships, or many-to-many relationships in Riak. Historically, Riak has shifted these responsibilities to the application developer. The developer is forced to either find a way to fit their data into a key/value model, or to adopt a polyglot storage strategy, maintaining data in one system and relationships in another.

This adds complexity and technical risk, as the developer is burdened with writing additional bookkeeping code and/or learning and maintaining multiple systems.

That’s why we’re so happy about Secondary Indexes. Secondary Indexes are the first step toward solving these challenges, lifting the burden from the backs of developers, and enabling more complex data modeling in Riak. And the best part is that it ships in our 1.0 release, just a few weeks from now.

September 10, 2011

The Language Problem: Jaguars & The Turing Test

Filed under: Ambiguity,Authoring Topic Maps,Indexing,Language — Patrick Durusau @ 6:10 pm

The Language Problem: Jaguars & The Turing Test by Gord Hotchkiss.

The post begins innocently enough:

“I love Jaguars!”

When I ask you to understand that sentence, I’m requiring you to take on a pretty significant undertaking, although you do it hundreds of times each day without really thinking about it.

The problem comes with the ambiguity of words.

If you appreciate discussions of language, meaning and the short falls of our computing companions, you will really like this article and the promised following posts.

Not to mention bringing into sharp relief the issues that topic map authors (or indexers) face when trying to specify a subject that will be recognized and used by N unknown users.

I suppose that is really the tricky part, or at least part of it, the communication channel for an index or topic map is only one way. There is no opportunity for correcting a reading/mis-reading by the author. All that lies with the user/reader alone.

September 6, 2011

Berlin Buzzwords 2011 – Slides/Videos

Filed under: Indexing,NoSQL — Patrick Durusau @ 6:59 pm

Berlin Buzzwords 2011 – Slides/Videos

I listed the slides and presentations together and sorted the listing by author. A number of very good presentations.

BTW, congratulations to the organizers of Berlin Buzzwords! Truly awesome gathering of talent.

I created this listing to assist myself in mining the presentations. Please forward any corrections. Enjoy!

August 23, 2011

Lucene 4.0: The Revolution

Filed under: Indexing,Lucene — Patrick Durusau @ 6:46 pm

Lucene 4.0: The Revolution by Simon Willnauer.

From the post:

The near-limitless innovative potential of a thriving open source community often has to be tempered by the need for a steady roadmap with version compatibility. As a result, once the decision to break backward compatibility in Lucene 4.0 had been made, it opened the floodgates on a host of step changes, which, together, will deliver a product whose performance is unrecognisable from previous 3.x releases.

One of the most significant changes in Lucene 4.0 is the full switch to using bytes (UTF8) in place of text strings for indexing within the search engine library. This change has improved the efficiency of a number of core processes: the ‘term dictionary’, used as a core part of the index, can now be loaded up to 30 times faster; it uses 10% of the memory; and search speeds are increased by removing the need for string conversion.

This switch to using bytes for indexing has also facilitated one of the main goals for Lucene 4.0, which is ‘flexible indexing’. The data structure for the index format can now be chosen and loaded into Lucene as a pluggable codec. As such, optimised codecs can be loaded to suit the indexing of individual datasets or even individual fields.

The performance enhancements through flexible indexing are highly case specific. However, flexible indexing introduces an entirely new dimension to the Lucene project. New indexing codecs can be developed and existing ones updated without the need for hard-coding within Lucene. There is no longer any need for project-level compromise on the best general-purpose index formats and data structures. A new field of specialised codec development can take place independently from development of the Lucene kernel.

Looks like the time to be learning new features of Lucene 4.0 is now!

Flexible indexing! That sounds very cool.

August 22, 2011

Finite State Transducers in Lucene

Filed under: Indexing,Software — Patrick Durusau @ 7:42 pm

I found part 1 of this series a DZone but there was no reference to part 2. I found part 2 by tracing the article back to its original blog post and seeing it was followed by part 2.

Using Finite State Transducers in Lucene

Finite State Transducers, Part 2

I won’t try to summarize the posts, they are short and heavy on links to more material but would quote this comment from the second article:

To test this, I indexed the first 10 million 1KB documents derived from Wikipedia’s English database download. The resulting RAM required for the FST was ~38% – 52% smaller (larger segments see more gains, as the FST “scales up” well). Not only is the RAM required much lower, but term lookups are also faster: the FuzzyQuery united~2 was ~22% faster.

If using less RAM and faster lookups are of interest to you, these posts should be on your reading list.

August 16, 2011

Distributional Semantics

Filed under: Distributional Semantics,Indexing,Indirect Inference,Random Indexing — Patrick Durusau @ 7:06 pm

Distributional Semantics.

Trevor Cohen’s, co-author with Roger Schvaneveldt, and Dominic Widdows of Reflective Random Indexing and indirect inference…, page on distributional semantics which starts with:

Empirical Distributional Semantics is an emerging discipline that is primarily concerned with the derivation of semantics (or meaning) from the distribution of terms in natural language text. My research in DS is concerned primarily with spatial models of meaning, in which terms are projected into high-dimensional semantic space, and an estimate of their semantic relatedness is derived from the distance between them in this space.

The relations derived by these models have many useful applications in biomedicine and beyond. A particularly interesting property of distributional semantics models is their capacity to recognize connections between terms that do not occur together in the same document, as this has implications for knowledge discovery. In many instances it is possible also to reveal a plausible pathway linking these terms by using the distances estimated by distributional semantic models to generate a network representation, and using Pathfinder networks (PFNETS) to reveal the most significant links in this network, as shown in the example below:

Links to projects, software and other cool stuff! Making a separate post on one of his software libraries.

« Newer PostsOlder Posts »

Powered by WordPress