Indexing « Another Word For It

June 28, 2012

Flexible Indexing in Hadoop

Filed under: Hadoop,Indexing — Patrick Durusau @ 6:31 pm

Flexible Indexing in Hadoop by Dmitriy Ryaboy.

There was much excitement about Dmitriy Ryaboy’s talk about Flexible Indexing in Hadoop (slides available). Twitter has created a novel indexing system atop Hadoop to avoid “Looking for needles in haystacks with snowplows,” or – using mapreduce over lots of data to pick out a few records. Twitter Analytics’s new tool, Elephant Twin goes beyond folder/subfolder partitioning schemes used by many, for instance bucketizing data by /year/month/week/day/hour. Elephant Twin is a framework for creating indexes in Hadoop using Lucene. This enables you to push filtering down into Lucene, to return a few records and to dramatically reduce the records streamed and the time spent on jobs that only parse a small subset of your overall data. A huge boon for the Hadoop Community from Twitter!

The slides plus a slide-by-slide transcript of the presentation is available.

Going in the opposite direction of some national security efforts, which are creating bigger haystacks for the purpose of having larger haystacks.

There are a number of legitimately large haystacks in medicine, physics, astronomy, chemistry and any number of other disciplines. Grabbing all phone traffic to avoid saying you choose the < 5,000 potential subjects of interest is just bad planning.

Comments Off

June 22, 2012

Virtual Documents: “Search the Impossible Search”

Filed under: Indexing,Search Data,Search Engines,Virtual Documents — Patrick Durusau @ 4:17 pm

Virtual Documents: “Search the Impossible Search”

From the post:

The solution was to build an indexing pipeline specifically to address this user requirement, by creating “virtual documents” about each member of staff. In this case, we used the Aspire content processing framework as it provided a lot more flexibility than the indexing pipeline of the incumbent search engine, and many of the components that were needed already existed in Aspire’s component library.

[graphic omitted]

Merging was done selectively. For example, documents were identified that had been authored by the staff member concerned and from those documents, certain entities were extracted including customer names, dates and specific industry jargon. The information captured was kept in fields, and so could be searched in isolation if necessary.

The result was a new class of documents, which existed only in the search engine index, containing extended information about each member of staff; from basic data such as their billing rate, location, current availability and professional qualifications, through to a range of important concepts and keywords which described their previous work, and customer and industry sector knowledge.

Another tool to put in your belt but I wonder if there is a deeper lesson to be learned here?

Creating a “virtual” document, unlike anyone that existed in the target collection and indexing those “virtual” documents was a clever solution.

But it retains the notion of a “container” or “document” that is examined in isolation from all other “documents.”

Is that necessary? What are we missing if we retain it?

I don’t have any answers to those questions but will be thinking about them.

Comments/suggestions?

Comments Off

June 16, 2012

Wavelet Trees for All

Filed under: Graphs,Indexing,Wavelet Trees — Patrick Durusau @ 1:51 pm

Wavelet Trees for All (free version) (official publication reference)

Abstract:

The wavelet tree is a versatile data structure that serves a number of purposes, from string processing to geometry. It can be regarded as a device that represents a sequence, a reordering, or a grid of points. In addition, its space adapts to various entropy measures of the data it encodes, enabling compressed representations. New competitive solutions to a number of problems, based on wavelet trees, are appearing every year. In this survey we give an overview of wavelet trees and the surprising number of applications in which we have found them useful: basic and weighted point grids, sets of rectangles, strings, permutations, binary relations, graphs, inverted indexes, document retrieval indexes, full-text indexes, XML indexes, and general numeric sequences.

Good survey article but can be tough sledding depending on your math skills. Fortunately the paper covers enough uses and has references to freely available applications of this technique. I am sure you will find one that trips your understanding of wavelet trees.

Comments Off

June 12, 2012

Dreams of Universality, Reality of Interdisciplinarity [Indexing/Mapping Pidgin]

Filed under: Complexity,Indexing,Mapping — Patrick Durusau @ 12:55 pm

Complex Systems Science: Dreams of Universality, Reality of Interdisciplinarity by Sebastian Grauwin, Guillaume Beslon, Eric Fleury, Sara Franceschelli, Jean-Baptiste Rouquier, and Pablo Jensen.

Abstract:

Using a large database (~ 215 000 records) of relevant articles, we empirically study the “complex systems” field and its claims to find universal principles applying to systems in general. The study of references shared by the papers allows us to obtain a global point of view on the structure of this highly interdisciplinary field. We show that its overall coherence does not arise from a universal theory but instead from computational techniques and fruitful adaptations of the idea of self-organization to specific systems. We also find that communication between different disciplines goes through specific “trading zones”, ie sub-communities that create an interface around specific tools (a DNA microchip) or concepts (a network).

If disciplines don’t understand each other…:

Where do the links come from then? In an illuminating analogy, Peter Galison [32] compares the difficulty of connecting scientific disciplines to the difficulty of communicating between different languages. History of language has shown that when two cultures are strongly motivated to communicate – generally for commercial reasons – they develop simplied languages that allow for simple forms of interaction. At first, a “foreigner talk” develops, which becomes a “pidgin” when social uses consolidate this language. In rare cases, the “trading zone” stabilizes and the expanded pidgin becomes a creole, initiating the development of an original, autonomous culture. Analogously, biologists may create a simplied and partial version of their discipline for interested physicists, which may develop to a full-blown new discipline such as biophysics. Specifically, Galison has studied [32] how Monte Carlo simulations developed in the postwar period as a trading language between theorists, experimentalists, instrument makers, chemists and mechanical engineers. Our interest in the concept of a trading zone is to allow us to explore the dynamics of the interdisciplinary interaction instead of ending analysis by reference to a “symbiosis” or “collaboration”.

My interest is in how to leverage “trading zones” for the purpose of indexing and mapping (as in topic maps).

Noting that “trading zones” are subject to emprical discovery and no doubt change over time.

Discovering and capitalizing on such “trading zones” will be a real value-add for users.

Comments Off

June 10, 2012

Citizen Archivist Dashboard [“…help the next person discover that record”]

Filed under: Archives,Crowd Sourcing,Indexing,Tagging — Patrick Durusau @ 8:15 pm

Citizen Archivist Dashboard

What’s the common theme of these interfaces from the National Archives (United States)?

Tag – Tagging is a fun and easy way for you to help make National Archives records found more easily online. By adding keywords, terms, and labels to a record, you can do your part to help the next person discover that record. For more information about tagging National Archives records, follow “Tag It Tuesdays,” a weekly feature on the NARAtions Blog. [includes “missions” (sets of materials for tagging), rated as “beginner,” “intermediate,” and “advanced.” Or you can create your own mission.]

Transcribe – By contributing to transcriptions, you can help the National Archives make historical documents more accessible. Transcriptions help in searching for the document as well as in reading and understanding the document. The work you do transcribing a handwritten or typed document will help the next person discover and use that record.
The transcription tool features over 300 documents ranging from the late 18th century through the 20th century for citizen archivists to transcribe. Documents include letters to a civil war spy, presidential records, suffrage petitions, and fugitive slave case files.

[A pilot project with 300 documents but one you should follow. Public transcription (crowd-sourced if you want the popular term) of documents has the potential to open up vast archives of materials.]

Edit Articles – Our Archives Wiki is an online space for researchers, educators, genealogists, and Archives staff to share information and knowledge about the records of the National Archives and about their research.
Here are just a few of the ways you may want to participate:

Create new pages and edit pre-existing pages

Share your research tips

Store useful information discovered during research

Expand upon a description in our online catalog

Check out the “Getting Started” page. When you’re ready to edit, you’ll need to log in by creating a username and password.

Upload & Share – Calling all researchers! Start sharing your digital copies of National Archives records on the Citizen Archivist Research group on Flickr today.
Researchers scan and photograph National Archives records every day in our research rooms across the country — that’s a lot of digital images for records that are not yet available online. If you have taken scans or photographs of records you can help make them accessible to the public and other researchers by sharing your images with the National Archives Citizen Archivist Research Group on Flickr.

Index the Census – Citizen Archivists, you can help index the 1940 census!
The National Archives is supporting the 1940 census community indexing project along with other archives, societies, and genealogical organizations. The release of the decennial census is one of the most eagerly awaited record openings. The 1940 census is available to search and browse, free of charge, on the National Archives 1940 Census web site. But, the 1940 census is not yet indexed by name.

You can help index the 1940 census by joining the 1940 census community indexing project. To get started you will need to download and install the indexing software, register as an indexing volunteer, and download a batch of images to transcribe. When the index is completed, the National Archives will make the named index available for free.

The common theme?

The tagging entry sums it up with: “…you can do your part to help the next person discover that record.”

That’s the “trick” of topic maps. Once a fact about a subject is found, you can preserve your “finding” for the next person.

Comments Off

May 27, 2012

The Seven Deadly Sins of Solr

Filed under: Indexing,Lucene,Solr — Patrick Durusau @ 3:51 pm

The Seven Deadly Sins of Solr by Jay Hill.

From the post:

Working at Lucid Imagination gives me the opportunity to analyze and evaluate a great many instances of Solr implementations, running in some of the largest Fortune 500 companies as well as some of the smallest start-ups. This experience has enabled me to identify many common mistakes and pitfalls that occur, either when starting out with a new Solr implementation, or by not keeping up with the latest improvements and changes.Thanks to my colleague Simon Rosenthal for suggesting the title, and to Simon, Lance Norskog, and Tom Hill for helpful input and suggestions.So, without further ado…the Seven Deadly Sins of Solr.

Not recent and to some degree Solr specific.

You will encounter one or more of these “sins” with every IT solution, including topic maps.

This should be fancy printed, laminated and handed out as swag.

Comments Off

May 20, 2012

1940 US Census Indexing Progress Report—May 18, 2012

Filed under: Census Data,Indexing — Patrick Durusau @ 6:49 pm

1940 US Census Indexing Progress Report—May 18, 2012

From the post:

We’re finishing our 7th week of indexing and we are a breath away from having 40% of the entire collection indexed. I hear from so many people words of amazement at the things this indexing community has accomplished. In 7 weeks we’ve collectively indexed more than 55 million names. It is truly amazing. With 111,612 indexers now signed up to index and arbitrate, we have a formidable team making some great things happen. Let’s keep up the great work.

It is a popular data set but isn’t the whole story.

What do you think are the major factors that contribute to their success?

Comments Off

May 16, 2012

Lucene-1622

Filed under: Indexing,Lucene,Synonymy — Patrick Durusau @ 9:32 am

Multi-word synonym filter (synonym expansion at indexing time) Lucene-1622

From the description:

It would be useful to have a filter that provides support for indexing-time synonym expansion, especially for multi-word synonyms (with multi-word matching for original tokens).

The problem is not trivial, as observed on the mailing list. The problems I was able to identify (mentioned in the unit tests as well):

if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., “big” in the synonym “big apple” for “new york city”) causes the document to match;

there are problems with highlighting the original document when synonym is matched (see unit tests for an example),

if the synonym is of different length than the original sequence of tokens to be matched, then phrase queries spanning the synonym and the original sequence boundary won’t be found. Example “big apple” synonym for “new york city”. A phrase query “big apple restaurants” won’t match “new york city restaurants”.

I am posting the patch that implements phrase synonyms as a token filter. This is not necessarily intended for immediate inclusion, but may provide a basis for many people to experiment and adjust to their own scenarios.

This remains an open issue as of 16 May 2012.

It is also an important open issue.

Think about it.

As “big data” gets larger and larger, at some point traditional ETL isn’t going to be practical. Due to storage, performance, selective granularity or other issues, ETL is going to fade into the sunset.

Indexing, on the other hand, which treats data “in situ” (“in position” for you non-archaeologists in the audience), avoids many of the issues with ETL.

The treatment of synonyms, that is synonyms across data sets, multi-word synonyms, specifying the ranges of synonyms (both for indexing and search), synonym expansion, a whole range of synonyms features and capabilities, needs to “man up” to take on “big data.”

Comments Off

May 7, 2012

Indexing Reverse Top-k Queries

Filed under: Indexing,Reverse Data Management,Top-k Query Processing — Patrick Durusau @ 7:19 pm

Indexing Reverse Top-k Queries by Sean Chester, Alex Thomo, S. Venkatesh, and Sue Whitesides.

Abstract:

We consider the recently introduced monochromatic reverse top-k queries which ask for, given a new tuple q and a dataset D, all possible top-k queries on D union {q} for which q is in the result. Towards this problem, we focus on designing indexes in two dimensions for repeated (or batch) querying, a novel but practical consideration. We present the insight that by representing the dataset as an arrangement of lines, a critical k-polygon can be identified and used exclusively to respond to reverse top-k queries. We construct an index based on this observation which has guaranteed worst-case query cost that is logarithmic in the size of the k-polygon.

We implement our work and compare it to related approaches, demonstrating that our index is fast in practice. Furthermore, we demonstrate through our experiments that a k-polygon is comprised of a small proportion of the original data, so our index structure consumes little disk space.

This was the article that made me start looking for resources on “reverse data management.”

Interesting in its own right but I commend it to your attention in part because of the recognition that interesting challenges lie ahead with higher-dimensional indexing.

If you think about it, “indexing” in the sense of looking for a simple string token, is indexing in one dimension.

If you index a simple string token, but with the further requirement that it appear in JAMA (Journal of the American Medical Association), that is indexing in two dimensions.

If you index a simple string token, appearing in JAMA, but it must also be in an article with the “tag” cancer, now you are indexing in three dimensions.

A human indexer, creating an annual index of cancer publications would move along those dimensions with ease.

Topic maps are an attempt to capture the insight that allows automatic replication of the human indexer’s insight.

Comments Off

May 2, 2012

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Filed under: Bioinformatics,Burrows-Wheeler Transform (BWT),Compression,Data Structures,DNA,FM-Indexes,Genome,Indexing — Patrick Durusau @ 9:35 am

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform by Anthony J. Cox, Markus J. Bauer, Tobias Jakobi, and Giovanna Rosone.

Abstract:

Motivation

The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets.

Results

We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm.

We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel `implicit sorting’ strategy that enables these benefits to be realised without the overhead of sorting the reads. With these techniques, a 45x coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming a small proportion of low-quality bases from the reads improves the compression still further).

This is more than 4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FM-index on large-scale DNA sequence collections.

Important work for several reasons.

First, if the human genome is thought of as “big data,” it opens the possibility that compressed full text indexes can be build for other instances of “big data.”

Second, indexing is similar to topic mapping in the sense that pointers to information about a particular subject are gathered to a common location. Indexes often account for synonyms (see also) and distinguish the use of the same word for different subjects (polysemy).

Third, depending on the granularity of tokenizing and indexing, index entries should be capable of recombination to create new index entries.

Source code for this approach:

Code to construct the BWT and SAP-array on large genomic data sets is part of the BEETL library, available as a github respository at git@github.com:BEETL/BEETL.git.

Comments?

Comments (1)

April 9, 2012

Iowa Government Gets a Digital Dictionary Provided By Access

Filed under: Indexing,Law,Legal Informatics,Thesaurus — Patrick Durusau @ 4:32 pm

Iowa Government Gets a Digital Dictionary Provided By Access

Whitney Grace writes:

How did we get by without the invention of the quick search to look up information? We used to use dictionaries, encyclopedias, and a place called the library. Access Innovations, Inc. has brought the Iowa Legislature General Assembly into the twenty-first century.

The write-up “Access Innovations, Inc. Creates Taxonomy for Iowa Code, Administrative Code and Acts” tells us the data management industry leader has built a thesaurus that allows the Legislature to search its library of proposed laws, bills, acts, and regulations. Users can also add their unstructured data to the thesaurus. Access used their Data Harmony software to provide subscription-based delivery and they built the thesaurus on MAIstro.

Sounds very much like a topic map-like project doesn’t it? Will be following up for more details.

Comments Off

April 8, 2012

Context matters: Search can’t replace a high-quality index

Filed under: eBooks,Indexing,Marketing — Patrick Durusau @ 4:21 pm

Context matters: Search can’t replace a high-quality index

Joe Wikert writes:

I’ve never consulted an index in an ebook. From a digital content point of view, indexes seem to be an unnecessary relic of the print world. The problem with my logic is that I’m thinking of simply dropping a print index into an ebook, and that’s as shortsighted as thinking the future of ebooks in general is nothing more than quick-and-dirty conversions of print books. In this TOC podcast interview, Kevin Broccoli, CEO of BIM Publishing Services, talks about how indexes can and should evolve in the digital world.

Key points from the full video interview (below) include:

Why bother with e-indexes? — Searching for raw text strings completely removes context, which is one of the most valuable attributes of a good index. [Discussed at the 1:05 mark.]

Index mashups are part of the future — In the digital world you should be able to combine indexes from books on common topics in your library. That’s exactly what IndexMasher sets out to do. [Discussed at 3:37.]

Indexes with links — It seems simple but almost nobody is doing it. And as Kevin notes, wouldn’t it be nice for ebook retailers to offer something like this as part of the browsing experience? [Discussed at 6:24.]

Index as cross-selling tool — The index mashup could be designed to show live links to content you own but also include entries without links to content in ebooks you don’t own. Those entries could offer a way to quickly buy the other books, right from within the index. [Discussed at 7:28.]

Making indexes more dynamic — The entry for “Anderson, Chris” in the “Poke The Box” index on IndexMasher shows a simple step in this direction by integrating a Google and Amazon search into the index. [Discussed at 9:42.]

Apologies but I left the links out to the interview to encourage you to visit the original. It is really worth your time.

Do these points sound like something a topic map could do? 😉

BTW, I am posting a note to IndexMasher and will advise. Sounds very interesting.

Comments Off

April 1, 2012

Stupid Solr tricks: Introduction (SST #0)

Filed under: Indexing,Lucene,Solr — Patrick Durusau @ 7:13 pm

Stupid Solr tricks: Introduction (SST #0)

Bill Dueber writes:

Those of you who read this blog regularly (Hi Mom!) know that while we do a lot of stuff at the University of Michigan Library, our bread-and-butter these days are projects that center around Solr.

Right now, my production Solr is running an ancient nightly of version 1.4 (i.e., before 1.4 was even officially released), and reflects how much I didn’t know when we first started down this path. My primary responsibility is for Mirlyn, our catalog, but there’s plenty of smart people doing smart things around here, and I’d like to be one of them.

Solr has since advanced to 3.x (with version 4 on the horizon), and during that time I’ve learned a lot more about Solr and how to push it around. More importantly, I’ve learned a lot more about our data, the vagaries in the MARC/AACR2 that I process and how awful so much of it really is.

So…starting today I’m going to be doing some on-the-blog experiments with a new version of Solr, reflecting some of the problems I’ve run into and ways I think we can get more out of Solr.

Definitely a series to watch, or to contribute to, or better yet, to start for your software package of choice!

Comments Off

Flexible Searching with Solr and Sunspot

Filed under: Indexing,Query Expansion,Solr — Patrick Durusau @ 7:13 pm

Flexible Searching with Solr and Sunspot.

Mike Pack writes:

Just about every type of datastore has some form of indexing. A typical relational database, such as MySQL or PostreSQL, can index fields for efficient querying. Most document databases, like MongoDB, contain indexing as well. Indexing in a relational database is almost always done for one reason: speed. However, sometimes you need more than just speed, you need flexibility. That’s where Solr comes in.

In this article, I want to outline how Solr can benefit your project’s indexing capabilities. I’ll start by introducing indexing and expand to show how Solr can be used within a Rails application.

If you are a Ruby fan (or not), this post is a nice introduction to some of the power of Solr for indexing.

At the same time, it is a poster child for what is inflexible about Solr query expansion.

Mike uses the following example for synonyms/query expansion:

# citi is the stem of cities
citi => city

# copi is the stem of copies
copi => copy

Well, that works no doubt, if those expansions are uniform across a body of texts. Depending on the size of the collection, that may or may not be the case. That is the uniformity of the expansion of strings.

We could say:

#cop is a synonym for the police
cop => police

Meanwhile, elsewhere in the collection we need:

#cop is the stem of copulate
cop => copulate

Without more properties to distinguish the two (or more) cases, we are going to get false positives in one case or the other.

Comments Off

March 25, 2012

Lucene Full Text Indexing with Neo4j

Filed under: Indexing,Lucene,Neo4j,Neo4jClient — Patrick Durusau @ 7:15 pm

Lucene Full Text Indexing with Neo4j by Romiko Derbynew.

From the post:

I spent some time working on full text search for Neo4j. The basic goals were as follows.

Control the pointers of the index

Full Text Search

All operations are done via Rest

Can create an index when creating a node

Can update and index

Can check if an index exists

When bootstrapping Neo4j in the cloud run Index checks

Query Index using full text search lucene query language.

Download:

This is based on Neo4jClient: http://nuget.org/List/Packages/Neo4jClient

Source Code at:http://hg.readify.net/neo4jclient/

Introduction

So with the above objectives, I decided to go with Manual Indexing. The main reason here is that I can put an index pointing to node A based on values in node B.

Imagine the following.

You have Node A with a list:

Surname, FirstName and MiddleName. However Node A also has a relationship to Node B which has other names, perhaps Display Names, Avatar Names and AKA’s.

So with manual indexing, you can have all the above entries for names in Node A and Node B point to Node A only. (emphasis added)

Not quite merging but it is an interesting take on creating a single point of reference.

BTW, search for Neo4j while you are at Romiko’s blog. Several very interesting posts and I am sure more are forthcoming.

Comments Off

March 22, 2012

Secondary Indices Have Arrived! (Hypertable)

Filed under: Hypertable,Indexing — Patrick Durusau @ 7:41 pm

Secondary Indices Have Arrived! (Hypertable)

From the post:

Until now, SELECT queries in Hypertable had to include a row key, row prefix or row interval specification in order to be fast. Searching for rows by specifying a cell value or a column qualifier involved a full table scan which resulted in poor performance and scaled badly because queries took longer as the dataset grew. With 0.9.5.6, we’ve implemented secondary indices that will make such SELECT queries lightning fast!

Hypertable supports two kinds of indices: a cell value index and a column qualifier index. This blog post explains what they are, how they work and how to use them.

I am glad to hear about the new indexing features but how do “cell value indexes” and “column qualifier indexes” differ from secondary indexes as described in the PostgreSQL 9.1 documentation as:

All indexes in PostgreSQL are what are known technically as secondary indexes; that is, the index is physically separate from the table file that it describes. Each index is stored as its own physical relation and so is described by an entry in the pg_class catalog. The contents of an index are entirely under the control of its index access method. In practice, all index access methods divide indexes into standard-size pages so that they can use the regular storage manager and buffer manager to access the index contents.

It would be helpful in evaluating new features to know when (if?) they are substantially the same as features known in other contexts.

Comments (2)

March 14, 2012

New index statistics in Lucene 4.0

Filed under: Indexing,Lucene — Patrick Durusau @ 7:35 pm

New index statistics in Lucene 4.0

Mike McCandless writes:

In the past, Lucene recorded only the bare minimal aggregate index statistics necessary to support its hard-wired classic vector space scoring model.

Fortunately, this situation is wildly improved in trunk (to be 4.0), where we have a selection of modern scoring models, including Okapi BM25, Language models, Divergence from Randomness models and Information-based models. To support these, we now save a number of commonly used index statistics per index segment, and make them available at search time.

Mike uses a simple example to illustrate the statistics available in Lucene 4.0.

Comments Off

Keyword Indexing for Books vs. Webpages

Filed under: Books,Indexing,Keywords,Search Engines — Patrick Durusau @ 7:35 pm

I was watching a lecture on keyword indexing that started off with a demonstration of an index to a book, which was being compared to indexing web pages. The statement was made that the keyword pointed the reader to a page where that keyword could be found, much like a search engine does for a web page.

Leaving aside the more complex roles that indexes for books play, such as giving alternative terms, classifying the nature of the occurrence of the term (definition, mentioned, footnote, etc.), cross-references, etc., I wondered if there is a difference between a page reference in a book index vs. a web page reference by a search engine?

In some 19th century indexes I have used, the page references are followed by a letter of the alphabet, to indicate that the page is divided into sections, sometimes as many as a – h or even higher. Mostly those are complex reference works, dictionaries, lexicons, works of that type, where the information is fairly dense. (Do you know of any modern examples of indexes where pages are divided? A note would be appreciated.)

I have the sense that an index of a book, without sub-dividing a page, is different from a index pointing to a web page. It may be a difference that has never been made explicit but I think it is important.

Some facts about word length on a “page:”

The average number of words on a page in a book is about 400. (Word Count to Page)
Google is said to recommend 250 – 300 words per page. (Word Count Per Page: Making It Just Right!).
UIE reports that users don’t mind scrolling. (Long Pages Rule! and As the Page Scrolls)

With a short amount of content, average book page length, the user has little difficulty finding an index term on a page. But the longer the web page, the less useful our instinctive (trained?) scan of the page becomes.

In part because part of the page scrolls out of view. As you may know, that doesn’t happen with a print book.

Scanning of a print book is different from scanning of a webpage. How to account for that difference I don’t know.

Before you suggest Ctrl-F, see Do You Ctrl-F?. What was it you were saying about Ctrl-F?

Web pages (or other electronic media) that don’t replicate the fixed display of book pages result in a different indexing experience for the reader.

If a search engine index could point into a page, it would still be different from a traditional index but would come closer to a traditional index.

(The W3C has steadfastly resisted any effective subpage pointing. See the sad history of XLink/XPointer. You will probably have to ask insiders but it is a well known story.)

BTW, in case you are interested in blog length, see: Bloggers: This Is How Long Your Posts Should Be. Informative and amusing.

Comments Off

March 2, 2012

Indexing Files via Solr and Java MapReduce

Filed under: Cloudera,Indexing,MapReduce,Solr — Patrick Durusau @ 8:04 pm

Indexing Files via Solr and Java MapReduce by Adam Smieszny.

From the post:

Several weeks ago, I set about to demonstrate the ease with which Solr and Map/Reduce can be integrated. I was unable to find a simple, yet comprehensive, primer on integrating the two technologies. So I set about to write one.

What follows is my bare-bones tutorial on getting Solr up and running to index each word of the complete works of Shakespeare. Note: Special thanks to Sematext for looking over the Solr bits and making sure they are sane. Check them out if you’re going to be doing a lot of work with Solr, ElasticSearch, or search in general and want to bring in the experts.

Looks like a nice weekend (if you are married, long night if not) project!

If you have the time, look over this post and report back on your experiences.

Particularly if you learn something new or see something others need to know about (such as other resources).

Comments Off

February 10, 2012

Dragsters, Drag Cars & Drag Racing Cars

Filed under: Cataloging,Classification,Indexing,Interface Research/Design,Subject Headings — Patrick Durusau @ 4:04 pm

I still remember the cover of Hot Rod magazine that announced (from memory) “The 6’s are here!” Don “The Snake” Prudhomme had broken the 200 mph barrier in a drag race. Other memories follow on from that one but I mention it to explain my interest in a recent Subject Authority Cooperative Program decision to not have a cross-reference from dragster (the term I would have used) to more recent terms, drag cars or drag racing cars.

The expected search (in this order) due to this decision is:

Cars (Automobiles) -> redirect to Automobiles -> Automobiles -> narrower term -> Automobiles, racing -> narrower term -> Dragsters

Adam L. Schiff, proposer of drag cars & drag racing cars says below “This just is not likely to happen.”

Question: Is there a relationship between users “work[ing] their way up and down hierarchies” and display of relationships methods? Who chooses which items will be the starting point to lead to other items? How do you integrate a keyword search into such a system?

Question: And what of the full phrase/sentence AI systems where keywords work less well? How does that work with relationship display systems?

Question: I wonder if the relationship display methods are closer to the up and down hierarchies, but with less guidance?

Adam’s Dragster proposal post in full:

Dragsters

Automobiles has a UF Cars (Automobiles). Since the UF already exists on the basic heading, it is not necessary to add it to Dragsters. The proposal was not approved.

Our proposal was to add two additional cross-references to Dragsters: Drag cars, and Drag racing cars. While I understand, in principle, the reasoning behind the rejection of these additional references, I do not see how it serves users. A user coming to a catalog to search for the subject “Drag cars” will now get nothing, no redirection to the established heading. I don’t see how the presence of a reference from Cars (Automobiles) to Automobiles helps any user who starts a search with “Drag cars”. Only if they begin their search with Cars would they get led to Automobiles, and then only if they pursue narrower terms under that heading would they find Automobiles, Racing, which they would then have to follow further down to Dragsters. This just is not likely to happen. Instead they will probably start with a keyword search on “Drag cars” and find nothing, or if lucky, find one or two resources and think they have it all. And if they are astute enough to look at the subject headings on one of the records and see “Dragsters”, perhaps they will then redo their search.

Since the proposed cross-refs do not begin with the word Cars, I do not at all see how a decision like this is in the service of users of our catalogs. I think that LCSH rules for references were developed when it was expected that users would consult the big red books and work their way up and down hierarchies. While some online systems do provide for such navigation, it is doubtful that many users take this approach. Keyword searching is predominant in our catalogs and on the Web. Providing as many cross-refs to established headings as we can would be desirable. If the worry is that the printed red books will grow to too many volumes if we add more variant forms that weren’t made in the card environment, then perhaps there needs to be a way to include some references in authority records but mark them as not suitable for printing in printed products.

PS: According to ODLIS: Online Dictionary for Library and Information Science by Joan M. Reitz, UF, has the following definition:

used for (UF)

A phrase indicating a term (or terms) synonymous with an authorized subject heading or descriptor, not used in cataloging or indexing to avoid scatter. In a subject headings list or thesaurus of controlled vocabulary, synonyms are given immediately following the official heading. In the alphabetical list of indexing terms, they are included as lead-in vocabulary followed by a see or USE cross-reference directing the user to the correct heading. See also: syndetic structure.

I did not attempt to reproduce the extremely rich cross-linking in this entry but commend the entire resource to your attention, particularly if you are a library science student.

Comments Off

February 6, 2012

Uwe Says: is your Reader atomic?

Filed under: Indexing,Lucene — Patrick Durusau @ 6:59 pm

Uwe Says: is your Reader atomic? by Uwe Schindler.

From the blog:

Since Day 1 Lucene exposed the two fundamental concepts of reading and writing an index directly through IndexReader & IndexWriter. However, the API didn’t reflect reality; from the IndexWriter perspective this was desirable but when reading the index this caused several problems in the past. In reality a Lucene index isn’t a single index while logically treated as a such. The latest developments in Lucene trunk try to expose reality for type-safety and performance, but before I go into details about Composite, Atomic and DirectoryReaders let me go back in time a bit.

If you don’t mind looking deep into the heart of indexing in Lucene, this is a post for you. Problems, both solved and remaining are discussed. This could be your opportunity to contribute to the Lucene community.

Comments Off

January 30, 2012

1 Billion Insertions – The Wait is Over!

Filed under: iiBench,Indexing,InnoDB,Insertion,TokuDB — Patrick Durusau @ 8:02 pm

1 Billion Insertions – The Wait is Over! by Tim Callaghan.

From the post:

iiBench measures the rate at which a database can insert new rows while maintaining several secondary indexes. We ran this for 1 billion rows with TokuDB and InnoDB starting last week, right after we launched TokuDB v5.2. While TokuDB completed it in 15 hours, InnoDB took 7 days.

The results are shown below. At the end of the test, TokuDB’s insertion rate remained at 17,028 inserts/second whereas InnoDB had dropped to 1,050 inserts/second. That is a difference of over 16x. Our complete set of benchmarks for TokuDB v5.2 can be found here.

Kudos to TokuDB team! Impressive performance!

Tim comments on iiBench:

iiBench [Indexed Insertion Benchmark] simulates a pattern of usage for always-on applications that:

Require fast query performance and hence require indexes

Have high data insert rates

Cannot wait for offline batch processing and hence require the indexes be maintained as data comes in

If this sounds familiar, could be an important benchmark to keep in mind.

BTW, do you know of any topic map benchmarks? Just curious.

Comments Off

January 27, 2012

A Full Table Scan of Indexing in NoSQL

Filed under: Indexing,NoSQL — Patrick Durusau @ 4:30 pm

A Full Table Scan of Indexing in NoSQL by Will LaForest (MongoDB).

One slide reads:

What Indexes Can Help Us Do

Find the “location” of data

Based upon a value

Based upon a range

 Geospatial

Fast checks for existence

Uniqueness enforcement

Sorting

Aggregation

Usually covering indexes

The next slide is titled: “Requisite Book Analogy” with an image of a couple of pages from an index.

So, let’s copy out some of those entries and see where they fit into Will’s scheme:

Bears, 75, 223
Beds, good, their moral influence, 184, 186
Bees, stationary civilisation of, 195
Beethoven on Handel, 18
Beginners in art, how to treat them, 195

The entry for Bears I think qualifies for “location of data based on a value.

And I see sorting, but those two are the only aspects of Will’s indexing that I see.

Do you see more?

What I do see is that the index is expressing relationships between subjects (“Beethoven on Handel”) and commenting on what information awaits a reader (“Beds, good, their moral influence”).

A NoSQL index could replicate the strings of these entries but without the richness of this index.

For example, consider the entry:

Aurora Borealis like pedal notes in Handel’s bass, 83

One expects that the entry on Handel to contain that reference as well as the one of “Beethoven on Handel.” (I have only the two pages in this image and as far as I know, I haven’t seen this particular index before.)

Question: How would you use the indexes in MongoDB to represent the richness of these two pages?

Question: Where did MongoDB (or other NoSQL) indexing fail?

Important to remember that indexes prior to the auto-generated shallowness of recent decades were highly skilled acts of authorship, that were a value-add for readers.

Comments Off

January 24, 2012

How Google Code Search Worked

Filed under: Indexing,Regexes — Patrick Durusau @ 3:37 pm

Regular Expression Matching with a Trigram Index or How Google Code Search Worked by Russ Cox.

In the summer of 2006, I was lucky enough to be an intern at Google. At the time, Google had an internal tool called gsearch that acted as if it ran grep over all the files in the Google source tree and printed the results. Of course, that implementation would be fairly slow, so what gsearch actually did was talk to a bunch of servers that kept different pieces of the source tree in memory: each machine did a grep through its memory and then gsearch merged the results and printed them. Jeff Dean, my intern host and one of the authors of gsearch, suggested that it would be cool to build a web interface that, in effect, let you run gsearch over the world’s public source code. I thought that sounded fun, so that’s what I did that summer. Due primarily to an excess of optimism in our original schedule, the launch slipped to October, but on October 5, 2006 we did launch (by then I was back at school but still a part-time intern).

I built the earliest demos using Ken Thompson’s Plan 9 grep, because I happened to have it lying around in library form. The plan had been to switch to a “real” regexp library, namely PCRE, probably behind a newly written, code reviewed parser, since PCRE’s parser was a well-known source of security bugs. The only problem was my then-recent discovery that none of the popular regexp implementations – not Perl, not Python, not PCRE – used real automata. This was a surprise to me, and even to Rob Pike, the author of the Plan 9 regular expression library. (Ken was not yet at Google to be consulted.) I had learned about regular expressions and automata from the Dragon Book, from theory classes in college, and from reading Rob’s and Ken’s code. The idea that you wouldn’t use the guaranteed linear time algorithm had never occurred to me. But it turned out that Rob’s code in particular used an algorithm only a few people had ever known, and the others had forgotten about it years earlier. We launched with the Plan 9 grep code; a few years later I did replace it, with RE2.

Russ covers inverted indexes, tri-grams, regexes, pointers to working code and examples of how to use the code searcher locally on Linux source code for example.

Extremely useful article as an introduction to indexes and regexes.

Comments Off

MongoDB Indexing in Practice

Filed under: Indexing,MongoDB — Patrick Durusau @ 3:36 pm

MongoDB Indexing in Practice

From the post:

With the right indexes in place, MongoDB can use its hardware efficiently and serve your application’s queries quickly. In this article, based on chapter 7 of MongoDB in Action, author Kyle Banker talks about refining and administering indexes. You will learn how to create, build and backup MongoDB indexes.

Indexing is closely related to topic maps and the more you learn about them, the better topic maps you will be writing.

Take for example the treatment of “multiple keys” in this post.

What that means is that multiple entries in an index can point at the same document.

Not that big of a step to multiple ways to identify the the same subject.

Granting that in Kyle’s example, none of his “keys” really identify the subject. More isa, usedWith, usedIn type associations.

Comments (1)

January 20, 2012

Simon says: Single Byte Norms are Dead!

Filed under: Indexing,Lucene — Patrick Durusau @ 9:19 pm

Simon says: Single Byte Norms are Dead!

From the post:

Apache Lucene turned 10 last year with a limitation that bugged many many users from day one. You may know Lucene’s core scoring model is based on TF/IDF (Vector Space Model). Lucene encapsulates all related calculations in a class called Similarity. Among pure TF/IDF factors Similarity also provides a norm value per document that is, by default a float value composed out of length normalization and field boost. Nothing special so far! However, this float value is encoded into a single byte and written down to the index. Lucene trades some precision lost for space on disk and eventually in memory since norms are loaded into memory per field upon first access.

In lots of cases this precision lost is a fair trade-off, but once you find yourself in a situation where you need to store more information based on statistics collected during indexing you end up writing your own auxiliary data structure or “fork” Lucene for your app and mess with the source.

The upcoming version of Lucene already added support for a lot more scoring models like:

Divergence from Randomness

Language Models

Information Based Models

Okapi BM25

The abstractions added to Lucene to implement those models already opens the door for applications that either want to roll their own “awesome” scoring model or modify the low level scorer implementations. Yet, norms are still one byte!

Don’t worry! The post has a happy ending!

Read on if you want to be on the cutting edge of Lucene work.

Thanks Lucene Team!

Comments Off

January 14, 2012

Extract meta concepts through co-occurrences analysis and graph theory

Filed under: Classification,co-occurrence,Indexing — Patrick Durusau @ 7:36 pm

Extract meta concepts through co-occurrences analysis and graph theory

Cristian Mesiano writes:

During The Christmas period I had finally the chance to read some papers about probabilistic latent semantic and its applications in auto classification and indexing.

The main concept behind “latent semantic” lays on the assumption that words that occurs close in the text are related to the same semantic construct.

Based on this principle the LSA (and partially also the PLSA ) builds a matrix to keep track of the co-occurrences of the words in text, and it assign a score to these co-occurrences considering the distribution in the corpus as well.

Often TF-IDF score is used to rank the words.

Anyway, I was wondering if this techniques could be useful also to extract key concepts from the text.

Basically I thought: “in LSA we consider some statistics over the co-occurrences, so: why not consider the link among the co-occurrences as well?”.

Using the first three chapters of “The Media in the Network Society, author: Gustavo Cardoso,” Christian creates a series of graphs.

Christian promises his opinion on classification of texts using this approach.

In the meantime, what’s yours?

Comments Off

January 9, 2012

Searching relational content with Lucene’s BlockJoinQuery

Filed under: Indexing,Lucene — Patrick Durusau @ 1:32 pm

Searching relational content with Lucene’s BlockJoinQuery

Mike McCandless writes:

Lucene’s 3.4.0 release adds a new feature called index-time join (also sometimes called sub-documents, nested documents or parent/child documents), enabling efficient indexing and searching of certain types of relational content.

Most search engines can’t directly index relational content, as documents in the index logically behave like a single flat database table. Yet, relational content is everywhere! A job listing site has each company joined to the specific listings for that company. Each resume might have separate list of skills, education and past work experience. A music search engine has an artist/band joined to albums and then joined to songs. A source code search engine would have projects joined to modules and then files.

Mike covers how to index relational content with Lucene 3.4.0 as well as the current limitations on that relational indexing. Current work is projected to resolve some of those limitations.

This feature will be immediately useful in a number of contexts.

Even more promising is the development of thinking about indexing as more than term -> document. Both sides of that operator need more granularity.

Comments Off

January 7, 2012

Distributed Indexing – SolrCloud

Filed under: Distributed Indexing,Indexing,SolrCloud — Patrick Durusau @ 4:07 pm

Distributed Indexing – SolrCloud

Not for the faint of heart but I noticed that progress is being made on distributed indexing for the SolrCloud project.

Whether you are a hard core coder or someone who is interested in using this feature (read feedback), now would be a good time to start paying attention to this work.

I added a new category for “Distributed Indexing” because this isn’t only going to come up for Solr. And I suspect there are aspects of “distributed indexing” that are going to be applicable to distributed topic maps as well.

Comments Off

January 4, 2012

Hadoop for Archiving Email – Part 2

Filed under: Hadoop,Indexing,Lucene,Solr — Patrick Durusau @ 9:40 am

Hadoop for Archiving Email – Part 2 by Sunil Sitaula.

From the post:

Part 1 of this post covered how to convert and store email messages for archival purposes using Apache Hadoop, and outlined how to perform a rudimentary search through those archives. But, let’s face it: for search to be of any real value, you need robust features and a fast response time. To accomplish this we use Solr/Lucene-type indexing capabilities on top of HDFS and MapReduce.

Before getting into indexing within Hadoop, let us review the features of Lucene and Solr:

Continues Part 1 (my blog post) and mentions several applications and libraries that will be useful for indexing email.

Comments (1)

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 28, 2012

June 22, 2012

June 16, 2012

June 12, 2012

June 10, 2012

May 27, 2012

May 20, 2012

May 16, 2012

May 7, 2012

May 2, 2012

April 9, 2012

April 8, 2012

April 1, 2012

March 25, 2012

March 22, 2012

March 14, 2012

March 2, 2012

February 10, 2012

February 6, 2012

January 30, 2012

January 27, 2012

January 24, 2012

January 20, 2012

January 14, 2012

January 9, 2012

January 7, 2012

January 4, 2012