## Archive for the ‘Indexing’ Category

### Labels and Schema Indexes in Neo4j

Tuesday, May 14th, 2013

Labels and Schema Indexes in Neo4j by Tareq Abedrabbo.

From the post:

Neo4j recently introduced the concept of labels and their sidekick, schema indexes. Labels are a way of attaching one or more simple types to nodes (and relationships), while schema indexes allow to automatically index labelled nodes by one or more of their properties. Those indexes are then implicitly used by Cypher as secondary indexes and to infer the starting point(s) of a query.

I would like to shed some light in this blog post on how these new constructs work together. Some details will be inevitably specific to the current version of Neo4j and might change in the future but I still think it’s an interesting exercise.

Before we start though I need to populate the graph with some data. I’m more into cartoon for toddlers than second-rate sci-fi and therefore Peppa Pig shall be my universe.

So let’s create some labeled graph resources.

Nice review of the impact of the new label + schema index features in Neo4j.

I am still wondering why Neo4j “simple types” cannot be added to nodes and edges without the additional machinery of labels?

Allow users to declare properties to be indexed and used by Cypher for queries.

Which creates a generalized mechanism that requires no changes to the data model.

I have a question pending with the Neo4j team on this issue and will report back with their response.

### How Impoverished is the “current world of search?”

Wednesday, May 8th, 2013

Internet Content Is Looking for You

From the post:

Where you are and what you’re doing increasingly play key roles in how you search the Internet. In fact, your search may just conduct itself.

This concept, called “contextual search,” is improving so gradually the changes often go unnoticed, and we may soon forget what the world was like without it, according to Brian Proffitt, a technology expert and adjunct instructor of management in the University of Notre Dame’s Mendoza College of Business.

Contextual search describes the capability for search engines to recognize a multitude of factors beyond just the search text for which a user is seeking. These additional criteria form the “context” in which the search is run. Recently, contextual search has been getting a lot of attention due to interest from Google.

(…)

“You no longer have to search for content, content can search for you, which flips the world of search completely on its head,” says Proffitt, who is the author of 24 books on mobile technology and personal computing and serves as an editor and daily contributor for ReadWrite.com.

“Basically, search engines examine your request and try to figure out what it is you really want,” Proffitt says. “The better the guess, the better the perceived value of the search engine. In the days before computing was made completely mobile by smartphones, tablets and netbooks, searches were only aided by previous searches.

(…)

Context can include more than location and time. Search engines will also account for other users’ searches made in the same place and even the known interests of the user.

If time and location plus prior searches is context that “…flips the world of search completely on its head…”, imagine what a traditional index must do.

A traditional index being created by a person who has subject matter knowledge beyond the average reader and so is able to point to connections and facts (context) previously unknown to the user.

The “…current world of search…” is truly impoverished for time and location to have that much impact.

### You Say Beowulf, I Say Biowulf [Does Print Shape Digital?]

Monday, May 6th, 2013

You Say Beowulf, I Say Biowulf by Julian Harrison.

From the post:

Students of medieval manuscripts will know that it’s always instructive to consult the originals, rather than to rely on printed editions. There are many aspects of manuscript culture that do not translate easily onto the printed page — annotations, corrections, changes of scribe, the general layout, the decoration, ownership inscriptions.

Beowulf is a case in point. Only one manuscript of this famous Old English epic poem has survived, which is held at the British Library (Cotton MS Vitellius A XV). The writing of this manuscript was divided between two scribes, the first of whom terminated their stint with the first three lines of f. 175v, ending with the words “sceaden mæl scyran”; their counterpart took over at this point, implying that an earlier exemplar lay behind their text, from which both scribes copied.

(…)

Another distinction between those two scribes, perhaps less familiar to modern students of the text, is the varying way in which they spell the name of the eponymous hero Beowulf. On 40 occasions, Beowulf’s name is spelt in the conventional manner (the first is found in line 18 of the standard editions, the last in line 2510). However, in 7 separate instances, the name is instead spelt “Biowulf” (“let’s call the whole thing off), the first case coming in line 1987 of the poem.

I think you will enjoy the post, to say nothing of the images of the manuscript.

My topic map concern is with:

There are many aspects of manuscript culture that do not translate easily onto the printed page — annotations, corrections, changes of scribe, the general layout, the decoration, ownership inscriptions.

I take it that non-facsimile publication in print loses some of the richness of the manuscript.

My question is: To what extent have we duplicated the limitations of print media in digital publications?

For example, a book may have more than one index, but not more than one index of the same kind.

That is you can’t find a book that has multiple distinct subject indexes. Not surprising considering the printing cost of duplicate subject indexes, but we don’t have that limitation with electronic indexes.

Or do we?

In my experience anyway, electronic indexes mimic their print forefathers. Each electronic index stands on its own, even if each index is of the same work.

Assume we have a Spanish and English index, for the casual reader, to the plays of Shakespeare. Even in electronic form, I assume they would be created and stored as separate indexes.

But isn’t that simply replicating what we would experience with a print edition?

Can you think of other cases where our experience with print media has shaped our choices with digital artifacts?

### Indexing Millions Of Documents…

Monday, April 29th, 2013

Indexing Millions Of Documents Using Tika And Atomic Update by Patricia Gorla.

From the post:

On a recent engagement, we were posed with the problem of sorting through 6.5 million foreign patent documents and indexing them into Solr. This totaled about 1 TB of XML text data alone. The full corpus included an additional 5 TB of images to incorporate into the index; this blog post will only cover the text metadata.

Streaming large volumes of data into Solr is nothing new, but this dataset posed a unique challenge: Each patent document’s translation resided in a separate file, and the location of each translation file was unknown at runtime. This meant that for every document processed we wouldn’t know where its match would be. Furthermore, the translations would arrive in batches, to be added as they come. And lastly, the project needed to be open to different languages and different file formats in the future.

Our options for dealing with inconsistent data came down to: cleaning all data and organizing it before processing, or building an ingester robust enough to handle different situations.

We opted for the latter and built an ingester that would process each file individually and index the documents with an atomic update (new in Solr 4). To detect and extract the text metadata we chose Apache Tika. Tika is a document-detection and content-extraction tool useful for parsing information from many different formats.

On the surface Tika offers a simple interface to retrieve data from many sources. Our use case, however, required a deeper extraction of specific data. Using the built-in SAX parser allowed us to push Tika beyond its normal limits, and analyze XML content according to the type of information it contained.

No magic bullet but an interesting use case (patents in multiple languages).

### Broccoli: Semantic Full-Text Search at your Fingertips

Friday, April 19th, 2013

Broccoli: Semantic Full-Text Search at your Fingertips by Hannah Bast, Florian Bäurle, Björn Buchhold, Elmar Haussmann.

Abstract:

We present Broccoli, a fast and easy-to-use search engine for what we call semantic full-text search. Semantic full-text search combines the capabilities of standard full-text search and ontology search. The search operates on four kinds of objects: ordinary words (e.g., edible), classes (e.g., plants), instances (e.g., Broccoli), and relations (e.g., occurs-with or native-to). Queries are trees, where nodes are arbitrary bags of these objects, and arcs are relations. The user interface guides the user in incrementally constructing such trees by instant (search-as-you-type) suggestions of words, classes, instances, or relations that lead to good hits. Both standard full-text search and pure ontology search are included as special cases. In this paper, we describe the query language of Broccoli, a new kind of index that enables fast processing of queries from that language as well as fast query suggestion, the natural language processing required, and the user interface. We evaluated query times and result quality on the full version of the EnglishWikipedia (32 GB XML dump) combined with the YAGO ontology (26 million facts). We have implemented a fully functional prototype based on our ideas, see http://broccoli.informatik.uni-freiburg.de.

The most impressive part of an impressive paper was the new index, context lists.

The second idea, which is the main idea behind our new index, is to have what we call context lists instead of inverted lists. The context list for a pre x contains one index item per occurrence of a word starting with that pre x, just like the inverted list for that pre x would. But along with that it also contains one index item for each occurrence of an arbitrary entity in the same context as one of these words.

The performance numbers speak for themselves.

This should be a feature in the next release of Lucene/Solr. Or perhaps even configurable for the number of entities that can appear in a “context list.”

Was it happenstance or a desire for simplicity that caused the original indexing engines to parse text into single tokens?

Literature references on that point?

### Ultimate library challenge: taming the internet

Saturday, April 6th, 2013

Ultimate library challenge: taming the internet by Jill Lawless.

From the post:

Capturing the unruly, ever-changing internet is like trying to pin down a raging river. But the British Library is going to try.

For centuries, the library has kept a copy of every book, pamphlet, magazine and newspaper published in Britain. Starting on Saturday, it will also be bound to record every British website, e-book, online newsletter and blog in a bid to preserve the nation’s ”digital memory”.

As if that’s not a big enough task, the library also has to make this digital archive available to future researchers – come time, tide or technological change.

The library says the work is urgent. Ever since people began switching from paper and ink to computers and mobile phones, material that would fascinate future historians has been disappearing into a digital black hole. The library says firsthand accounts of everything from the 2005 London transit bombings to Britain’s 2010 election campaign have already vanished.

”Stuff out there on the web is ephemeral,” said Lucie Burgess the library’s head of content strategy. ”The average life of a web page is only 75 days, because websites change, the contents get taken down.

”If we don’t capture this material, a critical piece of the jigsaw puzzle of our understanding of the 21st century will be lost.”

For more details, see Jill’s post or, Click to save the nations digital memory (British Library press release), or 100 websites: Capturing the digital universe (sample of results of archiving with only 100 sites).

The content gathered by the project will be made available to the public.

A welcome venture, particularly since the results will be made available to the public.

An unanswerable question but I do wonder how we would view Greek drama if all of it had been preserved?

Hundreds if not thousands of plays were written and performed every year.

The Complete Greek Drama lists only forty-seven (47) that have survived to this day.

If whole scale preservation is the first step, how do we preserve paths to what’s worth reading in a data labyrinth as a second step?

I first saw this in a tweet by Jason Ronallo.

### Indexing PDF for OSINT and Pentesting [or Not!]

Saturday, April 6th, 2013

Indexing PDF for OSINT and Pentesting by Alejandro Nolla.

From the post:

Most of us, when conducting OSINT tasks or gathering information for preparing a pentest, draw on Google hacking techniques like site:company.acme filetype:pdf “for internal use only” or something similar to search for potential sensitive information uploaded by mistake. At other times, a customer will ask us to find out if through negligence they have leaked this kind of sensitive information and we proceed to make some google hacking fu.

But, what happens if we don’t want to make this queries against Google and, furthermore, follow links from search that could potentially leak referrers? Sure we could download documents and review them manually in local but it’s boring and time consuming. Here is where Apache Solr comes into play for processing documents and creating an index of them to give us almost real time searching capabilities.

A nice outline of using Solr for internal security testing of PDF files.

At the same time, a nice outline of using Solr for external security testing of PDF files.

You can sweep sites for new PDF files on a periodic basis and retain only those meeting a particular criteria.

Low grade ore but even low grade ore can have a small diamond every now and again.

### ElasticSearch: Text analysis for content enrichment

Saturday, March 30th, 2013

ElasticSearch: Text analysis for content enrichment by Jaibeer Malik.

From the post:

Taking an example of a typical eCommerce site, serving the right content in search to the end customer is very important for the business. The text analysis strategy provided by any search solution plays very big role in it. As a search user, I would prefer some of typical search behavior for my query to automatically return,

• should look for synonyms matching my query text
• should match singluar and plural words or words sounding similar to enter query text
• should not allow searching on protected words
• should allow search for words mixed with numberic or special characters
• should not allow search on html tags
• should allow search text based on proximity of the letters and number of matching letters

Enriching the content here would be to add above search capabilities to you content while indexing and searching for the content.

I thought the “…look for synonyms matching my query text…” might get your attention.

Not quite a topic map because there isn’t any curation of the search results, saving the next searcher time and effort.

But in order to create and maintain a topic map, you are going to need expansion of your queries by synonyms.

You will take the results of those expanded queries and fashion them into a topic map.

Think of it this way:

Machines can rapidly harvest, even sort content at your direction.

What they can’t do is curate the results of their harvesting.

That requires a secret ingredient.

That would be you.

I first saw this at DZone.

### Build a search engine in 20 minutes or less

Thursday, March 28th, 2013

Build a search engine in 20 minutes or less by Ben Ogorek.

I was suspicious but pleasantly surprised by the demonstration of the vector space model you will find here.

True, it doesn’t offer all the features of the latest Lucene/Solr releases but it will give you a firm grounding on vector space models.

Enjoy!

PS: One thing to keep in mind, semantics do not map to vector space. We can model word occurrences in vector space but occurrences are not semantics.

### Master Indexing and the Unified View

Monday, March 25th, 2013

Master Indexing and the Unified View by David Loshin.

From the post:

1) Identity resolution – The master data environment catalogs the set of representations that each unique entity exhibits in the original source systems. Applying probabilistic aggregation and/or deterministic rules allows the system to determine that the data in two or more records refers to the same entity, even if the original contexts are different.

2) Data quality improvement – Linking records that share data about the same real-world entity enable the application of business rules to improve the quality characteristics of one or more of the linked records. This doesn’t specifically mean that a single “golden copy” record must be created to replace all instances of the entity’s data. Instead, depending on the scenario and quality requirements, the accessibility of the different sources and the ability to apply those business rules at the data user’s discretion will provide a consolidated view that best meets the data user’s requirements at the time the data is requested.

3) Inverted mapping – Because the scope of data linkage performed by the master index spans the breadth of both the original sources and the collection of data consumers, it holds a unique position to act as a map for a standardized canonical representation of a specific entity to the original source records that have been linked via the identity resolution processes.

In essence this allows you to use a master data index to support federated access to original source data while supporting the application of data quality rules upon delivery of the data.

It’s been a long day but does David’s output have all the attributes of a topic map?

1. Identity resolution – Two or more representatives the same subject
2. Data quality improvement – Consolidated view of the data based on a subject and presented to the user
3. Inverted mapping – Navigation based on a specific entity into original source records

### Lucene/Solr 4 – A Revolution in Enterprise Search Technology (Webinar)

Friday, March 22nd, 2013

Lucene/Solr 4 – A Revolution in Enterprise Search Technology (Webinar). Presenter: Erik Hatcher, Lucene/Solr Committer and PMC member.

Date: Wednesday, March 27, 2013
Time: 10:00am Pacific Time

From the signup page:

Lucene/Solr 4 is a ground breaking shift from previous releases. Solr 4.0 dramatically improves scalability, performance, reliability, and flexibility. Lucene 4 has been extensively upgraded. It now supports near real-time (NRT) capabilities that allow indexed documents to be rapidly visible and searchable. Additional Lucene improvements include pluggable scoring, much faster fuzzy and wildcard querying, and vastly improved memory usage.

The improvements in Lucene have automatically made Solr 4 substantially better. But Solr has also been considerably improved and magnifies these advances with a suite of new “SolrCloud” features that radically improve scalability and reliability.

In this Webinar, you will learn:

• What are the Key Feature Enhancements of Lucene/Solr 4, including the new distributed capabilities of SolrCloud
• How to Use the Improved Administrative User Interface
• How Sharding has been improved
• What are the improvements to GeoSpatial Searches, Highlighting, Advanced Query Parsers, Distributed search support, Dynamic core management, Performance statistics, and searches for rare values, such as Primary Key

Great way to get up to speed on the latest release of Lucene/Solr!

### Using Solr’s New Atomic Updates

Friday, March 15th, 2013

Using Solr’s New Atomic Updates by Scott Stults.

From the post:

A while ago we created a sample index of US patent grants roughly 700k documents big. Adjacently we pulled down the corresponding multi-page TIFFs of those grants and made PNG thumbnails of each page. So far, so good.

You see, we wanted to give our UI the ability to flip through those thumbnails and we wanted it to be fast. So our original design had a client-side function that pulled down the first thumbnail and then tried to pull down subsequent thumbnails until it ran out of pages or cache. That was great for a while, but it didn’t scale because a good portion of our requests were for non-existent resources.

Things would be much better if the UI got the page count along with the other details of the search hits. So why not update each record in Solr with that?

A new feature in Solr and one that I suspect will be handy. Such as updating a index of associations, for example.

### A Peek Behind the Neo4j Lucene Index Curtain

Friday, March 15th, 2013

A Peek Behind the Neo4j Lucene Index Curtain by Max De Marzi.

Max suggests using a copy of your Neo4j database for this exercise.

Could be worth your while to go exploring.

And you will learn something about Lucene in the bargain.

### Learning Hash Functions Using Column Generation

Monday, March 11th, 2013

Learning Hash Functions Using Column Generation by Xi Li, Guosheng Lin, Chunhua Shen, Anton van den Hengel, Anthony Dick.

Abstract:

Fast nearest neighbor searching is becoming an increasingly important tool in solving many large-scale problems. Recently a number of approaches to learning data-dependent hash functions have been developed. In this work, we propose a column generation based method for learning data-dependent hash functions on the basis of proximity comparison information. Given a set of triplets that encode the pairwise proximity comparison information, our method learns hash functions that preserve the relative comparison relationships in the data as well as possible within the large-margin learning framework. The learning procedure is implemented using column generation and hence is named CGHash. At each iteration of the column generation procedure, the best hash function is selected. Unlike most other hashing methods, our method generalizes to new data points naturally; and has a training objective which is convex, thus ensuring that the global optimum can be identified. Experiments demonstrate that the proposed method learns compact binary codes and that its retrieval performance compares favorably with state-of-the-art methods when tested on a few benchmark datasets.

Interesting review of hashing techniques.

Raises the question of customized similarity (read sameness) hashing algorithms for topic maps.

I first saw this in a tweet by Stefano Bertolo.

### A.nnotate

Monday, March 4th, 2013

A.nnotate

From the homepage:

A.nnotate is an online annotation, collaboration and indexing system for documents and images, supporting PDF, Word and other document formats. Instead of emailing different versions of a document back and forth you can now all comment on a single read-only copy online. Documents are displayed in high quality with fonts and layout just like the printed version. It is easy to use and runs in all common web browsers, with no software or plugins to install.

Hosted solutions are available for individuals and workgroups. For enterprise users the full system is available for local installation. Special discounts apply for educational use. A.nnotate technology can also be used to enhance existing document and content management systems with high quality online document viewing, annotation and collaboration facilities.

I suppose that is one way to solve the “index merging” problem.

Everyone use a common document.

Doesn’t help if a group starts with different copies of the same document.

Or if other indexes from other documents need to be merged with the present document.

Not to mention merging indexes/annotations separate from any particular document instance.

Still, a step away from the notion of a document as a static object.

Which is a good thing.

I first saw this in a tweet by Stian Danenbarger.

### MongoDB + Fractal Tree Indexes = High Compression

Friday, March 1st, 2013

MongoDB + Fractal Tree Indexes = High Compression by Tim Callaghan.

You may have heard that MapR Technologies broke the MinuteSort Record by sorting 15 billion 100-btye records in 60 seconds. Used 2,103 virtual instances in the Google Compute Engine and each instance had four virtual cores and one virtual disk, totaling 8,412 virtual cores and 2,103 virtual disks. Google Compute Engine, MapR Break MinuteSort Record.

So, the next time you have 8,412 virtual cores and 2,103 virtual disks, you know what is possible,

But if you have less firepower than that, you will need to be clever:

One doesn’t have to look far to see that there is strong interest in MongoDB compression. MongoDB has an open ticket from 2009 titled “Option to Store Data Compressed” with Fix Version/s planned but not scheduled. The ticket has a lot of comments, mostly from MongoDB users explaining their use-cases for the feature. For example, Khalid Salomão notes that “Compression would be very good to reduce storage cost and improve IO performance” and Andy notes that “SSD is getting more and more common for servers. They are very fast. The problems are high costs and low capacity.” There are many more in the ticket.

In prior blogs we’ve written about significant performance advantages when using Fractal Tree Indexes with MongoDB. Compression has always been a key feature of Fractal Tree Indexes. We currently support the LZMA, quicklz, and zlib compression algorithms, and our architecture allows us to easily add more. Our large block size creates another advantage as these algorithms tend to compress large blocks better than small ones.

Given the interest in compression for MongoDB and our capabilities to address this functionality, we decided to do a benchmark to measure the compression achieved by MongoDB + Fractal Tree Indexes using each available compression type. The benchmark loads 51 million documents into a collection and measures the size of all files in the file system (–dbpath).

More benchmarks to follow and you should remember that all benchmarks are just that, benchmarks.

Investigate software based on the first, purchase software based on the second.

### Drill Sideways faceting with Lucene

Monday, February 25th, 2013

Drill Sideways faceting with Lucene by Mike McCandless.

From the post:

Lucene’s facet module, as I described previously, provides a powerful implementation of faceted search for Lucene. There’s been a lot of progress recently, including awesome performance gains as measured by the nightly performance tests we run for Lucene:

[3.8X speedup!]

….

For example, try searching for an LED television at Amazon, and look at the Brand field, seen in the image to the right: this is a multi-select UI, allowing you to select more than one value. When you select a value (check the box or click on the value), your search is filtered as expected, but this time the field does not disappear: it stays where it was, allowing you to then drill sideways on additional values. Much better!

LinkedIn’s faceted search, seen on the left, takes this even further: not only are all fields drill sideways and multi-select, but there is also a text box at the bottom for you to choose a value not shown in the top facet values.

To recap, a single-select field only allows picking one value at a time for filtering, while a multi-select field allows picking more than one. Separately, drilling down means adding a new filter to your search, reducing the number of matching docs. Drilling up means removing an existing filter from your search, expanding the matching documents. Drilling sideways means changing an existing filter in some way, for example picking a different value to filter on (in the single-select case), or adding another or’d value to filter on (in the multi-select case). (images omitted)

More details: DrillSideways class being developed under LUCENE-4748.

Just following the progress on Lucene is enough to make you dizzy!

### Text processing (part 2): Inverted Index

Sunday, February 24th, 2013

Text processing (part 2): Inverted Index by Ricky Ho.

From the post:

This is the second part of my text processing series. In this blog, we’ll look into how text documents can be stored in a form that can be easily retrieved by a query. I’ll used the popular open source Apache Lucene index for illustration.

Not only do you get to learn about inverted indexes but some Lucene in the bargain.

### Indexing StackOverflow In Solr

Saturday, February 23rd, 2013

Indexing StackOverflow In Solr by John Berryman.

From the post:

One thing I really like about Solr is that its super easy to get started. You just download solr, fire it up, and then after following the 10 minute tutorial you’ll have a basic understand of indexing, updating, searching, faceting, filtering, and generally using Solr. But, you’ll soon get bored of playing with the 50 or so demo documents. So, quit insulting Solr with this puny, measly, wimpy dataset; Index something of significance and watch what Solr can do.

One of the most approachable large datasets is the StackExchange data set which most notably includes all of StackOverflow, but also contains many of the other StackExchange sites (Cooking, English Grammar, Bicycles, Games, etc.) So if StackOverflow is not your cup of tea, there’s bound to be a data set in there that jives more with your interests.

Once you’ve pulled down the data set, then you’re just moments away from having your own SolrExchange index. Simply unzip the dataset that you’re interested in (7-zip format zip files), pull down this git repo which walks you through indexing the data, and finally, just follow the instructions in the README.md.

Interesting data set for Solr.

More importantly, a measure of how easy it needs to be to get started with software.

Software like topic maps.

Suggestions?

### Machine Biases: Stop Word Lists

Friday, February 22nd, 2013

Oracle Text Search doesn’t work on some words (Stackoverflow)

From the post:

I am using Oracle’ Text Search for my project. I created a ctxsys.context index on my column and inserted one entry “Would you like some wine???”. I executed the query

select guid, text, score(10) from triplet where contains (text, ‘Would’, 10) > 0

it gave me no results. Querying ‘you’ and ‘some’ also return zero results. Only ‘like’ and ‘wine’ matches the record. Does Oracle consider you, would, some as stop words?? How can I let Oracle match these words? Thank you.

I mention this because Oracle Text Workaround for Stop Words List (Beyond Search) reports this as “interesting.”

Hardly. Stop lists are a common feature of text searching.

Moreover, the stop list in question, dates back in MySQL to 3.23.

11.9.4. Full-Text Stopwords excludes “you,” and “would” from indexing.

Suggestions of other stop word lists that exclude “you,” and “would?”

### NoSQL is Great, But You Still Need Indexes [MongoDB for example]

Wednesday, February 20th, 2013

NoSQL is Great, But You Still Need Indexes by Martin Farach-Colton.

From the post:

I’ve said it before, and, as is the nature of these things, I’ll almost certainly say it again: your database performance is only as good as your indexes.

That’s the grand thesis, so what does that mean? In any DB system — SQL, NoSQL, NewSQL, PostSQL, … — data gets ingested and organized. And the system answers queries. The pain point for most users is around the speed to answer queries. And the query speed (both latency and throughput, to be exact) depend on how the data is organized. In short: Good Indexes, Fast Queries; Poor Indexes, Slow Queries.

But building indexes is hard work, or at least it has been for the last several decades, because almost all indexing is done with B-trees. That’s true of commercial databases, of MySQL, and of most NoSQL solutions that do indexing. (The ones that don’t do indexing solve a very different problem and probably shouldn’t be confused with databases.)

It’s not true of TokuDB. We build Fractal Tree Indexes, which are much easier to maintain but can still answer queries quickly. So with TokuDB, it’s Fast Indexes, More Indexes, Fast Queries. TokuDB is usually thought of as a storage engine for MySQL and MariaDB. But it’s really a B-tree substitute, so we’re always on the lookout for systems where we can improving the indexing.

Enter MongoDB. MongoDB is beloved because it makes deployment fast. But when you peel away the layers, you get down to a B-tree, with all the performance headaches and workarounds that they necessitate.

That’s the theory, anyway. So we did some testing. We ripped out the part of MongoDB that takes care of secondary indices and plugged in TokuDB. We’ve posted the blogs before, but here they are again, the greatest hits of TokuDB+MongoDB: we show a 10x insertion performance, a 268x query performance, and a 532x (or 53,200% if you prefer) multikey index insertion performance. We also discussed covered indexes vs. clustered Fractal Tree Indexes.

Did somebody declare February 20th to be performance release day?

Did I miss that memo?

Like every geek, I like faster. But, here’s my question:

Have there been any studies on the impact of faster systems on searching and decision making by users?

My assumption is the faster I get a non-responsive result from a search, the sooner I can improve it.

But that’s an assumption on my part.

Is that really true?

### Concurrency Improvements in TokuDB v6.6 (Part 1)

Friday, February 1st, 2013

Concurrency Improvements in TokuDB v6.6 (Part 1)

From the post:

With TokuDB v6.6 out now, I’m excited to present one of my favorite enhancements: concurrency within a single index. Previously, while there could be many SQL transactions in-flight at any given moment, operations inside a single index were fairly serialized. We’ve been working on concurrency for a few versions, and things have been getting a lot better over time. Today I’ll talk about what to expect from v6.6. Next time, we’ll see why.

Impressive numbers as always!

Should get you interested in learning how this was done as an engineering matter. (That’s in part 2.)

### Getting real-time field values in Lucene

Sunday, January 27th, 2013

Getting real-time field values in Lucene by Mike McCandless.

From the post:

We know Lucene’s near-real-time search is very fast: you can easily refresh your searcher once per second, even at high indexing rates, so that any change to the index is available for searching or faceting at most one second later. For most applications this is plenty fast.

But what if you sometimes need even better than near-real-time? What if you need to look up truly live or real-time values, so for any document id you can retrieve the very last value indexed?

Just use the newly committed LiveFieldValues class!

It’s simple to use: when you instantiate it you provide it with your SearcherManager or NRTManager, so that it can subscribe to the RefreshListener to be notified when new searchers are opened, and then whenever you add, update or delete a document, you notify the LiveFieldValues instance. Finally, call the get method to get the last indexed value for a given document id.

I saw a webinar by Mike McCandless that is probably the only webinar I would ever repeat watching.

Organized, high quality technical content, etc.

Compare that to a recent webinar I watched that spent fify-five (55) minutes reviewing information know to anyone who could say the software’s name. The speaker then lamented the lack of time to get into substantive issues.

When you see a webinar like Mike’s, drop me a line. We need to promote that sort of presentation over the other.

### Apache Lucene 4.1 and Apache SolrTM 4.1 available

Thursday, January 24th, 2013

Lucene CHANGES.text

Solr CHANGES.txt

That’s getting the new year off to a great start!

### A new Lucene highlighter is born [The final inch problem]

Monday, January 7th, 2013

A new Lucene highlighter is born Mike McCandless.

From the post:

Robert has created an exciting new highlighter for Lucene, PostingsHighlighter, our third highlighter implementation (Highlighter and FastVectorHighlighter are the existing ones). It will be available starting in the upcoming 4.1 release.

Highlighting is crucial functionality in most search applications since it’s the first step of the hard-to-solve final inch problem, i.e. of getting the user not only to the best matching documents but getting her to the best spot(s) within each document. The larger your documents are, the more crucial it is that you address the final inch. Ideally, your user interface would let the user click on each highlight snippet to jump to where it occurs in the full document, or at least scroll to the first snippet when the user clicks on the document link. This is in general hard to solve: which application renders the content is dependent on its mime-type (i.e., the browser will render HTML, but will embed Acrobat Reader to render PDF, etc.).

Google’s Chrome browser has an ingenious solution to the final inch problem, when you use “Find…” to search the current web page: it highlights the vertical scroll bar showing you where the matches are on the page. You can then scroll to those locations, or, click on the highlights in the scroll bar to jump there. Wonderful!

All Lucene highlighters require search-time access to the start and end offsets per token, which are character offsets indicating where in the original content that token started and ended. Analyzers set these two integers per-token via the OffsetAttribute, though some analyzers and token filters are known to mess up offsets which will lead to incorrect highlights or exceptions during highlighting. Highlighting while using SynonymFilter is also problematic in certain cases, for example when a rule maps multiple input tokens to multiple output tokens, because the Lucene index doesn’t store the full token graph.

An interesting addition to the highlighters in Lucene.

### Superlinear Indexes

Tuesday, December 18th, 2012

Superlinear Indexes

From the webpage:

Multidimensional and String Indexes for Streaming Data

The Superlinear Index project is investigating how to data structures and algorithms for maintaining superlinear indexes on out-of-core storage (such as disk drives), with high incoming data rates. To understand what a superlinear index is, consider a linear index, which provides a total order on keys. A superlinear index is more complex than a total order. Examples of superlinear indexes including multidimensional indexes and full-text indexes.

A number of publications but none in 2012.

I will be checking to see if the project is still alive.

### Closing the Knowledge Gap: A Case Study – How Cisco Unlocks Communications

Sunday, December 16th, 2012

Closing the Knowledge Gap: A Case Study – How Cisco Unlocks Communications by Tony Frazier, Director of Product Management, Cisco Systems and David Fishman, Marketing, Lucid Imagination.

A bit dated (2010) but I think you will find this interesting reading.

A couple of snippets to tempt you into reading the full post:

Consider this: you’re looking for information and immediately search the documents at your disposal to find the answer. Are you the first person who conducted this search? If you are in a reasonably large organization, given the scope and mix of electronic communications today, there could be more than 10 other employees looking for the same answer. Unearthing documents, one employee at a time, may not be the best way of tapping into that collective intellect and maximizing resources across an organization. Wouldn’t it make more sense to tap into existing discussions taking place across the network—over email, voice and increasingly video communications?

and,

Cisco’s approach to this project centered on vocabulary-based tagging and search. Every organization has the ability to define keywords for their personalized library. Cisco Pulse then tags a user’s activity, content and behavior in electronic communications to match the vocabulary, presenting valuable information that simplifies and accelerates knowledge sharing across an organization. Vocabulary-based tagging makes unlocking the relevant content of electronic communications safe and efficient.

Add a little vocabulary mapping with topic maps, toss and serve!

### Looking at a Plaintext Lucene Index

Saturday, December 8th, 2012

Looking at a Plaintext Lucene Index by Florian Hopf.

From the post:

The Lucene file format is one of the reasons why Lucene is as fast as it is. An index consist of several binary files that you can’t really inspect if you don’t use tools like the fantastic Luke.

Starting with Lucene 4 the format for these files can be configured using the Codec API. Several implementations are provided with the release, among those the SimpleTextCodec that can be used to write the files in plaintext for learning and debugging purposes.

Good starting point for learning more about Lucene indexes.

Thursday, December 6th, 2012

From the post:

Every day, a new app or service arrives with the promise of helping people cut down on the flood of information they receive. It’s the natural result of living in a time when an ever-increasing number of news providers push a constant stream of headlines at us every day.

But what if it’s the ways we choose to read the news — not the glut of news providers — that make us feel overwhelmed? An interesting new study out of the University of Texas looks at the factors that contribute to the concept of information overload, and found that, for some people, the platform on which news is being consumed can make all the difference between whether you feel overwhelmed.

The study, “News and the Overloaded Consumer: Factors Influencing Information Overload Among News Consumers” was conducted by Avery Holton and Iris Chyi. They surveyed more than 750 adults on their digital consumption habits and perceptions of information overload. On the central question of whether they feel overloaded with the amount of news available, 27 percent said “not at all”; everyone else reported some degree of overloaded.

The results imply that the more constrained the platform for delivery of content, the less overwhelmed users feel. Reading news on a cell phone for example. The links and videos on Facebook being at the other extreme.

Which makes me curious about information interfaces in general and topic map interfaces in particular.

Does the traditional topic map interface (think Omnigator) contribute to a feeling of information overload?

If so, how would you alter that display to offer the user less information by default but allow its expansion upon request?

Compare to a book index, which offers sparse information on a subject, that can be expanded by following a pointer to fuller treatment of a subject.

I don’t think replicating a print index with hyperlinks in place of traditional references is the best solution but it might be a starting place for consideration.

### Lucene with Zing, Part 2

Wednesday, November 21st, 2012

Lucene with Zing, Part 2 by Mike McCandless.

From the post:

When I last tested Lucene with the Zing JVM the results were impressive: Zing’s fully concurrent C4 garbage collector had very low pause times with the full English Wikipedia index (78 GB) loaded into RAMDirectory, which is not an easy feat since we know RAMDirectory is stressful for the garbage collector.

I had used Lucene 4.0.0 alpha for that first test, so I decided to re-test using Lucene’s 4.0.0 GA release and, surprisingly, the results changed! MMapDirectory’s max throughput was now better than RAMDirectory’s (versus being much lower before), and the concurrent mark/sweep collector (-XX:-UseConcMarkSweepGC) was no longer hitting long GC pauses.

This was very interesting! What change could improve MMapDirectory’s performance, and lower the pressure on concurrent mark/sweep’s GC to the point where pause times were so much lower in GA compared to alpha?

Mike updates his prior experience with Lucene and Zing.

Covers the use gcLogAnalyser and Fragger to understand “why” his performance test results changed from the alpha to GA releases.

Insights into both Lucene and Zing.