Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 14, 2014

Apache MarkMail

Filed under: Indexing,MarkLogic,Searching — Patrick Durusau @ 6:56 pm

Apache MarkMail

Just in case you don’t have your own index of the 10+ million messages in Apache mailing list archives, this is the site for you.


I ran across it today while debugging an error in a Solr config file.

If I could add one thing to MarkMail it would be software release date facets. Posts are not limited by release dates but I suspect a majority of posts between release dates are about the current release. Enough so that I would find it a useful facet.


March 9, 2014

Lucene 4 Essentials for Text Search and Indexing

Filed under: Indexing,Java,Lucene,Searching — Patrick Durusau @ 5:06 pm

Lucene 4 Essentials for Text Search and Indexing by Mitzi Morris.

From the post:

Here’s a short-ish introduction to the Lucene search engine which shows you how to use the current API to develop search over a collection of texts. Most of this post is excerpted from Text Processing in Java, Chapter 7, Text Search with Lucene.

Not too short! 😉

I have seen blurbs about Text Processing in Java but this post convinced me to put it on my wish list.


PS: As soon as a copy arrives I will start working on a review of it. If you want to see that happen sooner rather than later, ping me.

March 7, 2014

Using Lucene’s search server to search Jira issues

Filed under: Indexing,Lucene,Search Engines — Patrick Durusau @ 5:02 pm

Using Lucene’s search server to search Jira issues by Michael McCandless.

From the post:

You may remember my first blog post describing how the Lucene developers eat our own dog food by using a Lucene search application to find our Jira issues.

That application has become a powerful showcase of a number of modern Lucene features such as drill sideways and dynamic range faceting, a new suggester based on infix matches, postings highlighter, block-join queries so you can jump to a specific issue comment that matched your search, near-real-time indexing and searching, etc. Whenever new users ask me about Lucene’s capabilities, I point them to this application so they can see for themselves.

Recently, I’ve made some further progress so I want to give an update.

The source code for the simple Netty-based Lucene server is now available on this subversion branch (see LUCENE-5376 for details). I’ve been gradually adding coverage for additional Lucene modules, including facets, suggesters, analysis, queryparsers, highlighting, grouping, joins and expressions. And of course normal indexing and searching! Much remains to be done (there are plenty of nocommits), and the goal here is not to build a feature rich search server but rather to demonstrate how to use Lucene’s current modules in a server context with minimal “thin server” additional source code.

Separately, to test this new Lucene based server, and to complete the “dog food,” I built a simple Jira search application plugin, to help us find Jira issues, here. This application has various Python tools to extract and index Jira issues using Jira’s REST API and a user-interface layer running as a Python WSGI app, to send requests to the server and render responses back to the user. The goal of this Jira search application is to make it simple to point it at any Jira instance / project and enable full searching over all issues.

Of particular interest to me because OASIS is about to start using JIRA 6.2 (the version in use at Apache).

I haven’t looked closely at the documentation for JIRA 6.2.

Thoughts on where it has specific weaknesses that are addressed by Michael’s solution?

February 9, 2014

January 13, 2014

January 4, 2014

Writing a full-text search engine using Bloom filters

Filed under: Bloom Filters,Indexing,Search Engines — Patrick Durusau @ 2:32 pm

Writing a full-text search engine using Bloom filters by Stavros Korokithakis.

A few minutes ago I came across a Hacker News post that detailed a method of adding search to your static site. As you probably know, adding search to a static site is a bit tricky, because you can’t just send the query to a server and have the server process it and return the results. If you want full-text search, you have to implement something like an inverted index.

How an inverted index works

An inverted index is a data structure that basically maps every word in every document to the ID of the document it can be found in. For example, such an index might look like {"python": [1, 3, 6], "raspberry": [3, 7, 19]}. To find the documents that mention both “python” and “raspberry”, you look those terms up in the index and find the common document IDs (in our example, that is only document with ID 3).

However, when you have very long documents with varied words, this can grow a lot. It’s a hefty data structure, and, when you want to implement a client-side search engine, every byte you transmit counts.

Client-side search engine caveats

The problem with client-side search engines is that you (obviously) have to do all the searching on the client, so you have to transmit all available information there. What static site generators do is generate every required file when generating your site, then making those available for the client to download. Usually, search-engine plugins limit themselves to tags and titles, to cut down on the amount of information that needs to be transmitted. How do we reduce the size? Easy, use a Bloom filter!

An interesting alternative to indexing a topic map with an inverted index.

I mention it in part because of one of the “weaknesses” of Bloom filters for searching:

You can’t weight pages by relevance, since you don’t know how many times a word appears in a page, all you know is whether it appears or not. You may or may not care about this.

Unlike documents, which are more or less relevant due to work occurrences, topic maps cluster information about a subject into separate topics (or proxies if you prefer).

That being the case, it isn’t the case that one topic/proxy is more “relevant” than another. The question is whether this topic/proxy represents the subject you want?

Or to put it another way, topics/proxies have already been arranged by “relevance” by a topic map author.

If a topic map interface gives you hundreds or thousands of “relevant” topics/proxies, how are you any better off than a more traditional search engine?

If you need to look like you are working, go search any of the social media sites for useful content. It’s there, the difficulty is going to be finding it.

December 19, 2013

indexing in Neo4j – an overview

Filed under: Graphs,Indexing,Neo4j — Patrick Durusau @ 7:13 pm

indexing in Neo4j – an overview by Stefan Armbruster.

From the post:

Neo4j as a graph database features indexing as the preferred way to find start points for graph traversals. Over the years multiple different indexing approach have been added. The goal of this article is to give an overview on this to avoid confusion esp. for those who just recently got started with Neo4j.

A graph database using a property graph model stores its data in nodes, relationships and properties. In Neo4j 2.0 this model was amended with labels.

A very nice summary of the indexing mechanisms in Neo4j.

After all, if you write something down and then can’t find it, what good is it?


November 25, 2013

How-to: Index and Search Data with Hue’s Search App

Filed under: Hue,Indexing,Interface Research/Design,Solr — Patrick Durusau @ 4:32 pm

How-to: Index and Search Data with Hue’s Search App

From the post:

You can use Hue and Cloudera Search to build your own integrated Big Data search app.

In a previous post, you learned how to analyze data using Apache Hive via Hue’s Beeswax and Catalog apps. This time, you’ll see how to make Yelp Dataset Challenge data searchable by indexing it and building a customizable UI with the Hue Search app.

Don’t be discouraged by the speed of the presenter in the video.

I suspect he is more than “familiar” with the Hue, Solr and the Yelp dataset. 😉

Like all great “how-to” guides you get a very positive outcome.

A positive outcome with minimal effort may be essential reinforcement for new technologies.

November 16, 2013


Filed under: Indexing,Lucene,Luke — Patrick Durusau @ 7:24 pm

CLue – Command Line tool for Apache Lucene by John Wang.

From the webpage:

When working with Lucene, it is often useful to inspect an index.

Luke is awesome, but often times it is not feasible to inspect an index on a remote machine using a GUI. That’s where Clue comes in. You can ssh into your production box and inspect your index using your favorite shell.

Another important feature for Clue is the ability to interact with other Unix commands via piping, e.g. grep, more etc.

[New in 0.0.4 Release]

  • Add ability to investigate indexes on HDFS
  • Add command to dump the index
  • Add command to import from a dumped index
  • Add configuration support, now you can configure Clue to run your own custom code
  • Add index trimming functionlity: sometimes you want a smaller index to work with
  • lucene 4.5.1 upgrade

Definitely a tool to investigate for adding to your tool belt!

November 15, 2013

November 9, 2013

Full-Text Indexing PDFs in Javascript

Filed under: Indexing,Javascript,PDF — Patrick Durusau @ 8:35 pm

Full-Text Indexing PDFs in Javascript by Gary Sieling.

From the post:

Mozilla Labs received a lot of attention lately for a project impressive in it’s ambitions: rendering PDFs in a browser using only Javascript. The PDF spec is incredibly complex, so best of luck to the pdf.js team! On a different vein, Oliver Nightingale is implementing a Javascript full-text indexer in the Javascript – combining these two projects allows reproducing the PDF processing pipeline entirely in web browsers.

As a refresher, full text indexing lets a user search unstructured text, ranking resulting documents by a relevance score determined by word frequencies. The indexer counts how often each word occurs per document and makes minor modifications the text, removing grammatical features which are irrelevant to search. E.g. it might subtract “-ing” and change vowels to phonetic common denominators. If a word shows up frequently across the document set it is automatically considered less important, and it’s effect on resulting ranking is minimized. This differs from the basic concept behind Google PageRank, which boosts the rank of documents based on a citation graph.

Most database software provides full-text indexing support, but large scale installations are typically handled in more powerful tools. The predominant open-source product is Solr/Lucene, Solr being a web-app wrapper around the Lucene library. Both are written in Java.

Building a Javascript full-text indexer enables search in places that were previously difficult such as Phonegap apps, end-user machines, or on user data that will be stored encrypted. There is a whole field of research to encrypted search indices, but indexing and encrypting data on a client machine seems like a good way around this naturally challenging problem. (Emphasis added.)

The need for a full-text indexer without using one of the major indexing packages had not occurred to me.

Access to the user’s machine might be limited by time, for example. You would not want to waste cycles spinning up a major indexer when you don’t know the installed software.

Something to add to your USB stick. 😉

November 6, 2013

Introduction to Information Retrieval

Filed under: Classification,Indexing,Information Retrieval,Probalistic Models,Searching — Patrick Durusau @ 5:10 pm

Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich SchĂźtze.

A bit dated now (2008) but the underlying principles of information retrieval remain the same.

I have a hard copy but the additional materials and ability to cut-n-paste will make this a welcome resource!

We’d be pleased to get feedback about how this book works out as a textbook, what is missing, or covered in too much detail, or what is simply wrong. Please send any feedback or comments to: informationretrieval (at) yahoogroups (dot) com

Online resources

Apart from small differences (mainly concerning copy editing and figures), the online editions should have the same content as the print edition.

The following materials are available online. The date of last update is given in parentheses.

Information retrieval resources

A list of information retrieval resources is also available.

Introduction to Information Retrieval: Table of Contents

Front matter (incl. table of notations) pdf

01   Boolean retrieval pdf html

02 The term vocabulary & postings lists pdf html

03 Dictionaries and tolerant retrieval pdf html

04 Index construction pdf html

05 Index compression pdf html

06 Scoring, term weighting & the vector space model pdf html

07 Computing scores in a complete search system pdf html

08 Evaluation in information retrieval pdf html

09 Relevance feedback & query expansion pdf html

10 XML retrieval pdf html

11 Probabilistic information retrieval pdf html

12 Language models for information retrieval pdf html

13 Text classification & Naive Bayes pdf html

14 Vector space classification pdf html

15 Support vector machines & machine learning on documents pdf html

16 Flat clustering pdf html Resources.

17 Hierarchical clustering pdf html

18 Matrix decompositions & latent semantic indexing pdf html

19 Web search basics pdf html

20 Web crawling and indexes pdf html

21 Link analysis pdf html

Bibliography & Index pdf

bibtex file bib

October 11, 2013


Filed under: Bitmap Indexes,FastBit,Indexing — Patrick Durusau @ 4:44 pm

FastBit: An Efficient Compressed Bitmap Index Technology

From the webpage:

FastBit is an open-source data processing library following the spirit of NoSQL movement. It offers a set of searching functions supported by compressed bitmap indexes. It treats user data in the column-oriented manner similar to well-known database management systems such as Sybase IQ, MonetDB, and Vertica. It is designed to accelerate user’s data selection tasks without imposing undue requirements. In particular, the user data is NOT required to be under the control of FastBit software, which allows the user to continue to use their existing data analysis tools.


The FastBit software is distributed under the Less GNU Public License (LGPL). The software is available at The most recent release is FastBit ibis1.3.7; it comes as a source tar ball named fastbit-ibis1.3.7.tar.gz. The latest development version is available from

Other items of interest:

FastBit related publications

The most recent entry in this list is 2011. A quick search of the ACM Digital Library (for fastBit) found seventeen (17) articles for 2012 – 2013.

FastBit Users Guide

From the users guide:

This package implements a number of different bitmap indexes compressed with Word-Aligned Hybrid code. These indexes differ in their bitmap encoding methods and binning options. The basic bitmap index compressed with WAH has been shown to answer one-dimensional queries in time that is proportional to the number of hits in theory. In a number of performance measurements, WAH compressed indexes were found to be much more efficient than other indexes [CIKM 2001] [SSDBM 2002] [DOLAP 2002]. One of the crucial step in achieving these efficiency is to be able to perform bitwise OR operations on a large compressed bitmaps efficiently without decompression [VLDB 2004]. Numerous other bitmap encodings and binning strategies are implemented in this software package, please refer to indexSpec.html for descriptions on how to access these indexes and refer to our publications for extensive studies on these methods. FastBit was primarily developed to test these techniques for improving compressed bitmap indexes. Even though, it has grown to include a small number other useful data analysis functions, its primary strength is still in having a diversity of efficient compressed bitmap indexes.

Just in case you want to follow up on the use of fastBit in the RaptorDB.

September 30, 2013

Lucene now has an in-memory terms dictionary…

Filed under: Indexing,Lucene — Patrick Durusau @ 7:05 pm

Lucene now has an in-memory terms dictionary, thanks to Google Summer of Code by Mike McCandless.

From the post:

Last year, Han Jiang’s Google Summer of Code project was a big success: he created a new (now, default) postings format for substantially faster searches, along with smaller indices.

This summer, Han was at it again, with a new Google Summer of Code project with Lucene: he created a new terms dictionary holding all terms and their metadata in memory as an FST.

In fact, he created two new terms dictionary implementations. The first, FSTTermsWriter/Reader, hold all terms and metadata in a single in-memory FST, while the second, FSTOrdTermsWriter/Reader, does the same but also supports retrieving the ordinal for a term (TermsEnum.ord()) and looking up a term given its ordinal (TermsEnum.seekExact(long ord)). The second one also uses this ord internally so that the FST is more compact, while all metadata is stored outside of the FST, referenced by ord.

Lucene continues to improve, rapidly!

September 26, 2013

Email Indexing Using Cloudera Search [Stepping Beyond “Hello World”]

Filed under: Cloudera,Email,Indexing — Patrick Durusau @ 6:00 pm

Email Indexing Using Cloudera Search by Jeff Shmain

From the post:

Why would any company be interested in searching through its vast trove of email? A better question is: Why wouldn’t everybody be interested?

Email has become the most widespread method of communication we have, so there is much value to be extracted by making all emails searchable and readily available for further analysis. Some common use cases that involve email analysis are fraud detection, customer sentiment and churn, lawsuit prevention, and that’s just the tip of the iceberg. Each and every company can extract tremendous value based on its own business needs.

A little over a year ago we described how to archive and index emails using HDFS and Apache Solr. However, at that time, searching and analyzing emails were still relatively cumbersome and technically challenging tasks. We have come a long way in document indexing automation since then — especially with the recent introduction of Cloudera Search, it is now easier than ever to extract value from the corpus of available information.

In this post, you’ll learn how to set up Apache Flume for near-real-time indexing and MapReduce for batch indexing of email documents. Note that although this post focuses on email data, there is no reason why the same concepts could not be applied to instant messages, voice transcripts, or any other data (both structured and unstructured).

If you want a beyond “Hello World” introduction to: Flume, Solr, Cloudera Morphlines, HDFS, Hue’s Search application, and Cloudera Search, this is the post for you.

With the added advantage that you can apply the basic principles in this post as you expand your knowledge of the Hadoop ecosystem.

September 8, 2013

Postgres and Full Text Indexes

Filed under: Indexing,PostgreSQL,Solr — Patrick Durusau @ 4:06 pm

After reading Jeff Larson’s account of his text mining adventures in ProPublica’s Jeff Larson on the NSA Crypto Story, I encountered a triplet of post from Gary Sieling on Postgres and full text indexes.

In order of appearance:

Fixing Issues Where Postgres Optimizer Ignores Full Text Indexes

GIN vs GiST For Faceted Search with Postgres Full Text Indexes

Querying Multiple Postgres Full-Text Indexes

If Postgres and full text indexing are project requirements, these are must read posts.

Gary does note in the middle post that Solr with default options (no tuning) out performs Postgres.

Solr would have been the better option for Jeff Larson when compared to Postgres.

But the difference in that case is a contrast between structured data and “dumpster data.”

It appears that the hurly-burly race to enable “connecting the dots” post-9/11:

Structural barriers to performing joint intelligence work. National intelligence is still organized around the collection disciplines of the home agencies, not the joint mission. The importance of integrated, all-source analysis cannot be overstated. Without it, it is not possible to “connect the dots.” No one component holds all the relevant information.

Yep, #1 with a bullet problem.

Response? From the Manning and Snowden leaks, one can only guess that “dumpster data” is the preferred solution.

By “dumpster data” I mean that data from different sources, agencies, etc., are simply dumped into a large data store.

No wonder the NSA runs 600,000 of queries a day or about 20 million queries a month. That is a lot of data dumpster diving.

Secrecy may be hiding that data from the public, but poor planning is hiding it from the NSA.

August 22, 2013

Indexing use cases and technical strategies [Hadoop]

Filed under: Hadoop,HDFS,Indexing — Patrick Durusau @ 6:02 pm

Indexing use cases and technical strategies

From the post:

In this post, let us look at 3 real life indexing use cases. While Hadoop is commonly used for distributed batch index building, it is desirable to optimize the index capability in near real time. We look at some practical real life implementations where the engineers have successfully worked out their technology stack combinations using different products.

Resources on:

  1. Near Real Time index at eBay
  2. Distributed indexing strategy at Trovit
  3. Incremental Processing by Google’s Percolator

Presentations and a paper for the weekend!

August 11, 2013

Embedding Concepts in text for smarter searching with Solr4

Filed under: Concept Detection,Indexing,Searching,Solr — Patrick Durusau @ 7:08 pm

Embedding Concepts in text for smarter searching with Solr4 by Sujit Pal.

From the post:

Storing the concept map for a document in a payload field works well for queries that can treat the document as a bag of concepts. However, if you want to consider the concept’s position(s) in the document, then you are out of luck. For queries that resolve to multiple concepts, it makes sense to rank documents with these concepts close together higher than those which had these concepts far apart, or even drop them from the results altogether.

We handle this requirement by analyzing each document against our medical taxonomy, and annotating recognized words and phrases with the appropriate concept ID before it is sent to the index. At index time, a custom token filter similar to the SynonymTokenFilter (described in the LIA2 Book) places the concept ID at the start position of the recognized word or phrase. Resolved multi-word phrases are retained as single tokens – for example, the phrase “breast cancer” becomes “breast0cancer”. This allows us to rewrite queries such as “breast cancer radiotherapy”~5 as “2790981 2791965″~5.

One obvious advantage is that synonymy is implicitly supported with the rewrite. Medical literature is rich with synonyms and acronyms – for example, “breast cancer” can be variously called “breast neoplasm”, “breast CA”, etc. Once we rewrite the query, 2790981 will match against a document annotation that is identical for each of these various synonyms.

Another advantage is the increase of precision since we are dealing with concepts rather than groups of words. For example, “radiotherapy for breast cancer patients” would not match our query since “breast cancer patient” is a different concept than “breast cancer” and we choose the longest subsequence to annotate.

Yet another advantage of this approach is that it can support mixed queries. Assume that a query can only be partially resolved to concepts. You can still issue the partially resolved query against the index, and it would pick up the records where the pattern of concept IDs and words appear.

Finally, since this is just a (slightly rewritten) Solr query, all the features of standard Lucene/Solr proximity searches are available to you.

In this post, I describe the search side components that I built to support this approach. It involves a custom TokenFilter and a custom Analyzer that wraps it, along with a few lines of configuration code. The code is in Scala and targets Solr 4.3.0.

So if Solr4 can make documents smarter, can the same be said about topics?

Recalling that “document” for Solr is defined by your indexing, not some arbitrary byte count.

As we are indexing topics we could add information to topics to make merging more robust.

One possible topic map flow being:

Index -> addToTopics -> Query -> Results -> Merge for Display.


July 30, 2013

To index is to translate

Filed under: Indexing,Ontology,Translation — Patrick Durusau @ 6:50 pm

To index is to translate by Fran Alexander.

From the post:

Living in Montreal means I am trying to improve my very limited French and in trying to communicate with my Francophone neighbours I have become aware of a process of attempting to simplify my thoughts and express them using the limited vocabulary and grammar that I have available. I only have a few nouns, fewer verbs, and a couple of conjunctions that I can use so far and so trying to talk to people is not so much a process of thinking in English and translating that into French, as considering the basic core concepts that I need to convey and finding the simplest ways of expressing relationships. So I will say something like “The sun shone. It was big. People were happy” because I can’t properly translate “We all loved the great weather today”.

This made me realise how similar this is to the process of breaking down content into key concepts for indexing. My limited vocabulary is much like the controlled vocabulary of an indexing system, forcing me to analyse and decompose my ideas into simple components and basic relationships. This means I am doing quite well at fact-based communication, but my storytelling has suffered as I have only one very simple emotional register to work with. The best I can offer is a rather laconic style with some simple metaphors: “It was like a horror movie.”

It is regularly noted that ontology work in the sciences has forged ahead of that in the humanities, and the parallel with my ability to express facts but not tell stories struck me. When I tell my simplified stories I rely on shared understanding of a broad cultural context that provides the emotional aspect – I can use the simple expression “horror movie” because the concept has rich emotional associations, connotations, and resonances for people. The concept itself is rather vague, broad, and open to interpretation, so the shared understanding is rather thin. The opposite is true of scientific concepts, which are honed into precision and a very constrained definitive shared understanding. So, I wonder how much of sense that I can express facts well is actually an illusion, and it is just that those factual concepts have few emotional resonances.

Is mapping a process of translation?

Are translations always less rich than the source?

Or are translations as rich but differently rich?

Lucene 4 Performance Tuning

Filed under: Indexing,Lucene,Performance,Searching — Patrick Durusau @ 6:47 pm

From the description:

Apache Lucene has undergone a major overhaul influencing many of the key characteristics dramatically. New features and modification allow for new as well as fundamentally different ways of tuning the engine for best performance.

Tuning performance is essential for almost every Lucene based application these days – Search & Performance almost a synonyms. Knowing the details of the underlying software provides the basic tools to get the best out of your application. Knowing the limitations can safe you and your company a massive amount of time and money. This talks tries to explain design decision made in Lucene 4 compared to older versions and provide technical details how those implementations and design decisions can help to improve the performance of your application. The talk will mainly focus on core features like: Realtime & Batch Indexing Filter and Query performance Highlighting and Custom Scoring

The talk will contain a lot of technical details that require a basic understanding of Lucene, datastructures and algorithms. You don’t need to be an expert to attend but be prepared for some deep dive into Lucene. Attendees don’t need to be direct Lucene users, the fundamentals provided in this talk are also essential for Apache Solr or elasticsearch users.

If you want to catch some of the highlights of Lucene 4, this is the presentation for you!

It will be hard to not go dig deeper in a number of areas.

The new codec features were particularly impressive!

July 15, 2013

Why Unique Indexes are Bad [Caveat on Fractal Tree(R) Indexes]

Filed under: Fractal Trees,Indexing,TokuDB — Patrick Durusau @ 2:12 pm

Why Unique Indexes are Bad by Zardosht Kasheff.

From the post:

Before creating a unique index in TokuMX or TokuDB, ask yourself, “does my application really depend on the database enforcing uniqueness of this key?” If the answer is ANYTHING other than yes, do not declare the index to be unique. Why? Because unique indexes may kill your write performance. In this post, I’ll explain why.

Unique indexes are a strange beast: they have no impact on standard databases that use B-Trees, such as MongoDB and MySQL, but may be horribly painful for databases that use write optimized data structures, like TokuMX’s Fractal Tree(R) indexes. How? They essentially drag the Fractal Tree index down to the B-Tree’s level of performance.

When a user declares a unique index, the user tells the database, “please help me and enforce uniqueness on this index.” So, before doing any insertion into a unique index, the database must first verify that the key being inserted does not already exist. If the possible location of the key is not in memory, which may happen if the working set does not fit in memory, then the database MUST perform an I/O to bring into memory the contents of the potential location (be it a leaf node in a tree, or an offset into a memory mapped file), in order to check whether the key exists in that location.


Zardosht closes by recommending if your application does require unique indexes that you consider re-writing it so it doesn’t.


Not a mark against Fractal Tree(R) indexes but certainly a consideration in deciding to adopt technology using them.

Would be nice if this type of information could be passed along as more than sysadmin lore.

Like a plugin for your browser that at your request highlights products or technologies of interest and on mouse-over displays known limitations or bugs.

The sort of things that vendors loath to disclose.

July 6, 2013

Norch- a search engine for node.js

Filed under: Indexing,node-js,Search Engines — Patrick Durusau @ 4:30 pm

Norch- a search engine for node.js by Fergus McDowall.

From the post:

Norch is a search engine written for Node.js. Norch uses the Node search-index module which is in turn written using the super fast levelDB library that Google open-sourced in 2011.

The aim of Norch is to make a simple, fast search server, that requires minimal configuration to set up. Norch sacrifices complex functionality for a limited robust feature set, that can be used to set up a freetext search engine for most enterprise scenarios.

Currently Norch features

  • Full text search
  • Stopword removal
  • Faceting
  • Filtering
  • Relevance weighting (tf-idf)
  • Field weighting
  • Paging (offset and resultset length)

Norch can index any data that is marked up in the appropriate JSON format.

Download the first release of Norch (0.2.1) here

See: for various details and instructions.

Interesting but I am curious what advantage Norch offers over Solr or Elasticseach, for example?

July 3, 2013

Triggers for Apache HBase

Filed under: HBase,Indexing,Triggers — Patrick Durusau @ 9:00 am

Cloudera Search over Apache HBase: A Story of Collaboration by Steven Noels.

Great background story on the development of triggers and indexing updates for Apache HBase by NGDATA (for their Lily product) and that underlies Cloudera Search.

From the post:

In this most recent edition, we introduced an order of magnitude performance improvement: a cleaner, more efficient, and fault-tolerant code path with no write performance penalty on HBase. In the interest of modularity, we decoupled the trigger and indexing component from Lily, making it into a stand-alone, collaborative open source project that is now underpinning both Cloudera Search HBase support as well as Lily.

This made sense for us, not just because we believe in HBase and its community but because our customers in Banking, Media, Pharma and Telecom have unqualified expectations for both the scalability and resilience of Lily. Outsourcing some part of that responsibility towards the infrastructure tier is efficient for us. We are very pleased with the collaboration, innovation, and quality that Cloudera has produced by working with us and look forward to a continued relationship that combines joint development in a community oriented way with responsible stewardship of the infrastructure code base we build upon.

Our HBase Triggering and Indexing software can be found on GitHub at:

Do you have any indexing or update side-effect needs for HBase? Tell us your thoughts on this solution.

June 30, 2013

Solr Authors, A Suggestion

Filed under: Indexing,Lucene,Solr — Patrick Durusau @ 3:01 pm

I am working my way through a recent Solr publication. It reproduces some, but not all of the output of queries.

But it remains true that the output of queries is a sizeable portion of the text.

Suggestion: Could the queries be embedded in PDF text as hyperlinks?

Thus: http://localhost:8983/solr/select?q=*:*&indent=yes.

If I have Solr running, etc., the full results show up in my browser and save page space. Perhaps resulting in room for more analysis or examples.

There may be a very good reason to not follow my suggestion so it truly is a suggestion.

If there is a question of verifying the user’s results, perhaps a separate PDF of results keyed to the text?

That could be fuller results and at the same time allow the text to focus on substantive material.

Elasticsearch and Joining

Filed under: ElasticSearch,Indexing,Joins — Patrick Durusau @ 1:51 pm

Elasticsearch and Joining by Felix HĂźrlimann.

From the post:

With the success of elasticsearch, people, including us, start to explore the possibilities and mightiness of the system. Including border cases for which the underlying core, Lucene, never was originally intended or optimized for. One of the many requests that come up pretty quickly is the whish for joining data across types or indexes, similar to an SQL join clause that combines records from two or more tables in a database. Unfortunately full join support is not (yet?) available out of the box. But there are some possibilities and some attempts to solve parts of issue. This post is about summarizing some of the ideas in this field.

To illustrate the different ideas, let’s work with the following example: we would like to index documents and comments with a one to many relationship between them. Each comment has an author and we would like to answer the question: Give me all documents that match a certain query and a specific author has commented on it.

A variety of options are explored, including some new features of Elasticsearch.

Would you model documents with comments as an association?

Would you query on roles when searching for such a comment by a specific author on such a document?

June 29, 2013

Indexing data in Solr…

Filed under: Apache Camel,Indexing,Solr — Patrick Durusau @ 12:44 pm

Indexing data in Solr from disparate sources using Camel by Bilgin Ibryam.

From the post:

Apache Solr is ‘the popular, blazing fast open source enterprise search platform’ built on top of Lucene. In order to do a search (and find results) there is the initial requirement of data ingestion usually from disparate sources like content management systems, relational databases, legacy systems, you name it… Then there is also the challenge of keeping the index up to date by adding new data, updating existing records, removing obsolete data. The new sources of data could be the same as the initial ones, but could also be sources like twitter, AWS or rest endpoints.

Solr can understand different file formats and provides fair amount of options for data indexing:

  1. Direct HTTP and remote streaming – allows you to interact with Solr over HTTP by posting a file for direct indexing or the path to the file for remote streaming.
  2. DataImportHandler – is a module that enables both full and incremental delta imports from relational databases or file system.
  3. SolrJ – a java client to access Solr using Apache Commons HTTP Client.

But in real life, indexing data from different sources with millions of documents, dozens of transformations, filtering, content enriching, replication, parallel processing requires much more than that. One way to cope with such a challenge is by reinventing the wheel: write few custom applications, combine them with some scripts or run cronjobs. Another approach would be to use a tool that is flexible and designed to be configurable and plugable, that can help you to scale and distribute the load with ease. Such a tool is Apache Camel which has also a Solr connector now.


Avoid reinventing the wheel: check mark

Robust software: check mark

Name recognition of Lucene/Solr: check mark

Name recognition of Camel: check mark

Do you see any negatives?

BTW, the examples that round out Bilgin’s post are quite useful!

June 26, 2013

Scaling Through Partitioning and Shard Splitting in Solr 4 (Webinar)

Filed under: Indexing,Search Engines,Solr — Patrick Durusau @ 3:28 pm

Scaling Through Partitioning and Shard Splitting in Solr 4 by Timothy Potter.

Date: Thursday, July 18, 2013
Time: 10:00am Pacific Time

From the post:

Over the past several months, Solr has reached a critical milestone of being able to elastically scale-out to handle indexes reaching into the hundreds of millions of documents. At Dachis Group, we’ve scaled our largest Solr 4 index to nearly 900M documents and growing. As our index grows, so does our need to manage this growth.

In practice, it’s common for indexes to continue to grow as organizations acquire new data. Over time, even the best designed Solr cluster will reach a point where individual shards are too large to maintain query performance. In this Webinar, you’ll learn about new features in Solr to help manage large-scale clusters. Specifically, we’ll cover data partitioning and shard splitting.

Partitioning helps you organize subsets of data based on data contained in your documents, such as a date or customer ID. We’ll see how to use custom hashing to route documents to specific shards during indexing. Shard splitting allows you to split a large shard into 2 smaller shards to increase parallelism during query execution.

Attendees will come away from this presentation with a real-world use case that proves Solr 4 is elastically scalable, stable, and is production ready.

Just in time for when you finish your current Solr reading! 😉

Definitely on the calendar!

June 24, 2013

Hybrid Indexes for Repetitive Datasets

Filed under: Data Structures,Indexing — Patrick Durusau @ 2:09 pm

Hybrid Indexes for Repetitive Datasets by H. Ferrada, T. Gagie, T. Hirvola, and S. J. Puglisi.


Advances in DNA sequencing mean databases of thousands of human genomes will soon be commonplace. In this paper we introduce a simple technique for reducing the size of conventional indexes on such highly repetitive texts. Given upper bounds on pattern lengths and edit distances, we preprocess the text with LZ77 to obtain a filtered text, for which we store a conventional index. Later, given a query, we find all matches in the filtered text, then use their positions and the structure of the LZ77 parse to find all matches in the original text. Our experiments show this also significantly reduces query times.

Need another repetitive data set?

Have you thought about topic maps?

If there is to be any merging in a topic map there are multiple topics that represent the same subjects.

This technique may be overkill for some hardly merging topic maps but if you had the endless repetition that you find in linked data versions of Wikipedia data, there it would be quite useful.

That might knock down the “Some-Smallish-Number” of triples count and so would be disfavored.

On the other hand, there are other data sets with massive replication (think phone records) where fast querying could be an advantage.

June 21, 2013

TokuMX: High Performance for MongoDB

Filed under: Fractal Trees,Indexing,MongoDB,Tokutek — Patrick Durusau @ 6:20 pm

TokuMX: High Performance for MongoDB

From the webpage:

TokuMXTM for MongoDB is here!

Tokutek, whose Fractal TreeÂŽ indexing technology has brought dramatic performance and scalability to MySQL and MariaDB users, now brings those same benefits to MongoDB users.

TokuMX is open source performance-enhancing software for MongoDB that make MongoDB more performant in large application with demanding requirements. In addition to replacing B-tree indexing with more modern technology, TokuMX adds transaction support, document-level locking for concurrent writes, and replication.

You have seen the performance specs on Fractal Tree indexing.

Now they are available for MongoDB!

Configure Solr on Ubuntu, the quickest way

Filed under: Indexing,Solr,Topic Maps — Patrick Durusau @ 5:51 pm

Configure Solr on Ubuntu, the quickest way

From the webpage:

Note: I used the wiki page Ubuntu-10.04-lts-server as basis of this tutorial.
More infos on the general installation at :

One of the most efficient way to deploy a Solr server is to encapsulate it in a Java servlet, the Apache Foundation (the provider of Solr) brought to us Tomcat, a powerfull http server written in Java.

I thought you might find this useful.

With the various advances in indexing, I am beginning to wonder in what way does a topic map “backend,” differ from an index?

And if it doesn’t (or by much), what can indexing structures teach us about faster topic maps?

« Newer PostsOlder Posts »

Powered by WordPress