Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 21, 2014

Apache Lucene/Solr 4.8.1 (Bug Fixes)

Filed under: Lucene,Solr — Patrick Durusau @ 1:23 pm

From the Lucene News:

The Lucene PMC is pleased to announce the availability of Apache Lucene 4.8.1 and Apache Solr 4.8.1.

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Both releases contain a number of bug fixes.

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

It’s upgrade time again!

May 17, 2014

Building a Recipe Search Site…

Filed under: ElasticSearch,Lucene,Search Engines,Solr — Patrick Durusau @ 4:32 pm

Building a Recipe Search Site with Angular and Elasticsearch by Adam Bard.

From the post:

Have you ever wanted to build a search feature into an application? In the old days, you might have found yourself wrangling with Solr, or building your own search service on top of Lucene — if you were lucky. But, since 2010, there’s been an easier way: Elasticsearch.

Elasticsearch is an open-source storage engine built on Lucene. It’s more than a search engine; it’s a true document store, albeit one emphasizing search performance over consistency or durability. This means that, for many applications, you can use Elasticsearch as your entire backend. Applications such as…

Think of this as a snapshot of the capabilities of most search solutions.

Which makes this a great baseline for answering the question: What does your app do that Elasticsearch + Angular cannot?

That’s a serious question.

Responses that don’t count include:

  1. My app is written in the Linear B programming language.
  2. My app uses a Post-Pre-NOSQL DB engine.
  3. My app will bring freedom and health to the WWW.
  4. (insert your reason)

You can say all those things if you like, but the convincing point for users is going to be exceeding their expectations about current solutions.

Do the best you can with Elasticsearch and Angular and use that as your basepoint for comparison.

May 13, 2014

Choosing a fast unique identifier (UUID) for Lucene

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 9:44 am

Choosing a fast unique identifier (UUID) for Lucene by Michael McCandless.

From the post:

Most search applications using Apache Lucene assign a unique id, or primary key, to each indexed document. While Lucene itself does not require this (it could care less!), the application usually needs it to later replace, delete or retrieve that one document by its external id. Most servers built on top of Lucene, such as Elasticsearch and Solr, require a unique id and can auto-generate one if you do not provide it.

Sometimes your id values are already pre-defined, for example if an external database or content management system assigned one, or if you must use a URI, but if you are free to assign your own ids then what works best for Lucene?

One obvious choice is Java’s UUID class, which generates version 4 universally unique identifiers, but it turns out this is the worst choice for performance: it is 4X slower than the fastest. To understand why requires some understanding of how Lucene finds terms.
….

Excellent tips for creating identifiers for Lucene! Complete with tests and an explanation for the possible choices.

Enjoy!

May 10, 2014

Parameterizing Queries in Solr and Elasticsearch

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 6:19 pm

Parameterizing Queries in Solr and Elasticsearch by RAFAŁ KUĆ.

From the post:

We all know how good it is to have abstraction layers in software we create. We tend to abstract implementation from the method contracts using interfaces, we use n-tier architectures so that we can abstract and divide different system layers from each other. This is very good – when we change one piece, we don’t need to touch the other parts that only knew about method contracts, API’s, etc. Why not do the same with search queries? Can we even do that in Elasticsearch and Solr? We can and I’ll show you how to do that.

The problem

Imagine, that we have a query, a complicated one, with boosts, sorts, facets and so on. However in most cases the query is pretty static when it comes to its structure and the only thing that changes is one of the filters in the query (actually a filter value) and the query entered by the user. I guess such situation could ring a bell for someone who developed a search application. Of course we can include the whole query in the application itself and reuse it. But in such case, changes to boosts for example requires us to deploy the application or a configuration file. And if more than a single application uses the same query, than we need to change them all.

What if we could make the change on the search server side only and let application pass the necessary data only? That would be nice, but it requires us to do some work on the search server side.

For the purpose of the blog post, let’s assume that we want to have a query that:

  • searches for documents with terms entered by the user,
  • limits the searches to a given category,
  • displays facet results for the price ranges

This is a simple example, so that the queries are easy to understand. So, in the perfect world we would only need to provide user query and category identifier to a search engine.

It is encouraging to see someone give solutions to the same search problem from Solr and Elasticsearch perspectives.

Not to mention that I think you will find this very useful.

April 28, 2014

Apache Lucene/Solr 4.8.0 Available!

Filed under: Lucene,Solr — Patrick Durusau @ 7:55 pm

The Lucene PMC is pleased to announce the availability of Apache Lucene 4.8.0 and Apache Solr 4.8.0.

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Both releases now require Java 7 or greater (recommended is Oracle Java 7 or OpenJDK 7, minimum update 55; earlier versions have known JVM bugs affecting Lucene and Solr). In addition, both are fully compatible with Java 8.

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

Highlights of the Lucene release include:

  • All index files now store end-to-end checksums, which are now validated during merging and reading. This ensures that corruptions caused by any bit-flipping hardware problems or bugs in the JVM can be detected earlier. For full detection be sure to enable all checksums during merging (it’s disabled by default).
  • Lucene has a new Rescorer/QueryRescorer API to perform second-pass rescoring or reranking of search results using more expensive scoring functions after first-pass hit collection.
  • AnalyzingInfixSuggester now supports near-real-time autosuggest.
  • Simplified impact-sorted postings (using SortingMergePolicy and EarlyTerminatingCollector) to use Lucene’s Sort class to express the sort order.
  • Bulk scoring and normal iterator-based scoring were separated, so some queries can do bulk scoring more effectively.
  • Switched to MurmurHash3 to hash terms during indexing.
  • IndexWriter now supports updating of binary doc value fields.
  • HunspellStemFilter now uses 10 to 100x less RAM. It also loads all known OpenOffice dictionaries without error.
  • Lucene now also fsyncs the directory metadata on commits, if the operating system and file system allow it (Linux, MacOSX are known to work).
  • Lucene now uses Java 7 file system functions under the hood, so index files can be deleted on Windows, even when readers are still open.
  • A serious bug in NativeFSLockFactory was fixed, which could allow multiple IndexWriters to acquire the same lock. The lock file is no longer deleted from the index directory even when the lock is not held.

Highlights of the Solr release include:

  • <fields> and <types> tags have been deprecated from schema.xml. There is no longer any reason to keep them in the schema file, they may be safely removed. This allows intermixing of <fieldType>, <field> and <copyField> definitions if desired.
  • The new {!complexphrase} query parser supports wildcards, ORs etc. inside Phrase Queries.
  • New Collections API CLUSTERSTATUS action reports the status of collections, shards, and replicas, and also lists collection aliases and cluster properties.
  • Added managed synonym and stopword filter factories, which enable synonym and stopword lists to be dynamically managed via REST API.
  • JSON updates now support nested child documents, enabling {!child} and {!parent} block join queries.
  • Added ExpandComponent to expand results collapsed by the CollapsingQParserPlugin, as well as the parent/child relationship of nested child documents.
  • Long-running Collections API tasks can now be executed asynchronously; the new REQUESTSTATUS action provides status.
  • Added a hl.qparser parameter to allow you to define a query parser for hl.q highlight queries.
  • In Solr single-node mode, cores can now be created using named configsets.
  • New DocExpirationUpdateProcessorFactory supports computing an expiration date for documents from the “TTL” expression, as well as automatically deleting expired documents on a periodic basis.

All exciting additions, except that today I finished configuring Tomcat7/Solr/Nutch, where Solr = 4.7.2.

Sigh, well, I suppose that was just a trial run. 😉

April 26, 2014

Solr 4.8 Features

Filed under: Lucene,Solr — Patrick Durusau @ 6:57 pm

Solr 4.8 Features by Yonik Seeley.

Yonik reviews the coming new features for Solr 4.8:

  • Complex Phrase Queries
  • Indexing Child Documents in JSON
  • Expand Component
  • Named Config Sets
  • Stopwords and Synonyms REST API

Do you think traditional publishing models work well for open source projects that evolve as rapidly as Solr?

I first saw this in a tweet by Martin Grotzke.

April 19, 2014

Apache Lucene/Solr 4.7.2

Filed under: Lucene,Solr — Patrick Durusau @ 6:23 pm

Apache Lucene 4.7.2

http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene Changes.txt

Fixes potential index corruption, LUCENE-5574.

Apache Solr 4.7.2

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr Changes.txt

In view of possible index corruption, I would not take this as an optional upgrade.

April 12, 2014

Testing Lucene’s index durability after crash or power loss

Filed under: Indexing,Lucene — Patrick Durusau @ 8:08 pm

Testing Lucene’s index durability after crash or power loss by Mike McCandless.

From the post:

One of Lucene’s useful transactional features is index durability which ensures that, once you successfully call IndexWriter.commit, even if the OS or JVM crashes or power is lost, or you kill -KILL your JVM process, after rebooting, the index will be intact (not corrupt) and will reflect the last successful commit before the crash.

If anyone at your startup is writing an indexing engine, be sure to pass this post from Mike along.

Ask them for a demonstration of equal durability of the index before using their work instead of Lucene.

You have enough work to do without replicating (poorly) work that already has enterprise level reliability.

April 2, 2014

Apache Lucene/Solr 4.7.1

Filed under: Lucene,Solr — Patrick Durusau @ 3:16 pm

Apache Lucene 4.7.1

http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene Changes.txt

Apache Solr 4.7.1

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr Changes.txt

Fixes include a bad memory leak: Solr-5875 so upgrading is advised.

March 9, 2014

Lucene 4 Essentials for Text Search and Indexing

Filed under: Indexing,Java,Lucene,Searching — Patrick Durusau @ 5:06 pm

Lucene 4 Essentials for Text Search and Indexing by Mitzi Morris.

From the post:

Here’s a short-ish introduction to the Lucene search engine which shows you how to use the current API to develop search over a collection of texts. Most of this post is excerpted from Text Processing in Java, Chapter 7, Text Search with Lucene.

Not too short! 😉

I have seen blurbs about Text Processing in Java but this post convinced me to put it on my wish list.

You?

PS: As soon as a copy arrives I will start working on a review of it. If you want to see that happen sooner rather than later, ping me.

March 7, 2014

Using Lucene’s search server to search Jira issues

Filed under: Indexing,Lucene,Search Engines — Patrick Durusau @ 5:02 pm

Using Lucene’s search server to search Jira issues by Michael McCandless.

From the post:

You may remember my first blog post describing how the Lucene developers eat our own dog food by using a Lucene search application to find our Jira issues.

That application has become a powerful showcase of a number of modern Lucene features such as drill sideways and dynamic range faceting, a new suggester based on infix matches, postings highlighter, block-join queries so you can jump to a specific issue comment that matched your search, near-real-time indexing and searching, etc. Whenever new users ask me about Lucene’s capabilities, I point them to this application so they can see for themselves.

Recently, I’ve made some further progress so I want to give an update.

The source code for the simple Netty-based Lucene server is now available on this subversion branch (see LUCENE-5376 for details). I’ve been gradually adding coverage for additional Lucene modules, including facets, suggesters, analysis, queryparsers, highlighting, grouping, joins and expressions. And of course normal indexing and searching! Much remains to be done (there are plenty of nocommits), and the goal here is not to build a feature rich search server but rather to demonstrate how to use Lucene’s current modules in a server context with minimal “thin server” additional source code.

Separately, to test this new Lucene based server, and to complete the “dog food,” I built a simple Jira search application plugin, to help us find Jira issues, here. This application has various Python tools to extract and index Jira issues using Jira’s REST API and a user-interface layer running as a Python WSGI app, to send requests to the server and render responses back to the user. The goal of this Jira search application is to make it simple to point it at any Jira instance / project and enable full searching over all issues.

Of particular interest to me because OASIS is about to start using JIRA 6.2 (the version in use at Apache).

I haven’t looked closely at the documentation for JIRA 6.2.

Thoughts on where it has specific weaknesses that are addressed by Michael’s solution?

February 26, 2014

Lucene and Solr 4.7

Filed under: Lucene,Solr — Patrick Durusau @ 2:53 pm

Lucene and Solr 4.7

From the post:

Today Apache Lucene and Solr PMC announced a new of Apache Lucene library and Apache Solr search server – the 4.7 one. This is another release from the 4.x brach bringing new functionalities and bugfixes.

Apache Lucene 4.7 library can be downloaded from the following address: http://www.apache.org/dyn/closer.cgi/lucene/java/. Apache Solr 4.7 can be downloaded at the following URL address: http://www.apache.org/dyn/closer.cgi/lucene/solr/. Release note for Apache Lucene 4.7 can be found at: http://wiki.apache.org/lucene-java/ReleaseNote47, Solr release notes can be found at: http://wiki.apache.org/solr/ReleaseNote47.

Time to upgrade, again!

February 24, 2014

Index and Search Multilingual Documents in Hadoop

Filed under: Hadoop,Lucene,Solr — Patrick Durusau @ 4:27 pm

Index and Search Multilingual Documents in Hadoop by Justin Kestelyn.

From the post:

Basis Technology’s Rosette Base Linguistics for Java (RBL-JE) provides a comprehensive multilingual text analytics platform for improving search precision and recall. RBL provides tokenization, lemmatization, POS tagging, and de-compounding for Asian, European, Nordic, and Middle Eastern languages, and has just been certified for use with Cloudera Search.

Cloudera Search brings full-text, interactive search, and scalable indexing to Apache Hadoop by marrying SolrCloud with HDFS and Apache HBase, and other projects in CDH. Because it’s integrated with CDH, Cloudera Search brings the same fault tolerance, scale, visibility, and flexibility of your other Hadoop workloads to search, and allows for a number of indexing, access control, and manageability options.

In this post, you’ll learn how to use Cloudera Search and RBL-JE to index and search documents. Since Cloudera takes care of the plumbing for distributed search and indexing, the only work needed to incorporate Basis Technology’s linguistics is loading the software and configuring your Solr collections.

You may have guessed by the way the introduction is worded that Rosette Base Linguistics isn’t free. I checked at the website but found no pricing information. Not to mention that the coverage looks spotty:

  • Arabic
  • Chinese (simplified)
  • Chinese (traditional)
  • English
  • Japanese
  • Korean
  • If your multilingual needs fall in one or more of those languages, this may work for you.

    On the other hand, for indexing and searching multilingual text, you should compare Solr, which has factories for the following languages:

    • Arabic
    • Brazilian Portuguese
    • Bulgarian
    • Catalan
    • Chinese
    • Simplified Chinese
    • CJK
    • Czech
    • Danish
    • Dutch
    • Finnish
    • French
    • Galician
    • German
    • Greek
    • Hebrew, Lao, Myanmar, Khmer
    • Hindi
    • Indonesian
    • Italian
    • Irish
    • Kuromoji (Japanese)
    • Latvian
    • Norwegian
    • Persian
    • Polish
    • Portuguese
    • Romanian
    • Russian
    • Spanish
    • Swedish
    • Thai
    • Turkish

    Source: Solr Wiki.

February 19, 2014

Why Not AND, OR, And NOT?

Filed under: Boolean Operators,Lucene,Searching,Solr — Patrick Durusau @ 3:20 pm

Why Not AND, OR, And NOT?

From the post:

The following is written with Solr users in mind, but the principles apply to Lucene users as well.

I really dislike the so called “Boolean Operators” (“AND”, “OR”, and “NOT”) and generally discourage people from using them. It’s understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it’s a good idea to try to “set aside childish things” and start thinking (and encouraging your users to think) in terms of the superior “Prefix Operators” (“+”, “-”).

Background: Boolean Logic Makes For Terrible Scores

Boolean Algebra is (as my father would put it) “pretty neat stuff” and the world as we know it most certainly wouldn’t exist with out it. But when it comes to building a search engine, boolean logic tends to not be very helpful. Depending on how you look at it, boolean logic is all about truth values and/or set intersections. In either case, there is no concept of “relevancy” — either something is true or it’s false; either it is in a set, or it is not in the set.

When a user is looking for “all documents that contain the word ‘Alligator’” they aren’t going to very be happy if a search system applied simple boolean logic to just identify the unordered set of all matching documents. Instead algorithms like TF/IDF are used to try and identify the ordered list of matching documents, such that the “best” matches come first. Likewise, if a user is looking for “all documents that contain the words ‘Alligator’ or ‘Crocodile’”, a simple boolean logic union of the sets of documents from the individual queries would not generate results as good as a query that took into the TF/IDF scores of the documents for the individual queries, as well as considering which documents matches both queries. (The user is probably more interested in a document that discusses the similarities and differences between Alligators to Crocodiles then in documents that only mention one or the other a great many times).

This brings us to the crux of why I think it’s a bad idea to use the “Boolean Operators” in query strings: because it’s not how the underlying query structures actually work, and it’s not as expressive as the alternative for describing what you want.

As if you needed more proof that knowing “how” a search system is constructed is as important as knowing the surface syntax.

A great post that gives examples to illustrate each of the issues.

In case you are wondering about the December 28, 2011 date on the post, BooleanCause.Occur Lucene 4.6.1.

January 29, 2014

Apache Lucene 4.6.1 and Apache SolrTM 4.6.1

Filed under: Lucene,Solr — Patrick Durusau @ 3:49 pm

Download: Apache Lucene 4.6.1

Lucene:

Lucene CHANGES.txt

New features include: FreeTextSuggester (Lucene-5214), New Document Dictionary (Lucene-5221), and twenty-one (21) others!

Solr:

Download: Apache SolrTM 4.6.1

Solr CHANGES.txt

New features include: support for AnalyzingInfixSuggester (Solr-5167), new field type EnumField (Solr-5084), and fifteen (15) others!

Not that I will know anytime soon but I am curious how well the AnalyzingInfixSuggester would work with Akkadian.

January 25, 2014

Use Cases for Taming Text, 2nd ed.

Filed under: Lucene,Mahout,MALLET,OpenNLP,Solr,Stanford NLP — Patrick Durusau @ 5:31 pm

Use Cases for Taming Text, 2nd ed. by Grant Ingersoll.

From the post:

Drew Farris, Tom Morton and I are currently working on the 2nd Edition of Taming Text (http://www.manning.com/ingersoll for first ed.) and are soliciting interested parties who would be willing to contribute to a chapter on practical use cases (i.e. you have something in production and are willing to write about it) for search with Solr, NLP using OpenNLP or Stanford NLP and machine learning using Mahout, OpenNLP or MALLET — ideally you are using combinations of 2 or more of these to solve your problems. We are especially interested in large scale use cases in eCommerce, Advertising, social media analytics, fraud, etc.

The writing process is fairly straightforward. A section roughly equates to somewhere between 3 – 10 pages, including diagrams/pictures. After writing, there will be some feedback from editors and us, but otherwise the process is fairly simple.

In order to participate, you must have permission from your company to write on the topic. You would not need to divulge any proprietary information, but we would want enough information for our readers to gain a high-level understanding of your use case. In exchange for your participation, you will have your name and company published on that section of the book as well as in the acknowledgments section. If you have a copy of Lucene in Action or Mahout In Action, it would be similar to the use case sections in those books.

Cool!

I am guessing the second edition isn’t going to take as long as the first. 😉

Couldn’t be in better company as far as co-authors.

See the post for the contact details.

Searching in Solr, Analyzing Results and CJK

Filed under: CJK,Lucene,Solr — Patrick Durusau @ 5:08 pm

Searching in Solr, Analyzing Results and CJK

From the post:

In my recently completed twelve post series on Chinese, Japanese and Korean (CJK) with Solr for Libraries, my primary objective was to make information available to others in an expeditious manner. However, the organization of the topics is far from optimal for readers, and the series is too long for easy skimming for topics of interest. Therefore, I am providing this post as a sort of table of contents into the previous series.
Introduction

In Fall 2013, we rolled out some significant improvements for Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford library “catalog” built with Blacklight on top of our Solr index. If your collection has a significant number of CJK resources and they are in multiple languages, you might be interested in our recipes. You might also be interested if you have a significant number of resources in multiple languages, period.

If you are interested in improving searching, or in improving your methodology when working on searching, these posts provide a great deal of information. Analysis of Solr result relevancy figured heavily in this work, as did testing: relevancy/acceptance/regression testing against a live Solr index, unit testing, and integration testing. In addition, there was testing by humans, which was well managed and produced searches that were turned into automated tests. Many of the blog entries contain useful approaches for debugging Solr relevancy and for test driven development (TDD) of new search behavior.
….

Excellent!

I am sure many of the issues addressed here will be relevant should anyone decide to create a Solr index to the Assyrian Dictionary of the Oriental Institute of the University of Chicago (CAD).

Quite serious. At least I would be interested at any rate.

January 23, 2014

Foundation…

Filed under: ElasticSearch,Lucene — Patrick Durusau @ 8:29 pm

Foundation: Learn and Play with Elasticsearch

I have posted about several of the articles here but missed posting about the homepage for this site.

Take a close look at Play. It offers you the opportunity to alter documents and search settings, online experimentation I would call it, with ElasticSearch.

The idea of simple, interactive play with search software is a good one.

I wonder how that would translate into an interface for the same thing for topic maps?

The immediacy of feedback along with a non-complex interface would be selling points to me.

You will also find some twenty-five articles (as of today) ranging from beginner to more advanced topics on ElasticSearch.

Finding long tail suggestions…

Filed under: Lucene,Search Engines,Searching — Patrick Durusau @ 7:26 pm

Finding long tail suggestions using Lucene’s new FreeTextSuggester by Mike McCandless.

From the post:

Lucene’s suggest module offers a number of fun auto-suggest implementations to give a user live search suggestions as they type each character into a search box.

For example, WFSTCompletionLookup compiles all suggestions and their weights into a compact Finite State Transducer, enabling fast prefix lookup for basic suggestions.

AnalyzingSuggester improves on this by using an Analyzer to normalize both the suggestions and the user’s query so that trivial differences in whitespace, casing, stop-words, synonyms, as determined by the analyzer, do not prevent a suggestion from matching.

Finally, AnalyzingInfixSuggester goes further by allowing infix matches so that words inside each suggestion (not just the prefix) can trigger a match. You can see this one action at the Lucene/Solr Jira search application (e.g., try “python”) that I recently created to eat our own dog food. It is also the only suggester implementation so far that supports highlighting (this has proven challenging for the other suggesters).

Yet, a common limitation to all of these suggesters is that they can only suggest from a finite set of previously built suggestions. This may not be a problem if your suggestions are past user queries and have tons and tons of them (e.g., you are Google). Alternatively, if your universe of suggestions is inherently closed, such as the movie and show titles that Netflix’s search will suggest, or all product names on an e-commerce site, then a closed set of suggestions is appropriate.
….

Since you are unlikely to be Google, Mike goes on to show how FreeTextSuggester can ride to your rescue!

As always, Mike’s post is a pleasure to read.

January 22, 2014

Build Your Own Custom Lucene Query And Scorer

Filed under: Lucene,Search Engines — Patrick Durusau @ 8:03 pm

Build Your Own Custom Lucene Query And Scorer by Doug Turnbull.

From the post:

Every now and then we’ll come across a search problem that can’t simply be solved with plain Solr relevancy. This usually means a customer knows exactly how documents should be scored. They may have little tolerance for close approximations of this scoring through Solr boosts, function queries, etc. They want a Lucene-based technology for text analysis and performant data structures, but they need to be extremely specific in how documents should be scored relative to each other.

Well for those extremely specialized cases we can prescribe a little out-patient surgery to your Solr install – building your own Lucene Query.

This Is The Nuclear Option

Before we dive in, a word of caution. Unless you just want the educational experience, building a custom Lucene Query should be the “nuclear option” for search relevancy. It’s very fiddly and there are many ins-and-outs. If you’re actually considering this to solve a real problem, you’ve already gone down the following paths:

Not for the faint of heart!

On the other hand, Doug’s list of options to try before writing a custom Lucene query and scorer makes a great checklist of tweaking options.

You could stop there and learn a great deal. Or you can opt to continue for what Doug calls “the educational experience.”

January 21, 2014

Geospatial (distance) faceting…

Filed under: Facets,Geographic Data,Georeferencing,Lucene — Patrick Durusau @ 7:32 pm

Geospatial (distance) faceting using Lucene’s dynamic range facets by Mike McCandless.

From the post:

There have been several recent, quiet improvements to Lucene that, taken together, have made it surprisingly simple to add geospatial distance faceting to any Lucene search application, for example:

  < 1 km (147)
  < 2 km (579)
  < 5 km (2775)

Such distance facets, which allow the user to quickly filter their search results to those that are close to their location, has become especially important lately since most searches are now from mobile smartphones.

In the past, this has been challenging to implement because it’s so dynamic and so costly: the facet counts depend on each user’s location, and so cannot be cached and shared across users, and the underlying math for spatial distance is complex.

But several recent Lucene improvements now make this surprisingly simple!

As always, Mike is right on the edge so wait for Lucene 4.7 to try his code out or download the current source.

Distance might not be the only consideration. What if you wanted the shortest distance that did not intercept a a known patrol? Or known patrol within some window of variation.

Distance is still going to be a factor but the search required maybe more complex than just distance.

January 13, 2014

Multi level composite-id routing in SolrCloud

Filed under: Lucene,SolrCloud — Patrick Durusau @ 7:57 pm

Multi level composite-id routing in SolrCloud by Anshum Gupta.

From the post:

SolrCloud over the last year has evolved into a rather intelligent system with a lot of interesting and useful features going in. One of them has been the work for intelligent routing of documents and queries.

SolrCloud started off with a basic hash based routing in 4.0. It then got interesting with the composite id router being introduced with 4.1 which enabled smarter routing of documents and queries to achieve things like multi-tenancy and co-location. With 4.7, the 2-level composite id routing will be expanded to work for 3-levels (SOLR-5320).

A good post about how document routing generally works can be found here. Now, let’s look at how the composite-id routing extends to 3-levels and how we can really use it to query specific documents in our corpus.

An important thing to note here is that the 3-level router only extends the 2-level one. It’s the same router and the same java class i.e. you don’t really need to ‘set it up’.

Where would you want to use the multi-level composite-id router?

The multi-level implementation further extends the support for multi tenancy and co-location of documents provided by the already existing composite-id router. Consider a scenario where a single setup is used to host data for multiple applications (or departments) and each of them have a set of users. Each user further has documents associated with them. Using a 3-level composite-id router, a user can route the documents to the right shards at index time without having to really worry about the actual routing. This would also enable users to target queries for specific users or applications using the shard.keys parameter at query time.

Does that sound related to topic maps?

What if you remembered that “document” for Lucene means:

Documents are the unit of indexing and search. A Document is a set of fields. Each field has a name and a textual value. A field may be stored with the document, in which case it is returned with search hits on the document. Thus each document should typically contain one or more stored fields which uniquely identify it.

Probably not an efficient way to handle multiple identifiers but that depends on your use case.

December 14, 2013

Using Lucene Similarity in Item-Item Recommenders

Filed under: Lucene,Recommendation,Similarity — Patrick Durusau @ 5:47 pm

Using Lucene Similarity in Item-Item Recommenders by Sujit Pal.

From the post:

Last week, I implemented 4 (of 5) recommenders from the Programming Assignments of the Introduction to Recommender Systems course on Coursera, but using Apache Mahout and Scala instead of Lenskit and Java. This week, I implement an Item Item Collaborative Filtering Recommender that uses Lucene (more specifically, Lucene’s More Like This query) as the item similarity provider.

By default, Lucene stores document vectors keyed by terms, but can be configured to store term vectors by setting the field attribute TermVector.YES. In case of text documents, words (or terms) are the features which are used to compute similarity between documents. I am using the same dataset as last week, where movies (items) correspond to documents and movie tags correspond to the words. So we build a movie “document” by preprocessing the tags to form individual tokens and concatenating them into a tags field in the index.

Three scenarios are covered. The first two are similar to the scenarios covered with the item-item collaborative filtering recommender from last week, where the user is on a movie page, and we need to (a) predict the rating a user would given a specified movie and (b) find movies similar to a given movie. The third scenario is recommending movies to a given user. We describe each algorithm briefly, and how Lucene fits in.

I’m curious how easy/difficult it would be to re-purpose similarity algorithms to detect common choices in avatar characteristics, acquisitions, interaction with others, goals, etc.?

Thinking that while obvious repetitions are easy enough to avoid, gender, age, names, etc., there are other, more subtle characteristics of interaction with others that would be far harder to be aware of. Much less to mask effectively.

It would require a lot of data on interaction but I assume that isn’t all that difficult to whistle up on any of the major systems.

If you have any pointers to that sort of research, forward them along. I will be posting a collection of pointers and will credit anyone who wants to be credited.

December 13, 2013

Implementing a Custom Search Syntax…

Filed under: Lucene,Patents,Solr — Patrick Durusau @ 8:33 pm

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled by John Berryman.

Description:

In a recent project with the United States Patent and Trademark Office, Opensource Connections was asked to prototype the next generation of patent search – using Solr and Lucene. An important aspect of this project was the implementation of BRS, a specialized search syntax used by patent examiners during the examination process. In this fast paced session we will relate our experiences and describe how we used a combination of Parboiled (a Parser Expression Grammar [PEG] parser), Lucene Queries and SpanQueries, and an extension of Solr’s QParserPlugin to build BRS search functionality in Solr. First we will characterize the patent search problem and then define the BRS syntax itself. We will then introduce the Parboiled parser and discuss various considerations that one must make when designing a syntax parser. Following this we will describe the methodology used to implement the search functionality in Lucene/Solr. Finally, we will include an overview our syntactic and semantic testing strategies. The audience will leave this session with an understanding of how Solr, Lucene, and Parboiled may be used to implement their own custom search parser.

One part of the task was to re-implement a thirty (30) year old query language on modern software. (Ouch!)

Uses parboiled to parse the query syntax.

On parboiled:

parboiled is a mixed Java/Scala library providing for lightweight and easy-to-use, yet powerful and elegant parsing of arbitrary input text based on Parsing expression grammars (PEGs). PEGs are an alternative to context free grammars (CFGs) for formally specifying syntax, they make a good replacement for regular expressions and generally have quite a few advantages over the “traditional” way of building parsers via CFGs. parboiled is released under the Apache License 2.0.

Covers a plugin for the custom query language.

Great presentation, although one where you will want to be following the slides (below the video).

Fast range faceting…

Filed under: Facets,Lucene — Patrick Durusau @ 3:47 pm

Fast range faceting using segment trees and the Java ASM library by Mike McCandless.

From the post:

In Lucene’s facet module we recently added support for dynamic range faceting, to show how many hits match each of a dynamic set of ranges. For example, the Updated drill-down in the Lucene/Solr issue search application uses range facets. Another example is distance facets (< 1 km, < 2 km, etc.), where the distance is dynamically computed based on the user's current location. Price faceting might also use range facets, if the ranges cannot be established during indexing. To implement range faceting, for each hit, we first calculate the value (the distance, the age, the price) to be aggregated, and then lookup which ranges match that value and increment its counts. Today we use a simple linear search through all ranges, which has O(N) cost, where N is the number of ranges. But this is inefficient! ...

Mike lays out a more efficient approach, that hasn’t been folded into Lucene, yet.

I like the example of distance from a user as an example of distance as a dynamic facet.

Distance issues are common with mobile devices, but most of those are merchants trying to sell you something.

Not a public database use case, but what if you had an alternative map of a metropolitan area? Where the distance issue was to caches, safe houses, contacts, etc.?

You are double thumbing your mobile device just like everyone else but yours is displaying different data.

You could get false information that is auto-corrected by a local app. 😉

You may have heard the old saying:

The old saying goes that God made men, but Sam Colt made them equal.

We may need to add IT to that list.

December 9, 2013

Building Client-side Search Applications with Solr

Filed under: Lucene,Search Interface,Searching,Solr — Patrick Durusau @ 7:46 pm

Building Client-side Search Applications with Solr by Daniel Beach.

Description:

Solr is a powerful search engine, but creating a custom user interface can be daunting. In this fast paced session I will present an overview of how to implement a client-side search application using Solr. Using open-source frameworks like SpyGlass (to be released in September) can be a powerful way to jumpstart your development by giving you out-of-the box results views with support for faceting, autocomplete, and detail views. During this talk I will also demonstrate how we have built and deployed lightweight applications that are able to be performant under large user loads, with minimal server resources.

If you need a compelling reason to watch this video, check out:

Global Patent Search Network.

What is the Global Patent Search Network?

As a result of cooperative effort between the United States Patent and Trademark Office (USPTO) and State Intellectual Property Office (SIPO) of the People’s Republic of China, Chinese patent documentation is now available for search and retrieval from the USPTO website via the Global Patent Search Network. This tool will enable the user to search Chinese patent documents in the English or Chinese language. The data available include fulltext Chinese patents and machine translations. Also available are full document images of Chinese patents which are considered the authoritative Chinese patent document. Users can search documents including published applications, granted patents and utility models from 1985 to 2012.

Something over four (4) million patents.

Try the site, then watch the video.

Software mentioned: Spyglass, Ember.js.

Introducing Luwak,…

Filed under: Java,Lucene,Searching — Patrick Durusau @ 5:04 pm

Introducing Luwak, a library for high-performance stored queries by Charlie Hull.

From the post:

A few weeks ago we spoke in Dublin at Lucene Revolution 2013 on our work in the media monitoring sector for various clients including Gorkana and Australian Associated Press. These organisations handle a huge number (sometimes hundreds of thousands) of news articles every day and need to apply tens of thousands of stored expressions to each one, which would be extremely inefficient if done with standard search engine libraries. We’ve developed a much more efficient way to achieve the same result, by pre-filtering the expressions before they’re even applied: effectively we index the expressions and use the news article itself as a query, which led to the presentation title ‘Turning Search Upside Down’.

We’re pleased to announce the core of this process, a Java library we’ve called Luwak, is now available as open source software for your own projects. Here’s how you might use it:

That may sound odd, using the article as the query but be aware that Charlie reports “speeds of up to 70,000 stored queries applied to an article in around a second on modest hardware.

Perhaps not “big data speed” but certainly enough speed to get your attention.

Charlie mentions in his Dublin slides that Luwak could be used to “Add metadata to items based on their content.”

That one use case but creating topic/associations out of content would be another.

November 27, 2013

Apache Lucene and Solr 4.6.0!

Filed under: Lucene,Search Engines,Solr — Patrick Durusau @ 11:37 am

Apache Lucene and Solr 4.6.0 are out!

From the announcement:

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html.

Both releases contain a number of bug fixes.

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

While it is fair to say that “Both releases contain a number of bug fixes.” I think that gives the wrong impression.

The Lucene 4.6.0 release has 23 new features versus 5 bugs and Solr 4.6.0 has 17 new features versus 14 bug fixes. Closer but 40 new features total versus 22 bug fixes sounds good to me! 😉

Just to whet your appetite for looking at the detailed change lists:

LUCENE-5294 Suggester Dictionary implementation that takes expressions as term weights

From the description:

It could be an extension of the existing DocumentDictionary (which takes terms, weights and (optionally) payloads from the stored documents in the index). The only exception being that instead of taking the weights for the terms from the specified weight fields, it could compute the weights using an user-defn expression, that uses one or more NumicDocValuesField from the document.

Example:
let the document have

  • product_id
  • product_name
  • product_popularity
  • product_profit

Then this implementation could be used with an expression of “0.2*product_popularity + 0.8*product_profit” to determine the weights of the terms for the corresponding documents (optionally along with a payload (product_id))

You may remember I pointed out Mike McCandless’ blog post on this issue.

SOLR-5374 Support user configured doc-centric versioning rules

From the description:

The existing optimistic concurrency features of Solr can be very handy for ensuring that you are only updating/replacing the version of the doc you think you are updating/replacing, w/o the risk of someone else adding/removing the doc in the mean time – but I’ve recently encountered some situations where I really wanted to be able to let the client specify an arbitrary version, on a per document basis, (ie: generated by an external system, or perhaps a timestamp of when a file was last modified) and ensure that the corresponding document update was processed only if the “new” version is greater then the “old” version – w/o needing to check exactly which version is currently in Solr. (ie: If a client wants to index version 101 of a doc, that update should fail if version 102 is already in the index, but succeed if the currently indexed version is 99 – w/o the client needing to ask Solr what the current version)

November 20, 2013

Dublin Lucene Revolution 2013 (videos/slides)

Filed under: Conferences,Lucene,Solr — Patrick Durusau @ 7:46 pm

Dublin Lucene Revolution 2013 (slides/presentations)

I had confidence that LuceneRevolution wouldn’t abandon non-football fans in the U.S. for the Thanksgiving or Black Friday!

My faith has been vindicated!

I’ll create a sorted list of the presentations by author and title, to post here tomorrow.

In the meantime, I wanted to relieve your worry about endless hours of sports or shopping next week. 😉

…Scorers, Collectors and Custom Queries

Filed under: Lucene,Search Engines,Searching — Patrick Durusau @ 7:30 pm

Lucene Search Essentials: Scorers, Collectors and Custom Queries by Mikhail Khludnev.

From the description:

My team is building next generation eCommerce search platform for major an online retailer with quite challenging business requirements. Turns out, default Lucene toolbox doesn’t ideally fit for those challenges. Thus, the team had to hack deep into Lucene core to achieve our goals. We accumulated quite a deep understanding of Lucene search internals and want to share our experience. We will start with an API overview, and then look at essential search algorithms and their implementations in Lucene. Finally, we will review a few cases of query customization, pitfalls and common performance problems.

Don’t be frightened of the slide count at 179!

Multiple slides are used with single illustrations to demonstrate small changes.

Having said that, this is a “close to the metal” type presentation.

Worth your time but read along carefully.

Don’t miss the extremely fine index on slide 18.

Follow http://www.lib.rochester.edu/index.cfm?PAGE=489 for images of pages that go with the index. This copy of Fasciculus Temporum dates from 1480.

« Newer PostsOlder Posts »

Powered by WordPress