Archive for the ‘Lucene’ Category

10 Reasons to Choose Apache Solr Over Elasticsearch

Saturday, November 12th, 2016

10 Reasons to Choose Apache Solr Over Elasticsearch by Grant Ingersoll.

From the post:

Hey, clickbait title aside, I get it, Elasticsearch has been growing. Kudos to the project for tapping into a new set of search users and use cases like logging, where they are making inroads against the likes of Splunk in the IT logging market. However, there is another open source, Lucene-based search engine out there that is quite mature, more widely deployed and still growing, granted without a huge marketing budget behind it: Apache Solr. Despite what others would have you believe, Solr is quite alive and well, thank you very much. And I’m not just saying that because I make a living off of Solr (which I’m happy to declare up front), but because the facts support it.

For instance, in the Google Trends arena (see below or try the query yourself), Solr continues to hold a steady recurring level of interest even while Elasticsearch has grown. Dissection of these trends (which are admittedly easy to game, so I’ve tried to keep them simple), show Elasticsearch is strongest in Europe and Russia while Solr is strongest in the US, China, India, Brazil and Australia. On the DB-Engines ranking site, which factors in Google trends and other job/social metrics, you’ll see both Elasticsearch and Solr are top 15 projects, beating out a number of other databases like HBase and Hive. Solr’s mailing list is quite active (~280 msgs per week compared to ~170 per week for Elasticsearch) and it continues to show strong download numbers via Maven repository statistics. Solr as a codebase continues to innovate (which I’ll cover below) as well as provide regular, stable releases. Finally, Lucene/Solr Revolution, the conference my company puts on every year, continues to set record attendance numbers.

Not so much an “us versus them” piece as tantalizing facts about Solr 6 that will leave you wanting to know more.

Grant invites you to explore the Solr Quick Start if one or more of his ten points capture your interest.

Timely because with a new presidential administration about to take over in Washington, D.C., there will be:

  • Data leaks as agencies vie with each other
  • Data leaks due to inexperienced staffers
  • Data leaks to damage one side or in retaliation
  • Data leaks from foundations and corporations
  • others

If 2016 was the year of “false news” then 2017 is going to be the year of the “government data leak.”

Left unexplored except for headline suitable quips found with grep, leaks may not be significant.

On the other hand, using Solr 6 can enable you to weave a coherent narrative from diverse resources.

But you will have to learn Solr 6 to know for sure.

Enjoy!

Apache Lucene 6.2.1 and Apache Solr 6.2.1 Available [Presidential Data Leaks]

Thursday, September 22nd, 2016

Lucene can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/java/6.2.1

Solr can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/solr/6.2.1

If you aren’t using Lucene/Solr 6.2, here’s your chance to grab the latest bug fixes as well!

Data leaks will accelerate as the US presidential election draws to a close.

What’s your favorite tool for analysis and delivery of data dumps?

Enjoy!

The Iraq Inquiry (Chilcot Report) [4.5x longer than War and Peace]

Wednesday, July 6th, 2016

The Iraq Inquiry

To give a rough sense of the depth of the Chilcot Report, the executive summary runs 150 pages. The report appears in twelve (12) volumes, not including video testimony, witness transcripts, documentary evidence, contributions and the like.

Cory Doctorow reports a Guardian project to crowd source collecting facts from the 2.6 million word report. The Guardian observes the Chilcot report is “…almost four-and-a-half times as long as War and Peace.”

Manual reading of the Chilcot report is doable, but unlikely to yield all of the connections that exist between participants, witnesses, evidence, etc.

How would you go about making the Chilcot report and its supporting evidence more amenable to navigation and analysis?

The Report

The Evidence

Other Material

Unfortunately, sections within volumes were not numbered according to their volume. In other words, volume 2 starts with section 3.3 and ends with 3.5, whereas volume 4 only contains sections beginning with “4.,” while volume 5 starts with section 5 but also contains sections 6.1 and 6.2. Nothing can be done for it but be aware that section numbers don’t correspond to volume numbers.

Lucene/Solr 6.0 Hits The Streets! (There goes the weekend!)

Friday, April 8th, 2016

From the Lucene PMC:

The Lucene PMC is pleased to announce the release of Apache Lucene 6.0.0 and Apache Solr 6.0.0

Lucene can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/java/6.0.0
and Solr can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/solr/6.0.0

Highlights of this Lucene release include:

  • Java 8 is the minimum Java version required.
  • Dimensional points, replacing legacy numeric fields, provides fast and space-efficient support for both single- and multi-dimension range and shape filtering. This includes numeric (int, float, long, double), InetAddress, BigInteger and binary range filtering, as well as geo-spatial shape search over indexed 2D LatLonPoints. See this blog post for details. Dependent classes and modules (e.g., MemoryIndex, Spatial Strategies, Join module) have been refactored to use new point types.
  • Lucene classification module now works on Lucene Documents using a KNearestNeighborClassifier or SimpleNaiveBayesClassifier.
  • The spatial module no longer depends on third-party libraries. Previous spatial classes have been moved to a new spatial-extras module.
  • Spatial4j has been updated to a new 0.6 version hosted by locationtech.
  • TermsQuery performance boost by a more aggressive default query caching policy.
  • IndexSearcher’s default Similarity is now changed to BM25Similarity.
  • Easier method of defining custom CharTokenizer instances.

Highlights of this Solr release include:

  • Improved defaults for “Similarity” used in Solr, in order to provide better default experience for new users.
  • Improved “Similarity” defaults for users upgrading: DefaultSimilarityFactory has been removed, implicit default Similarity has been changed to SchemaSimilarityFactory, and SchemaSimilarityFactory has been modified to use BM25Similarity as the default for field types that do not explicitly declare a Similarity.
  • Deprecated GET methods for schema are now accessible through the bulk API. The output has less details and is not backward compatible.
  • Users should set useDocValuesAsStored=”false” to preserve sort order on multi-valued fields that have both stored=”true” and docValues=”true”.
  • Formatted date-times are more consistent with ISO-8601. BC dates are now better supported since they are now formatted with a leading ‘-‘. AD years after 9999 have a leading ‘+’. Parse exceptions have been improved.
  • Deprecated SolrServer and subclasses have been removed, use SolrClient instead.
  • The deprecated configuration in solrconfig.xml has been removed. Users must remove it from solrconfig.xml.
  • SolrClient.shutdown() has been removed, use SolrClient.close() instead.
  • The deprecated zkCredientialsProvider element in solrcloud section of solr.xml is now removed. Use the correct spelling (zkCredentialsProvider) instead.
  • Added support for executing Parallel SQL queries across SolrCloud collections. Includes StreamExpression support and a new JDBC Driver for the SQL Interface.
  • New features and capabilities added to the streaming API.
  • Added support for SELECT DISTINCT queries to the SQL interface.
  • New GraphQuery to enable graph traversal as a query operator.
  • New support for Cross Data Center Replication consisting of active/passive replication for separate SolrClouds hosted in separate data centers.
  • Filter support added to Real-time get.
  • Column alias support added to the Parallel SQL Interface.
  • New command added to switch between non/secure mode in zookeeper.
  • Now possible to use IP fragments in replica placement rules.

For features new to Solr 6.0, be sure to consult the unreleased Solr reference manual. (unreleased as of 8 April 2016)

Happy searching!

Apache Lucene 5.3.1, Solr 5.3.1 Available

Thursday, September 24th, 2015

Apache Lucene 5.3.1, Solr 5.3.1 Available

From the post:

The Lucene PMC is pleased to announce the release of Apache Lucene 5.3.1 and Apache Solr 5.3.1

Lucene can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/java/5.3.1
and Solr can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/solr/5.3.1

Highlights of this Lucene release include:

Bug Fixes

  • Remove classloader hack in MorfologikFilter
  • UsageTrackingQueryCachingPolicy no longer caches trivial queries like MatchAllDocsQuery
  • Fixed BoostingQuery to rewrite wrapped queries

Highlights of this Solr release include:

Bug Fixes

  • security.json is not loaded on server start
  • RuleBasedAuthorization plugin does not work for the collection-admin-edit permission
  • VelocityResponseWriter template encoding issue. Templates must be UTF-8 encoded
  • SimplePostTool (also bin/post) -filetypes “*” now works properly in ‘web’ mode
  • example/files update-script.js to be Java 7 and 8 compatible.
  • SolrJ could not make requests to handlers with ‘/admin/’ prefix
  • Use of timeAllowed can cause incomplete filters to be cached and incorrect results to be returned on subsequent requests
  • VelocityResponseWriter’s $resource.get(key,baseName,locale) to use specified locale.
  • Resolve XSS issue in Admin UI stats page

Time to upgrade!

Enjoy!

Apache Lucene 5.2, Solr 5.2 Available

Tuesday, June 9th, 2015

From the news:

Lucene can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/java/5.2.0 and Solr can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/solr/5.2.0

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

Enjoy!

PS: Also see the Reference Guide for Solr 5.2.

NLP4L

Saturday, May 30th, 2015

NLP4L

From the webpage:

NLP4L is a natural language processing tool for Apache Lucene written in Scala. The main purpose of NLP4L is to use the NLP technology to improve Lucene users’ search experience. Lucene/Solr, for example, already provides its users with auto-complete and suggestion functions for search keywords. Using NLP technology, NLP4L development members may be able to present better keywords. In addition, NLP4L provides functions to collaborate with existing machine learning tools, including one to directly create document vector from a Lucene index and write it to a LIBSVM format file.

As NLP4L processes document data registered in the Lucene index, you can directly access a word database normalized by powerful Lucene Analyzer and use handy search functions. Being written in Scala, NLP4L excels at trying ad hoc interactive processing as well.

The documentation is currently in Japanese with a TOC for the English version. Could be interesting if you want to try your hand either at translation and/or working from the API Docs.

Enjoy!

Apache Lucene 5.1.0, Solr 5.1.0 Available

Thursday, May 7th, 2015

From the news:

Lucene can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/java/5.1.0 and Solr can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/solr/5.1.0

Both releases contain a number of new features, bug fixes, and optimizations.

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

See also Solr 5.1 Features by Yonik Seeley.

Of particular interest, Streaming Aggregation For SolrCloud (new in Solr 5.1) by Joel Bernstein.

Enjoy!

Full-Text Search in Javascript (Part 1: Relevance Scoring)

Wednesday, April 1st, 2015

Full-Text Search in Javascript (Part 1: Relevance Scoring) by Barak Kanber.

From the post:

Full-text search, unlike most of the topics in this machine learning series, is a problem that most web developers have encountered at some point in their daily work. A client asks you to put a search field somewhere, and you write some SQL along the lines of WHERE title LIKE %:query%. It’s convincing at first, but then a few days later the client calls you and claims that “search is broken!”

Of course, your search isn’t broken, it’s just not doing what the client wants. Regular web users don’t really understand the concept of exact matches, so your search quality ends up being poor. You decide you need to use full-text search. With some MySQL fidgeting you’re able to set up a FULLTEXT index and use a more evolved syntax, the “MATCH() … AGAINST()” query.

Great! Problem solved. For smallish databases.

As you hit the hundreds of thousands of records, you notice that your database is sluggish. MySQL just isn’t great at full-text search. So you grab ElasticSearch, refactor your code a bit, and deploy a Lucene-driven full-text search cluster that works wonders. It’s fast and the quality of results is great.

Which leads you to ask: what the heck is Lucene doing so right?

This article (on TF-IDF, Okapi BM-25, and relevance scoring in general) and the next one (on inverted indices) describe the basic concepts behind full-text search.

Illustration of search engine concepts in Javascript with code for download. You can tinker to your heart’s delight.

Enjoy!

PS: Part 2 is promised in the next “several” weeks. Will be watching for it.

Apache Lucene 5.0.0

Sunday, February 22nd, 2015

Apache Lucene 5.0.0

For the impatient:

http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene CHANGES.txt

From the post:

Highlights of the Lucene release include:

Stronger index safety

  • All file access now uses Java’s NIO.2 APIs which give Lucene stronger index safety in terms of better error handling and safer commits.
  • Every Lucene segment now stores a unique id per-segment and per-commit to aid in accurate replication of index files.
  • During merging, IndexWriter now always checks the incoming segments for corruption before merging. This can mean, on upgrading to 5.0.0, that merging may uncover long-standing latent corruption in an older 4.x index.

Reduced heap usage

  • Lucene now supports random-writable and advance-able sparse bitsets (RoaringDocIdSet and SparseFixedBitSet), so the heap required is in proportion to how many bits are set, not how many total documents exist in the index.
  • Heap usage during IndexWriter merging is also much lower with the new Lucene50Codec, since doc values and norms for the segments being merged are no longer fully loaded into heap for all fields; now they are loaded for the one field currently being merged, and then dropped.
  • The default norms format now uses sparse encoding when appropriate, so indices that enable norms for many sparse fields will see a large reduction in required heap at search time.
  • 5.0 has a new API to print a tree structure showing a recursive breakdown of which parts are using how much heap.

Other features

  • FieldCache is gone (moved to a dedicated UninvertingReader in the misc module). This means when you intend to sort on a field, you should index that field using doc values, which is much faster and less heap consuming than FieldCache.
  • Tokenizers and Analyzers no longer require Reader on init.
  • NormsFormat now gets its own dedicated NormsConsumer/Producer
  • SortedSetSortField, used to sort on a multi-valued field, is promoted from sandbox to Lucene’s core.
  • PostingsFormat now uses a “pull” API when writing postings, just like doc values. This is powerful because you can do things in your postings format that require making more than one pass through the postings such as iterating over all postings for each term to decide which compression format it should use.
  • New DateRangeField type enables Indexing and searching of date ranges, particularly multi-valued ones.
  • A new ExitableDirectoryReader extends FilterDirectoryReader and enables exiting requests that take too long to enumerate over terms.
  • Suggesters from multi-valued field can now be built as DocumentDictionary now enumerates each value separately in a multi-valued field.
  • ConcurrentMergeScheduler detects whether the index is on SSD or not and does a better job defaulting its settings. This only works on Linux for now; other OS’s will continue to use the previous defaults (tuned for spinning disks).
  • Auto-IO-throttling has been added to ConcurrentMergeScheduler, to rate limit IO writes for each merge depending on incoming merge rate.
  • CustomAnalyzer has been added that allows to configure analyzers like you do in Solr’s index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.
  • Memory index now supports payloads.
  • Added a filter cache with a usage tracking policy that caches filters based on frequency of use.
  • The default codec has an option to control BEST_SPEED or BEST_COMPRESSION for stored fields.
  • Stored fields are merged more efficiently, especially when upgrading from previous versions or using SortingMergePolicy

More goodness to start your week!

Solr 5.0 Will See Another RC – But Docs Are Available

Friday, February 13th, 2015

I saw a tweet from Anshum Gupta today saying:

Though the vote passed, seems like there’s need for another RC for #Apache #Lucene / #Solr 5.0. Hopefully we’d be third time lucky.

To brighten your weekend prospects, the Apache Solr Reference Guide for Solr 5.0 is available.

With an other Solr RC on the horizon, now would be a good time to spend some time with the reference guide. Both in terms of new features and to smooth out any infelicities in the documentation.

Draft Lucene 5.0 Release Highlights

Friday, January 23rd, 2015

Draft Lucene 5.0 Release Highlights

Just a draft of Lucene 5.0 release notes but it is a signal that the release is getting closer!

Or as the guy said in Star Wars, “…almost there!” Hopefully with happier results.

Update: My bad, I forgot to include the Solr 5.0 draft release notes as well!

http://wiki.apache.org/solr/ReleaseNote50

Solr 5 Preview (Podcast) [Update on Solr 5 Release Target Date]

Tuesday, January 6th, 2015

Solr 5 Preview with Anshum Gupta and Tim Potter

Description:

Solr committers Anshum Gupta and Tim Potter tell us about the upcoming Solr 5 release. We discuss making Solr “easy to start, easy to finish” while continuing to add improvements and stability for experienced users. Hear more about SolrCloud hardening, clusterstate improvements, the schema and solrconfig APIs, easier ZooKeeper management, improved flexible and schemaless indexing, and overall ease-of-use improvements.

Some notes:

Focus in Solr 5 development has been on ease of use. Directory layout of Solr install has been changed. 5.0 gets rid of the war file. Stand alone application. Don’t have to add parts to it. Don’t need Tomcat. Distributed IDF management. (Documents used to score differently based on shard where they reside. Not so in 5.0 (SOLR-1632)) API access to config files. Not schema-less so much but smarter about doing reasonable things by default.

The one missing question?

What is the anticipated release date for Solr 5?

I did look at the roadmap for 5.0, “No release date.” As of today, 228 of 313 issues have been resolved.

Here’s an open issue that may interest some of you: Create a shippable tutorial integrated with running Solr instance. That’s SOLR-6808 for those following in your hymn books.

Enjoy!


Update: Solr 5 is targeted for late January 2015! Hot damn!

Apache Lucene™ 5.0.0 is coming!

Monday, November 17th, 2014

Apache Lucene™ 5.0.0 is coming! by Michael McCandless.

At long last, after a strong series of 4.x feature releases, most recently 4.10.2, we are finally working towards another major Apache Lucene release!

There are no promises for the exact timing (it’s done when it’s done!), but we already have a volunteer release manager (thank you Anshum!).

A major release in Lucene means all deprecated APIs (as of 4.10.x) are dropped, support for 3.x indices is removed while the numerous 4.x index formats are still supported for index backwards compatibility, and the 4.10.x branch becomes our bug-fix only release series (no new features, no API changes).

5.0.0 already contains a number of exciting changes, which I describe below, and they are still rolling in with ongoing active development.

Michael has a great list and explanation of changes you will be seeing in Lucene 5.0.0. Pick your favorite(s) to follow and/or contribute to the next release.

Solr/Lucene 5.0 (December, 2014)

Wednesday, November 12th, 2014

Just so you know, email traffic suggests a release candidate for Solr/Lucene 5.0 will appear in December, 2014.

If you are curious, see the unreleased Solr Reference Guide (for Solr 5.0).

If you are even more curious, see the issues targeted for Solr 5.0.

OK, I have to admit that not everyone uses Solr so see also the issues targeted for Lucene 5.0.

Nothing like a pre-holiday software drop to provide leisure activities for the holidays!

Understanding Information Retrieval by Using Apache Lucene and Tika

Saturday, October 25th, 2014

Understanding Information Retrieval by Using Apache Lucene and Tika, Part 1

Understanding Information Retrieval by Using Apache Lucene and Tika, Part 2

Understanding Information Retrieval by Using Apache Lucene and Tika, Part 3

by Ana-maria Mihalceanu.

From part 1:

In this tutorial, the Apache Lucene and Apache Tika frameworks will  be explained through their core concepts (e.g.  parsing, mime detection,  content analysis, indexing,  scoring, boosting) via illustrative examples that should be applicable to not only seasoned software developers but to beginners to content analysis and programming as well. We assume you have a working knowledge of the Java™ programming language and plenty of content to analyze.

Throughout this tutorial, you will learn:

  • how to use Apache Tika’s API and its most relevant functions
  • how to develop code with Apache Lucene API and its most important modules
  • how to integrate Apache Lucene and Apache Tika in order to build your own piece of software that stores and retrieves information efficiently. (project code is available for download)

Part 1 introduces you to Apache Lucene and Apache Tika and concludes by covering automatic extraction of metadata from files with Apache Tika.

Part 2 covers extracting/indexing of content, along with stemming, boosting and scoring. (If any of that sounds unfamiliar, this isn’t the best tutorial for you.)

Part 3 details the highlighting of fragments when they match a search query.

A good tutorial on Apache Lucene and Apache Tika, what parts of them are covered, but there was no coverage of information retrieval. For example, part 3 talks about increasing search “efficiency” without any consideration of what “efficiency” might mean in a particular search context.

Illuminating issues in information retrieval using Apache Lucene and Tika as opposed to coding up an indexing/searching application with no discussion of the potential choices and tradeoffs would make a much better tutorial.

Apache Lucene and Solr 4.10

Sunday, September 21st, 2014

Apache Lucene and Solr 4.10

From the post:

Today Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbered 4.10. This is a next release continuing the 4th version of both Apache Lucene and Apache Solr.

Here are some of the changes that were made comparing to the 4.9:

Lucene

  • Simplified Version handling for analyzers
  • TermAutomatonQuery was added
  • Optimizations and bug fixes

Solr

  • Ability to automatically add replicas in SolrCloud mode in HDFS
  • Ability to export full results set
  • Distributed support for facet.pivot
  • Optimizations and bugfixes from Lucene 4.9

Full changes list for Lucene can be found at http://wiki.apache.org/lucene-java/ReleaseNote410. Full list of changes in Solr 4.10 can be found at: http://wiki.apache.org/solr/ReleaseNote410.

Apache Lucene 4.10 library can be downloaded from the following address: http://www.apache.org/dyn/closer.cgi/lucene/java/. Apache Solr 4.10 can be downloaded at the following URL address: http://www.apache.org/dyn/closer.cgi/lucene/solr/. Please remember that the mirrors are just starting to update so not all of them will contain the 4.10 version of Lucene and Solr.

A belated note about Apache Lucene and Solr 4.10.

I must have been distracted by the continued fumbling with the Ebola crisis. I no longer wonder how the international community would respond to an actual world wide threat. In a word, ineffectively.

Elastic Search: The Definitive Guide

Saturday, September 6th, 2014

Elastic Search: The Definitive Guide by Clinton Gormley and Zachary Tong.

From “why we wrote this book:”

We wrote this book because Elasticsearch needs a narrative. The existing reference documentation is excellent… as long as you know what you are looking for. It assumes that you are intimately familiar with information retrieval concepts, distributed systems, the query DSL and a host of other topics.

This book makes no such assumptions. It has been written so that a complete beginner — to both search and distributed systems — can pick it up and start building a prototype within a few chapters.

We have taken a problem based approach: this is the problem, how do I solve it, and what are the trade-offs of the alternative solutions? We start with the basics and each chapter builds on the preceding ones, providing practical examples and explaining the theory where necessary.

The existing reference documentation explains how to use features. We want this book to explain why and when to use various features.

An important guide/reference for Elastic Search but the “why” for this book is important as well.

Reference documentation is absolutely essential but so is documentation that eases the learning curve in order to promote adoption of software or a technology.

Read this both for Elastic Search as well as one model for writing a “why” and “when” book for other technologies.

Scoring tennis using finite-state automata

Saturday, August 30th, 2014

Scoring tennis using finite-state automata by Michael McCandless.

From the post:

For some reason having to do with the medieval French, the scoring system for tennis is very strange.

In actuality, the game is easy to explain: to win, you must score at least 4 points and win by at least 2. Yet in practice, you are supposed to use strange labels like “love” (0 points), “15” (1 point), “30” (2 points), “40” (3 points), “deuce” (3 or more points each, and the players are tied), “all” (players are tied) instead of simply tracking points as numbers, as other sports do.

This is of course wildly confusing to newcomers. Fortunately, the convoluted logic is easy to express as a finite-state automaton (FSA):

And you thought that CS course in automata wasn’t going to be useful. 😉

Michael goes on to say:

FSA minimization saved only 3 states for the game of tennis, resulting in a 10% smaller automaton, and maybe this simplifies keeping track of scores in your games by a bit, but in other FSA applications in Lucene, such as the analyzing suggester, MemoryPostingsFormat and the terms index, minimization is vital since saves substantial disk and RAM for Lucene applications!

A funny introduction with a serious purpose!

Introducing Splainer…

Monday, August 25th, 2014

Introducing Splainer — The Open Source Search Sandbox That Tells You Why by Doug Turnbull.

Splainer is a step towards addressing two problems:

From the post:

  • Collaboration: At OpenSource Connections, we believe that collaboration with non-techies is the secret ingredient of search relevancy. We need to arm business analysts and content experts with a human readable version of the explain information so they can inform the search tuning process.
  • Usability: I want to paste a Solr URL, full of query paramaters and all, and go! Then, once I see more helpful explain information, I want to tweak (and tweak and tweak) until I get the search results I want. Much like some of my favorite regex tools. Get out of the way and let me tune!
  • ….

    We hope you’ll give it a spin and let us know how it can be improved. We welcome your bugs, feedback, and pull requests. And if you want to try the Splainer experience over multiple queries, with diffing, results grading, a develoment history, and more — give Quepid a spin for free!

Improving the information content of the tokens you are searching is another way to improve search results.

A new proximity query for Lucene, using automatons

Tuesday, August 5th, 2014

A new proximity query for Lucene, using automatons by Michael McCandless.

From the post:


As of Lucene 4.10 there will be a new proximity query to further generalize on MultiPhraseQuery and the span queries: it allows you to directly build an arbitrary automaton expressing how the terms must occur in sequence, including any transitions to handle slop.

automata

This is a very expert query, allowing you fine control over exactly what sequence of tokens constitutes a match. You build the automaton state-by-state and transition-by-transition, including explicitly adding any transitions (sorry, no QueryParser support yet, patches welcome!). Once that’s done, the query determinizes the automaton and then uses the same infrastructure (e.g. CompiledAutomaton) that queries like FuzzyQuery use for fast term matching, but applied to term positions instead of term bytes. The query is naively scored like a phrase query, which may not be ideal in some cases.

Micahael walks through current proximity queries before diving into the new proximity query for Lucene 4.10.

As always, this is a real treat!

Side by side with Elasticsearch and Solr

Sunday, August 3rd, 2014

Side by side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe.

Abstract:

We all know that Solr and Elasticsearch are different, but what those differences are and which solution is the best fit for a particular use case is a frequent question. We will try to make those differences clear, not by showing slides and compare them, but by showing online demo of both Elasticsearch and Solr:

  • Set up and start both search servers. See what you need to prepare and launch Solr and Elasticsearch
  • Index data right after the server was started using the “schemaless” mode
  • Create index structure and modify it using the provided API
  • Explore different query use cases
  • Scale by adding and removing nodes from the cluster, creating indices and managing shards. See how that affects data indexing and querying.
  • Monitor and administer clusters. See what metrics can be seen out of the box, how to get them and what tools can provide you with the graphical view of all the goodies that each search server can provide.

Slides

Very impressive split-screen comparison of Elasticsearch and Solr by two presenters on the same data set.

I first saw this at: Side-By-Side with Solr and Elasticsearch : A Comparison by Charles Ditzel.

Elasticsearch 1.3.1 released

Friday, August 1st, 2014

Elasticsearch 1.3.1 released by Clinton Gormley.

From the post:

Today, we are happy to announce the bugfix release of Elasticsearch 1.3.1, based on Lucene 4.9. You can download it and read the full changes list here: Elasticsearch 1.3.1.

Enjoy!

Multi-Term Synonyms [Bags of Properties?]

Wednesday, July 30th, 2014

Solution for multi-term synonyms in Lucene/Solr using the Auto Phrasing TokenFilter by Ted Sullivan.

From the post:

In a previous blog post, I introduced the AutoPhrasingTokenFilter. This filter is designed to recognize noun-phrases that represent a single entity or ‘thing’. In this post, I show how the use of this filter combined with a Synonym Filter configured to take advantage of auto phrasing, can help to solve an ongoing problem in Lucene/Solr – how to deal with multi-term synonyms.

The problem with multi-term synonyms in Lucene/Solr is well documented (see Jack Krupansky’s proposal, John Berryman’s excellent summary and Nolan Lawson’s query parser solution). Basically, what it boils down to is a problem with parallel term positions in the synonym-expanded token list – based on the way that the Lucene indexer ingests the analyzed token stream. The indexer pays attention to a token’s start position but does not attend to its position length increment. This causes multi-term tokens to overlap subsequent terms in the token stream rather than maintaining a strictly parallel relation (in terms of both start and end positions) with their synonymous terms. Therefore, rather than getting a clean ‘state-graph’, we get a pattern called “sausagination” that does not accurately reflect the 1-1 mapping of terms to synonymous terms within the flow of the text (see blog post by Mike McCandless on this issue). This problem disappears if all of the synonym pairs are single tokens.

The multi-term synonym problem was described in a Lucene JIRA ticket (LUCENE-1622) which is still marked as “Unresolved”:

Posts like this one are a temptation to sign off Twitter and read the ticket feeds for Lucene/Solr instead. Seriously.

Ted proposes a workaround to the multi-term synonym problem using the auto phrasing tokenfilter. Equally important is his conclusion:

The AutoPhrasingTokenFilter can be an important tool in solving one of the more difficult problems with Lucene/Solr search – how to deal with multi-term synonyms. Simultaneously, we can improve another serious problem that all search engines have – their focus on single tokens and the ambiguities that are present at that level. By shifting the focus more towards phrases that should be treated as semantic entities or units of language (i.e. “things”), the search engine is better able to return results based on ‘what’ the user is looking for rather than documents containing words that match the query. We are moving from searching with a “bag of words” to searching a “bag of things”.

Or more precisely:

…their focus on single tokens and the ambiguities that are present at that level. By shifting the focus more towards phrases that should be treated as semantic entities or units of language (i.e. “things”)…

Ambiguity at the token level remains, even if for particular cases phrases can be treated as semantic entities.

Rather than Ted’s “bag of things,” may I suggest indexing “bags of properties?” Where the lowliest token or a higher semantic unit can be indexed as a bag of properties.

Imagine indexing these properties* for a single token:

  • string: value
  • pubYear: value
  • author: value
  • journal: value
  • keywords: value

Would that suffice to distinguish a term in a medical journal from Vanity Fair?

Ambiguity is predicated upon a lack of information.

That should be suggestive of a potential cure.

*(I’m not suggesting that all of those properties or even most of them would literally appear in a bag. Most, if not all, could be defaulted from an indexed source.)

I first saw this in a tweet by SolrLucene.

Solr’s New AnalyticsQuery API

Tuesday, July 29th, 2014

Solr’s New AnalyticsQuery API by Joel Bernstein.

From the post:

In Solr 4.9 there is a new AnalyticsQuery API that allows developers to plug custom analytic logic into Solr. The AnalyticsQuery class provides a clean and simple API that gives developers access to all the rich functionality in Lucene and is strategically placed within Solr’s distributed search architecture. Using this API you can harness all the power of Lucene and Solr to write custom analytic logic.

Not all the detail you are going to want but a good start towards using the new AnalyticsQuery API in Solr 4.9.

The AnalyticsQuery API is an example of why I wonder about projects with custom search solutions (read not Lucene-based).

If you have any doubts, default to a Lucene-based search solution.

CouchDB-Lucene 1.0 Release

Sunday, July 6th, 2014

CouchDB-Lucene

From the release page:

  • Upgrade to Lucene 4.9.0
  • Upgrade to Tika 1.5
  • Use the full OOXML Schemas from Apache POI, to make Tika able to parse Office documents that use exotic features
  • Allow search by POST (using form data)

+1! to incorporating Lucene in software as opposed to re-rolling basic indexing.

Apache Lucene/Solr 4.9.0 Released!

Saturday, June 28th, 2014

From the announcement:

The Lucene PMC is pleased to announce the availability of Apache Lucene 4.9.0 and Apache Solr 4.9.0.

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

Combined Solr and Lucene Javadoc 4.8.0

Sunday, June 8th, 2014

Combined Solr and Lucene Javadoc 4.8.0

A resource built by Solr Start using …., you guessed, Solr. 😉

From the Solr Start homepage:

Welcome to the collection of resources to make Apache Solr more comprehensible to beginner and intermediate users. While Solr is very easy to start with, tuning it is – like for any search engine – fairly complex. This website will try to make this simpler by compiling information and creating tools to accelerate learning Solr. The currently available resources are linked in the menubar above. More resources will be coming shortly.

If you would like to be notified of such new resources, get early access and receive exclusive discounts on commercial tools, join the mailing list below:

I’m curious.

This Javadocs resource will be very useful but obviously Javadocs are missing something or else there would be fewer presentations, papers, blogs, etc., on issues covered by the Javadocs.

Yes?

While I applaud the combined index of Lucene and Solr Javadocs, what would an index have to cover beyond the Javadocs to be really useful to you?

I first saw this in a tweet by SolrStart.

Introducing the Solr Scale Toolkit

Saturday, June 7th, 2014

Introducing the Solr Scale Toolkit by Timothy Potter.

From the post:

SolrCloud is a set of features in Apache Solr that enable elastic scaling of distributed search indexes using sharding and replication. One of the hurdles to adopting SolrCloud has been the lack of tools for deploying and managing a SolrCloud cluster. In this post, I introduce the Solr Scale Toolkit, an open-source project sponsored by LucidWorks (www.lucidworks.com), which provides tools and guidance for deploying and managing SolrCloud in cloud-based platforms such as Amazon EC2. In the last section, I use the toolkit to run some performance benchmarks against Solr 4.8.1 to see just how “scalable” Solr really is.

Motivation

When you download a recent release of Solr (4.8.1 is the latest at the time of this writing), it’s actually quite easy to get a SolrCloud cluster running on your local workstation. Solr allows you to start an embedded ZooKeeper instance to enable “cloud” mode using a simple command-line option: -DzkRun. If you’ve not done this before, I recommend following the instructions provided by the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/SolrCloud

Once you’ve worked through the out-of-the-box experience with SolrCloud, you quickly realize you need tools to help you automate deployment and system administration tasks across multiple servers. Moreover, once you get a well-configured cluster running, there are ongoing system maintenance tasks that also should be automated, such as doing rolling restarts, performing off-site backups, or simply trying to find an error message across multiple log files on different servers.

Until now, most organizations had to integrate SolrCloud operations into an existing environment using tools like Chef or Puppet. While those are still valid approaches, the Solr Scale Toolkit provides a simple, Python-based solution that is easy to install and use to manage SolrCloud. In the remaining sections of this post, I walk you through some of the key features of the toolkit and encourage you to follow along. To begin there’s a little setup that is required to use the toolkit.

If you are looking to scale Solr, Timothy’s post is the right place to start!

Take serious heed of the following advice:

One of the most important tasks when planning to use SolrCloud is to determine how many servers you need to support your index(es). Unfortunately, there’s not a simple formula for determining this because there are too many variables involved. However, most experienced SolrCloud users do agree that the only way to determine computing resources for your production cluster is to test with your own data and queries. So for this blog, I’m going to demonstrate how to provision the computing resources for a small cluster but you should know that the same process works for larger clusters. In fact, the toolkit was developed to enable large-scale testing of SolrCloud. I leave it as an exercise for the reader to do their own cluster-size planning.

If anyone offers you a fixed rate SolrCloud, you should know they have calculated the cluster to be good for them, and if possible, good for you.

You have been warned.

Elasticsearch 1.2.0 and 1.1.2 released

Saturday, May 24th, 2014

Elasticsearch 1.2.0 and 1.1.2 released by Clinton Gormley.

From the post:

Today, we are happy to announce the release of Elasticsearch 1.2.0, based on Lucene 4.8.1, along with a bug fix release Elasticsearch 1.1.2.

You can download them and read the full change lists here:

Elasticsearch 1.2.0 is a bumper release, containing over 300 new features, enhancements, and bug fixes. You can see the full changes list in the Elasticsearch 1.2.0 release notes, but we will highlight some of the important ones below:

Highlights of the more important changes for Elasticsearch 1.2.0:

  • Java 7 required
  • dynamic scripting disabled by default
  • field data and filter caches
  • gateways removed
  • indexing and merging
  • aggregations
  • context suggester
  • improved deep scrolling
  • field value factor

See Clinton’s post or the release notes for more complete coverage. (Aggregation looks particularly interesting.)