Archive for the ‘LucidWorks’ Category

10 Reasons to Choose Apache Solr Over Elasticsearch

Saturday, November 12th, 2016

10 Reasons to Choose Apache Solr Over Elasticsearch by Grant Ingersoll.

From the post:

Hey, clickbait title aside, I get it, Elasticsearch has been growing. Kudos to the project for tapping into a new set of search users and use cases like logging, where they are making inroads against the likes of Splunk in the IT logging market. However, there is another open source, Lucene-based search engine out there that is quite mature, more widely deployed and still growing, granted without a huge marketing budget behind it: Apache Solr. Despite what others would have you believe, Solr is quite alive and well, thank you very much. And I’m not just saying that because I make a living off of Solr (which I’m happy to declare up front), but because the facts support it.

For instance, in the Google Trends arena (see below or try the query yourself), Solr continues to hold a steady recurring level of interest even while Elasticsearch has grown. Dissection of these trends (which are admittedly easy to game, so I’ve tried to keep them simple), show Elasticsearch is strongest in Europe and Russia while Solr is strongest in the US, China, India, Brazil and Australia. On the DB-Engines ranking site, which factors in Google trends and other job/social metrics, you’ll see both Elasticsearch and Solr are top 15 projects, beating out a number of other databases like HBase and Hive. Solr’s mailing list is quite active (~280 msgs per week compared to ~170 per week for Elasticsearch) and it continues to show strong download numbers via Maven repository statistics. Solr as a codebase continues to innovate (which I’ll cover below) as well as provide regular, stable releases. Finally, Lucene/Solr Revolution, the conference my company puts on every year, continues to set record attendance numbers.

Not so much an “us versus them” piece as tantalizing facts about Solr 6 that will leave you wanting to know more.

Grant invites you to explore the Solr Quick Start if one or more of his ten points capture your interest.

Timely because with a new presidential administration about to take over in Washington, D.C., there will be:

  • Data leaks as agencies vie with each other
  • Data leaks due to inexperienced staffers
  • Data leaks to damage one side or in retaliation
  • Data leaks from foundations and corporations
  • others

If 2016 was the year of “false news” then 2017 is going to be the year of the “government data leak.”

Left unexplored except for headline suitable quips found with grep, leaks may not be significant.

On the other hand, using Solr 6 can enable you to weave a coherent narrative from diverse resources.

But you will have to learn Solr 6 to know for sure.


Solr 5 Preview (Podcast) [Update on Solr 5 Release Target Date]

Tuesday, January 6th, 2015

Solr 5 Preview with Anshum Gupta and Tim Potter


Solr committers Anshum Gupta and Tim Potter tell us about the upcoming Solr 5 release. We discuss making Solr “easy to start, easy to finish” while continuing to add improvements and stability for experienced users. Hear more about SolrCloud hardening, clusterstate improvements, the schema and solrconfig APIs, easier ZooKeeper management, improved flexible and schemaless indexing, and overall ease-of-use improvements.

Some notes:

Focus in Solr 5 development has been on ease of use. Directory layout of Solr install has been changed. 5.0 gets rid of the war file. Stand alone application. Don’t have to add parts to it. Don’t need Tomcat. Distributed IDF management. (Documents used to score differently based on shard where they reside. Not so in 5.0 (SOLR-1632)) API access to config files. Not schema-less so much but smarter about doing reasonable things by default.

The one missing question?

What is the anticipated release date for Solr 5?

I did look at the roadmap for 5.0, “No release date.” As of today, 228 of 313 issues have been resolved.

Here’s an open issue that may interest some of you: Create a shippable tutorial integrated with running Solr instance. That’s SOLR-6808 for those following in your hymn books.


Update: Solr 5 is targeted for late January 2015! Hot damn!

Solr Cluster

Friday, December 20th, 2013

Solr Cluster

From the webpage:

Join us weekly for tips and tricks, product updates and Q&A on topics you suggest. Guest appearances from Lucene/Solr committers and PMC members. Send questions to

So far:

#1 Entity Recognition

Enhance Search applications beyond simple keyword search by adding intelligence through metadata. Help classify common patterns from unstructured data/content into predefined categories. Examples include names of persons, organizations, locations, expressions of time, quantities, monetary values, percentages etc. Entity recognition is usually built using either linguistic grammar-based techniques or statistical models.

#2 On Enterprise and Intranet Search

What use is search to an enterprise? What is the purpose of intranet search? How hard is it to implement? In this episode we speak with LucidWorks consultant Evan Sayer about the benefits of internal search and how to prepare your business data to best take advantage of full-text search.

Well, the lead in music isn’t Beaker Street, but it’s not that long.

I think the discussion would be easier to follow with a webpage with common terms and an outline of the topic for the day.

Has real potential so I urge you to listen, send in questions and comments.

Webinar: Trubo-Charging Solr

Monday, October 7th, 2013

Turbo-charge your Solr instance with Entity Recognition, Business Rules and a Relevancy Workbench by Yann Yu.

Date: Thursday, October 17, 2013
Time: 10:00am Pacific Time

From the post:

LucidWorks has three new modules available in the Solr Marketplace that run on top of your existing Solr or LucidWorks Search instance. Join us for an overview of each module and learn how implementing one, two or all three will turbo-charge your Solr instance.

  • Business Rules Engine: Out of the box integration with Drools, the popular open-source business rules engine is now available for Solr and LucidWorks Search. With the LucidWorks Business Rules module, developers can write complex rules using declarative syntax with very little programming. Data can be modified, cleaned and enriched through multiple permutations and combinations.
  • Relevancy Workbench: Experiment with different search parameters to understand the impact of these changes to search results. With intuitive, color-code and side-by-side comparisons of results for different sets of parameters, users can quickly tune their application to produce the results they need. The Relevancy Workbench encourages experimentation with a visual “before and after” view of the results of parameter changes.
  • Entity Recognition: Enhance Search applications beyond simple keyword search by adding intelligence through metadata. Help classify common patterns from unstructured data/content into predefined categories. Examples include names of persons, organizations, locations, expressions of time, quantities, monetary values, percentages etc.

All of these modules will be of interest to topic mappers who are processing bulk data.

Lucene/Solr Revolution EU 2013

Saturday, June 22nd, 2013

Lucene/Solr Revolution EU 2013

November 4 -7, 2013
Dublin, Ireland

Abstract Deadline: August 2, 2013.

From the webpage:

LucidWorks is proud to present Lucene/Solr Revolution EU 2013, the biggest open source conference dedicated to Apache Lucene/Solr.

The conference, held in Dublin, Ireland on November 4-7, will be packed with technical sessions, developer content, user case studies, and panels. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology.

From the call for papers:

The Call for Papers for Lucene/Solr Revolution EU 2013 is now open.

Lucene/Solr Revolution is the biggest open source conference dedicated to Apache Lucene/Solr. The great content delivered by speakers like you is the heart of the conference. If you are a practitioner, business leader, architect, data scientist or developer and have something important to share, we welcome your submission.

We are particularly interested in compelling use cases and success stories, best practices, and technology insights.

Don’t be shy!

Lucene/Solr Revolution 2013 San Diego (Video Index)

Thursday, June 20th, 2013

Videos from Lucene/Solr Revolution 2013 San Diego (April 29th – May 2nd, 2013)

Sorted by author, duplicates removed, etc.

These videos merit far more views than they have today. Pass this list along.

Work through the videos and related docs. There are governments out there that want useful search results.

James Atherton, Search Team Lead, 7digital Implementing Search with Solr at 7digital

A usage/case study, describing our journey as we implemented Lucene/Solr, the lessons we learned along the way and where we hope to go in the future.How we implemented our instant search/search suggest. How we handle trying to index 400 million tracks and metadata for over 40 countries, comprising over 300GB of data, and about 70GB of indexes. Finally where we hope to go in the future.

Ben Brown, Software Architect, Cerner Corporation Brahe – Mass scale flexible indexing

Our team made their first foray into Solr building out Chart Search, an offering on top of Cerner's primary EMR to help make search over a patient's chart smarter and easier. After bringing on over 100 client hospitals and indexing many tens of billions of clinical documents and discrete results we've (thankfully) learned a couple of things.

The traditional hashed document ID over many shards and no easily accessible source of truth doesn't make for a flexible index.
Learn the finer points of the strategy where we shifted our source of truth to HBase. How we deploy new indexes with the click of a button, take an existing index and expand the number of shards on the fly, and several other fancy features we enabled.

Paul Doscher, CEO LucidWorks Lucene Revolution 2013, Opening Remarks – Paul Doscher, CEO LucidWorks

Ted Dunning, Chief Application Architect, MapR & Grant Ingersoll, Chief Technology Officer, LucidWorks Crowd-sourced intelligence built into Search over Hadoop

Search has quickly evolved from being an extension of the data warehouse to being run as a real time decision processing system. Search is increasingly being used to gather intelligence on multi-structured data leveraging distributed platforms such as Hadoop in the background. This session will provide details on how search engines can be abused to use not text, but mathematically derived tokens to build models that implement reflected intelligence. In such a system, intelligent or trend-setting behavior of some users is reflected back at other users. More importantly, the mathematics of evaluating these models can be hidden in a conventional search engine like SolR, making the system easy to build and deploy. The session will describe how to integrate Apache Solr/Lucene with Hadoop. Then we will show how crowd-sourced search behavior can be looped back into analysis and how constantly self-correcting models can be created and deployed. Finally, we will show how these models can respond with intelligent behavior in realtime.

Stephane Gamard, Chief Technology Officer, Searchbox How to make a simple cheap high-availability self-healing Solr cluster

In this presentation we aim to show how to make a high availability Solr cloud with 4.1 using only Solr and a few bash scripts. The goal is to present an infrastructure which is self healing using only cheap instances based on ephemeral storage. We will start by providing a comprehensive overview of the relation between collections, Solr cores, shardes, and cluster nodes. We continue by an introduction to Solr 4.x clustering using zookeeper with a particular emphasis on cluster state status/monitoring and solr collection configuration. The core of our presentation will be demonstrated using a live cluster.

We will show how to use cron and bash to monitor the state of the cluster and the state of its nodes. We will then show how we can extend our monitoring to auto generate new nodes, attach them to the cluster, and assign them shardes (selecting between missing shardes or replication for HA). We will show that using a high replication factor it is possible to use ephemeral storage for shards without the risk of data loss, greatly reducing the cost and management of the architecture. Future work discussions, which might be engaged using an open source effort, include monitoring activity of individual nodes as to scale the cluster according to traffic and usage.

Trey Grainger, Search Technology Development Manager, CareerBuilder Building a Real-time, Big Data Analytics Platform with Solr

Having "big data" is great, but turning that data into actionable intelligence is where the real value lies. This talk will demonstrate how you can use Solr to build a highly scalable data analytics engine to enable customers to engage in lightning fast, real-time knowledge discovery.

At CareerBuilder, we utilize these techniques to report the supply and demand of the labor force, compensation trends, customer performance metrics, and many live internal platform analytics. You will walk away from this talk with an advanced understanding of faceting, including pivot-faceting, geo/radius faceting, time-series faceting, function faceting, and multi-select faceting. You'll also get a sneak peak at some new faceting capabilities just wrapping up development including distributed pivot facets and percentile/stats faceting, which will be open-sourced.

The presentation will be a technical tutorial, along with real-world use-cases and data visualizations. After this talk, you'll never see Solr as just a text search engine again.

Chris Hostetter (aka Hoss) Stump The Chump: Get On The Spot Solutions To Your Real Life Lucene/Solr Challenges

Got a tough problem with your Solr or Lucene application? Facing
challenges that you'd like some advice on? Looking for new approaches to
overcome a Lucene/Solr issue? Not sure how to get the results you
expected? Don't know where to get started? Then this session is for you.

Now, you can get your questions answered live, in front of an audience of
hundreds of Lucene Revolution attendees! Back again by popular demand,
"Stump the Chump" at Lucene Revolution 2013 puts Chris Hostetter (aka Hoss) in the hot seat to tackle questions live.

All you need to do is send in your questions to us here at You can ask anything you like, but consider
topics in areas like: Data modelling Query parsing Tricky faceting Text analysis Scalability

You can email your questions to Please
describe in detail the challenge you have faced and possible approach you
have taken to solve the problem. Anything related to Solr/Lucene is fair game.

Our moderator, Steve Rowe, will will read the questions, and Hoss have to formulate a solution on the spot. A panel of judges will decide if he has provided an effective answer. Prizes will be awarded by the panel for the best question – and for those deemed to have "Stumped the Chump".

Rahul Jain, System Analyst (Software Engineer), IVY Comptech Pvt Ltd Building a Near Real-time Search Engine and Analytics for logs using Solr

Consolidation and Indexing of logs to search them in real time poses an array of challenges when you have hundreds of servers producing terabytes of logs every day. Since the log events mostly have a small size of around 200 bytes to few KBs, makes it more difficult to handle because lesser the size of a log event, more the number of documents to index. In this session, we will discuss the challenges faced by us and solutions developed to overcome them. The list of items that will be covered in the talk are as follows.

Methods to collect logs in real time.
How Lucene was tuned to achieve an indexing rate of 1 GB in 46 seconds
Tips and techniques incorporated/used to manage distributed index generation and search on multiple shards
How choosing a layer based partition strategy helped us to bring down the search response times.
Log analysis and generation of analytics using Solr.
Design and architecture used to build the search platform.

Mikhail Khludnev, eCommerce Search Platform, Grid Dynamics Concept Search for eCommerce with Solr

This talk describes our experience in eCommerce Search: challenges which we've faced and the chosen approaches. It's not indented to be a full description of implementation, because too many details need to be touched. This talk is more like problem statement and general solutions description, which have a number of points for technical or even academic discussion. It's focused on text search use-case, structures (or scoped) search is out of agenda as well as faceted navigation.

Hilary Mason, Chief Scientist, bitly Search is not a solved problem.

Remi Mikalsen, Search Engineer, The Norwegian Centre for ICT in Education Multi-faceted responsive search, autocomplete, feeds engine and logging

Learn how leverages open source technologies to deliver a blazing fast multi-faceted responsive search experience and a flexible and efficient feeds engine on top of Solr 3.6. Among the key open source projects that will be covered are Solr, Ajax-Solr, SolrPHPClient, Bootstrap, jQuery and Drupal. Notable highlights are ajaxified pivot facets, multiple parents hierarchical facets, ajax autocomplete with edge-n-gram and grouping, integrating our search widgets on any external website, custom Solr logging and using Solr to deliver Atom feeds. is a governmental website that collects, normalizes and publishes study information for related to secondary school and higher education in Norway. With 1.2 million visitors each year and 12.000 indexed documents we focus on precise information and a high degree of usability for students, potential students and counselors.

Mark Miller, Software Engineer, Cloudera SolrCloud: the 'Search First' NoSQL database

As the NoSQL ecosystem looks to integrate great search, great search is naturally beginning to expose many NoSQL features. Will these Goliath's collide? Or will they remain specialized while intermingling — two sides of the same coin.
Come learn about where SolrCloud fits into the NoSQL landscape. What can it do? What will it do? And how will the big data, NoSQL, Search ecosystem evolve. If you are interested in Big Data, NoSQL, distributed systems, CAP theorem and other hype filled terms, than this talk may be for you.

Dragan Milosevic, Senior Architect, zanox Analytics in OLAP with Lucene and Hadoop

Analytics powered by Hadoop is powerful tool and this talk addresses its application in OLAP built on top of Lucene. Many applications use Lucene indexes also for storing data to alleviate challenges concerned with external data sources. The analyses of queries can reveal stored fields that are in most cases accessed together. If one binary compressed field replaces those fields, amount of data to be loaded is reduced and processing of queries is boosted. Furthermore, documents that are frequently loaded together can be identified. If those documents are saved in almost successive positions in Lucene stored files, benefits from file-system caches are improved and loading of documents is noticeably faster.

Large-scale searching applications typically deploy sharding and partition documents by hashing. The implemented OLAP has shown that such hash-based partitioning is not always an optimal one. An alternative partitioning, supported by analytics, has been developed. It places documents that are frequently used together in same shards, which maximizes the amount of work that can be locally done and reduces the communication overhead among searchers. As an extra bonus, it also identifies slow queries that typically point to emerging trends, and suggests the addition of optimized searchers for handling similar queries.

Christian Moen, Software Engineer, Atilika Inc. Language support and linguistics in Lucene/Solr and its eco-system

In search, language handling is often key to getting a good search experience. This talk gives an overview of language handling and linguistics functionality in Lucene/Solr and best-practices for using them to handle Western, Asian and multi-language deployments. Pointers and references within the open source and commercial eco-systems for more advanced linguistics and their applications are also discussed.

The presentation is mix of overview and hands-on best-practices the audience can benefit immediately from in their Lucene/Solr deployments. The eco-system part is meant to inspire how more advanced functionality can be developed by means of the available open source technologies within the Apache eco-system (predominantly) while also highlighting some of the commercial options available.

Chandra Mouleeswaran, Co Chair at, ThreatMetrix Rapid pruning of search space through hierarchical matching

This talk will present our experiences in using Lucene/Solr to the classification of user and device data. On a daily basis, ThreatMetrix, Inc., handles a huge volume of volatile data. The primary challenge is rapidly and precisely classifying each incoming transaction, by searching a huge index within a very strict latency specification. The audience will be taken through the various design choices and the lessons learned. Details on introducing a hierarchical search procedure that systematically divides the search space into manageable partitions, yet maintaining precision, will be presented.

Kathy Phillips, Enterprise Search Services Manager/VP, Wells Fargo & Co. & Tom Lutmer, eBusiness Systems Consultant, Enterprise Search Services team, Wells Fargo & Co Beyond simple search — adding business value in the enterprise

What is enterprise search? Is it a single search box that spans all enterprise resources or is it much more than that? Explore how enterprise search applications can move beyond simple keyword search to add unique business value. Attendees will learn about the benefits and challenges to different types of search applications such as site search, interactive search, search as business intelligence, and niche search applications. Join the discussion about the possibilities and future direction of new business applications within the enterprise.

David Piraino and Daniel Palmer, Chief Imaging Information Officers, Imaging Institute Cleveland Clinic, Cleveland Clinic Next Generation Electronic Medical Records and Search: A Test Implementation in Radiology

Most patient specifc medical information is document oriented with varying amounts of associated meta-data. Most of pateint medical information is textual and semi-structured. Electronic Medical Record Systems (EMR) are not optimized to present the textual information to users in the most understandable ways. Present EMRs show information to the user in a reverse time oriented patient specific manner only. This talk discribes the construction and use of Solr search technologies to provide relevant historical information at the point of care while intepreting radiology images.

Radiology reports over a 4 year period were extracted from our Radiology Information System (RIS) and passed through a text processing engine to extract the results, impression, exam description, location, history, and date. Fifteen cases reported during clinical practice were used as test cases to determine if ""similar"" historical cases were found . The results were evaluated by the number of searches that returned any result in less than 3 seconds and the number of cases that illustrated the questioned diagnosis in the top 10 results returned as determined by a bone and joint radiologist. Also methods to better optimize the search results were reviewed.

An average of 7.8 out of the 10 highest rated reports showed a similar case highly related to the present case. The best search showed 10 out of 10 cases that were good examples and the lowest match search showed 2 out of 10 cases that were good examples.The talk will highlight this specific use case and the issues and advances of using Solr search technology in medicine with focus on point of care applications.

Timothy Potter, Architect, Big Data Analytics, Dachis Group Scaling up Solr 4.1 to Power Big Search in Social Media Analytics

My presentation focuses on how we implemented Solr 4.1 to be the cornerstone of our social marketing analytics platform. Our platform analyzes relationships, behaviors, and conversations between 30,000 brands and 100M social accounts every 15 minutes. Combined with our Hadoop cluster, we have achieved throughput rates greater than 8,000 documents per second. Our index currently contains more than 500,000,000 documents and is growing by 3 to 4 million documents per day.

The presentation will include details about:

Designing a Solr Cloud cluster for scalability and high-availability using sharding and replication with Zookeeper
Operations concerns like how to handle a failed node and monitoring
How we deal with indexing big data from Pig/Hadoop as an example of using the CloudSolrServer in SolrJ and managing searchers for high indexing throughput
Example uses of key features like real-time gets, atomic updates, custom hashing, and distributed facets. Attendees will come away from this presentation with a real-world use case that proves Solr 4.1 is scalable, stable, and is production ready. (note: we are in production on 18 nodes in EC2 with a recent nightly build off the branch_4x).

Ingo Renner, Software Engineer, Infield Design CMS Integration of Apache Solr – How we did it.

TYPO3 is an Open Source Content Management System that is very popular in Europe, especially in the German market, and gaining traction in the U.S., too.

TYPO3 is a good example of how to integrate Solr with a CMS. The challenges we faced are typical of any CMS integration. We came up with solutions and ideas to these challenges and our hope is that they might be of help for other CMS integrations as well.

That includes content indexing, file indexing, keeping track of content changes, handling multi-language sites, search and facetting, access restrictions, result presentation, and how to keep all these things flexible and re-usable for many different sites.

For all these things we used a couple additional Apache projects and we would like to show how we use them and how we contributed back to them while building our Solr integration.

David Smiley, Software Systems Engineer, Lead, MITRE Lucene / Solr 4 Spatial Deep Dive

Lucene's former spatial contrib is gone and in its place is an entirely new spatial module developed by several well-known names in the Lucene/Solr spatial community. The heart of this module is an approach in which spatial geometries are indexed using edge-ngram tokenized geohashes searched with a prefix-tree/trie recursive algorithm. It sounds cool and it is! In this presentation, you'll see how it works, why it's fast, and what new things you can do with it. Key features are support for multi-valued fields, and indexing shapes with area — even polygons, and support for various spatial predicates like "Within". You'll see a live demonstration and a visual representation of geohash indexed shapes. Finally, the session will conclude with a look at the future direction of the module.

David Smiley, Software Systems Engineer, Lead, MITRE Text Tagging with Finite State Transducers

OpenSextant is an unstructured-text geotagger. A core component of OpenSextant is a general-purpose text tagger that scans a text document for matching multi-word based substrings from a large dictionary. Harnessing the power of Lucene's state-of-the-art finite state transducer (FST) technology, the text tagger was able to save over 40x the amount of memory estimated for a leading in-memory alternative. Lucene's FSTs are elusive due to their technical complexity but overcoming the learning curve can pay off handsomely.

Marc Sturlese, Architect, Backend engineer, Trovit Batch Indexing and Near Real Time, keeping things fast

In this talk I will explain how we combine a mixed architecture using Hadoop for batch indexing and Storm, HBase and Zookeeper to keep our indexes updated in near real time.Will talk about why we didn't choose just a default Solr Cloud and it's real time feature (mainly to avoid hitting merges while serving queries on the slaves) and the advantages and complexities of having a mixed architecture. Both parts of the infrastucture and how they are coordinated will be explained with details.Finally will mention future lines, how we plan to use Lucene real time feature.

Tyler Tate, Cofounder, TwigKit Designing the Search Experience

Search is not just a box and ten blue links. Search is a journey: an exploration where what we encounter along the way changes what we seek. But in order to guide people along this journey, we must understand both the art and science of search.In this talk Tyler Tate, cofounder of TwigKit and coauthor of the new book Designing the Search Experience, weaves together the theories of information seeking with the practice of user interface design, providing a comprehensive guide to designing search.Pulling from a wealth of research conducted over the last 30 years, Tyler begins by establishing a framework of search and discovery. He outlines cognitive attributes of users—including their level of expertise, cognitive style, and learning style; describes models of information seeking and how they've been shaped by theories such as information foraging and sensemaking; and reviews the role that task, physical, social, and environmental context plays in the search process.

Tyler then moves from theory to practice, drawing on his experience of designing 50+ search user interfaces to provide practical guidance for common search requirements. He describes best practices and demonstrates reams of examples for everything from entering the query (including the search box, as-you-type suggestions, advanced search, and non-textual input), to the layout of search results (such as lists, grids, maps, augmented reality, and voice), to result manipulation (e.g. pagination and sorting) and, last but not least, the ins-and-outs of faceted navigation. Through it all, Tyler also addresses mobile interface design and how responsive design techniques can be used to achieve cross-platform search.This intensive talk will enable you to create better search experiences by equipping you with a well-rounded understanding of the theories of information seeking, and providing you with a sweeping survey of search user interface best practices.

Troy Thomas, Senior Manager, Internet Enabled Services, Synopsys & Koorosh Vakhshoori, Software Architect,Synopsys Make your GUI Shine with AJAX-Solr

With AJAX-Solr, you can implement widgets like faceting, auto-complete, spellchecker and pagination quickly and elegantly. AJAX-Solr is a JavaScript library that uses the Solr REST-like API to display search results in an interactive user interface. Come learn why we chose AJAX-Solr and Solr 4 for the SolvNet search project. Get an overview of the AJAX-Solr framework (Manager, Parameters, Widgets and Theming). Get a deeper understanding of the technical concepts using real-world examples. Best practices and lessons learned will also be presented.

Adrian Trenaman, Senior Software Engineer, Gilt Groupe Personalized Search on the Largest Flash Sale Site in America

Gilt Groupe is an innovative online shopping destination offering its members special access to the most inspiring merchandise, culinary offerings, and experiences every day, many at insider prices. Every day new merchandising is offered for sale at discounts of up to 70%. Sales start at 12 noon EST resulting in an avalanche of hits to the site, so delivering a rich user experience requires substantial technical innovation.

Implementing search for a flash-sales business, where inventory is limited and changes rapidly as our sales go live to a stampede of members every noon, poses a number of technical challenges. For example, with small numbers of fast moving inventory we want to be sure that search results reflect those products we still have available for sale. Also, personalizing search — where search listings may contain exclusive items that are available only to certain users — was also a big challenge

Gilt has built out keyword search using Scala, Play Framework and Apache Solr / Lucene. The solution, which involves less than 4,000 lines of code, comfortably provides search results to members in under 40ms. In this talk, we'll give a tour of the logical and physical architecture of the solution, the approach to schema definition for the search index, and how we use custom filters to perform personalization and enforce product availability windows. We'll discuss lessons learnt, and describe how we plan to adopt Solr to power sale, brand, category and search listings throughout all of Gilt's estate.

Doug Turnbull, Search and Big Data Architect, OpenSource Connections State Decoded: Empowering The Masses with Open Source State Law Search

The Law has traditionally been a topic dominated by an elite group of experts. Watch how State Decoded has transformed the law from a scary, academic topic to a friendly resource that empowers everyone using Apache Solr. This talk is a call to action for discovery and design to break open ivory towers of expertise by baking rich discovery into your UI and data structures.

LucidWorks™ Teams with MapR™… [Not 26% but 5-6% + not from Big Data]

Wednesday, February 20th, 2013

LucidWorks™ Teams with MapR™ Technologies to Offer Best-in-Class Big Data Analytics Solution

Performance Day just keeps on going!

From the press release:

REDWOOD CITY, Calif. – February 20, 2013 – Big Data provides a very real opportunity for organizations to drive business decisions by utilizing new information that has yet to be tapped. However, it is increasingly apparent that organizations are struggling to make effective use of this new multi-structured content for data-driven decision-making. According to a report from the Economist Intelligence Unit, the challenge is not so much the volume, but instead it is the pressing need to analyze and act on Big Data in real-time.

Existing business intelligence (BI) tools have simply not been designed to provide spontaneous search on multi-structured data in motion. Responding directly to this need, LucidWorks, the company transforming the way people access information, and MapR Technologies, the Hadoop technology leader, today announced the integration between LucidWorks Search™ and MapR. Available now, the combined solution allows organizations to easily search their MapR Distributed File System (DFS) in a natural way to discover actionable insights from information maintained in Hadoop.

“Organizations that wait to address big data until this evolution is well under way will lose out competitively in their vertical markets, compared to organizations that have aggressively pursued big data flexibility. Aggressive organizations will demonstrate faster, more accurate analysis and decisions relating to their tactical operations and strategic planning.”

  • Source: Big Data Drives Rapid Changes in Infrastructure and $232 Billion in IT Spending Through 2016, Gartner Group

Integration Solution Highlights

  • Combines the best of Big Data with Search with an integrated and fully distributed solution
  • Supports a pre-defined MapR target data source within LucidWorks Search
  • Enables users to create and configure the MapR data source directly from the LucidWorks Search administration console
  • Leverages enterprise security features offered by both MapR and LucidWorks Search

The Economist Intelligence Unit study found that global companies experienced a 26 percent improvement in performance over the last three years when big data analytics were applied to the decision-making process. And now, those data-savvy executives are forecasting a 41 percent improvement over the next three years. The integration between LucidWorks Search and MapR makes it easier to put Big Data analytics in motion.

I’m really excited about this match up but you know I can’t simply let claims like “…global companies experienced a 26 percent improvement in performance….” slide by. 😉

If you go read the report,
The Deciding Factor: Big Data & Decision Making
, you will find at page six (6):

On average, survey participants say that big data has improved their organisations’ performance in the past three years by 26%, and they are optimistic that it will improve performance by an average of 41% in the next three years. While “performance” in this instance is not rigorously specified, it is a useful gauge of mood.

The measured difference in performance, from:

firms that emphasise decision-making based on data and analytics performed 5-6% better—as measured by output and performance—than those that rely on intuition and experience for decision-making.

So, not 26% but 5-6% measured and the 5-6% is for decision-making on data and analytics, not big data.

You don’t find code written at either LucidWorks or MapR that is “close enough.” Both have well deserved reputations for clean code and hard work.

Why should communications fall short of that mark?

Searching for Dark Data

Tuesday, February 19th, 2013

Searching for Dark Data by Paul Doscher.

From the post:

We live in a highly connected world where every digital interaction spawns chain reactions of unfathomable data creation. The rapid explosion of text messaging, emails, video, digital recordings, smartphones, RFID tags and those ever-growing piles of paper – in what was supposed to be the paperless office – has created a veritable ocean of information.

Welcome to the world of Dark Data

Welcome to the world of Dark Data, the humongous mass of constantly accumulating information generated in the Information Age. Whereas Big Data refers to the vast collection of the bits and bytes that are being generated each nanosecond of each day, Dark Data is the enormous subset of unstructured, untagged information residing within it.

Research firm IDC estimates that the total amount of digital data, aka Big Data, will reach 2.7 zettabytes by the end of this year, a 48 percent increase from 2011. (One zettabyte is equal to one billion terabytes.) Approximately 90 percent of this data will be unstructured – or Dark.

Dark Data has thrown traditional business intelligence and reporting technologies for a loop. The software that countless executives have relied on to access information in the past simply cannot locate or make sense of the unstructured data that comprises the bulk of content today and tomorrow. These tools are struggling to tap the full potential of this new breed of data.

The good news is that there’s an emerging class of technologies that is ready to pick up where traditional tools left off and carry out the crucial task of extracting business value from this data.

Effective exploration of Dark Data will require something different from search tools that depend upon:

  • Pre-specified semantics (RDF) because Dark Data has no pre-specified semantics.
  • Structure because Dark Data has no structure.

Effective exploration of Dark Data will require:

Machine assisted-Interactive searching with gifted and grounded semantic comparators (people) creating pathways, tunnels and signposts into the wilderness of Dark Data.

I first saw this at: Delving into Dark Data.

Reflective Intelligence and Unnatural Acts

Thursday, December 13th, 2012

I wasn’t in the best of shape today but did manage to attend the webinar: Crowd Sourcing Reflected Intelligence Using Search and Big Data.

Not a lot of detail but there were two topics that caught my attention.

The first was “reflective intelligence,” that is a system that reflects the intelligence of the users back to other users.

Intelligence derived from tracking “clicks,” search terms, etc.

Question: How does your topic map solution “reflect” the intelligence of its users?

That is how do responses “improve” (by some measure) as a result of user interaction.

Could be measuring user behavior, what links do they select for particular query terms. (That is an example from the webinar.) Or could be users adding information, perhaps even suggesting/voting on merges.

The second riff that got my attention was a description of the software under discussion as:

“I don’t have to do unnatural acts.”

Is that like the Papa John’s “better ingredients?” Taken to imply that other pizzas use sub-par ingredients?

Or in this case, other software solutions require “unnatural acts?”

Interesting selling point.

What unusual properties would you claim for topic maps or topic map software?

Crowd Sourcing Reflected Intelligence Using Search and Big Data [Webinar]

Monday, December 3rd, 2012

Crowd Sourcing Reflected Intelligence Using Search and Big Data

Date: December 13, 2012

Time: 10:00 am PT / 1:00 pm ET

From the webpage:

Anyone interested in drawing insights from their Big Data repository/project/application should attend this informative webinar brought to you by MapR and LucidWorks. LucidWorks Search is a development platform that accelerates and simplifies building highly secure, scalable, and cost-effective search applications.

This webinar will show:

  • how search users’ search behavior can be mined
  • how big data analytics can be applied to that raw data
  • how to redeploy that data back to the users to improve their experience

Experts from MapR and Lucidworks will show the strengths of combining the easiest, most dependable and fastest distribution for Hadoop with the real-time, ad hoc data accessibility of LucidWorks Search to provide analytic capabilities along with scalable machine learning algorithms for deeper insight into both content and user behavior.

Speakers: Grant Ingersoll, Chief Scientist for LucidWorks and Ted Dunning, Chief Application Architect for MapR.

I have seen Grant on video and it was great. If Ted is anywhere close to as good as Grant, this is going to be a webinar to remember!

LucidWorks Announces Lucene Revolution 2013

Sunday, November 18th, 2012

LucidWorks Announces Lucene Revolution 2013 by Paul Doscher, CEO of LucidWorks.

From the webpage:

LucidWorks, the trusted name in Search, Discovery and Analytics, today announced that Lucene Revolution 2013 will take place at The Westin San Diego on April 29 – May 2, 2013. Many of the brightest minds in open source search will convene at this 4th annual Lucene Revolution to discuss topics and trends driving the next generation of search. The conference will be preceded by two days of Apache Lucene, Solr and Big Data training.

BTW, the call for papers opened up on November 12, 2012, but you still have time left:

Jan. 13, 2013: CFP closes
Feb 1, 2013: Speakers notified

Searching Big Data’s Open Source Roots

Monday, October 22nd, 2012

Searching Big Data’s Open Source Roots by Nicole Hemsoth.

Nicole talks to Grant Ingersoll, Chief Scientist at LucidWorks, about the open source roots of big data.

No technical insights but a nice piece to pass along to the c-suite. Investment in open source projects can pay rich dividends. So long as you don’t need them next quarter. 😉

And a snapshot of where we are now, which is on the brink of new tools and capabilities in search technologies.

Proximity Operators [LucidWorks]

Thursday, August 16th, 2012

Proximity Operators

From the webpage:

A proximity query searches for terms that are either near each other or occur in a specified order in a document rather than simply whether they occur in a document or not.

You will use some of these operators more than others but having a bookmark to the documentation will prove to be useful.

Lucid Imagination become LucidWorks [Man Bites Dog Story]

Monday, August 13th, 2012

Lucid Imagination becomes LucidWorks

Soft news except for the note about the soon to appear (September, 2012).

And the company listening to users refer to it as LucidWorks and deciding to change the name of the company from Lucid Imagination to LucidWorks.

Sort of a man bites dog sort of story don’t your think?

Hurray for LucidWorks!

Makes me curious about the site. Likely to listen to users there as well.

Lucene Eurocon / ApacheCon Europe

Wednesday, August 8th, 2012

Lucene Eurocon / ApacheCon Europe November 5-8 | Sinsheim, Germany

From a post I got today from Lucid Imagination:

Lucid Imagination and the Apache Foundation have agreed to co-locate Lucid’s Apache Lucene EuroCon with ApacheCon Europe being held this November 5-8 in Sinsheim, Germany. Lucene EuroCon at ApacheCon Europe will cover the breadth and depth of search innovation and application. The dedicated track will bring together Apache Lucene/Solr committers and technologists from around the world to offer compelling presentations that share future directions for the project and technical implementation experiences. Topic examples include channeling the flood of structured and unstructured data into faster, more cost-effective Lucene/Solr search applications that span a host of sectors and industries.

Some of the most talented Lucene/Solr developers gather each year at Apache Lucene EuroCon to share best practices and create next-generation search applications. Coupling Apache Lucene EuroCon with this year’s ApacheCon Europe offers a great benefit to the community at large. The combined attendees benefit from expert trainings and in-depth sessions, real-world case studies, excellent networking and the opportunity to connect with the industry’s leading minds.

Call For Papers Deadline is August 13

The Call for Papers for ApacheCon has been extended to August 13, 2012, and can be found on the ApacheCon website. As always, proceeds from Apache Lucene EuroCon benefit The Apache Software Foundation. We encourage all Lucene/Solr committers and developers who have a technical story to tell to submit an abstract. Apache Lucene/Solr has a rich community of developers. Supporting ApacheCon Europe by submitting your abstract and sharing your story is important for maintaining this important and thriving community.

Just so you don’t think this is a search only event, papers are welcome on:

  • Apache Daily – Tools frameworks and components used on a daily basis
  • ApacheEE – Java enterprise projects
  • Big Data – Cassandra, Hadoop, HBase, Hive, Kafka, Mahout, Pig, Whirr, ZooKeeper and friends
  • Camel in Action – All things Apache Camel, from their problems to their solutions
  • Cloud – Cloud-related applications of a broad range of Apache projects
  • Linked Data – (need a concise caption for this track)
  • Lucene, SOLR and Friends – Learn about important web search technologies from the experts
  • Modular Java Applications – Using Felix, ACE, Karaf, Aries and Sling to deploy modular Java applications to public and private cloud environments
  • NoSQL Database – Use cases and recent developments in Cassandra, HBase, CouchDBa and Accumulo
  • OFBiz – The Apache Enterprise Automation project
  • Open Office – Open Office and the Apache Content Ecosystem
  • Web Infrastructure – HTTPD, TomCat and Traffic Server, the heart of many Internet projects

Submissions are welcome from any developer or user of Apache projects. First-time speakers are just as welcome as experienced ones, and we will do our best to make sure that speakers get all the help they need to give a great presentation.

Reducing Software Highway Friction

Thursday, June 7th, 2012

Lucid Imagination Search Product Offered in Windows Azure Marketplace

From the post:

Ease of use and flexibility are two key business drivers that are fueling the rapid adoption of cloud computing. The ability to disconnect an application from its supporting architecture provides a new level of business agility that has never before been possible. To ease the move towards this new realm of computing, integrated platforms have begun emerge that make cloud computing easier to adopt and leverage.

Lucid Imagination, a trusted name in Search, Discovery and Analytics, today announced that its LucidWorks Cloud product has been selected by Microsoft Corp. to be offered as a Search-as-a-Service product in Microsoft’s Windows Azure Marketplace. LucidWorks Cloud is a full cloud service version of its LucidWorks Enterprise platform. LucidWorks Cloud delivers full open source Apache Lucene/Solr community innovation with support and maintenance from the world’s leading experts in open source search. An extensible platform architected for developers, LucidWorks Cloud is the only Solr distribution that provides security, abstraction and pre-built connectors for essential enterprise data sources – along with dramatic ease of use advantages in a well-tested, integrated and documented package.

Example use cases for LucidWorks Cloud include Search-as-a-Service for websites, embedding search into SaaS product offerings, and Prototyping and developing cloud-based search-enabled applications in general.


Highlights of LucidWorks Cloud Search-as-a-Service

  • Sign-up for a plan and start building your search application in minute
  • Well-organized UI makes Apache Lucene/Solr innovation easier to consume and more adaptable to constant change
  • Create multiple search collections and manage them independently
  • Configure index and query settings, fields, stop words, synonyms for each collection
  • Built-in support for Hadoop, Microsoft SharePoint and traditional online content types
  • An open connector framework is available to customize access to other data sources
  • REST API automates and integrates search as a service with an application
  • Well-instrumented dashboard for infrastructure administration, monitoring and reporting
  • Monitored 24×7 by Lucid Development Operations insuring minimum downtime

Source: PR Newswire (

I find this deeply encouraging.

It is a step towards a diverse but reduced friction software highway.

The user community is not well served by uniform models for data, software or UIs.

The user community can be well served by a reduced friction software highway as they move data from application to application.

Microsoft has taken a large step towards a reduced friction software highway today. And it is appreciated!

Different ways to make auto suggestions with Solr

Monday, June 4th, 2012

Different ways to make auto suggestions with Solr

From the post:

Nowadays almost every website has a full text search box as well as the auto suggestion feature in order to help users to find what they are looking for, by typing the least possible number of characters possible. The example below shows what this feature looks like in Google. It progressively suggests how to complete the current word and/or phrase, and corrects typo errors. That’s a meaningful example which contains multi-term suggestions depending on the most popular queries, combined with spelling correction.

Starts with seven (7) questions you should ask yourself about auto-suggestions and then covers four methods for implementing them in Solr.

You can have the typical word completion seen in most search engines or you can be more imaginative, using custom dictionaries.

Lucene conference touches many areas of growth in search

Monday, May 14th, 2012

Lucene conference touches many areas of growth in search by Andy Oram.

From the post:

With a modern search engine and smart planning, web sites can provide visitors with a better search experience than Google. For instance, Google may well turn up interesting results if you search for a certain kind of shirt, but a well-designed clothing site can also pull up related trousers, skirts, and accessories. It’s not Google’s job to understand the intricate interrelationships of data on a particular web property, but the site’s own team can constantly tune searches to reflect what the site has to offer and what its visitors uniquely need.

Hence the important of search engines like Solr, based on the Lucene library. Both are open source Apache projects, maintained by Lucid Imagination, a company founded to commercialize the underlying technology. I attended parts of Lucid Imagination’s conference this week, Lucene Revolution, and found Lucene evolving in the ways much of the computer industry is headed.

Andy’s summary of the conference will make you wonder two things:

  1. Why weren’t you at the Lucene Revolution conference this year?
  2. Where are the videos from Lucene Revolution 2012?

I won’t ever be able to answer #1 but will post an answer to #2 as soon as it is available.

Dark Data

Sunday, May 13th, 2012

Lucid Imagination Combines Search, Analytics and Big Data to Tackle the Problem of Dark Data

This post was too well written to break up as quotes/excerpts. I am re-posting it in full.

Organizations today have little to no idea how much lost opportunity is hidden in the vast amounts of data they’ve collected and stored.  They have entered the age of total data overload driven by the sheer amount of unstructured information, also called “dark” data, which is contained in their stored audio files, text messages, e-mail repositories, log files, transaction applications, and various other content stores.  And this dark data is continuing to grow, far outpacing the ability of the organization to track, manage and make sense of it.

Lucid Imagination, a developer of search, discovery and analytics software based on Apache Lucene and Apache Solr technology, today unveiled LucidWorks Big Data. LucidWorks Big Data is the industry’s first fully integrated development stack that combines the power of multiple open source projects including Hadoop, Mahout, R and Lucene/Solr to provide search, machine learning, recommendation engines and analytics for structured and unstructured content in one complete solution available in the cloud.

Tweet This: Lucid Imagination combines #search, analytics and #BigData in complete stack. Beta now open

With LucidWorks Big Data, Lucid Imagination equips technologists and business users with the ability to initially pilot Big Data projects utilizing technologies such as Apache Lucene/Solr, Mahout and Hadoop, in a cloud sandbox. Once satisfied, the project can remain in the cloud, be moved on premise or executed within a hybrid configuration.  This means they can avoid the staggering overhead costs and long lead times associated with infrastructure and application development lifecycles prior to placing their Big Data solution into production.

The product is now available in beta. To sign up for inclusion in the beta program, visit

Dark Data Problem Is Real

How big is the problem of dark data? The total amount of digital data in the world will reach 2.7 zettabytes in 2012, a 48 percent increase from 2011.* 90 percent of this data will be unstructured or “dark” data. Worldwide, 7.5 quintillion bytes of data, enough to fill over 100,000 Libraries of Congress get generated every day. Conversely, that deep volume of data can serve to help predict the weather, uncover consumer buying patterns or even ease traffic problems – if discovered and analyzed proactively.

“We see a strong opportunity for search to play a key role in the future of data management and analytics,” said Matthew Aslett, research manager, data management and analytics, 451 Research. “Lucid’s Big Data offering, and its combination of large-scale data storage in Hadoop with Lucene/Solr-based indexing and machine-learning capabilities, provides a platform for developing new applications to tackle emerging data management challenges.”

LucidWorks Big Data

Data analytics has traditionally been the domain of business intelligence technologies. Most of these tools, however, have been designed to handle structured data such as SQL, and cannot easily tap into the broad range of data types that can be used in a Big Data application. With the announcement of LucidWorks Big Data, organizations will be able to utilize a single platform for their Big Data search, discovery and analytics needs. LucidWorks Big Data is the only complete platform that:

  • Combines the real time, ad hoc data accessibility of LucidWorks (Lucene/Solr) with compute and storage capabilities of Hadoop
  • Delivers commonly used analytic capabilities along with Mahout’s proven, scalable machine learning algorithms for deeper insight into both content and users
  • Tackles data, both big and small with ease, seamlessly scaling while minimizing the impact of provisioning Hadoop, LucidWorks and other components
  • Supplies a single, coherent, secure and well documented REST API for both application integration and administration
  • Offers fault tolerance with data safety baked in
  • Provides choice and flexibility, via on premise, cloud hosted or hybrid deployment solutions
  • Is tested, integrated and fully supported by the world’s leading experts in open source search.
  • Includes powerful tools for configuration, deployment, content acquisition, security, and search experience that is packaged in a convenient, well-organized application

Lucid Imagination’s Open Search Platform uncovers real-time insights from any enterprise data, whether structured in databases, unstructured in formats such as emails or social channels, or semi-structured from sources such as websites.  The company’s rich portfolio of enterprise-grade solutions is based on the same proven open source Apache Lucene/Solr technology that powers many of the world’s largest e-commerce sites. Lucid Imagination’s on-premise and cloud platforms are quicker to deploy, cost less than competing products and are more easily tailored to specific needs than business intelligence solutions because they leverage innovation from the open source community.  

“We’re allowing a broad set of enterprises to test and implement data discovery and analysis projects that have historically been the province of large multinationals with large data centers. Cloud computing and LucidWorks Big Data finally level the field,” said Paul Doscher, CEO of Lucid Imagination. “Large companies, meanwhile, can use our Big Data stack to reduce the time and cost associated with evaluating and ultimately implementing big data search, discovery and analysis. It’s their data – now they can actually benefit from it.”

LucidWorks 2.1

Thursday, April 26th, 2012

LucidWorks 2.1

There are times, not very often, when picking only a few features to report would be unfair to a product.

This is one of those times.

I have reproduced the description of LucidWorks 2.1 as it appears on the Lucid Imagination site:

LucidWorks 2.1 new features list:

Enhancement Areas Key Benefits

Includes the latest Lucene/Solr 4.0

  • Near Real Time
  • Fault Tolerance and High Availability
  • Data Durability
  • Centralized Configuration
  • Elasticity

Business Rules

  • Integrate your business processes and rules with the user search experience
  • Examples: Landing Pages, provide targeted search results per user, etc.
  • Framework to integrate with your BRMS (Business Rules Management System)
  • OOB integration with leading open source BRMS – Drools

Upgrade and Migrations

  • Lucid can help upgrade customers from Solr 3.x to 4.0 or older Solr versions to LucidWorks 2.1
  • Upgrades for existing LucidWorks customers on previous versions of LucidWorks to LucidWorks 2.1

Enhanced Connector Framework

  • Easily build integrations to index data from any application or data sources
  • Framework supports REST API driven integration, generates dynamic configuration UI, and allows admins to schedule the new connectors
  • Connectors available to crawl large amounts of HDFS data, integrate twitter updates into index, and CMIS connector to support CMS systems like Alfresco, etc.

Efficient Crawl of Large web content

  • OOB integration for Nutch  (open source)
  • Helps crawl Webscale data into your index

REST API and UI Enhancements

  • Supports memory and cache settings, schema less configuration using Dynamic fields from UI
  • Subject Matter Experts can create Best Bets for improved search experience

Key features and benefits of LucidWorks search platform

  • Streamlined search configuration, optimization and operations: Well-organized UI makes Solr innovation easier to consume, better adapting to constant change.
  • Enterprise-grade, business critical manageability Includes tools for infrastructure administration, monitoring and reporting so your search application can thrive within a well-defined, well-managed operational environment; includes upgradability across successive releases. We can help migrate Solr installations to LucidWorks 2.1.
  • Broad-based content acquisition Access big data and enterprise content faster and more securely with built-in support for Hadoop and Amazon S3, along with Sharepoint and traditional online content types – plus a new open connector framework to customize access to other data sources
  • Versatile access and data security Flexible, resilient built-in security simplifies getting search connected right to the right data and content
  • Advanced search experience enhancements Powerful, innovative search capabilities deliver faster, better, more useful results for a richer user experience; easily integrates into your application and infrastructure; REST API automates and integrates search as a service with your application.
  • Open source power and innovation Complete, supported release of Lucene/Solr 4.0, including latest innovations in Near Real Time search, distributed indexing and more versatile field faceting over and above Apache Lucene/Solr 3.x; all the flexibility of open source, packaged for business-critical development, maintenance and deployment
  • Cost-effective commercial grade expertise & Global 24×7 Support a range of annual support subscriptions including bundled services, consulting, training and certification from the world’s leading experts in Lucene/Solr open source.

Optimizing Findability in Lucene and Solr

Sunday, January 1st, 2012

Optimizing Findability in Lucene and Solr

From the post:

To paraphrase an age-old question about trees falling in the woods: “If content lives in your application and you can’t find it, does it still exist?” In this article, we explore how to make your content findable by presenting tips and techniques for discovering what is important in your content and how to leverage it in the Lucene Stack.

Table of Contents

Planning for Findability
Knowing your Content
Knowing your Users
Garbage In, Garbage Out
Analyzing your Analysis
Stemming In Greater Detail
Query Techniques for Better Search
Navigation Hints
Final Thoughts

by Grant Ingersoll

You know when a blog post starts off with a table of contents it is long. Fortunately in this case, it is also very good. By one of the principal architects of Lucene, Grant Ingersoll.

A good start on developing findability skills but as the post points out, a lot of it will depend on your knowledge of what “findability” means to your users. Only you can answer that question.

LucidWorks Enterprise 2.0.1 Release

Friday, December 30th, 2011

LucidWorks Enterprise 2.0.1 Release

From the post:

LucidWorks Enterprise 2.0.1 is an interim bug-fix release. We’ve have resolved couple of critical bugs and LDAP integration issues. The list of issues resolved with this updates are available here.

Relevancy Driven Development with Solr

Thursday, December 1st, 2011

Relevancy Driven Development with Solr by Robin Bramley.

From the post:

The relevancy of search engine results is very subjective so therefore testing the relevancy of queries is also subjective. One technique that exists in the information retrieval field is the use of judgement lists; an alternative approach discussed here is to follow the Behaviour Driven Development methodology employing user story acceptance criteria – I’ve been calling this Relevancy Driven Development or RDD for short.

I’d like to thank Eric Pugh for a great discussion on search engine testing and for giving me a guest slot in his ‘Better Search Engine Testing‘ talk* at Lucene EuroCon Barcelona 2011 to mention RDD. The first iteration of Solr-RDD combines my passion for automated testing with my passion for Groovy by leveraging EasyB (a Groovy BDD testing framework).

The Solr-RDD GitHub site comes closer to the expectations of the project:

The aim of RDD is to allow the business users to gain confidence in the relevancy of the search query results.

The trick is that the business users can use a constrained data set, define a query and the results they expect in the order that they expect.

Well…, maybe. Two things of concern:

First, a user would have to “know” the data extremely well to formulate queries in that sort of detail, and

Second, it does not appear to leave any room for unexpected information that might also be useful to the user.

Perhaps this is a technique that works well with very well known data sets with few if any unexpected results.

Search + Big Data: It’s (still) All About the User (Users or Documents?)

Tuesday, November 8th, 2011

Search + Big Data: It’s (still) All About the User by Grant Ingersoll.



Apache Hadoop has rapidly become the primary framework of choice for enterprises that need to store, process and manage large data sets. It helps companies to derive more value from existing data as well as collect new data, including unstructured data from server logs, social media channels, call center systems and other data sets that present new opportunities for analysis. This keynote will provide insight into how Apache Hadoop is being leveraged today and how it evolving to become a key component of tomorrow’s enterprise data architecture. This presentation will also provide a view into the important intersection between Apache Hadoop and search.

Awesome as always!

Please watch the presentation and review the slides before going further. What follows won’t make much sense without Grant’s presentation as a context. I’ll wait……

Back so soon? 😉

On slide 4 (I said to review the slides), Grant presents four overlapping areas, starting with Documents: Models, Feature Selection; Content Relationships: Page Rank, etc., Organization; Queries: Phrases, NLP; User Interaction: Clicks, Ratings/Reviews, Learning to Rank, Social Graph; and the intersection of those four areas is where Grant says search is rapidly evolving.

On slide 5 (sorry, last slide reference), Grant say to mine that intersection is a loop composed of: Search -> Discovery -> Analytics -> (back to Search). All of which involve processing of data that has been collected from use of the search interface.

Grant’s presentation made clear something that I have been overlooking:

Search/Indexing, as commonly understood, does not capture any discoveries or insights of users.

Even the search trails that Grant mentions are just lemming tracks complete with droppings. You can follow them if you like, may find interesting data, may not.

My point being that there is no way to capture the user’s insight that LBJ, for instance, is a common acronym for Lyndon Baines Johnson. So that the next user who searches for LBJ will find the information contributed by a prior user. Such as distinguishing application of Lyndon Baines Johnson to a graduate school (Lyndon B. Johnson School of Public Affairs), a hospital (Lyndon B. Johnson General Hospital), a PBS show (American Experience . The Presidents . Lyndon B. Johnson), a biography (American President: Lyndon Baines Johnson), and that is in just the first ten (10) “hits.” Oh, and as the name of an American President.

Grant made that clear for me with his loop of Search -> Discovery -> Analytics -> (back to Search) because Search only ever focuses on the documents, never the user’s insight into the documents.

And with every search, every user (with the exception of search trails), starts over at the beginning.

What if a colleague found a bug in program code, but you have to start at the beginning of the program and work your way there. Good use of your time? To reset with every user? That is what happens with search, nearly a complete reset. (Not complete because of page rank, etc. but only just.)

If we are going to make it “All About the User,” shouldn’t we be indexing their insights* into data? (Big or otherwise.)

*”Clicks” are not insights. Could be an unsteady hand, DTs, etc.

Solr and LucidWorks Enterprise: When to use each

Wednesday, September 28th, 2011

Solr and LucidWorks Enterprise: When to use each

From the post:

If LucidWorks Enterprise is built on Solr, how do you know which one to use when for your own circumstances? This article describes the difference between using straight Solr, using the LucidWorks Enterprise user interface, and using LucidWorks Enterprise’s ReST API for accomplishing various common tasks so you can see which fits your situation at a given moment.

In today’s world, building the perfect product is a lot like trying to repair a set of train tracks while the train is barreling down on you. The world just keeps moving, with great ideas and new possibilities tempting you every day. And to make things worse, innovation doesn’t just show its face for you; it regularly visits your competitors as well.

That’s why you use open source software in the first place. You have smart people; does it make sense to have them building search functionality when Apache Solr already provides it? Of course not. You’d rather rely on the solid functionality that’s already been built by the community of Solr developers, and let your people spend their time building innovation into your own products. It’s simply a more efficient use of resources.

But what if you need search-related functionality that’s not available in straight Solr? In some cases, you may be able to fill those holes and lighten your load with LucidWorks Enterprise. Built on Solr, LucidWorks Enterprise starts by simplifying the day-to-day use tasks involved in using Solr, and then moves on to adding additional features that can help free up your development team for work on your own applications. But how do you know which path would be right for you?

Since I posted the LucidWorks 2.0 announcement yesterday, I thought this might be helpful in terms of its evaluation. I did not see a date on it but it looks current enough.