Lucene « Another Word For It

July 14, 2013

Solr vs ElasticSearch

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 7:14 pm

Solr vs ElasticSearch by Ryan Tabora.

Ryan evaluates Solr and ElasticSearch (both based on Lucene) in these categories:

Foundations
Coordination
Shard Splitting
Automatic Shard Rebalancing
Schema
Schema Creation
Nested Typing
Queries
Distributed Group By
Percolation Queries
Community
Vendor Support

As Ryan points out, making a choice between Solr and ElasticSearch requires detailed knowledge of your requirements.

If you are a developer, I would suggest following Lucene, as well as Solr and ElasticSearch.

No one tool is going to be the right tool for every job.

Comments Off

July 9, 2013

The Blur Project: Marrying Hadoop with Lucene

Filed under: Hadoop,Lucene — Patrick Durusau @ 3:40 pm

The Blur Project: Marrying Hadoop with Lucene by Aaron McCurry.

From the post:

Blur is an Apache Incubator project that provides distributed search functionality on top of Apache Hadoop, Apache Lucene, Apache ZooKeeper, and Apache Thrift. When I started building Blur three years ago, there wasn’t a search solution that had a solid integration with the Hadoop ecosystem. Our initial needs were to be able to index our data using MapReduce, store indexes in HDFS, and serve those indexes from clusters of commodity servers while remaining fault tolerant. Blur was built specifically for Hadoop — taking scalability, redundancy, and performance into consideration from the very start — while leveraging all the great features that already exist in the Hadoop stack.

(…)

Blur was initially released on Github as an Apache Licensed project and was then accepted into the Apache Incubator project in July 2012, with Patrick Hunt as its champion. Since then, Blur as a software project has matured and become much more stable. One of the major milestones over the past year has been the upgrade to Lucene 4, which has brought many new features and massive performance gains.

Recently there has been some interest in folding some of Blur’s code (HDFSDirectory and BlockCache) back into the Lucene project for others to utilize. This is an exciting development that legitimizes some of the approaches that we have taken to date. We are in conversations with some members of the Lucene community, such as Mark Miller, to figure out how we can best work together to benefit both the fledgling Blur project as well as the much larger and more well known/used Lucene project.

Blur’s community is small but growing. Our project goals are to continue to grow our community and graduate from the Incubator project. Our technical goals are to continue to add features that perform well at scale while maintaining the fault tolerance that is required of any modern distributed system.

We welcome your contributions at http://incubator.apache.org/blur/!

Another exciting Apache project that needs contributors!

Comments Off

July 8, 2013

Querying ElasticSearch – A Tutorial and Guide

Filed under: ElasticSearch,Lucene,Search Engines — Patrick Durusau @ 6:59 pm

Querying ElasticSearch – A Tutorial and Guide by Rufus Pollock.

From the post:

ElasticSearch is a great open-source search tool that’s built on Lucene (like SOLR) but is natively JSON + RESTful. Its been used quite a bit at the Open Knowledge Foundation over the last few years. Plus, as its easy to setup locally its an attractive option for digging into data on your local machine.

While its general interface is pretty natural, I must confess I’ve sometimes struggled to find my way around ElasticSearch’s powerful, but also quite complex, query system and the associated JSON-based “query DSL” (domain specific language).

This post therefore provides a simple introduction and guide to querying ElasticSearch that provides a short overview of how it all works together with a good set of examples of some of the most standard queries.

(…)

This is a very nice introduction to ElasticSearch.

Read, bookmark and pass it along!

Comments Off

June 30, 2013

Solr Authors, A Suggestion

Filed under: Indexing,Lucene,Solr — Patrick Durusau @ 3:01 pm

I am working my way through a recent Solr publication. It reproduces some, but not all of the output of queries.

But it remains true that the output of queries is a sizeable portion of the text.

Suggestion: Could the queries be embedded in PDF text as hyperlinks?

Thus: http://localhost:8983/solr/select?q=*:*&indent=yes.

If I have Solr running, etc., the full results show up in my browser and save page space. Perhaps resulting in room for more analysis or examples.

There may be a very good reason to not follow my suggestion so it truly is a suggestion.

If there is a question of verifying the user’s results, perhaps a separate PDF of results keyed to the text?

That could be fuller results and at the same time allow the text to focus on substantive material.

Comments Off

June 27, 2013

Apache Solr volume 1 -….

Filed under: Lucene,Search Engines,Solr — Patrick Durusau @ 1:13 pm

Apache Solr V[olume] 1 – Introduction, Features, Recency Ranking and Popularity Ranking by Ramzi Alqrainy.

I amended the title to expand v for volume. Just seeing the “v” made me think version. No true in this case.

Nothing new or earthshaking but a nice overview of Solr.

It is a “read along” slide deck so the absence of a presenter won’t impair its usefulness.

Comments Off

June 23, 2013

A new Lucene suggester based on infix matches

Filed under: Lucene,Search Engines — Patrick Durusau @ 8:39 am

A new Lucene suggester based on infix matches by Michael McCandless.

From the post:

Suggest, sometimes called auto-suggest, type-ahead search or auto-complete, is now an essential search feature ever since Google added it almost 5 years ago.

Lucene has a number of implementations; I previously described AnalyzingSuggester. Since then, FuzzySuggester was also added, which extends AnalyzingSuggester by also accepting mis-spelled inputs.

Here I describe our newest suggester: AnalyzingInfixSuggester, now going through iterations on the LUCENE-4845 Jira issue.

Unlike the existing suggesters, which generally find suggestions whose whole prefix matches the current user input, this suggester will find matches of tokens anywhere in the user input and in the suggestion; this is why it has Infix in its name.

You can see it in action at the example Jira search application that I built to showcase various Lucene features.

Lucene is a flagship open source project. It just keeps pushing the boundaries of its area of interest.

Comments Off

June 22, 2013

Lucene/Solr Revolution EU 2013

Filed under: Conferences,Lucene,LucidWorks,Solr — Patrick Durusau @ 4:49 pm

Lucene/Solr Revolution EU 2013

November 4 -7, 2013
Dublin, Ireland

Abstract Deadline: August 2, 2013.

From the webpage:

LucidWorks is proud to present Lucene/Solr Revolution EU 2013, the biggest open source conference dedicated to Apache Lucene/Solr.

The conference, held in Dublin, Ireland on November 4-7, will be packed with technical sessions, developer content, user case studies, and panels. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology.

From the call for papers:

The Call for Papers for Lucene/Solr Revolution EU 2013 is now open.

Lucene/Solr Revolution is the biggest open source conference dedicated to Apache Lucene/Solr. The great content delivered by speakers like you is the heart of the conference. If you are a practitioner, business leader, architect, data scientist or developer and have something important to share, we welcome your submission.

We are particularly interested in compelling use cases and success stories, best practices, and technology insights.

Don’t be shy!

Comments Off

June 20, 2013

Lucene/Solr Revolution 2013 San Diego (Video Index)

Filed under: Lucene,LucidWorks,Solr — Patrick Durusau @ 6:28 pm

Videos from Lucene/Solr Revolution 2013 San Diego (April 29th – May 2nd, 2013)

Sorted by author, duplicates removed, etc.

These videos merit far more views than they have today. Pass this list along.

Work through the videos and related docs. There are governments out there that want useful search results.

James Atherton, Search Team Lead, 7digital Implementing Search with Solr at 7digital

A usage/case study, describing our journey as we implemented Lucene/Solr, the lessons we learned along the way and where we hope to go in the future.How we implemented our instant search/search suggest. How we handle trying to index 400 million tracks and metadata for over 40 countries, comprising over 300GB of data, and about 70GB of indexes. Finally where we hope to go in the future.

Ben Brown, Software Architect, Cerner Corporation Brahe – Mass scale flexible indexing

Our team made their first foray into Solr building out Chart Search, an offering on top of Cerner's primary EMR to help make search over a patient's chart smarter and easier. After bringing on over 100 client hospitals and indexing many tens of billions of clinical documents and discrete results we've (thankfully) learned a couple of things.

The traditional hashed document ID over many shards and no easily accessible source of truth doesn't make for a flexible index.
Learn the finer points of the strategy where we shifted our source of truth to HBase. How we deploy new indexes with the click of a button, take an existing index and expand the number of shards on the fly, and several other fancy features we enabled.

Paul Doscher, CEO LucidWorks Lucene Revolution 2013, Opening Remarks – Paul Doscher, CEO LucidWorks

Ted Dunning, Chief Application Architect, MapR & Grant Ingersoll, Chief Technology Officer, LucidWorks Crowd-sourced intelligence built into Search over Hadoop

Search has quickly evolved from being an extension of the data warehouse to being run as a real time decision processing system. Search is increasingly being used to gather intelligence on multi-structured data leveraging distributed platforms such as Hadoop in the background. This session will provide details on how search engines can be abused to use not text, but mathematically derived tokens to build models that implement reflected intelligence. In such a system, intelligent or trend-setting behavior of some users is reflected back at other users. More importantly, the mathematics of evaluating these models can be hidden in a conventional search engine like SolR, making the system easy to build and deploy. The session will describe how to integrate Apache Solr/Lucene with Hadoop. Then we will show how crowd-sourced search behavior can be looped back into analysis and how constantly self-correcting models can be created and deployed. Finally, we will show how these models can respond with intelligent behavior in realtime.

Stephane Gamard, Chief Technology Officer, Searchbox How to make a simple cheap high-availability self-healing Solr cluster

In this presentation we aim to show how to make a high availability Solr cloud with 4.1 using only Solr and a few bash scripts. The goal is to present an infrastructure which is self healing using only cheap instances based on ephemeral storage. We will start by providing a comprehensive overview of the relation between collections, Solr cores, shardes, and cluster nodes. We continue by an introduction to Solr 4.x clustering using zookeeper with a particular emphasis on cluster state status/monitoring and solr collection configuration. The core of our presentation will be demonstrated using a live cluster.

We will show how to use cron and bash to monitor the state of the cluster and the state of its nodes. We will then show how we can extend our monitoring to auto generate new nodes, attach them to the cluster, and assign them shardes (selecting between missing shardes or replication for HA). We will show that using a high replication factor it is possible to use ephemeral storage for shards without the risk of data loss, greatly reducing the cost and management of the architecture. Future work discussions, which might be engaged using an open source effort, include monitoring activity of individual nodes as to scale the cluster according to traffic and usage.

Trey Grainger, Search Technology Development Manager, CareerBuilder Building a Real-time, Big Data Analytics Platform with Solr

Having "big data" is great, but turning that data into actionable intelligence is where the real value lies. This talk will demonstrate how you can use Solr to build a highly scalable data analytics engine to enable customers to engage in lightning fast, real-time knowledge discovery.

At CareerBuilder, we utilize these techniques to report the supply and demand of the labor force, compensation trends, customer performance metrics, and many live internal platform analytics. You will walk away from this talk with an advanced understanding of faceting, including pivot-faceting, geo/radius faceting, time-series faceting, function faceting, and multi-select faceting. You'll also get a sneak peak at some new faceting capabilities just wrapping up development including distributed pivot facets and percentile/stats faceting, which will be open-sourced.

The presentation will be a technical tutorial, along with real-world use-cases and data visualizations. After this talk, you'll never see Solr as just a text search engine again.

Chris Hostetter (aka Hoss) Stump The Chump: Get On The Spot Solutions To Your Real Life Lucene/Solr Challenges

Got a tough problem with your Solr or Lucene application? Facing
challenges that you'd like some advice on? Looking for new approaches to
overcome a Lucene/Solr issue? Not sure how to get the results you
expected? Don't know where to get started? Then this session is for you.

Now, you can get your questions answered live, in front of an audience of
hundreds of Lucene Revolution attendees! Back again by popular demand,
"Stump the Chump" at Lucene Revolution 2013 puts Chris Hostetter (aka Hoss) in the hot seat to tackle questions live.

All you need to do is send in your questions to us here at
stump@lucenerevolution.org. You can ask anything you like, but consider
topics in areas like: Data modelling Query parsing Tricky faceting Text analysis Scalability

You can email your questions to stump@lucenerevolution.org. Please
describe in detail the challenge you have faced and possible approach you
have taken to solve the problem. Anything related to Solr/Lucene is fair game.

Our moderator, Steve Rowe, will will read the questions, and Hoss have to formulate a solution on the spot. A panel of judges will decide if he has provided an effective answer. Prizes will be awarded by the panel for the best question – and for those deemed to have "Stumped the Chump".

Rahul Jain, System Analyst (Software Engineer), IVY Comptech Pvt Ltd Building a Near Real-time Search Engine and Analytics for logs using Solr

Consolidation and Indexing of logs to search them in real time poses an array of challenges when you have hundreds of servers producing terabytes of logs every day. Since the log events mostly have a small size of around 200 bytes to few KBs, makes it more difficult to handle because lesser the size of a log event, more the number of documents to index. In this session, we will discuss the challenges faced by us and solutions developed to overcome them. The list of items that will be covered in the talk are as follows.

Methods to collect logs in real time.
How Lucene was tuned to achieve an indexing rate of 1 GB in 46 seconds
Tips and techniques incorporated/used to manage distributed index generation and search on multiple shards
How choosing a layer based partition strategy helped us to bring down the search response times.
Log analysis and generation of analytics using Solr.
Design and architecture used to build the search platform.

Mikhail Khludnev, eCommerce Search Platform, Grid Dynamics Concept Search for eCommerce with Solr

This talk describes our experience in eCommerce Search: challenges which we've faced and the chosen approaches. It's not indented to be a full description of implementation, because too many details need to be touched. This talk is more like problem statement and general solutions description, which have a number of points for technical or even academic discussion. It's focused on text search use-case, structures (or scoped) search is out of agenda as well as faceted navigation.

Hilary Mason, Chief Scientist, bitly Search is not a solved problem.

Remi Mikalsen, Search Engineer, The Norwegian Centre for ICT in Education Multi-faceted responsive search, autocomplete, feeds engine and logging

Learn how utdanning.no leverages open source technologies to deliver a blazing fast multi-faceted responsive search experience and a flexible and efficient feeds engine on top of Solr 3.6. Among the key open source projects that will be covered are Solr, Ajax-Solr, SolrPHPClient, Bootstrap, jQuery and Drupal. Notable highlights are ajaxified pivot facets, multiple parents hierarchical facets, ajax autocomplete with edge-n-gram and grouping, integrating our search widgets on any external website, custom Solr logging and using Solr to deliver Atom feeds. utdanning.no is a governmental website that collects, normalizes and publishes study information for related to secondary school and higher education in Norway. With 1.2 million visitors each year and 12.000 indexed documents we focus on precise information and a high degree of usability for students, potential students and counselors.

Mark Miller, Software Engineer, Cloudera SolrCloud: the 'Search First' NoSQL database

As the NoSQL ecosystem looks to integrate great search, great search is naturally beginning to expose many NoSQL features. Will these Goliath's collide? Or will they remain specialized while intermingling — two sides of the same coin.
Come learn about where SolrCloud fits into the NoSQL landscape. What can it do? What will it do? And how will the big data, NoSQL, Search ecosystem evolve. If you are interested in Big Data, NoSQL, distributed systems, CAP theorem and other hype filled terms, than this talk may be for you.

Dragan Milosevic, Senior Architect, zanox Analytics in OLAP with Lucene and Hadoop

Analytics powered by Hadoop is powerful tool and this talk addresses its application in OLAP built on top of Lucene. Many applications use Lucene indexes also for storing data to alleviate challenges concerned with external data sources. The analyses of queries can reveal stored fields that are in most cases accessed together. If one binary compressed field replaces those fields, amount of data to be loaded is reduced and processing of queries is boosted. Furthermore, documents that are frequently loaded together can be identified. If those documents are saved in almost successive positions in Lucene stored files, benefits from file-system caches are improved and loading of documents is noticeably faster.

Large-scale searching applications typically deploy sharding and partition documents by hashing. The implemented OLAP has shown that such hash-based partitioning is not always an optimal one. An alternative partitioning, supported by analytics, has been developed. It places documents that are frequently used together in same shards, which maximizes the amount of work that can be locally done and reduces the communication overhead among searchers. As an extra bonus, it also identifies slow queries that typically point to emerging trends, and suggests the addition of optimized searchers for handling similar queries.

Christian Moen, Software Engineer, Atilika Inc. Language support and linguistics in Lucene/Solr and its eco-system

In search, language handling is often key to getting a good search experience. This talk gives an overview of language handling and linguistics functionality in Lucene/Solr and best-practices for using them to handle Western, Asian and multi-language deployments. Pointers and references within the open source and commercial eco-systems for more advanced linguistics and their applications are also discussed.

The presentation is mix of overview and hands-on best-practices the audience can benefit immediately from in their Lucene/Solr deployments. The eco-system part is meant to inspire how more advanced functionality can be developed by means of the available open source technologies within the Apache eco-system (predominantly) while also highlighting some of the commercial options available.

Chandra Mouleeswaran, Co Chair at Intellifest.org, ThreatMetrix Rapid pruning of search space through hierarchical matching

This talk will present our experiences in using Lucene/Solr to the classification of user and device data. On a daily basis, ThreatMetrix, Inc., handles a huge volume of volatile data. The primary challenge is rapidly and precisely classifying each incoming transaction, by searching a huge index within a very strict latency specification. The audience will be taken through the various design choices and the lessons learned. Details on introducing a hierarchical search procedure that systematically divides the search space into manageable partitions, yet maintaining precision, will be presented.

Kathy Phillips, Enterprise Search Services Manager/VP, Wells Fargo & Co. & Tom Lutmer, eBusiness Systems Consultant, Enterprise Search Services team, Wells Fargo & Co Beyond simple search — adding business value in the enterprise

What is enterprise search? Is it a single search box that spans all enterprise resources or is it much more than that? Explore how enterprise search applications can move beyond simple keyword search to add unique business value. Attendees will learn about the benefits and challenges to different types of search applications such as site search, interactive search, search as business intelligence, and niche search applications. Join the discussion about the possibilities and future direction of new business applications within the enterprise.

David Piraino and Daniel Palmer, Chief Imaging Information Officers, Imaging Institute Cleveland Clinic, Cleveland Clinic Next Generation Electronic Medical Records and Search: A Test Implementation in Radiology

Most patient specifc medical information is document oriented with varying amounts of associated meta-data. Most of pateint medical information is textual and semi-structured. Electronic Medical Record Systems (EMR) are not optimized to present the textual information to users in the most understandable ways. Present EMRs show information to the user in a reverse time oriented patient specific manner only. This talk discribes the construction and use of Solr search technologies to provide relevant historical information at the point of care while intepreting radiology images.

Radiology reports over a 4 year period were extracted from our Radiology Information System (RIS) and passed through a text processing engine to extract the results, impression, exam description, location, history, and date. Fifteen cases reported during clinical practice were used as test cases to determine if ""similar"" historical cases were found . The results were evaluated by the number of searches that returned any result in less than 3 seconds and the number of cases that illustrated the questioned diagnosis in the top 10 results returned as determined by a bone and joint radiologist. Also methods to better optimize the search results were reviewed.

An average of 7.8 out of the 10 highest rated reports showed a similar case highly related to the present case. The best search showed 10 out of 10 cases that were good examples and the lowest match search showed 2 out of 10 cases that were good examples.The talk will highlight this specific use case and the issues and advances of using Solr search technology in medicine with focus on point of care applications.

Timothy Potter, Architect, Big Data Analytics, Dachis Group Scaling up Solr 4.1 to Power Big Search in Social Media Analytics

My presentation focuses on how we implemented Solr 4.1 to be the cornerstone of our social marketing analytics platform. Our platform analyzes relationships, behaviors, and conversations between 30,000 brands and 100M social accounts every 15 minutes. Combined with our Hadoop cluster, we have achieved throughput rates greater than 8,000 documents per second. Our index currently contains more than 500,000,000 documents and is growing by 3 to 4 million documents per day.

The presentation will include details about:

Designing a Solr Cloud cluster for scalability and high-availability using sharding and replication with Zookeeper
Operations concerns like how to handle a failed node and monitoring
How we deal with indexing big data from Pig/Hadoop as an example of using the CloudSolrServer in SolrJ and managing searchers for high indexing throughput
Example uses of key features like real-time gets, atomic updates, custom hashing, and distributed facets. Attendees will come away from this presentation with a real-world use case that proves Solr 4.1 is scalable, stable, and is production ready. (note: we are in production on 18 nodes in EC2 with a recent nightly build off the branch_4x).

Ingo Renner, Software Engineer, Infield Design CMS Integration of Apache Solr – How we did it.

TYPO3 is an Open Source Content Management System that is very popular in Europe, especially in the German market, and gaining traction in the U.S., too.

TYPO3 is a good example of how to integrate Solr with a CMS. The challenges we faced are typical of any CMS integration. We came up with solutions and ideas to these challenges and our hope is that they might be of help for other CMS integrations as well.

That includes content indexing, file indexing, keeping track of content changes, handling multi-language sites, search and facetting, access restrictions, result presentation, and how to keep all these things flexible and re-usable for many different sites.

For all these things we used a couple additional Apache projects and we would like to show how we use them and how we contributed back to them while building our Solr integration.

David Smiley, Software Systems Engineer, Lead, MITRE Lucene / Solr 4 Spatial Deep Dive

Lucene's former spatial contrib is gone and in its place is an entirely new spatial module developed by several well-known names in the Lucene/Solr spatial community. The heart of this module is an approach in which spatial geometries are indexed using edge-ngram tokenized geohashes searched with a prefix-tree/trie recursive algorithm. It sounds cool and it is! In this presentation, you'll see how it works, why it's fast, and what new things you can do with it. Key features are support for multi-valued fields, and indexing shapes with area — even polygons, and support for various spatial predicates like "Within". You'll see a live demonstration and a visual representation of geohash indexed shapes. Finally, the session will conclude with a look at the future direction of the module.

David Smiley, Software Systems Engineer, Lead, MITRE Text Tagging with Finite State Transducers

OpenSextant is an unstructured-text geotagger. A core component of OpenSextant is a general-purpose text tagger that scans a text document for matching multi-word based substrings from a large dictionary. Harnessing the power of Lucene's state-of-the-art finite state transducer (FST) technology, the text tagger was able to save over 40x the amount of memory estimated for a leading in-memory alternative. Lucene's FSTs are elusive due to their technical complexity but overcoming the learning curve can pay off handsomely.

Marc Sturlese, Architect, Backend engineer, Trovit Batch Indexing and Near Real Time, keeping things fast

In this talk I will explain how we combine a mixed architecture using Hadoop for batch indexing and Storm, HBase and Zookeeper to keep our indexes updated in near real time.Will talk about why we didn't choose just a default Solr Cloud and it's real time feature (mainly to avoid hitting merges while serving queries on the slaves) and the advantages and complexities of having a mixed architecture. Both parts of the infrastucture and how they are coordinated will be explained with details.Finally will mention future lines, how we plan to use Lucene real time feature.

Tyler Tate, Cofounder, TwigKit Designing the Search Experience

Search is not just a box and ten blue links. Search is a journey: an exploration where what we encounter along the way changes what we seek. But in order to guide people along this journey, we must understand both the art and science of search.In this talk Tyler Tate, cofounder of TwigKit and coauthor of the new book Designing the Search Experience, weaves together the theories of information seeking with the practice of user interface design, providing a comprehensive guide to designing search.Pulling from a wealth of research conducted over the last 30 years, Tyler begins by establishing a framework of search and discovery. He outlines cognitive attributes of users—including their level of expertise, cognitive style, and learning style; describes models of information seeking and how they've been shaped by theories such as information foraging and sensemaking; and reviews the role that task, physical, social, and environmental context plays in the search process.

Tyler then moves from theory to practice, drawing on his experience of designing 50+ search user interfaces to provide practical guidance for common search requirements. He describes best practices and demonstrates reams of examples for everything from entering the query (including the search box, as-you-type suggestions, advanced search, and non-textual input), to the layout of search results (such as lists, grids, maps, augmented reality, and voice), to result manipulation (e.g. pagination and sorting) and, last but not least, the ins-and-outs of faceted navigation. Through it all, Tyler also addresses mobile interface design and how responsive design techniques can be used to achieve cross-platform search.This intensive talk will enable you to create better search experiences by equipping you with a well-rounded understanding of the theories of information seeking, and providing you with a sweeping survey of search user interface best practices.

Troy Thomas, Senior Manager, Internet Enabled Services, Synopsys & Koorosh Vakhshoori, Software Architect,Synopsys Make your GUI Shine with AJAX-Solr

With AJAX-Solr, you can implement widgets like faceting, auto-complete, spellchecker and pagination quickly and elegantly. AJAX-Solr is a JavaScript library that uses the Solr REST-like API to display search results in an interactive user interface. Come learn why we chose AJAX-Solr and Solr 4 for the SolvNet search project. Get an overview of the AJAX-Solr framework (Manager, Parameters, Widgets and Theming). Get a deeper understanding of the technical concepts using real-world examples. Best practices and lessons learned will also be presented.

Adrian Trenaman, Senior Software Engineer, Gilt Groupe Personalized Search on the Largest Flash Sale Site in America

Gilt Groupe is an innovative online shopping destination offering its members special access to the most inspiring merchandise, culinary offerings, and experiences every day, many at insider prices. Every day new merchandising is offered for sale at discounts of up to 70%. Sales start at 12 noon EST resulting in an avalanche of hits to the site, so delivering a rich user experience requires substantial technical innovation.

Implementing search for a flash-sales business, where inventory is limited and changes rapidly as our sales go live to a stampede of members every noon, poses a number of technical challenges. For example, with small numbers of fast moving inventory we want to be sure that search results reflect those products we still have available for sale. Also, personalizing search — where search listings may contain exclusive items that are available only to certain users — was also a big challenge

Gilt has built out keyword search using Scala, Play Framework and Apache Solr / Lucene. The solution, which involves less than 4,000 lines of code, comfortably provides search results to members in under 40ms. In this talk, we'll give a tour of the logical and physical architecture of the solution, the approach to schema definition for the search index, and how we use custom filters to perform personalization and enforce product availability windows. We'll discuss lessons learnt, and describe how we plan to adopt Solr to power sale, brand, category and search listings throughout all of Gilt's estate.

Doug Turnbull, Search and Big Data Architect, OpenSource Connections State Decoded: Empowering The Masses with Open Source State Law Search

The Law has traditionally been a topic dominated by an elite group of experts. Watch how State Decoded has transformed the law from a scary, academic topic to a friendly resource that empowers everyone using Apache Solr. This talk is a call to action for discovery and design to break open ivory towers of expertise by baking rich discovery into your UI and data structures.

Comments Off

June 19, 2013

Screaming fast Lucene searches using C++ via JNI

Filed under: C/C++,Lucene — Patrick Durusau @ 6:26 pm

Screaming fast Lucene searches using C++ via JNI by Michael McCandless.

From the post:

At the end of the day, when Lucene executes a query, after the initial setup the true hot-spot is usually rather basic code that decodes sequential blocks of integer docIDs, term frequencies and positions, matches them (e.g. taking union or intersection for BooleanQuery), computes a score for each hit and finally saves the hit if it’s competitive, during collection.

Even apparently complex queries like FuzzyQuery or WildcardQuery go through a rewrite process that reduces them to much simpler forms like BooleanQuery.

Lucene’s hot-spots are so simple that optimizing them by porting them to native C++ (via JNI) was too tempting!

So I did just that, creating the lucene-c-boost github project, and the resulting speedups are exciting:

(…)

Speedups range from 0.7 X to 7.8 X.

Read Michael’s post for explanations, warnings, caveats, etc.

But it is exciting news!

Comments Off

Fundamentals of Information Retrieval: Illustration with Apache Lucene

Filed under: Information Retrieval,Lucene — Patrick Durusau @ 10:54 am

Fundamentals of Information Retrieval: Illustration with Apache Lucene by Majirus FANSI.

From the description:

Information Retrieval is becoming the principal mean of access to Information. It is now common for web applications to provide interface for free text search. In this talk we start by describing the scientific underpinning of information retrieval. We review the main models on which are based the main search tools, i.e. the Boolean model and the Vector Space Model. We illustrate our talk with a web application based on Lucene. We show that Lucene combines both the Boolean and vector space models.

The presentation will give an overview of what Lucene is, where and how it can be used. We will cover the basic Lucene concepts (index, directory, document, field, term), text analysis (tokenizing, token filtering, sotp words), indexing (how to create an index, how to index documents), and seaching (how to run keyword, phrase, Boolean and other queries). We’ll inspect Lucene indices with Luke.

After this talk, the attendee will get the fundamentals of IR as well as how to apply them to build a search application with Lucene.

I am assuming that the random lines in the background of the slides are an artifact of the recording. Quite annoying.

Otherwise, a great presentation!

Comments Off

June 18, 2013

Apache Lucene / Solr 4.3.1 Release!

Filed under: Lucene,Solr — Patrick Durusau @ 12:34 pm

Lucene 4.3.1 — Lucene CHANGES.txt

Solr 4.3.1 — Solr CHANGES.txt

There was a time when we all waited endlessly for bug fixes, patches.

Now that open source projects deliver them on a routine basis, have your upgrade habits changed?

Just curious.

Comments Off

June 17, 2013

Semantic Diversity – Special Characters

Filed under: Lucene,Neo4j,Programming,Semantic Diversity — Patrick Durusau @ 8:16 am

neo4j/cypher/Lucene: Dealing with special characters by Mark Needham.

Mark outlines how to handle “special characters” in Lucene (indexer for Neo4j), only to find that an escape character for a Lucene query is also a special character for Cypher, which itself must be escaped.

There is a chart in Mastering Regular Expressions by Jeffrey E F Friedl of “special” characters but that doesn’t cover all the internal parsing choices software.

Over the last sixty plus years there has been little progress towards a common set of “special” characters in computer science.

Handling of “special” characters lies at the heart of accessing data and all programs have code to account for them.

With no common agreement on “special” characters, what reason would you offer to expect convergence elsewhere?

Comments Off

June 10, 2013

Search-As-You-Type With Solr

Filed under: Lucene,Searching,Solr — Patrick Durusau @ 9:53 am

Search-As-You-Type With Solr by John Berryman.

From the post:

In my previous post, I talked about implementing Suggest-As-You-Type using Solr. In this post I’ll cover a closely related functionality called Suggest-As-You-Type.

Several years back, Google introduced an interesting new interface for their search called Search-As-You-Type. Basically, as you type in the search box, the result set is continually updated with better and better search results. By this point, everyone is used to Google’s Search-As-You-Type, but for some reason I have yet to see any of our clients use this interface. So I thought it would be cool to take a stab at this with Solr.

Let’s get started. First things first, download Solr and spin up Solr’s example.
cd solr-4.2.0/example
java -jar start.jar
Next click this link and POOF! you will have the following documents indexed:

There’s nothing better than a shiny red apple on hot summer day.

Eat an apple!

I prefer a Grannie Smith apple over Fuji.

Apricots is kinda like a peach minus the fuzz.

(Kinda cool how that link works isn’t it?) Now let’s work on the strategy. Let’s assume that the user is going to search for “apple”. When the user types “a” what should we do? In a normal index, there’s a buzillion things that start with “a”, so maybe we should just do nothing. Next “ap” depending upon how large your index is, two letters may be a reasonably small enough set to start providing feedback back to your users. The goal is to provide Solr with appropriate information so that it continuously comes back with the best results possible.

Good demonstration that how you form a query makes a large difference in the result you get.

Comments Off

June 9, 2013

Build your own finite state transducer

Filed under: FSTs,Lucene — Patrick Durusau @ 3:26 pm

Build your own finite state transducer by Michael McCandless.

From the post:

Have you always wanted your very own Lucene finite state transducer (FST) but you couldn’t figure out how to use Lucene’s crazy APIs?

Then today is your lucky day! I just built a simple web application that creates an FST from the input/output strings that you enter.

If you just want a finite state automaton (no outputs) then enter only inputs, such as this example:

(…)

Mike’s post: Lucene finite state transducer (FST) summaries the potential for FSTs in Lucene.

HTRT? Be good with your tools. Be very good with your tools.

Comments Off

Build Your Own Lucene Codec!

Filed under: Indexing,Lucene — Patrick Durusau @ 3:09 pm

Build Your Own Lucene Codec! by Doug Turnbull.

From the post:

I’ve been having a lot of fun hacking on a Lucene Codec lately. My hope is to create a Lucene storage layer based on FoundationDB – a new distributed and transactional key-value store. It’s a fun opportunity to learn about both FoundationDB and low-level Lucene details.

But before we get into all that fun technical stuff, there’s some work we need to do. Our goal is going to be to get MyFirstCodec to work! Here’s the source code:

(…)

From the Lucene 4.1 documentation: Codec – Class in org.apache.lucene.codecs Encodes/decodes an inverted index segment.

How good do you want to be with your tools?

Comments Off

June 8, 2013

Are You Near Me?

Filed under: Geographic Data,Georeferencing,GIS,Lucene — Patrick Durusau @ 1:52 pm

Lucene 4.X is a great tool for analyzing cellphone location data (Did you really think only the NSA has it?).

Chilamakuru Vishnu gets us started with a code heavy post with the promise of:

My Next Blog Post will talk about how to implement advanced spatial queries like

geoInterseting – where one polygon intersects with another polygon/line.

geoWithIn – where one polygon lies completely within another polygon.

Or you could obtain geolocation data by other means.

I first saw this at DZone.

Comments Off

June 6, 2013

Search is Not a Solved Problem

Filed under: Lucene,Searching,Solr — Patrick Durusau @ 1:58 pm

From the description:

The brief idea behind this talk is that search is not a solved problem — there is still a big opportunity for building search (and finding?) capabilities for the kinds of questions that the current product fail to solve. For example, why do search engines just return a list of sorted URLs, but give me no information about the themes that are consistent across them?

Hmmm, “…themes that are consistent across them?”

Do you think she means subjects across URLs?

Important point: What people post isn’t the same content that they consume!

Comments Off

June 5, 2013

Cloudera Search: The Newest Hadoop Framework for CDH Users and Developers

Filed under: Cloudera,Hadoop,Lucene,Solr — Patrick Durusau @ 2:41 pm

Cloudera Search: The Newest Hadoop Framework for CDH Users and Developers by Doug Cutting.

From the post:

One of the unexpected pleasures of open source development is the way that technologies adapt and evolve for uses you never originally anticipated.

Seven years ago, Apache Hadoop sprang from a project based on Apache Lucene, aiming to solve a search problem: how to scalably store and index the internet. Today, it’s my pleasure to announce Cloudera Search, which uses Lucene (among other things) to make search solve a Hadoop problem: how to let non-technical users interactively explore and analyze data in Hadoop.

Cloudera Search is released to public beta, as of today. (See a demo here; get installation instructions here.) Powered by Apache Solr 4.3, Cloudera Search allows hundreds of users to search petabytes of Hadoop data interactively.

In the context of our platform, CDH (Cloudera’s Distribution including Apache Hadoop), Cloudera Search is another framework much like MapReduce and Cloudera Impala. It’s another way for users to interact with Hadoop data and for developers to build Hadoop applications. Each framework in our platform is designed to cater to different families of applications and users:

(…)

Did you catch the line:

Powered by Apache Solr 4.3, Cloudera Search allows hundreds of users to search petabytes of Hadoop data interactively.

Does that make you feel better about scale issues?

Also see: Cloudera Search Webinar, Wednesday, June 19, 2013 11AM-12PM PT.

A serious step up in capabilities.

Comments Off

May 30, 2013

Getting Started with ElasticSearch: Part 1 – Indexing

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 2:35 pm

Getting Started with ElasticSearch: Part 1 – Indexing by Florian Hopf.

From the post:

ElasticSearch is gaining a huge momentum with large installations like Github and Stackoverflow switching to it for its search capabilities. Its distributed nature makes it an excellent choice for large datasets with high availability requirements. In this 2 part article I’d like to share what I learned building a small Java application just for search.

The example I am showing here is part of an application I am using for talks to show the capabilities of Lucene, Solr and ElasticSearch. It’s a simple webapp that can search on user group talks. You can find the sources on GitHub.

Some experience with Solr can be helpful when starting with ElasticSearch but there are also times when it’s best to not stick to your old knowledge.

As rapidly as Solr, Lucene and ElasticSearch are developing, old knowledge can be an issue for any of them.

Comments Off

May 27, 2013

Automatically Acquiring Synonym Knowledge from Wikipedia

Filed under: Lucene,Solr,Synonymy,Wikipedia — Patrick Durusau @ 7:36 pm

Automatically Acquiring Synonym Knowledge from Wikipedia by Koji Sekiguchi.

From the post:

Synonym search sure is convenient. However, in order for an administrator to allow users to use these convenient search functions, he or she has to provide them with a synonym dictionary (CSV file) described above. New words are created every day and so are new synonyms. A synonym dictionary might have been prepared by a person in charge with huge effort but sometimes will be left unmaintained as time goes by or his/her position is taken over.

That is a reason people start longing for an automatic creation of synonym dictionary. That request has driven me to write the system I will explain below. This system learns synonym knowledge from “dictionary corpus” and outputs “original word – synonym” combinations of high similarity to a CSV file, which in turn can be applied to the SynonymFilter of Lucene/Solr as is.

This “dictionary corpus” is a corpus that contains entries consisting of “keywords” and their “descriptions”. An electronic dictionary exactly is a dictionary corpus and so is Wikipedia, which you are familiar with and is easily accessible.

Let’s look at a method to use the Japanese version of Wikipedia to automatically get synonym knowledge.

Complex representation of synonyms, which includes domain or scope would be more robust.

On the other hand, some automatic generation of synonyms is better than no synonyms at all.

Take this as a good place to start but not as a destination for synonym generation.

Comments Off

May 24, 2013

How Does A Search Engine Work?…

Filed under: Indexing,Lucene,Search Engines — Patrick Durusau @ 5:59 pm

How Does A Search Engine Work? An Educational Trek Through A Lucene Postings Format by Doug Turnbull.

From the post:

A new feature of Lucene 4 – pluggable codecs – allows for the modification of Lucene’s underlying storage engine. Working with codecs and examining their output yields fascinating insights into how exactly Lucene’s search works in its most fundamental form.

The centerpiece of a Lucene codec is it’s postings format. Postings are a commonly thrown around word in the Lucene space. A Postings format is the representation of the inverted search index – the core data structure used to lookup documents that contain a term. I think nothing really captures the logical look-and-feel of Lucene’s postings better than Mike McCandless’s SimpleTextPostingsFormat. SimpleText is a text-based representation of postings created for educational purposes. I’ve indexed a few documents in Lucene using SimpleText to demonstrate how postings are structured to allow for fast search:

A first step towards moving beyond being a search engine result consumer.

Comments Off

May 22, 2013

Dynamic faceting with Lucene

Filed under: Faceted Search,Facets,Indexing,Lucene,Search Engines — Patrick Durusau @ 2:08 pm

Dynamic faceting with Lucene by Michael McCandless.

From the post:

Lucene’s facet module has seen some great improvements recently: sizable (nearly 4X) speedups and new features like DrillSideways. The Jira issues search example showcases a number of facet features. Here I’ll describe two recently committed facet features: sorted-set doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next (4.4) release.

To understand these features, and why they are important, we first need a little background. Lucene’s facet module does most of its work at indexing time: for each indexed document, it examines every facet label, each of which may be hierarchical, and maps each unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc values field. A separate taxonomy index stores this mapping, and ensures that, even across segments, the same label gets the same id.

At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count.

This is in contrast to purely dynamic faceting implementations like ElasticSearch‘s and Solr‘s, which do all work at search time. Such approaches are more flexible: you need not do anything special during indexing, and for every query you can pick and choose exactly which facets to compute.

However, the price for that flexibility is slower searching, as each search must do more work for every matched document. Furthermore, the impact on near-real-time reopen latency can be horribly costly if top-level data-structures, such as Solr’s UnInvertedField, must be rebuilt on every reopen. The taxonomy index used by the facet module means no extra work needs to be done on each near-real-time reopen.

The dynamic range faceting sounds particularly useful.

Comments Off

May 14, 2013

Eating dog food with Lucene

Filed under: Lucene,Solr — Patrick Durusau @ 4:22 pm

Eating dog food with Lucene by Michael McCandless.

From the post:

Eating your own dog food is important in all walks of life: if you are a chef you should taste your own food; if you are a doctor you should treat yourself when you are sick; if you build houses for a living you should live in a house you built; if you are a parent then try living by the rules that you set for your kids (most parents would fail miserably at this!); and if you build software you should constantly use your own software.

So, for the past few weeks I’ve been doing exactly that: building a simple Lucene search application, searching all Lucene and Solr Jira issues, and using it instead of Jira’s search whenever I need to go find an issue.

It’s currently running at jirasearch.mikemccandless.com and it’s still quite rough (feedback welcome!).

Now there’s a way to learn the details!

Makes me think about the poor search capabilities at an SDO I frequent.

Could be a way to spend some quality time with Lucene and Solr.

Will have to give it some thought.

Comments Off

May 6, 2013

Apache Lucene / Solr 4.3 Release!

Filed under: Lucene,Solr — Patrick Durusau @ 6:57 pm

See Lucene Changes.txt.

See Solr Changes.txt

More good news for a Monday!

Comments Off

April 25, 2013

Client-side search

Filed under: Javascript,Lucene — Patrick Durusau @ 3:07 pm

Client-side search by Gene Golovchinsky.

From the post:

When we rolled out the CHI 2013 previews site, we got a couple of requests for being able to search the site with keywords. Of course interfaces for search are one of my core research interests, so that request got me thinking. How could we do search on this site? The problem with the conventional approach to search is that it requires some server-side code to do the searching and to return results to the client. This approach wouldn’t work for our simple web site, because from the server’s perspective, our site was static — just a few HTML files, a little bit of JavaScript, and about 600 videos. Using Google to search the site wouldn’t work either, because most of the searchable content is located on two pages, with hundreds of items on each page. So what to do?

I looked around briefly trying to find some client-side indexing and retrieval code, and struck out. Finally, I decided to take a crack at writing a search engine in JavaScript. Now, before you get your expectations up, I was not trying to re-implement Lucene in JavaScript. All I wanted was some rudimentary keyword search capability. Building that in JavaScript was not so difficult.

One simplifying assumption I could make was that my document collection was static: sorry, the submission deadline for the conference has passed. Thus, I could have a static index that could be made available to each client, and all the client needed to do was match and rank.

Each of my documents had a three character id, and a set of fields. I didn’t bother with the fields, and just lumped everything together in the index. The approach was simple, again due to lots of assumptions. I treated the inverted index as a hash table that maps keywords onto lists of document ids. OK, document ids and term frequencies. Including positional information is an exercise left to the reader.

A refreshing reminder that simplified requirements can lead to successful applications.

Or to put it another way, not every application has to meet every possible use case.

For example, I might want to have a photo matching application that only allows users to pick match/no match for any pair of photos.

Not why, what reasons for match/no match, etc.

But it does capture the users identity in an association as saying photo # and photo # are of the same person.

That doesn’t provide any basis for automated comparison of those judgments, but not every judgment is required to do so.

I am starting to think of subject identification as a continuum of practices, some of which enable more reuse than others.

Which of those you choose, depends upon your requirements, your resources and other factors.

Comments Off

April 16, 2013

How To Debug Solr With Eclipse

Filed under: Eclipse,Lucene,Solr — Patrick Durusau @ 11:49 am

How To Debug Solr With Eclipse by Doug Turnbull.

From the post:

Recently I was puzzled by some behavior Solr was showing me. I scratched my head and called over a colleague. We couldn’t quite figure out what was going on. Well Solr is open source so… next stop – Debuggersville!

Running Solr in the Eclipse debugger isn’t hard, but there are many scattered user group posts and blog articles that you’ll need to manually tie together into a coherent picture. So let me do you the favor of tying all of that info together for you here.

This looks very useful.

Curious of there are any statistical function debuggers?

That step you through the operations and show the state of values as they change?

Thinking that could be quite useful as a sanity test when the numbers just don’t jive.

Comments Off

April 10, 2013

Apache Lucene and Solr 4.2.1

Filed under: Lucene,Solr — Patrick Durusau @ 5:11 am

Bug fix releases for Apache Lucene and Solr.

Apache Lucene 4.2.1: Changes; Downloads.

Apache Solr 4.2.1: Changes; Downloads.

Comments Off

April 8, 2013

Beginners Guide To Enhancing Solr/Lucene Search…

Filed under: Lucene,Mahout,Solr — Patrick Durusau @ 4:33 pm

Beginners Guide To Enhancing Solr/Lucene Search With Mahout’s Machine Learning by Doug Turnbull.

From the post:

Yesterday, John and I gave a talk to the DC Hadoop Users Group about using Mahout with Solr to perform Latent Semantic Indexing — calculating and exploiting the semantic relationships between keywords. While we were there, I realized, a lot of people could benefit from a bigger picture, less in-depth, point of view outside of our specific story. In general where do Mahout and Solr fit together? What does that relationship look like, and how does one exploit Mahout to make search even more awesome? So I thought I’d blog about how you too get start to put these pieces together to simultaneously exploit Solr’s search and Mahout’s machine learning capabilities.

The root of how this all works is with a slightly obscure feature of Lucene based search — Term Vectors. Lucene based search applications give you the ability to generate term vectors from documents in the search index. Its a feature often turned on for specific search features, but other than that can appear to be a weird opaque feature to beginners. What is a term vector, you might ask? And why would you want to get one?

You know my misgivings about metric approaches to non-metric data (such as semantics) but there is no denying that Latent Semantic Indexing can be useful.

Think of Latent Semantic Indexing as a useful tool.

A saw is a tool too but not every cut made with a saw is a correct one.

Yes?

Comments Off

March 29, 2013

How NoSQL Paid Off for Telenor

Filed under: Lucene,Marketing,Neo4j,Solr — Patrick Durusau @ 4:07 am

How NoSQL Paid Off for Telenor by Sebastian Verheughe and Katrina Sponheim.

A presentation I encountered while searching for something else.

Makes a business case for Lucene/Solr and Neo4j solutions to improve customer access to data.

As opposed to the world being a better place case.

What information process/need have you encountered where you can make a business case for topic maps?

Comments (5)

March 28, 2013

Build a search engine in 20 minutes or less

Filed under: Indexing,Lucene,Search Engines,Solr — Patrick Durusau @ 7:15 pm

Build a search engine in 20 minutes or less by Ben Ogorek.

I was suspicious but pleasantly surprised by the demonstration of the vector space model you will find here.

True, it doesn’t offer all the features of the latest Lucene/Solr releases but it will give you a firm grounding on vector space models.

Enjoy!

PS: One thing to keep in mind, semantics do not map to vector space. We can model word occurrences in vector space but occurrences are not semantics.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 14, 2013

July 9, 2013

July 8, 2013

June 30, 2013

June 27, 2013

June 23, 2013

June 22, 2013

June 20, 2013

June 19, 2013

June 18, 2013

June 17, 2013

June 10, 2013

June 9, 2013

June 8, 2013

June 6, 2013

June 5, 2013

May 30, 2013

May 27, 2013

May 24, 2013

May 22, 2013

May 14, 2013

May 6, 2013

April 25, 2013

April 16, 2013

April 10, 2013

April 8, 2013

March 29, 2013

March 28, 2013