Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 22, 2013

Lucene/Solr 4 – A Revolution in Enterprise Search Technology (Webinar)

Filed under: Indexing,Lucene,Solr — Patrick Durusau @ 10:34 am

Lucene/Solr 4 – A Revolution in Enterprise Search Technology (Webinar). Presenter: Erik Hatcher, Lucene/Solr Committer and PMC member.

Date: Wednesday, March 27, 2013
Time: 10:00am Pacific Time

From the signup page:

Lucene/Solr 4 is a ground breaking shift from previous releases. Solr 4.0 dramatically improves scalability, performance, reliability, and flexibility. Lucene 4 has been extensively upgraded. It now supports near real-time (NRT) capabilities that allow indexed documents to be rapidly visible and searchable. Additional Lucene improvements include pluggable scoring, much faster fuzzy and wildcard querying, and vastly improved memory usage.

The improvements in Lucene have automatically made Solr 4 substantially better. But Solr has also been considerably improved and magnifies these advances with a suite of new “SolrCloud” features that radically improve scalability and reliability.

In this Webinar, you will learn:

  • What are the Key Feature Enhancements of Lucene/Solr 4, including the new distributed capabilities of SolrCloud
  • How to Use the Improved Administrative User Interface
  • How Sharding has been improved
  • What are the improvements to GeoSpatial Searches, Highlighting, Advanced Query Parsers, Distributed search support, Dynamic core management, Performance statistics, and searches for rare values, such as Primary Key

Great way to get up to speed on the latest release of Lucene/Solr!

March 21, 2013

elasticsearch 0.90.0.RC1 Released

Filed under: ElasticSearch,Lucene,Searching — Patrick Durusau @ 2:08 pm

elasticsearch 0.90.0.RC1 Released by Shay Banon.

From the post:

elasticsearch version 0.90.0.RC1 is out, the first release candiate for the 0.90 release. You can download it here.

This release includes an upgrade to Lucene 4.2, many improvements to the suggester feature (including its own dedicated API), another round of memory improvements to field data (long type will now automatically “narrow” to the smallest type when loaded to memory) and several bug fixes. Upgrading to it from previous beta releases is highly recommended. (inserted URL to release notes)

Just to keep you on the cutting edge of search technology!

March 19, 2013

MongoDB 2.4 Release

Filed under: Lucene,MongoDB,NoSQL,Searching,Solr — Patrick Durusau @ 1:11 pm

MongoDB 2.4 Release

From the webpage:

Developer Productivity

  • Capped Arrays simplify development by making it easy to incorporate fixed, sorted lists for features like leaderboards and logging.
  • Geospatial Enhancements enable new use cases with support for polygon intersections and analytics based on geospatial data.
  • Text Search provides a simplified, integrated approach to incorporating search functionality into apps (Note: this feature is currently in beta release).

Operations

  • Hash-Based Sharding simplifies deployment of large MongoDB systems.
  • Working Set Analyzer makes capacity planning easier for ops teams.
  • Improved Replication increases resiliency and reduces administration.
  • Mongo Client creates an intuitive, consistent feature set across all drivers.

Performance

  • Faster Counts and Aggregation Framework Refinements make it easier to leverage real-time, in-place analytics.
  • V8 JavaScript Engine offers better concurrency and faster performance for some operations, including MapReduce jobs.

Monitoring

  • On-Prem Monitoring provides comprehensive monitoring, visualization and alerting on more than 100 operational metrics of a MongoDB system in real time, based on the same application that powers 10gen’s popular MongoDB Monitoring Service (MMS). On-Prem Monitoring is only available with MongoDB Enterprise.



Security
….

  • Kerberos Authentication enables enterprise and government customers to integrate MongoDB into existing enterprise security systems. Kerberos support is only available in MongoDB Enterprise.
  • Role-Based Privileges allow organizations to assign more granular security policies for server, database and cluster administration.

You can read more about the improvements to MongoDB 2.4 in the Release Notes. Also, MongoDB 2.4 is available for download on MongoDB.org.

Lots to look at in MongoDB 2.4!

But I am curious about the beta text search feature.

MongoDB Text Search: Experimental Feature in MongoDB 2.4 says:

Text search (SERVER-380) is one of the most requested features for MongoDB 10gen is working on an experimental text-search feature, to be released in v2.4, and we’re already seeing some talk in the community about the native implementation within the server. We view this as an important step towards fulfilling a community need.

MongoDB text search is still in its infancy and we encourage you to try it out on your datasets. Many applications use both MongoDB and Solr/Lucene, but realize that there is still a feature gap. For some applications, the basic text search that we are introducing may be sufficient. As you get to know text search, you can determine when MongoDB has crossed the threshold for what you need. (emphasis added)

So, why isn’t MongoDB incorporating Solr/Lucene instead of a home grown text search feature?

Seems like users could leverage their Solr/Lucene skills with their MongoDB installations.

Yes?

March 16, 2013

Lux

Filed under: Lucene,Saxon,Solr,XQuery,XSLT — Patrick Durusau @ 7:51 pm

Lux

From the readme:

Lux is an open source XML search engine formed by fusing two excellent technologies: the Apache Lucene/Solr search index and the Saxon XQuery/XSLT processor.

At its core, Lux provides XML-aware indexing, an XQuery 1.0 optimizer that rewrites queries to use the indexes, and a function library for interacting with Lucene via XQuery. These capabilities are tightly integrated with Solr, and leverage its application framework in order to deliver a REST service and application server.

The REST service is accessible to applications written in almost any language, but it will be especially convenient for developers already using Solr, for whom Lux operates as a Solr plugin that provides query services using the same REST APIs as other Solr search plugins, but using a different query language (XQuery). XML documents may be inserted (and updated) using standard Solr REST calls: XML-aware indexing is triggered by the presence of an XML-aware field in a document. This means that existing application frameworks written in many different languages are positioned to use Lux as a drop-in capability for indexing and querying semi-structured content.

The application server is a great way to get started with Lux: it provides the ability to write a complete application in XQuery and XSLT with data storage backed by Lucene.

If you are looking for experience with XQuery and Lucene/Solr, look no further!

May be a good excuse for me to look at defining equivalence statements using XQuery.

I first saw this in a tweet by Michael Kay.

March 15, 2013

A Peek Behind the Neo4j Lucene Index Curtain

Filed under: Indexing,Lucene,Neo4j — Patrick Durusau @ 4:02 pm

A Peek Behind the Neo4j Lucene Index Curtain by Max De Marzi.

Max suggests using a copy of your Neo4j database for this exercise.

Could be worth your while to go exploring.

And you will learn something about Lucene in the bargain.

March 12, 2013

Elasticsearch and Joining

Filed under: ElasticSearch,Joins,Lucene — Patrick Durusau @ 2:40 pm

Elasticsearch and Joining by Felix Hürlimann.

From the post:

With the success of elasticsearch, people, including us, start to explore the possibilities and mightiness of the system. Including border cases for which the underlying core, Lucene, never was originally intended or optimized for. One of the many requests that come up pretty quickly is the whish for joining data across types or indexes, similar to an SQL join clause that combines records from two or more tables in a database. Unfortunately full join support is not (yet?) available out of the box. But there are some possibilities and some attempts to solve parts of issue. This post is about summarizing some of the ideas in this field.

To illustrate the different ideas, let’s work with the following example: we would like to index documents and comments with a one to many relationship between them. Each comment has an author and we would like to answer the question: Give me all documents that match a certain query and a specific author has commented on it.

The latest beta release of Elasticsearch is described as:

If you have more complex requirements for join, a new feature introdcued in the latest beta release may can help you. It introduces another feature that allows for a kind of join by looking up filter terms in another index or type. This allows then e.g. for queries like ‘Show me all comments from documents that relate to this document and the author is ‘John Doe’.

The “looking up” in a different index or type sounds quite interesting.

Have you looked at the new beta of Elasticsearch?

Apache Lucene/Solr 4.2 Out!

Filed under: Lucene,Solr — Patrick Durusau @ 1:40 pm

Apache Lucene 4.2

Download

Changes

Apache Solr 4.2

Download

Changes

See the Lucene homepage for a summary of the 4.2 changes in Lucene and Solr.

Warning: Reference good only until the next Lucene/Solr release. 😉

March 10, 2013

Solr: Custom Ranking with Function Queries

Filed under: Lucene,Ranking,Solr — Patrick Durusau @ 8:42 pm

Solr: Custom Ranking with Function Queries by Sujit Pal.

From the post:

Solr has had support for Function Queries since version 3.1, but before sometime last week, I did not have a use for it. Which is probably why when I would read about Function Queries, they would seem like a nice idea, but not interesting enough to pursue further.

Most people get introduced to Function Queries through the bf parameter in the DisMax Query Parser or through the geodist function in Spatial Search. So far, I haven’t had the opportunity to personally use either feature in a real application. My introduction to Function Queries was through a problem posed to me by one of my coworkers.

The problem was as follows. We want to be able to customize our search results based on what a (logged-in) user tells us about himself or herself via their profile. This could be gender, age, ethnicity and a variety of other things. On the content side, we can annotate the document with various features corresponding to these profile features. For example, we can assign a score to a document that indicates its appeal/information value to males versus females that would correspond to the profile’s gender.

So the idea is that if we know that the profile is male, we should boost the documents that have a high male appeal score and deboost the ones that have a high female appeal score, and vice versa if the profile is female. This idea can be easily extended for multi-category features such as ethnicity as well. In this post, I will describe a possible implementation that uses Function Queries to rerank search results using male/female appeal document scores.

Does your topic map deliver information based on user characteristics?

Have you re-invented the ranking or are you using an off-the-shelf solution?

March 9, 2013

Solr + “flash sale site”

Filed under: Lucene,Solr — Patrick Durusau @ 11:56 am

How Solr powers search on America’s largest flash sale site by Ade Trenaman.

The post caught my attention with “flash sale,” which I had to look up. 😉

Even after discovering it means “deal of the day,” the slides were interesting.

Especially the commentary on synonym lists!

What someone else considers to be a synonym may not be one for your audience.

February 25, 2013

Drill Sideways faceting with Lucene

Filed under: Facets,Indexing,Lucene,Searching — Patrick Durusau @ 5:13 am

Drill Sideways faceting with Lucene by Mike McCandless.

From the post:

Lucene’s facet module, as I described previously, provides a powerful implementation of faceted search for Lucene. There’s been a lot of progress recently, including awesome performance gains as measured by the nightly performance tests we run for Lucene:

[3.8X speedup!]

….

For example, try searching for an LED television at Amazon, and look at the Brand field, seen in the image to the right: this is a multi-select UI, allowing you to select more than one value. When you select a value (check the box or click on the value), your search is filtered as expected, but this time the field does not disappear: it stays where it was, allowing you to then drill sideways on additional values. Much better!

LinkedIn’s faceted search, seen on the left, takes this even further: not only are all fields drill sideways and multi-select, but there is also a text box at the bottom for you to choose a value not shown in the top facet values.

To recap, a single-select field only allows picking one value at a time for filtering, while a multi-select field allows picking more than one. Separately, drilling down means adding a new filter to your search, reducing the number of matching docs. Drilling up means removing an existing filter from your search, expanding the matching documents. Drilling sideways means changing an existing filter in some way, for example picking a different value to filter on (in the single-select case), or adding another or’d value to filter on (in the multi-select case). (images omitted)

More details: DrillSideways class being developed under LUCENE-4748.

Just following the progress on Lucene is enough to make you dizzy!

February 24, 2013

Text processing (part 2): Inverted Index

Filed under: Indexing,Lucene — Patrick Durusau @ 5:41 pm

Text processing (part 2): Inverted Index by Ricky Ho.

From the post:

This is the second part of my text processing series. In this blog, we’ll look into how text documents can be stored in a form that can be easily retrieved by a query. I’ll used the popular open source Apache Lucene index for illustration.

Not only do you get to learn about inverted indexes but some Lucene in the bargain.

That’s not a bad deal!

Lucene 4 Finite State Automata In 10 Minutes (Intro & Tutorial)

Filed under: Finite State Automata,Lucene — Patrick Durusau @ 5:36 pm

Lucene 4 Finite State Automata In 10 Minutes (Intro & Tutorial) by Doug Turnbull.

From the post:

This article is intended to help you bootstrap your ability to work with Finite State Automata (note automata == plural of automaton). Automata are a unique data structure, requiring a bit of theory to process and understand. Hopefully what’s below can give you a foundation for playing with these fun and useful Lucene data structures!

Motivation, Why Automata?

When working in search, a big part of the job is making sense of loosely-structured text. For example, suppose we have a list of about 1000 valid first names and 100,000 last names. Before ingesting data into a search application, we need to extract first and last names from free-form text.

Unfortunately the data sometimes has full names in the format “LastName, FirstName” like “Turnbull, Doug”. In other places, however, full names are listed “FirstName LastName” like “Doug Turnbull”. Add a few extra representations, and to make sense out of what strings represent valid names becomes a chore.

This becomes especially troublesome when we’re depending on these as natural identifiers for looking up or joining across multiple data sets. Each data set might textually represent the natural identifier in subtly different ways. We want to capture the representations across multiple data sets to ensure our join works properly.

So… Whats a text jockey to do when faced with such annoying inconsistencies?

You might initially think “regular expression”. Sadly, a normal regular expression can’t help in this case. Just trying to write a regular expression that allows a controlled vocabulary of 100k valid last names but nothing else is non-trivial. Not to mention the task of actually using such a regular expression.

But there is one tool that looks promising for solving this problem. Lucene 4.0′s new Automaton API. Lets explore what this API has to offer by first reminding ourselves about a bit of CS theory.

Are you motivated?

I am!

See John Berryman’s comment about matching patterns of words.

Then think about finding topics, associations and occurrences in free form data.

Or creating a collection of automata as a tool set for building topic maps.

February 19, 2013

Searching for Dark Data

Filed under: Dark Data,Lucene,LucidWorks — Patrick Durusau @ 7:42 am

Searching for Dark Data by Paul Doscher.

From the post:

We live in a highly connected world where every digital interaction spawns chain reactions of unfathomable data creation. The rapid explosion of text messaging, emails, video, digital recordings, smartphones, RFID tags and those ever-growing piles of paper – in what was supposed to be the paperless office – has created a veritable ocean of information.

Welcome to the world of Dark Data

Welcome to the world of Dark Data, the humongous mass of constantly accumulating information generated in the Information Age. Whereas Big Data refers to the vast collection of the bits and bytes that are being generated each nanosecond of each day, Dark Data is the enormous subset of unstructured, untagged information residing within it.

Research firm IDC estimates that the total amount of digital data, aka Big Data, will reach 2.7 zettabytes by the end of this year, a 48 percent increase from 2011. (One zettabyte is equal to one billion terabytes.) Approximately 90 percent of this data will be unstructured – or Dark.

Dark Data has thrown traditional business intelligence and reporting technologies for a loop. The software that countless executives have relied on to access information in the past simply cannot locate or make sense of the unstructured data that comprises the bulk of content today and tomorrow. These tools are struggling to tap the full potential of this new breed of data.

The good news is that there’s an emerging class of technologies that is ready to pick up where traditional tools left off and carry out the crucial task of extracting business value from this data.

Effective exploration of Dark Data will require something different from search tools that depend upon:

  • Pre-specified semantics (RDF) because Dark Data has no pre-specified semantics.
  • Structure because Dark Data has no structure.

Effective exploration of Dark Data will require:

Machine assisted-Interactive searching with gifted and grounded semantic comparators (people) creating pathways, tunnels and signposts into the wilderness of Dark Data.

I first saw this at: Delving into Dark Data.

February 17, 2013

Developing Your Own Solr Filter part 2

Filed under: Lucene,Solr — Patrick Durusau @ 8:17 pm

Developing Your Own Solr Filter part 2

From the post:

In the previous entry “Developing Your Own Solr Filter” we’ve shown how to implement a simple filter and how to use it in Apache Solr. Recently, one of our readers asked if we can extend the topic and show how to write more than a single token into the token stream. We decided to go for it and extend the previous blog entry about filter implementation.

What better way to start the week!

February 11, 2013

New Book: ElasticSearch Server!

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 3:51 pm

New Book: ElasticSearch Server!

In the blog post dedicated to Solr 4.0 Cookbook we give a small hint that cookbook was not the only project that occupies our free time. Today we can officially say that a few month of hard work is slowly coming to an end – we can announce a new book about one of the greatest piece of open-source software – ElasticSearch Server book!

ElasticSearch server book describes the most important and commonly used features of ElasticSearch (at least from our perspective). Example of topics discussed:

  • ElasticSearch installation and configuration
  • Static and dynamic index structure creation
  • Querying ElasticSearch with Query DSL explained
  • Using filters
  • Faceting
  • Routing
  • Indexing data that is not flat

BTW, some wag posted a comment saying a Solr blog should not talk about ElasticSearch.

I bet they don’t see the sunshine very often from that position either. 😉

January 30, 2013

A Data Driven Approach to Query Expansion in Question Answering

Filed under: Lucene,Query Expansion,TREC — Patrick Durusau @ 8:44 pm

A Data Driven Approach to Query Expansion in Question Answering by Leon Derczynski, Jun Wang, Robert Gaizauskas, and Mark A. Greenwood.

Abstract:

Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents are retrieved at low ranks for almost 40% of questions.

In this paper, answer texts from previous QA evaluations held as part of the Text REtrieval Conferences (TREC) are paired with queries and analysed in an attempt to identify performance-enhancing words. These words are then used to evaluate the performance of a query expansion method.

Data driven extension words were found to help in over 70% of difficult questions. These words can be used to improve and evaluate query expansion methods. Simple blind relevance feedback (RF) was correctly predicted as unlikely to help overall performance, and an possible explanation is provided for its low value in IR for QA.

Work on query expansion in natural language answering systems. Closely related to synonymy.

Query expansion tools could be useful prompts for topic map authors seeking terms for mapping.

January 27, 2013

Getting real-time field values in Lucene

Filed under: Indexing,Lucene — Patrick Durusau @ 5:41 pm

Getting real-time field values in Lucene by Mike McCandless.

From the post:

We know Lucene’s near-real-time search is very fast: you can easily refresh your searcher once per second, even at high indexing rates, so that any change to the index is available for searching or faceting at most one second later. For most applications this is plenty fast.

But what if you sometimes need even better than near-real-time? What if you need to look up truly live or real-time values, so for any document id you can retrieve the very last value indexed?

Just use the newly committed LiveFieldValues class!

It’s simple to use: when you instantiate it you provide it with your SearcherManager or NRTManager, so that it can subscribe to the RefreshListener to be notified when new searchers are opened, and then whenever you add, update or delete a document, you notify the LiveFieldValues instance. Finally, call the get method to get the last indexed value for a given document id.

I saw a webinar by Mike McCandless that is probably the only webinar I would ever repeat watching.

Organized, high quality technical content, etc.

Compare that to a recent webinar I watched that spent fify-five (55) minutes reviewing information know to anyone who could say the software’s name. The speaker then lamented the lack of time to get into substantive issues.

When you see a webinar like Mike’s, drop me a line. We need to promote that sort of presentation over the other.

January 25, 2013

Make your Filters Match: Faceting in Solr [Surveillance By and For The Public?]

Filed under: Facets,Lucene,Solr — Patrick Durusau @ 8:17 pm

Make your Filters Match: Faceting in Solr Florian Hopf.

From the post:

Facets are a great search feature that let users easily navigate to the documents they are looking for. Solr makes it really easy to use them though when naively querying for facet values you might see some unexpected behaviour. Read on to learn the basics of what is happening when you are passing in filter queries for faceting. Also, I’ll show how you can leverage local params to choose a different query parser when selecting facet values.

Introduction

Facets are a way to display categories next to a users search results, often with a count of how many results are in this category. The user can then select one of those facet values to retrieve only those results that are assigned to this category. This way he doesn’t have to know what category he is looking for when entering the search term as all the available categories are delivered with the search results. This approach is really popular on sites like Amazon and eBay and is a great way to guide the user.

Solr brought faceting to the Lucene world and arguably the feature was an important driving factor for its success (Lucene 3.4 introduced faceting as well). Facets can be build from terms in the index, custom queries and ranges though in this post we will only look at field facets.

Excellent introduction to facets in Solr.

The amount of enterprise quality indexing and search software that is freely available, makes me wonder why the average citizen worries about privacy?

There are far more average citizens than denizens of c-suites, government offices, and the like.

Shouldn’t they be the ones worrying about what the rest of us are compiling together?

Instead of secret, Stasi-like archives, a public archive, with the observations of ordinary citizens.

January 24, 2013

Apache Lucene 4.1 and Apache SolrTM 4.1 available

Filed under: Indexing,Lucene,Searching,Solr — Patrick Durusau @ 8:10 pm

Lucene 4.1 can be downloaded from: http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene CHANGES.text

Solr 4.1 can be downloaded from: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr CHANGES.txt

That’s getting the new year off to a great start!

January 7, 2013

A new Lucene highlighter is born [The final inch problem]

Filed under: Indexing,Lucene,Searching,Synonymy — Patrick Durusau @ 10:27 am

A new Lucene highlighter is born Mike McCandless.

From the post:

Robert has created an exciting new highlighter for Lucene, PostingsHighlighter, our third highlighter implementation (Highlighter and FastVectorHighlighter are the existing ones). It will be available starting in the upcoming 4.1 release.

Highlighting is crucial functionality in most search applications since it’s the first step of the hard-to-solve final inch problem, i.e. of getting the user not only to the best matching documents but getting her to the best spot(s) within each document. The larger your documents are, the more crucial it is that you address the final inch. Ideally, your user interface would let the user click on each highlight snippet to jump to where it occurs in the full document, or at least scroll to the first snippet when the user clicks on the document link. This is in general hard to solve: which application renders the content is dependent on its mime-type (i.e., the browser will render HTML, but will embed Acrobat Reader to render PDF, etc.).

Google’s Chrome browser has an ingenious solution to the final inch problem, when you use “Find…” to search the current web page: it highlights the vertical scroll bar showing you where the matches are on the page. You can then scroll to those locations, or, click on the highlights in the scroll bar to jump there. Wonderful!

All Lucene highlighters require search-time access to the start and end offsets per token, which are character offsets indicating where in the original content that token started and ended. Analyzers set these two integers per-token via the OffsetAttribute, though some analyzers and token filters are known to mess up offsets which will lead to incorrect highlights or exceptions during highlighting. Highlighting while using SynonymFilter is also problematic in certain cases, for example when a rule maps multiple input tokens to multiple output tokens, because the Lucene index doesn’t store the full token graph.

An interesting addition to the highlighters in Lucene.

Be sure to follow the link to Mike’s comments about the limitations on SynonymFilter and the difficulty of correction.

December 16, 2012

Closing the Knowledge Gap: A Case Study – How Cisco Unlocks Communications

Filed under: Indexing,Lucene,Searching — Patrick Durusau @ 8:12 pm

Closing the Knowledge Gap: A Case Study – How Cisco Unlocks Communications by Tony Frazier, Director of Product Management, Cisco Systems and David Fishman, Marketing, Lucid Imagination.

A bit dated (2010) but I think you will find this interesting reading.

A couple of snippets to tempt you into reading the full post:

Consider this: you’re looking for information and immediately search the documents at your disposal to find the answer. Are you the first person who conducted this search? If you are in a reasonably large organization, given the scope and mix of electronic communications today, there could be more than 10 other employees looking for the same answer. Unearthing documents, one employee at a time, may not be the best way of tapping into that collective intellect and maximizing resources across an organization. Wouldn’t it make more sense to tap into existing discussions taking place across the network—over email, voice and increasingly video communications?

and,

Cisco’s approach to this project centered on vocabulary-based tagging and search. Every organization has the ability to define keywords for their personalized library. Cisco Pulse then tags a user’s activity, content and behavior in electronic communications to match the vocabulary, presenting valuable information that simplifies and accelerates knowledge sharing across an organization. Vocabulary-based tagging makes unlocking the relevant content of electronic communications safe and efficient.

Add a little vocabulary mapping with topic maps, toss and serve!

December 13, 2012

Taming Text [Coming real soon now!]

Filed under: Lucene,Mahout,Solr,Text Mining — Patrick Durusau @ 3:14 pm

Taming Text by Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris.

During a webinar today Grant said that “Taming Text” should be out in ebook form in just a week or two.

Grant is giving up the position of being the second longest running MEAP project. (He didn’t say who was first.)

Let’s all celebrate Grant and his co-authors crossing the finish line with a record number of sales!

This promises to be a real treat!

PS: Not going to put this on my wish list, too random and clumsy a process. Will just order it direct. 😉

December 9, 2012

Fun with Lucene’s faceted search module

Filed under: Faceted Search,Lucene,Search Engines,Searching — Patrick Durusau @ 8:16 pm

Fun with Lucene’s faceted search module by Mike McCandless.

From the post:

These days faceted search and navigation is common and users have come to expect and rely upon it.

Lucene’s facet module, first appearing in the 3.4.0 release, offers a powerful implementation, making it trivial to add a faceted user interface to your search application. Shai Erera wrote up a nice overview here and worked through nice “getting started” examples in his second post.

The facet module has not been integrated into Solr, which has an entirely different implementation, nor into ElasticSearch, which also has its own entirely different implementation. Bobo is yet another facet implementation! I’m sure there are more…

The facet module can compute the usual counts for each facet, but also has advanced features such as aggregates other than hit count, sampling (for better performance when there are many hits) and complements aggregation (for better performance when the number of hits is more than half of the index). All facets are hierarchical, so the app is free to index an arbitrary tree structure for each document. With the upcoming 4.1, the facet module will fully support near-real-time (NRT) search.

Take some time over the holidays to play with faceted searches in Lucene.

December 8, 2012

Looking at a Plaintext Lucene Index

Filed under: Indexing,Lucene — Patrick Durusau @ 5:24 pm

Looking at a Plaintext Lucene Index by Florian Hopf.

From the post:

The Lucene file format is one of the reasons why Lucene is as fast as it is. An index consist of several binary files that you can’t really inspect if you don’t use tools like the fantastic Luke.

Starting with Lucene 4 the format for these files can be configured using the Codec API. Several implementations are provided with the release, among those the SimpleTextCodec that can be used to write the files in plaintext for learning and debugging purposes.

Good starting point for learning more about Lucene indexes.

November 21, 2012

Lucene with Zing, Part 2

Filed under: Indexing,Java,Lucene,Zing JVM — Patrick Durusau @ 9:22 am

Lucene with Zing, Part 2 by Mike McCandless.

From the post:

When I last tested Lucene with the Zing JVM the results were impressive: Zing’s fully concurrent C4 garbage collector had very low pause times with the full English Wikipedia index (78 GB) loaded into RAMDirectory, which is not an easy feat since we know RAMDirectory is stressful for the garbage collector.

I had used Lucene 4.0.0 alpha for that first test, so I decided to re-test using Lucene’s 4.0.0 GA release and, surprisingly, the results changed! MMapDirectory’s max throughput was now better than RAMDirectory’s (versus being much lower before), and the concurrent mark/sweep collector (-XX:-UseConcMarkSweepGC) was no longer hitting long GC pauses.

This was very interesting! What change could improve MMapDirectory’s performance, and lower the pressure on concurrent mark/sweep’s GC to the point where pause times were so much lower in GA compared to alpha?

Mike updates his prior experience with Lucene and Zing.

Covers the use gcLogAnalyser and Fragger to understand “why” his performance test results changed from the alpha to GA releases.

Insights into both Lucene and Zing.

Have you considered loading your topic map into RAM?

November 18, 2012

LucidWorks Announces Lucene Revolution 2013

Filed under: Conferences,Lucene,LucidWorks — Patrick Durusau @ 4:50 pm

LucidWorks Announces Lucene Revolution 2013 by Paul Doscher, CEO of LucidWorks.

From the webpage:

LucidWorks, the trusted name in Search, Discovery and Analytics, today announced that Lucene Revolution 2013 will take place at The Westin San Diego on April 29 – May 2, 2013. Many of the brightest minds in open source search will convene at this 4th annual Lucene Revolution to discuss topics and trends driving the next generation of search. The conference will be preceded by two days of Apache Lucene, Solr and Big Data training.

BTW, the call for papers opened up on November 12, 2012, but you still have time left: http://lucenerevolution.org/2013/call-for-papers

Jan. 13, 2013: CFP closes
Feb 1, 2013: Speakers notified

October 22, 2012

Searching Big Data’s Open Source Roots

Filed under: BigData,Hadoop,Lucene,LucidWorks,Mahout,Open Source,Solr — Patrick Durusau @ 1:56 pm

Searching Big Data’s Open Source Roots by Nicole Hemsoth.

Nicole talks to Grant Ingersoll, Chief Scientist at LucidWorks, about the open source roots of big data.

No technical insights but a nice piece to pass along to the c-suite. Investment in open source projects can pay rich dividends. So long as you don’t need them next quarter. 😉

And a snapshot of where we are now, which is on the brink of new tools and capabilities in search technologies.

October 12, 2012

Lucene 4.0 and Solr 4.0 Released!

Filed under: Lucene,Solr — Patrick Durusau @ 3:29 pm

Lucene 4.0 and Solr 4.0 Released!

I am omitted the lengthy list of new features, both since the beta and since the last version.

There will be more complete reviews and a list of changes is in the distribution. No need for me to repeat it here.

Grab copies of both and enjoy the weekend!

October 2, 2012

Solr vs ElasticSearch: Part 3 – Searching

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 9:21 am

Solr vs ElasticSearch: Part 3 – Searching by Rafał Kuć.

From the post:

In the last two parts of the series we looked at the general architecture and how data can be handled in both Apache Solr 4 (aka SolrCloud) and ElasticSearch and what the language handling capabilities of both enterprise search engines are like. In today’s post we will discuss one of the key parts of any search engine – the ability to match queries to documents and retrieve them.

  • Solr vs. ElasticSearch: Part 1 – Overview
  • Solr vs. ElasticSearch: Part 2 – Indexing and Language Handling
  • Solr vs. ElasticSearch: Part 3 – Searching
  • Solr vs. ElasticSearch: Part 4 – Faceting
  • Solr vs. ElasticSearch: Part 5 – API Usage Possibilities

Definitely a series to follow.

September 29, 2012

Lucene’s new analyzing suggester [Can You Say Synonym?]

Filed under: Lucene,Synonymy — Patrick Durusau @ 3:49 pm

Lucene’s new analyzing suggester by Mike McCandless.

From the post:

Live suggestions as you type into a search box, sometimes called suggest or autocomplete, is now a standard, essential search feature ever since Google set a high bar after going live just over four years ago.

In Lucene we have several different suggest implementations, under the suggest module; today I’m describing the new AnalyzingSuggester (to be committed soon; it should be available in 4.1).

To use it, you provide the set of suggest targets, which is the full set of strings and weights that may be suggested. The targets can come from anywhere; typically you’d process your query logs to create the targets, giving a higher weight to those queries that appear more frequently. If you sell movies you might use all movie titles with a weight according to sales popularity.

You also provide an analyzer, which is used to process each target into analyzed form. Under the hood, the analyzed form is indexed into an FST. At lookup time, the incoming query is processed by the same analyzer and the FST is searched for all completions sharing the analyzed form as a prefix.

Even though the matching is performed on the analyzed form, what’s suggested is the original target (i.e., the unanalyzed input). Because Lucene has such a rich set of analyzer components, this can be used to create some useful suggesters:

One of the use cases that Mike mentions is use of the AnalyzingSuggester to suggest synonyms of terms entered by a user.

That presumes that you know the target of the search and likely synonyms that occur in it.

Use standard synonym sets and you will get standard synonym results.

Develop custom synonym sets and you can deliver time/resource saving results.

« Newer PostsOlder Posts »

Powered by WordPress