Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 18, 2013

Solr Query Parsing

Filed under: Lucene,Solr — Patrick Durusau @ 7:07 pm

Solr Query Parsing by Eric Hatcher.

From the description:

Interpreting what the user meant and what they ideally would like to find is tricky business. This talk will cover useful tips and tricks to better leverage and extend Solr‘s analysis and query parsing capabilities to more richly parse and interpret user queries.

It may just be me but does it seem like Solr presentations hit the ground thinking you have a background on the subject at hand?

I won’t name names or topics but presentations start off with the same basics in any number of other talks, it’s hard to get interested.

That’s not the case, even just with the slides from Eric’s presentation!

Highly recommended!

November 17, 2013

Spelling isn’t a subject…

Filed under: Lucene,Solr — Patrick Durusau @ 8:39 pm

Have you seen Alec Baldwin’s teacher commercial?

A student suggests spelling as a subject and Alec responds: “Spelling isn’t a subject, spell-check, that’s a program, right?”

In Spellchecking in Trovit by Xavier Sanchez Loro, you will find that spell-check is more than a “program.”

Especially in a multi-language environment where the goal isn’t just correct spelling but delivery of relevant information to users.

From the post:

This post aims to explain the implementation and use case for spellchecking in the Trovit search engine that we will be presenting at the Lucene/Solr Revolution EU 2013 [1]. Trovit [2] is a classified ads search engine supporting several different sites, one for each country and vertical. Our search engine supports multiple indexes in multiple languages, each with several millions of indexed ads. Those indexes are segmented in several different sites depending on the type of ads (homes, cars, rentals, products, jobs and deals). We have developed a multi-language spellchecking system using SOLR [3] and Lucene [4] in order to help our users to better find the desired ads and to avoid the dreaded 0 results as much as possible (obviously, whilst still reporting back relevant information to the user). As such, our goal is not pure orthographic correction, but also to suggest correct searches for a certain site.

Our approach: Contextual Spellchecking

One key element in the spellchecking process is choosing the right dictionary, one with a relevant vocabulary for the type of information included in each site. Our approach is specializing the dictionaries based on user’s search context. Our search contexts are composed of country (with a default language) and vertical (determining the type of ads and vocabulary). Each site’s document corpus has a limited vocabulary, reduced to the type of information, language and terms included in each site’s ads. Using a more generalized approach is not suitable for our needs, since a unique vocabulary for each language (regardless of the vertical) is not as precise as specialized vocabularies for each language and vertical. We have observed drastic differences in the type of terms included in the indexes and the semantics of each vertical. Terms that are relevant in one context are meaningless in another one (e.g. “chalet” is not a relevant word in cars vertical, but is a highly relevant word for homes vertical). As such, Trovit’s spellchecking implementation exhibits very different vocabularies for each site, even when supporting the same language.

I like the emphasis on “contextual” spellchecking.

Sounds a lot like “contextual” subject recognition.

Yes?

Walking through this post in detail is an excellent exercise!

November 16, 2013

CLue

Filed under: Indexing,Lucene,Luke — Patrick Durusau @ 7:24 pm

CLue – Command Line tool for Apache Lucene by John Wang.

From the webpage:

When working with Lucene, it is often useful to inspect an index.

Luke is awesome, but often times it is not feasible to inspect an index on a remote machine using a GUI. That’s where Clue comes in. You can ssh into your production box and inspect your index using your favorite shell.

Another important feature for Clue is the ability to interact with other Unix commands via piping, e.g. grep, more etc.

[New in 0.0.4 Release]

  • Add ability to investigate indexes on HDFS
  • Add command to dump the index
  • Add command to import from a dumped index
  • Add configuration support, now you can configure Clue to run your own custom code
  • Add index trimming functionlity: sometimes you want a smaller index to work with
  • lucene 4.5.1 upgrade

Definitely a tool to investigate for adding to your tool belt!

November 14, 2013

Querying rich text with Lux

Filed under: Lucene,Query Language,XML,XQuery — Patrick Durusau @ 11:17 am

Querying rich text with Lux – XQuery for Lucene by Michael Sokolov.

Slide deck that highlights features of Lux, which is billed at its homepage as:

The XML Search Engine Lux is an open source XML search engine formed by fusing two excellent technologies: the Apache Lucene/Solr search index and the Saxon XQuery/XSLT processor.

Not surprisingly, I am in favor of using XML to provide context for data.

You can get a better feel for Lux by:

Reading Indexing Queries in Lux by Michael Sokolov (Balisage 2013)

Visiting the Lux homepage: http://luxdb.org

Downloading Lux Source: http://github.com/msokolov/lux

BTW, Michael does have experience with XML based content: safaribooksonline.com, oed.com, degruyter.com, oxfordreference.com and others.

PS: Remember any comments on XQuery 3.0 are due by November 19, 2013.

November 12, 2013

Using Solr to Search and Analyze Logs

Filed under: Hadoop,Log Analysis,logstash,Lucene,Solr — Patrick Durusau @ 4:07 pm

Using Solr to Search and Analyze Logs by Radu Gheorghe.

From the description:

Since we’ve added Solr output for Logstash, indexing logs via Logstash has become a possibility. But what if you are not using (only) Logstash? Are there other ways you can index logs in Solr? Oh yeah, there are! The following slides are from Lucene Revolution conference that just took place in Dublin where we talked about indexing and searching logs with Solr.

Slides but a very good set of slides.

Radu’s post reminds me I over looked logs in the Hadoop eco-system when describing semantic diversity (Hadoop Ecosystem Configuration Woes?).

Or for that matter, how do you link up the logs with particular configuration or job settings?

Emails to the support desk and sticky notes don’t seem equal to the occasion.

November 6, 2013

elasticsearch 1.0.0.beta1 released

Filed under: ElasticSearch,Lucene,Search Engines,Searching — Patrick Durusau @ 8:04 pm

elasticsearch 1.0.0.beta1 released by Clinton Gormley.

From the post:

Today we are delighted to announce the release of elasticsearch 1.0.0.Beta1, the first public release on the road to 1.0.0. The countdown has begun!

You can download Elasticsearch 1.0.0.Beta1 here.

In each beta release we will add one major new feature, giving you the chance to try it out, to break it, to figure out what is missing and to tell us about it. Your use cases, ideas and feedback is essential to making Elasticsearch awesome.

The main feature we are showcasing in this first beta is Distributed Percolation.

WARNING: This is a beta release – it is not production ready, features are not set in stone and may well change in the next version, and once you have made any changes to your data with this release, it will no longer be readable by older versions!

distributed percolation

For those of you who aren’t familiar with percolation, it is “search reversed”. Instead of running a query to find matching docs, percolation allows you to find queries which match a doc. Think of people registering alerts like: tell me when a newspaper publishes an article mentioning “Elasticsearch”.

Percolation has been supported by Elasticsearch for a long time. In the current implementation, queries are stored in a special _percolator index which is replicated to all nodes, meaning that all queries exist on all nodes. The idea was to have the queries alongside the data.

But users are using it at a scale that we never expected, with hundreds of thousands of registered queries and high indexing rates. Having all queries on every node just doesn’t scale.

Enter Distributed Percolation.

In the new implementation, queries are registered under the special .percolator type within the same index as the data. This means that queries are distributed along with the data, and percolation can happen in a distributed manner across potentially all nodes in the cluster. It also means that an index can be made as big or small as required. The more nodes you have the more percolation you can do.

After reading the news release I understand why Twitter traffic on the elasticsearch release surged today. 😉

A new major feature with each beta release? That should attract some attention.

Not to mention “distributed percolation.”

Getting closer to a result being the “result” at X time on the system clock.

October 27, 2013

Tiny Data: Rapid development with Elasticsearch

Filed under: ElasticSearch,Lucene,Ruby — Patrick Durusau @ 6:52 pm

Tiny Data: Rapid development with Elasticsearch by Leslie Hawthorn.

From the post:

Today we’re pleased to bring you the story of the creation of SeeMeSpeak, a Ruby application that allows users to record gestures for those learning sign language. Florian Gilcher, one of the organizers of the Berlin Elasticsearch User Group participated in a hackathon last weekend with three friends, resulting in this brand new open source project using Elasticsearch on the back end. (Emphasis in original.)

Project:

Sadly, there are almost no good learning resources for sign language on the internet. If material is available, licensing is a hassle or both the licensing and the material is poorly documented. Documenting sign language yourself is also hard, because producing and collecting videos is difficult. You need third-party recording tools, video conversion and manual categorization. That’s a sad state in a world where every notebook has a usable camera built in!

Our idea was to leverage modern browser technologies to provide an easy recording function and a quick interface to categorize the recorded words. The result is SeeMeSpeak.

Two lessons here:

  1. Data does not have to be “big” in order to be important.
  2. Browsers are very close to being the default UI for users.

Lucene Image Retrieval LIRE

Filed under: Image Recognition,Lucene — Patrick Durusau @ 6:40 pm

Lucene Image Retrieval LIRE by Mathias Lux.

From the post:

Today I gave a talk on LIRE at the ACM Multimedia conference in the open source software competition, currently taking place in Barcelona. It gave me the opportunity to present a local installation of the LIRE Solr plugin and the possibilities thereof. Find the slides of the talk at slideshare: LIRE presentation at the ACM Multimedia Open Source Software Competition 2013

The Solr plugin itself is fully functional for Solr 4.4 and the source is available at https://bitbucket.org/dermotte/liresolr. There is a markdown document README.md explaining what can be done with plugin and how to actually install it. Basically it can do content based search, content based re-ranking of text searches and brings along a custom field implementation & sub linear search based on hashing.

There is a demo site as well.

See also: LIRE: open source image retrieval in Java.

If you plan on capturing video feeds from traffic cams or other sources, to link up with other data, image recognition is in your future.

You can start with a no-bid research contract or with LIRE and Lucene.

Your call.

October 25, 2013

Collection Aliasing:…

Filed under: BigData,Cloudera,Lucene,Solr — Patrick Durusau @ 7:29 pm

Collection Aliasing: Near Real-Time Search for Really Big Data by Mark Miller.

From the post:

The rise of Big Data has been pushing search engines to handle ever-increasing amounts of data. While building Cloudera Search, one of the things we considered in Cloudera Engineering was how we would incorporate Apache Solr with Apache Hadoop in a way that would enable near-real-time indexing and searching on really big data.

Eventually, we built Cloudera Search on Solr and Apache Lucene, both of which have been adding features at an ever-faster pace to aid in handling more and more data. However, there is no silver bullet for dealing with extremely large-scale data. A common answer in the world of search is “it depends,” and that answer applies in large-scale search as well. The right architecture for your use case depends on many things, and your choice will generally be guided by the requirements and resources for your particular project.

We wanted to make sure that one simple scaling strategy that has been commonly used in the past for large amounts of time-series data would be fairly simple to set up with Cloudera Search. By “time-series data,” I mean logs, tweets, news articles, market data, and so on — data that is continuously being generated and is easily associated with a current timestamp.

One of the keys to this strategy is a feature that Cloudera recently contributed to Solr: collection aliasing. The approach involves using collection aliases to juggle collections in a very scalable little “dance.” The architecture has some limitations, but for the right use cases, it’s an extremely scalable option. I also think there are some areas of the dance that we can still add value to, but you can already do quite a bit with the current functionality.

A great post if you have really big data. 😉

Seriously, it is a great post and introduction to collection aliases.

On the other hand, I do wonder what routine Abbot and Costello would do with the variations on big, bigger, really big, etc., data.

Suggestions welcome!

Apache Lucene and Solr 4.5.1 (bugfix)

Filed under: Lucene,Solr — Patrick Durusau @ 8:04 am

Apache Lucene and Solr 4.5.1

From the post:

Today Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbered 4.5.1. This is a minor bugfix release.

Apache Lucene 4.5.1 library can be downloaded from the following address: http://www.apache.org/dyn/closer.cgi/lucene/java/. Apache Solr 4.5.1 can be downloaded at the following URL address: http://www.apache.org/dyn/closer.cgi/lucene/solr/. Release note for Apache Lucene 4.5.1 can be found at: http://wiki.apache.org/lucene-java/ReleaseNote451, Solr release notes can be found at: http://wiki.apache.org/solr/ReleaseNote451.

Without a “tech surge” no less.

October 11, 2013

Free Text and Spatial Search…

Filed under: Lucene,Searching,Spatial Index — Patrick Durusau @ 3:08 pm

Free Text and Spatial Search with Spatial4J and Lucene Spatial by Steven Citron-Pousty.

From the post:

Hey there, Shifters. One of my talks at FOSS4G 2013 covered Lucene Spatial. Todays post is going to follow up on my post about creating Lucene Indices by adding spatial capabilities to the index. In the end you will have a a full example on how create a fast and full featured full text spatial search on any documents you want to use.

How to add spatial to your Lucene index

In the last post I covered how to create a Lucene index so in this post I will just cover how to add spatial. The first thing you need to understand are the two pieces of how spatial is handled by Lucene. A lot of this work is done by Dave Smiley. He gave a great presentation on all this technology at Lucene/Solr Revolution 2013. If you really want to dig in deep, I suggest you watch his 1:15 h:m long video – my blog post is more the Too Long Didn’t Listen (TL;DL) version.

  • Spatial4J: This Java library provides geospatial shapes, distance calculations, and importing and exporting shapes. It is Apache Licensed so it can be used with other ASF projects. Lucene Spatial uses Spatial4J to create the spatial objects that get indexed along with the documents. It will also be used when calculating distances in a query or when we want to convert between distance units. Spatial4J is able to handle real-world on a sphere coordinates (what comes out of a GPS unit) and projected coordinates (any 2D map) for both shapes and distances.

Short aside: The oldest Java based spatial library is JTS and is used in many other Open Source Java geospatial projects. Spatial4J uses JTS under the hood if you want to work with Polygon shapes. Unfortunately, until recently it was LGPL and so could not be included in Lucene. JTS has announced it’s intention to go to a BSD type license which should allow Spatial4J and JTS to start working together for more Java Spatial goodness for all. One of the beauties of FOSS is the ability to see development discussions happen in the open.

  • Lucene Spatial After many different and custom iterations – there is now lucene spatial built right into Lucene as a standard library. It is new with the 4.x releases of Lucene. What Lucene spatial does is provide the indexing and search strategies for spatial4j shapes stored in a Lucene index. It has SpatialStrategy as the base class to define the signature that any spatial strategy must fulfill. You then use the same strategy for the index writing and reading.

Today I will show the code to use spatial4j with Lucene Spatial to add a spatially indexed field to your lucene index.

Pay special attention to the changes that made it possible for Spatial4J and JTS work together.

Cooperation between projects makes the resulting whole stronger.

Some office projects need to have that realization.

October 10, 2013

Apache Lucene: Then and Now

Filed under: Lucene,Solr,SolrCloud — Patrick Durusau @ 3:06 pm

Apache Lucene: Then and Now by Doug Cutting.

From the description at Washington DC Hadoop Users Group:

Doug Cutting originally wrote Lucene in 1997-8. It joined the Apache Software Foundation’s Jakarta family of open-source Java products in September 2001 and became its own top-level Apache project in February 2005. Until recently it included a number of sub-projects, such as Lucene.NET, Mahout, Solr and Nutch. Solr has merged into the Lucene project itself and Mahout, Nutch, and Tika have moved to become independent top-level projects. While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. At the core of Lucene’s logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene’s API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others (except images), can all be indexed as long as their textual information can be extracted.

In today’s discussion, Doug will share background on the impetus and creation of Lucene. He will talk about the evolution of the project and explain what the core technology has enabled today. Doug will also share his thoughts on what the future holds for Lucene and SOLR

Interesting walk down history lane with the creator of Lucene, Doug Cutting.

October 7, 2013

Webinar: Trubo-Charging Solr

Filed under: Entity Resolution,Lucene,LucidWorks,Relevance,Solr — Patrick Durusau @ 10:40 am

Turbo-charge your Solr instance with Entity Recognition, Business Rules and a Relevancy Workbench by Yann Yu.

Date: Thursday, October 17, 2013
Time: 10:00am Pacific Time

From the post:

LucidWorks has three new modules available in the Solr Marketplace that run on top of your existing Solr or LucidWorks Search instance. Join us for an overview of each module and learn how implementing one, two or all three will turbo-charge your Solr instance.

  • Business Rules Engine: Out of the box integration with Drools, the popular open-source business rules engine is now available for Solr and LucidWorks Search. With the LucidWorks Business Rules module, developers can write complex rules using declarative syntax with very little programming. Data can be modified, cleaned and enriched through multiple permutations and combinations.
  • Relevancy Workbench: Experiment with different search parameters to understand the impact of these changes to search results. With intuitive, color-code and side-by-side comparisons of results for different sets of parameters, users can quickly tune their application to produce the results they need. The Relevancy Workbench encourages experimentation with a visual “before and after” view of the results of parameter changes.
  • Entity Recognition: Enhance Search applications beyond simple keyword search by adding intelligence through metadata. Help classify common patterns from unstructured data/content into predefined categories. Examples include names of persons, organizations, locations, expressions of time, quantities, monetary values, percentages etc.

All of these modules will be of interest to topic mappers who are processing bulk data.

October 6, 2013

Apache Solr 4.5 documentation

Filed under: Lucene,Solr — Patrick Durusau @ 4:22 pm

Apache Solr 4.5 documentation

From the post:

Apache Solr PMC announced that the newest version of official Apache Solr documentation for Solr 4.5 (more about that version) is now available. The PDF file with documentation is available at: https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/.

If Apache Solr 4.5 was welcome news, this is even more so!

I am doing a lot of proofing of drafts (not by me) this week. Always refreshing to have alternative reading material that doesn’t make me wince.

That unfair. To the Apache Solr Reference Manual.

It is way better than simply not making me wince.

I am sure I will find things I would state differently but I feel confident I won’t encounter writing errors we were encouraged since grade school to avoid.

I won’t go into the details as someone might mistake description for recommendation. 😉

Enjoy the Apache Solr 4.5 documentation!

October 5, 2013

Apache Lucene 4.5 and Apache SolrTM 4.5 available

Filed under: Lucene,Solr — Patrick Durusau @ 7:07 pm

From: Apache Lucene News:

The Lucene PMC is pleased to announce the availability
of Apache Lucene 4.5 and Apache Solr 4.5.

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

Highlights of the Lucene release include:

  • Added support for missing values to DocValues fields through AtomicReader.getDocsWithField.
  • Lucene 4.5 has a new Lucene45Codec with Lucene45DocValues, supporting missing values and with most datastructures residing off-heap.
  • New in-memory DocIdSet implementations which are especially better than FixedBitSet on small sets: WAH8DocIdSet, PFORDeltaDocIdSet and EliasFanoDocIdSet.
  • CachingWrapperFilter now caches filters with WAH8DocIdSet by default, which has the same memory usage as FixedBitSet in the worst case but is smaller and faster on small sets.
  • TokenStreams now set the position increment in end(), so we can handle trailing holes.
  • IndexWriter no longer clones the given IndexWriterConfig.

Lucene 4.5 also includes numerous optimizations and bugfixes.

Highlights of the Solr release include:

  • Custom sharding support, including the ability to shard by field.
  • DocValue improvements: single valued fields no longer require a default value, allowiing dynamicFields to contain doc values, as well as sortMissingFirst and sortMissingLast on docValue fields.
  • Ability to store solr.xml in ZooKeeper.
  • Multithreaded faceting.
  • CloudSolrServer can now route updates directly to the appropriate shard leader.

Solr 4.5 also includes numerous optimizations and bugfixes.

Excellent!

October 3, 2013

Dublin Lucene Revolution 2013 Sessions

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 6:45 pm

Dublin Lucene Revolution 2013 Sessions

Just a sampling to whet your appetite:

With many more entries in the intermediate and introductory levels.

Of all of the listed sessions, which ones will set your sights on Dublin?

Reminder: Training: November 4-5, Conference: November 6-7

October 1, 2013

Elasticsearch internals: an overview

Filed under: ElasticSearch,Lucene,Search Engines — Patrick Durusau @ 2:50 pm

Elasticsearch internals: an overview by Njal Karevoll.

From the post:

This article gives an overview of the Elasticsearch internals. I will present a 10,000 foot view of the different modules that Elasticsearch is composed of and how we can extend or replace built-in functionality using plugins.

Using Freemind, Njal has created maps of the namespaces and modules of ElasticSearch for your exploration.

The full module view reminds me of SGML productions, except less complicated.

September 30, 2013

Lucene now has an in-memory terms dictionary…

Filed under: Indexing,Lucene — Patrick Durusau @ 7:05 pm

Lucene now has an in-memory terms dictionary, thanks to Google Summer of Code by Mike McCandless.

From the post:

Last year, Han Jiang’s Google Summer of Code project was a big success: he created a new (now, default) postings format for substantially faster searches, along with smaller indices.

This summer, Han was at it again, with a new Google Summer of Code project with Lucene: he created a new terms dictionary holding all terms and their metadata in memory as an FST.

In fact, he created two new terms dictionary implementations. The first, FSTTermsWriter/Reader, hold all terms and metadata in a single in-memory FST, while the second, FSTOrdTermsWriter/Reader, does the same but also supports retrieving the ordinal for a term (TermsEnum.ord()) and looking up a term given its ordinal (TermsEnum.seekExact(long ord)). The second one also uses this ord internally so that the FST is more compact, while all metadata is stored outside of the FST, referenced by ord.

Lucene continues to improve, rapidly!

September 28, 2013

Language support and linguistics

Filed under: Language,Lucene,Solr — Patrick Durusau @ 7:31 pm

Language support and linguistics in Apache Lucene™ and Apache Solr™ and the eco-system by Gaute Lambertsen and Christian Moen.

Slides from Lucene Revolution May, 2013.

Good overview of language support and linguistics in both Lucene and Solr.

A few less language examples at the beginning would shorten the slide deck from its current one hundred and fifty-one (151) count without impairing its message.

Still, if you are unfamiliar with language support in Lucene and Solr, the extra examples don’t hurt anything.

September 24, 2013

Three exciting Lucene features in one day

Filed under: Lucene,Search Engines — Patrick Durusau @ 4:21 pm

Three exciting Lucene features in one day by Mike McCandless.

From the post:

The first feature, committed yesterday, is the new expressions module. This allows you to define a dynamic field for sorting, using an arbitrary String expression. There is builtin support for parsing JavaScript, but the parser is pluggable if you want to create your own syntax.

The second feature, also committed yesterday, is updateable numeric doc-values fields, letting you change previously indexed numeric values using the new updateNumericDocValue method on IndexWriter. It works fine with near-real-time readers, so you can update the numeric values for a few documents and then re-open a new near-real-time reader to see the changes.

Finally, the third feature is a new suggester implementation, FreeTextSuggester. It is a very different suggester than the existing ones: rather than suggest from a finite universe of pre-built suggestions, it uses a simple ngram language model to predict the “long tail” of possible suggestions based on the 1 or 2 previous tokens.

By anybody’s count, that was an extraordinary day!

Drop by Mike’s post for the details.

September 3, 2013

IDH Hbase & Lucene Integration

Filed under: HBase,IDH HBase,Lucene — Patrick Durusau @ 7:00 pm

IDH Hbase & Lucene Integration by Ritu Kama.

From the post:

HBase is a non-relational, column-oriented database that runs on top of the Hadoop Distributed File System (HDFS). Hbase’s tables contain rows and columns. Each table has an element defined as a Primary Key which is used for all Get/Put/Scan/Delete operations on those tables. To some extent this can be a shortcoming because one may want to search within, say, a given column.

The IDH Integration with Lucene

The Intel® Distribution for Apache Hadoop* (IDH) solves this problem by incorporating native features that permit straightforward integration with Lucene. Lucene is a search library that acts upon documents containing data fields and their values. The IDH-to-Lucene integration leverages the HBase Observer and Endpoint concepts, and therein lies the flexibility to access the HBase data with Lucene searches more robustly.

The Observers can be likened to triggers in RDBMS’s, while the Endpoints share some conceptual similarity to stored procedures. The mapping of Hbase records and Lucene documents is done by a convenience class called IndexMetadata. The Hbase observer monitors data updates to the Hbase table and builds indexes synchronously. The Indexes are stored in multiple shards with each shard tied to a region. The Hbase Endpoint dispatches search requests from the client to those regions.

When entering data into an HBase table you’ll need to create an HBase-Lucene mapping using the IndexMetadata class. During the insertion, text in the columns that are mapped get broken into indexes and stored in the Lucene index file. This process of creating the Lucene index is done automatically by the IDH implementation. Once the Lucene index is created, you can search on any keyword. The implementation searches for the word in the Lucene index and retrieves the row ID’s of the target word. Then, using those keys you can directly access the relevant rows in the database.

IDH’s HBase-Lucene integration extends HBase’s capability and provides many advantages:

  1. Search not only by row key but also by values.
  2. Use multiple query types such as Starts, Ends, Contains, Range, etc.
  3. Ranking scores for the search are also available.

(…)

Interested yet?

See Ritu’s post for sample code and configuration procedures.

Definitely one for the short list on downloads to make.

August 28, 2013

Building a distributed search system

Filed under: Distributed Computing,Hadoop,Lucene,Search Engines — Patrick Durusau @ 2:13 pm

Building a distributed search system with Apache Hadoop and Lucene by Mirko Calvaresi.

From the preface:

This work analyses the problem coming from the so called Big Data scenario, which can be defined as the technological challenge to manage and administer quantity of information with global dimension in the order of Terabyte (1012bytes) or Petabyte (1015bytes) and with an exponential growth rate. We’ll explore a technological and algorithmic approach to handle and calculate theses amounts of data that exceed the limit of computation of a traditional architecture based on real-time request processing:in particular we’ll analyze a singular open source technology, called Apache Hadoop, which implements the approach described as Map and Reduce.

We’ll describe also how to distribute a cluster of common server to create a Virtual File System and use this environment to populate a centralized search index (realized using another open source technology, called Apache Lucene). The practical implementation will be a web based application which offers to the user a unified searching interface against a collection of technical papers. The scope is to demonstrate that a performant search system can be obtained pre-processing the data using the Map and Reduce paradigm, in order to obtain a real time response, which is independent to the underlying amount of data. Finally we’ll compare this solutions to different approaches based on clusterization or No SQL solutions, with the scope to describe the characteristics of concrete scenarios, which suggest the adoption of those technologies.

Fairly complete (75 pages) report on a project indexing academic papers with Lucene and Hadoop.

I would like to see treatment of the voiced demand for “real-time processing” versus the need for “real-time processing.”

When I started using research tools, indexes, like the Readers Guide to Periodical Literature were at a minimum two (2) weeks behind popular journals.

Academic indexes ran that far behind if not a good bit longer.

The timeliness of indexing journal articles is now nearly simultaneous with publication.

Has the quality of our research improved due to faster access?

I can imagine use cases, drug interactions for example, the discovery of which should be streamed out as soon as practical.

But drug interactions are not the average case.

It would be very helpful to see research on what factors favor “real-time” solutions and which are quite sufficient with “non-real-time” solutions.

August 24, 2013

Agenda for Lucene/Solr Revolution EU! [Closes September 9, 2013]

Filed under: Conferences,Lucene,Solr — Patrick Durusau @ 6:34 pm

Help Us Set the Agenda for Lucene/Solr Revolution EU! by Laura Whalen.

From the post:

Thanks to all of you who submitted an abstract for the Lucene/Solr Revolution EU 2013 conference in Dublin. We had an overwhelming response to the Call for Papers, and narrowing the topics from the many great submissions was a difficult task for the Conference Committee. Now we need your help in making the final selections!

Vote now! Community voting will close September 9, 2013.

The Lucene/Solr Revolution free voting system allows you to vote on your favorite topics. The sessions that receive the highest number of votes will be automatically added to the Lucene/Solr Revolution EU 2013 agenda. The remaining sessions will be selected by a committee of industry experts who will take into account the community’s votes as well as their own expertise in the area. Click here to start voting for your favorites.

Your chance to influence the Lucene/Solr Revolution agenda for Dublin! (November 4-7)

PS: As of August 24, 2013, about 11:33 UTC, I was getting a server error from the voting link. Maybe overload of voters?

August 22, 2013

You complete me

Filed under: AutoSuggestion,ElasticSearch,Interface Research/Design,Lucene — Patrick Durusau @ 2:03 pm

You complete me by Alexander Reelsen.

From the post:

Effective search is not just about returning relevant results when a user types in a search phrase, it’s also about helping your user to choose the best search phrases. Elasticsearch already has did-you-mean functionality which can correct the user’s spelling after they have searched. Now, we are adding the completion suggester which can make suggestions while-you-type. Giving the user the right search phrase before they have issued their first search makes for happier users and reduced load on your servers.

Warning: The completion suggester Alexander describes may “change/break in future releases.”

Two features that made me read the post were: readability and custom ordering.

Under readability, the example walks you through returning one output for several search completions.

Suggestions don’t have to be presented in TF/IDF relevance order. A weight assigned to the target of a completion controls the ordering of suggestions.

The post covers several other features and if you are using or considering using Elasticsearch, it is a good read.

duplitector

Filed under: Duplicates,ElasticSearch,Lucene — Patrick Durusau @ 1:17 pm

duplitector by Paweł Rychlik.

From the webpage:

duplitector

A duplicate data detector engine based on Elasticsearch. It’s been successfully used as a proof of concept, piloting an full-blown enterprize solution.

Context

In certain systems we have to deal with lots of low-quality data, containing some typos, malformatted or missing fields, erraneous bits of information, sometimes coming from different sources, like careless humans, faulty sensors, multiple external data providers, etc. This kind of datasets often contain vast numbers of duplicate or similar entries. If this is the case – then these systems might struggle to deal with such unnatural, often unforeseen, conditions. It might, in turn, affect the quality of service delivered by the system.

This project is meant to be a playground for developing a deduplication algorithm, and is currently aimed at the domain of various sorts of organizations (e.g. NPO databases). Still, it’s small and generic enough, so that it can be easily adjusted to handle other data schemes or data sources.

The repository contains a set of crafted organizations and their duplicates (partially fetched from IRS, partially intentionally modified, partially made up), so that it’s convenient to test the algorithm’s pieces.

Paweł also points to this article by Andrei Zmievski: Duplicates Detection with ElasticSearch. Andrei merges tags for locations based on their proximity to a particular coordinates.

I am looking forward to the use of indexing engines for deduplication of data in situ as it were. That is without transforming the data into some other format for processing.

August 21, 2013

SuggestStopFilter carefully removes stop words for suggesters

Filed under: Lucene,Search Engines — Patrick Durusau @ 6:07 pm

SuggestStopFilter carefully removes stop words for suggesters by Michael McCandless.

Michael has tamed the overly “aggressive” StopFilter with SuggestStopFilter.

From the post:

Finally, you could use the new StopSuggestFilter at lookup time: this filter is just like StopFilter except when the token is the very last token, it checks the offset for that token and if the offset indicates that the token has ended without any further non-token characters, then the token is preserved. The token is also marked as a keyword, so that any later stem filters won’t change it. This way a query “a” can find “apple”, but a query “a ” (with a trailing space) will find nothing because the “a” will be removed.

I’ve pushed StopSuggestFilter to jirasearch.mikemccandless.com and it seems to be working well so far!

Have you noticed how quickly improvements for Lucene and Solr emerge?

August 1, 2013

Open Source Search FTW!

Filed under: Lucene,Search Engines,Searching,Solr — Patrick Durusau @ 4:40 pm

Open Source Search FTW! by Grant Ingersoll.

Abstract:

Apache Lucene and Solr are the most widely deployed search technology on the planet, powering sites like Twitter, Wikipedia, Zappos and countless applications across a large array of domains. They are also free, open source, extensible and extremely scalable. Lucene and Solr also contain a large number of features for solving common information retrieval problems ranging from pluggable posting list compression and scoring algorithms to faceting and spell checking. Increasingly, Lucene and Solr also are being (ab)used to power applications going way beyond the search box. In this talk, we’ll explore the features and capabilities of Lucene and Solr 4.x, as well as look at how to (ab)use your search engine technology for fun and profit.

If you aren’t already studying search engines, perhaps these slides will convince you to do so.

When you think about it, search precedes all other computer processing.

July 30, 2013

Lucene 4 Performance Tuning

Filed under: Indexing,Lucene,Performance,Searching — Patrick Durusau @ 6:47 pm

From the description:

Apache Lucene has undergone a major overhaul influencing many of the key characteristics dramatically. New features and modification allow for new as well as fundamentally different ways of tuning the engine for best performance.

Tuning performance is essential for almost every Lucene based application these days – Search & Performance almost a synonyms. Knowing the details of the underlying software provides the basic tools to get the best out of your application. Knowing the limitations can safe you and your company a massive amount of time and money. This talks tries to explain design decision made in Lucene 4 compared to older versions and provide technical details how those implementations and design decisions can help to improve the performance of your application. The talk will mainly focus on core features like: Realtime & Batch Indexing Filter and Query performance Highlighting and Custom Scoring

The talk will contain a lot of technical details that require a basic understanding of Lucene, datastructures and algorithms. You don’t need to be an expert to attend but be prepared for some deep dive into Lucene. Attendees don’t need to be direct Lucene users, the fundamentals provided in this talk are also essential for Apache Solr or elasticsearch users.

If you want to catch some of the highlights of Lucene 4, this is the presentation for you!

It will be hard to not go dig deeper in a number of areas.

The new codec features were particularly impressive!

July 26, 2013

Lucene/Solr Revolution EU 2013 – Reminder

Filed under: Conferences,Lucene,Solr — Patrick Durusau @ 11:34 am

Lucene/Solr Revolution EU 2013 – Reminder

The deadline for submitting an abstract is August 2, 2013.

Key Dates:

June 3, 2013: CFP opens
August 2, 2013: CFP closes
August 12, 2013: Community voting begins
September 1, 2013: Community voting ends
September 22, 2013: All speakers notified of submission status

Top Five Reasons to Attend (according to conference organizers):

  • Learn:  Meet, socialize, collaborate, and network with fellow Lucene/Solr enthusiasts.
  • Innovate:  From field-collapsing to flexible indexing to integration with NoSQL technologies, you get the freshest thinking on solving the deepest, most interesting problems in open source search and big data.
  • Connect: The power of open source is demolishing traditional barriers and forging new opportunity for killer code and new search apps.
  • Enjoy:  We’ve scheduled fun into the conference! Networking breaks, Stump-the-Chump, Lightning talks and a big conference party!
  • Save:  Take advantage of packaged deals on accelerated two-day training workshops, coupled with conference sessions on real-world implementations presented by Solr/Lucene experts.

Let’s be honest. The real reason to attend is Dublin, Ireland in early November. (On average, 22 rainy days in November.) 😉

Take an umbrella, extra sweater or coat and enjoy!

July 24, 2013

Apache Lucene 4.4 and Apache SolrTM 4.4 available

Filed under: Lucene,Solr — Patrick Durusau @ 3:54 pm

Lucene: http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene CHANGES.txt

Solr: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr CHANGES.txt

If you follow Lucene/Solr you have probably already heard the news.

There are nineteen (19) new features in Lucene and twenty (20) in Solr so don’t neglect the release notes.

Spend some time with both releases. I don’t think you will be disappointed.

« Newer PostsOlder Posts »

Powered by WordPress