April 26, 2012
LucidWorks 2.1
There are times, not very often, when picking only a few features to report would be unfair to a product.
This is one of those times.
I have reproduced the description of LucidWorks 2.1 as it appears on the Lucid Imagination site:
LucidWorks 2.1 new features list:
Enhancement Areas |
Key Benefits
|
Includes the latest Lucene/Solr 4.0
|
- Near Real Time
- Fault Tolerance and High Availability
- Data Durability
- Centralized Configuration
- Elasticity
|
Business Rules
|
- Integrate your business processes and rules with the user search experience
- Examples: Landing Pages, provide targeted search results per user, etc.
- Framework to integrate with your BRMS (Business Rules Management System)
- OOB integration with leading open source BRMS – Drools
|
Upgrade and Migrations
|
- Lucid can help upgrade customers from Solr 3.x to 4.0 or older Solr versions to LucidWorks 2.1
- Upgrades for existing LucidWorks customers on previous versions of LucidWorks to LucidWorks 2.1
|
Enhanced Connector Framework
|
- Easily build integrations to index data from any application or data sources
- Framework supports REST API driven integration, generates dynamic configuration UI, and allows admins to schedule the new connectors
- Connectors available to crawl large amounts of HDFS data, integrate twitter updates into index, and CMIS connector to support CMS systems like Alfresco, etc.
|
Efficient Crawl of Large web content
|
- OOB integration for Nutch (open source)
- Helps crawl Webscale data into your index
|
REST API and UI Enhancements
|
- Supports memory and cache settings, schema less configuration using Dynamic fields from UI
- Subject Matter Experts can create Best Bets for improved search experience
|
Key features and benefits of LucidWorks search platform
- Streamlined search configuration, optimization and operations: Well-organized UI makes Solr innovation easier to consume, better adapting to constant change.
- Enterprise-grade, business critical manageability Includes tools for infrastructure administration, monitoring and reporting so your search application can thrive within a well-defined, well-managed operational environment; includes upgradability across successive releases. We can help migrate Solr installations to LucidWorks 2.1.
- Broad-based content acquisition Access big data and enterprise content faster and more securely with built-in support for Hadoop and Amazon S3, along with Sharepoint and traditional online content types – plus a new open connector framework to customize access to other data sources
- Versatile access and data security Flexible, resilient built-in security simplifies getting search connected right to the right data and content
- Advanced search experience enhancements Powerful, innovative search capabilities deliver faster, better, more useful results for a richer user experience; easily integrates into your application and infrastructure; REST API automates and integrates search as a service with your application.
- Open source power and innovation Complete, supported release of Lucene/Solr 4.0, including latest innovations in Near Real Time search, distributed indexing and more versatile field faceting over and above Apache Lucene/Solr 3.x; all the flexibility of open source, packaged for business-critical development, maintenance and deployment
- Cost-effective commercial grade expertise & Global 24×7 Support a range of annual support subscriptions including bundled services, consulting, training and certification from the world’s leading experts in Lucene/Solr open source.
Comments Off on LucidWorks 2.1
April 25, 2012
An open source replacement for the dtSearch closed source search engine
From the webpage:
We’ve been working on a client project where we needed to replace the dtSearch closed source search engine, which doesn’t perform that well at scale in this case. As the client has significant investment in stored queries (it’s for a monitoring application) they were keen that the new engine spoke exactly the same query language as the old – so we’ve built a version of Apache Lucene to replace dtSearch. There are a few other modifications we had to do as well, to return such things as positional information from deep within the Lucene code (this is particularly important in monitoring as you want to show clients where the keywords they were interested in appeared in an article – they may be checking their media coverage in detail, and position on the page is important).
The preservation/reuse of stored queries is a testimony to the configurable nature of Lucene software.
How far can the query preservation/reuse capabilities of Lucene be extended?
Comments Off on Replacing dtSearch
April 20, 2012
On Schemas and Lucene
Chris Male writes:
One of the very first thing users encounter when using Apache Solr is its schema. Here they configure the fields that their Documents will contain and the field types which define amongst other things, how field data will be analyzed. Solr’s schema is often touted as one of its major features and you will find it used in almost every Solr component. Yet at the same time, users of Apache Lucene won’t encounter a schema. Lucene is schemaless, letting users index Documents with any fields they like.
To me this schemaless flexibility comes at a cost. For example, Lucene’s QueryParsers cannot validate that a field being queried even exists or use NumericRangeQuerys when a field is numeric. When indexing, there is no way to automate creating Documents with their appropriate fields and types from a series of values. In Solr, the most optimal strategies for faceting and grouping different fields can be chosen based on field metadate retrieved from its schema.
Consequently as part of the modularisation of Solr and Lucene, I’ve always wondered whether it would be worth creating a schema module so that Lucene users can benefit from a schema, if they so choose. I’ve talked about this with many people over the last 12 months and have had a wide variety of reactions, but inevitably I’ve always come away more unsure. So in this blog I’m going ask you a lot of questions and I hope you can clarify this issue for me.
What follows is a deeply thoughtful examination of the pros and cons of schemas for Lucene and/or their role in Solr.
If you using Lucene, take the time to review Chris’s questions and voice your questions or concerns.
The Lucene you improve will be your own.
If you are interested in either Lucene or Solr, now would be a good time to speak up.
Comments Off on On Schemas and Lucene
April 16, 2012
Lucene Revolution 2012
Best advertising for the conference:
Presentations/videos from Lucene Revolution 2011.
Agenda for Lucene Revolution 2012.
Boston, May 7 – 10, The Royal Sonesta
The ad department thought otherwise:
Top 5 Reasons You Need To Attend Lucene Revolution!
- Learn from the Best
Meet, socialize, collaborate, ask questions and network with fellow Lucene / Solr enthusiasts. A large contingency of the project committers will be in Boston to discuss your questions in real-time.
- Innovate with Search
From field-collapsing to flexible indexing to integration with NoSQL technologies, you get the freshest thinking on solving the deepest, most interesting problems in open source search and big data.
- Get connected in the community
The power of open source is demolishing traditional barriers and forging new opportunity for killer code and new killer search apps — and this is the place to meet the people doing it.
- Fun…
We’ve scheduled in adequate time for fun at the conference! Networking breaks, Stump-the-Chump, and a big conference party at the Boston Museum of Science!
- A Bargain
Save money with packaged deals on accelerated two-day, hands-on training workshops, coupled with conference sessions on real-world implementations from Solr/Lucene experts throughout the world.
Not traveling so depending on your blogs and tweets to capture the conference!
Comments Off on Lucene Revolution 2012
April 14, 2012
Faceting & result grouping by Martijn van Groningen
From the post:
Result grouping and faceting are in essence two different search features. Faceting counts the number of hits for specific field values matching the current query. Result grouping groups documents together with a common property and places these documents under a group. These groups are used as the hits in the search result. Usually result grouping and faceting are used together and a lot of times the results get misunderstood.
The main reason is that when using grouping people expect that a hit is represented by a group. Faceting isn’t aware of groups and thus the computed counts represent documents and not groups. This different behaviour can be very confusion. A lot of questions on the Solr user mailing list are about this exact confusion.
In the case that result grouping is used with faceting users expect grouped facet counts. What does this mean? This means that when counting the number of matches for a specific field value the grouped faceting should check whether the group a document belongs to isn’t already counted before. This is best illustrated with some example documents.
Examples follow that make the distinction between groups and facets in Lucene and Solr clear. Not to mention specific suggestions on configuration of your service.
Comments Off on Faceting & result grouping
April 13, 2012
Lucene Core 3.6.0 and Solr 3.6.0 Available
You weren’t seriously planning on doing Spring cleaning this weekend were you?
Thanks to the Lucene/Solr release, which you naturally have to evaluate before Monday, that has been pushed off another week.
Hopefully something big will drop in the Hadoop ecosystem this coming week or perhaps from one of the graph databases. Will keep an eye out.
The Lucene PMC is pleased to announce the availability of Apache Lucene 3.6.0 and Apache Solr 3.6.0
Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html
Highlights of the Lucene release include:
- In addition to Java 5 and Java 6, this release has now full Java 7 support (minimum JDK 7u1 required).
-
TypeTokenFilter filters tokens based on their TypeAttribute.
-
Fixed offset bugs in a number of CharFilters, Tokenizers and TokenFilters that could lead to exceptions during highlighting.
-
Added phonetic encoders: Metaphone, Soundex, Caverphone, Beider-Morse, etc.
-
CJKBigramFilter and CJKWidthFilter replace CJKTokenizer.
-
Kuromoji morphological analyzer tokenizes Japanese text, producing both compound words and their segmentation.
-
Static index pruning (Carmel pruning) removes postings with low within-document term frequency.
-
QueryParser now interprets ‘*’ as an open end for range queries.
-
FieldValueFilter excludes documents missing the specified field.
-
CheckIndex and IndexUpgrader allow you to specify the specific FSDirectory implementation to use with the new -dir-impl command-line option.
-
FSTs can now do reverse lookup (by output) in certain cases and can be packed to reduce their size. There is now a method to retrieve top N shortest paths from a start node in an FST.
-
New WFSTCompletionLookup suggester supports finer-grained ranking for suggestions.
-
FST based suggesters now use an offline (disk-based) sort, instead of in-memory sort, when pre-sorting the suggestions.
-
ToChildBlockJoinQuery joins in the opposite direction (parent down to child documents).
-
New query-time joining is more flexible (but less performant) than index-time joins.
-
Added HTMLStripCharFilter to strip HTML markup.
-
Security fix: Better prevention of virtual machine SIGSEGVs when using MMapDirectory: Code using cloned IndexInputs of already closed indexes could possibly crash VM, allowing DoS attacks to your application.
-
Many bug fixes.
Highlights of the Solr release include:
-
New SolrJ client connector using Apache Http Components http client (SOLR-2020)
-
Many analyzer factories are now ‘multi term query aware’ allowing for things like field type aware lowercasing when building prefix & wildcard queries. (SOLR-2438)
-
New Kuromoji morphological analyzer tokenizes Japanese text, producing both compound words and their segmentation. (SOLR-3056)
-
Range Faceting (Dates & Numbers) is now supported in distributed search (SOLR-1709)
-
HTMLStripCharFilter has been completely re-implemented, fixing many bugs and greatly improving the performance (LUCENE-3690)
-
StreamingUpdateSolrServer now supports the javabin format (SOLR-1565)
-
New LFU Cache option for use in Solr’s internal caches. (SOLR-2906)
-
Memory performance improvements to all FST based suggesters (SOLR-2888)
-
New WFSTLookupFactory suggester supports finer-grained ranking for suggestions. (LUCENE-3714)
-
New options for configuring the amount of concurrency used in distributed searches (SOLR-3221)
-
Many bug fixes
Comments Off on Lucene Core 3.6.0 and Solr 3.6.0 Available
April 1, 2012
Stupid Solr tricks: Introduction (SST #0)
Bill Dueber writes:
Those of you who read this blog regularly (Hi Mom!) know that while we do a lot of stuff at the University of Michigan Library, our bread-and-butter these days are projects that center around Solr.
Right now, my production Solr is running an ancient nightly of version 1.4 (i.e., before 1.4 was even officially released), and reflects how much I didn’t know when we first started down this path. My primary responsibility is for Mirlyn, our catalog, but there’s plenty of smart people doing smart things around here, and I’d like to be one of them.
Solr has since advanced to 3.x (with version 4 on the horizon), and during that time I’ve learned a lot more about Solr and how to push it around. More importantly, I’ve learned a lot more about our data, the vagaries in the MARC/AACR2 that I process and how awful so much of it really is.
So…starting today I’m going to be doing some on-the-blog experiments with a new version of Solr, reflecting some of the problems I’ve run into and ways I think we can get more out of Solr.
Definitely a series to watch, or to contribute to, or better yet, to start for your software package of choice!
Comments Off on Stupid Solr tricks: Introduction (SST #0)
March 27, 2012
Result Grouping Made Easier
From the post:
Lucene has result grouping for a while now as a contrib in Lucene 3.x and as a module in the upcoming 4.0 release. In both releases the actual grouping is performed with Lucene Collectors. As a Lucene user you need to use various of these Collectors in searches. However these Collectors have many constructor arguments. So they can become quite cumbersome to use grouping in pure Lucene apps. The example below illustrates this.
(code omitted)
In the above example basic grouping with caching is used and also the group count is retrieved. As you can see there is quite a lot of coding involved. Recently a grouping convenience utility has been added to the Lucene grouping module to alleviate this problem. As the code example below illustrates, using the GroupingSearch utility is much easier than interacting with actual grouping collectors.
Normally the document count is returned as hit count. However in the situation where groups are being used as hit, rather than a document the document count will not work with pagination. For this reason the group count can be used the have correct pagination. The group count returns the number of unique groups matching the query. The group count can in the case be used as hit count since the individual hits are groups.
There are really two lessons here.
The first lesson is that if you need the GroupingSearch utility, use it.
Second is that Lucene is evolving rapidly enough that if you are a regular user, you need to be monitoring developments and releases carefully.
Comments Off on Result Grouping Made Easier
March 25, 2012
Lucene Full Text Indexing with Neo4j by Romiko Derbynew.
From the post:
I spent some time working on full text search for Neo4j. The basic goals were as follows.
- Control the pointers of the index
- Full Text Search
- All operations are done via Rest
- Can create an index when creating a node
- Can update and index
- Can check if an index exists
- When bootstrapping Neo4j in the cloud run Index checks
- Query Index using full text search lucene query language.
Download:
This is based on Neo4jClient: http://nuget.org/List/Packages/Neo4jClient
Source Code at:http://hg.readify.net/neo4jclient/
Introduction
So with the above objectives, I decided to go with Manual Indexing. The main reason here is that I can put an index pointing to node A based on values in node B.
Imagine the following.
You have Node A with a list:
Surname, FirstName and MiddleName. However Node A also has a relationship to Node B which has other names, perhaps Display Names, Avatar Names and AKA’s.
So with manual indexing, you can have all the above entries for names in Node A and Node B point to Node A only. (emphasis added)
Not quite merging but it is an interesting take on creating a single point of reference.
BTW, search for Neo4j while you are at Romiko’s blog. Several very interesting posts and I am sure more are forthcoming.
Comments Off on Lucene Full Text Indexing with Neo4j
March 23, 2012
Challenges in maintaining a high performance search engine written in Java
You will find this on the homepage or may have to search for it. I was logged in when I accessed it.
Very much worth your time for a high level overview of issues that Lucene will face sooner rather than later.
After reviewing, think about it, make serious suggestions and if possible, contributions to the future of Lucene.
Just off the cuff, I would really like to see Lucene become a search engine framework with a default data structure that admits to either extension or replacement by other data structures. Some data structures may have higher performance costs than others, but if that is what your requirements call for, they can hardly be wrong. Yes? A “fast” search engine that doesn’t meet your requirements is no prize.
Comments Off on Challenges in maintaining a high performance search engine written in Java
March 19, 2012
Document Frequency Limited MultiTermQuerys
From the post:
If you’ve ever looked at user generated data such as tweets, forum comments or even SMS text messages, you’ll have noticed there there are many variations in the spelling of words. In some cases they are intentional such as omissions of vowels to reduce message length, in other cases they are unintentional typos and spelling mistakes.
Querying this kind of data since only matching the traditional spelling of a word can lead to many valid results being missed. One way to includes matches on variations of a word is to use Lucene’s MultiTermQuerys
such as FuzzyQuery
or WildcardQuery
. For example, to find matches for the word “hotel” and all its variations, you might use the queries “hotel~” and “h*t*l”. Unfortunately, depending on how many variations there are, the queries could end up matching 10s or even 100s of terms, which will impact your performance.
You might be willing to accept this performance degradation to capture all the variations, or you might want to only query those terms which are common in your index, dropping the infrequent variations and giving your users maximum results with little impact on performance.
Lets explore how you can focus your MultiTermQuerys
on the most common terms in your index.
Not to give too much away, but you will learn how to tune a fuzzy match of terms. (To account for misspellings, for example.)
This is a very good site and blog for search issues.
Comments Off on Document Frequency Limited MultiTermQuerys
March 14, 2012
New index statistics in Lucene 4.0
Mike McCandless writes:
In the past, Lucene recorded only the bare minimal aggregate index statistics necessary to support its hard-wired classic vector space scoring model.
Fortunately, this situation is wildly improved in trunk (to be 4.0), where we have a selection of modern scoring models, including Okapi BM25, Language models, Divergence from Randomness models and Information-based models. To support these, we now save a number of commonly used index statistics per index segment, and make them available at search time.
Mike uses a simple example to illustrate the statistics available in Lucene 4.0.
Comments Off on New index statistics in Lucene 4.0
March 7, 2012
Integrating Lucene with HBase by Boris Lublinsky and Mike Segel.
You have to get to the conclusion for the punch line:
The simple implementation, described in this paper fully supports all of the Lucene functionality as validated by many unit tests from both Lucene core and contrib modules. It can be used as a foundation of building a very scalable search implementation leveraging inherent scalability of HBase and its fully symmetric design, allowing for adding any number of processes serving HBase data. It also avoids the necessity to close an open Lucene Index reader to incorporate newly indexed data, which will be automatically available to user with possible delay controlled by the cache time to live parameter. In the next article we will show how to extend this implementation to incorporate geospatial search support.
Put why your article is important in the introduction as well.
The second article does better:
Implementing Lucene Spatial Support
In our previous article [1], we discussed how to integrate Lucene with HBase for improved scalability and availability. In this article I will show how to extend this Implementation with the spatial support.
Lucene spatial contribution package [2, 3, 4, 5] provides powerful support for spatial search, but is limited to finding the closest point. In reality spatial search often has significantly more requirements, for example, which points belong to a given shape (circle, bounding box, polygon), which shapes intersect with a given shape and so on. Solution, presented in this article allows solving all of the above problems.
Comments Off on Integrating Lucene with HBase
March 6, 2012
Using your Lucene index as input to your Mahout job – Part I
From the post:
This blog shows you how to use an upcoming Mahout feature, the lucene2seq program or https://issues.apache.org/jira/browse/MAHOUT-944. This program reads the contents of stored fields in your Lucene index and converts them into text sequence files, to be used by a Mahout text clustering job. The tool contains both a sequential and MapReduce implementation and can be run from the command line or from Java using a bean configuration object. In this blog I demonstrate how to use the sequential version on an index of Wikipedia.
Access to original text can help with improving clustering results. See the blog post for details.
March 5, 2012
Bad vs Good Search Experience by Emir Dizdarevic.
From the post:
The Problem
This article will show how a bad search solution can be improved. We will demonstrate how to build an enterprise search solution relatively easy using Apache Lucene/SOLR.
We took a local ad site as an example of a bad search experience.
We crawled the ad site with Apache Nutch, using a couple of home grown plugins to fetch only the data we want and not the whole site. Stay tuned for a separate article on this topic.
‘BAD’ search is based on real search results from the ad site i.e. how the website search currently works. ‘GOOD ‘ search is based on same data but indexed with Apache Lucene/Solr (inverted index).
BAD Search: We assume that it’s based on exact match criteria or something similar to ‘%like%’ database statement. To simulate this behavior we used content field that it tokenized by whitespace, lowercased and used phrase queries every time. This is the closest we could get to existing ad site search solution, but even this bad it was performing better.
An excellent post in part because of the detailed example but also to show that improving search results is an iterative process.
Enjoy!
Comments Off on Bad vs Good Search Experience
March 1, 2012
Transactional Lucene
Mike McCandless writes:
Many users don’t appreciate the transactional semantics of Lucene’s APIs and how this can be useful in search applications. For starters, Lucene implements ACID properties:
If you have some very strong coffee and time to play with your experimental setup of Lucene, this is a post to get lost in.
When I read a post like this it sparks one idea and then another and pretty soon most of the afternoon is gone.
Comments Off on Transactional Lucene
February 9, 2012
Support joining in a distributed environment.
From the description:
Add two more methods in JoinUtil to support joining in a distributed manner.
- Method to retrieve all from values.
- Method to create a TermsQuery based on a set of from terms.
With these two methods distributed joining can be supported following these steps:
- Retrieve from values from each shard
- Merge the retrieved from values.
- Create a TermsQuery based on the merged from terms and send this query to all shards.
Topic maps that have been split into shards could have values that would trigger merging if present in a single shard.
This appears to be a way to address that issue.
Time spent with Lucene is time well spent.
Comments Off on Lucene-3759: Support joining in a distributed environment
February 7, 2012
8 Best Open Source Search Engines built on top of Lucene
By my count I get five (5) based on Lucene. See what you think.
Lucene base:
- Apache Solr
- Compass
- Constellio
- Elastic Search
- Katta
No Lucene base:
- Bobo Search
- Index Tank
- Summa
Post has short summaries about the search engines and links to their sites.
Do you think the terminology around search engines is as confused as around NoSQL databases?
Any cross-terminology comparisons you would recommend to CIO’s or even users?
Lucene – Solr (new website)
Must be something in the air that is leading to this rash of new websites. 😉
No complaints about having them, better design is always appreciated.
If you haven’t contributed to an Apache project lately, take this opportunity to check out Lucene, Solr or one of the related projects.
Use the software, make comments, find bugs, contribute fixes for bugs, documentation, etc.
You and the community will be richer for it.
Comments Off on Lucene – Solr (new website)
February 6, 2012
Uwe Says: is your Reader atomic? by Uwe Schindler.
From the blog:
Since Day 1 Lucene exposed the two fundamental concepts of reading and writing an index directly through IndexReader & IndexWriter. However, the API didn’t reflect reality; from the IndexWriter perspective this was desirable but when reading the index this caused several problems in the past. In reality a Lucene index isn’t a single index while logically treated as a such. The latest developments in Lucene trunk try to expose reality for type-safety and performance, but before I go into details about Composite, Atomic and DirectoryReaders let me go back in time a bit.
If you don’t mind looking deep into the heart of indexing in Lucene, this is a post for you. Problems, both solved and remaining are discussed. This could be your opportunity to contribute to the Lucene community.
Comments Off on Uwe Says: is your Reader atomic?
February 2, 2012
Query time joining in Lucene
From the post:
Recently query time joining has been added to the Lucene join module in the Lucene svn trunk. The query time joining will be included in the Lucene 4.0 release and there is a possibility that it will also be included in Lucene 3.6.
Lets say we have articles and comments. With the query time join you can store these entities as separate documents. Each comment and article can be updates without re-indexing large parts of your index. Even better would be to store articles in an article index and comments in a comment index! In both cases a comment would have a field containing the article identifier.
Joins based upon matching terms in different indexes.
Work is not finished yet so now would be the time to contribute your experiences or opinions.
Comments Off on Query time joining in Lucene
January 25, 2012
Berlin Buzzwords 2012
Important Dates (all dates in GMT +2)
Submission deadline: March 11th 2012, 23:59 MEZ
Notification of accepted speakers: April 6st, 2012, MEZ
Publication of final schedule: April 13th, 2012
Conference: June 4/5. 2012
The call:
Call for Submission Berlin Buzzwords 2012 – Search, Store, Scale — June 4 / 5. 2012
The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics:
- IR / Search – Lucene, Solr, katta, ElasticSearch or comparable solutions
- NoSQL – like CouchDB, MongoDB, Jackrabbit, HBase and others
- Large Data Processing – Hadoop itself, MapReduce, Cascading or Pig and relatives
Related topics not explicitly listed above are more than welcome. We are looking for presentations on the implementation of the systems themselves, technical talks, real world applications and case studies.
…(moved dates to top)…
High quality, technical submissions are called for, ranging from principles to practice. We are looking for real world use cases, background on the architecture of specific projects and a deep dive into architectures built on top of e.g. Hadoop clusters.
Here is your chance to experience summer in Berlin (Berlin Buzzwords 2012) and in Montreal (Balisage).
Seriously, both conferences are very strong and worth your attention.
Comments Off on Berlin Buzzwords 2012
January 23, 2012
Solr and Lucene Reference Guide updated for v3.5
From the post:
The free Solr Reference Guide published by Lucid Imagination has been updated to 3.5 – the current release version of Solr and Lucene. The changes weren’t major, but here are the key changes:
- Support for the Hunspell stemmer
- The new langid UpdateProcessor
- Numeric types now support sortMissingFirst/Last
- New parameter hl.q for use with highlighting
- Field types supported by the StatsComponent now includes date and string fields
Almost 400 pages of rainy winter day reading.
OK, so you need a taste for that sort of thing. 😉
Comments Off on Solr and Lucene Reference Guide updated for v3.5
January 20, 2012
Simon says: Single Byte Norms are Dead!
From the post:
Apache Lucene turned 10 last year with a limitation that bugged many many users from day one. You may know Lucene’s core scoring model is based on TF/IDF (Vector Space Model). Lucene encapsulates all related calculations in a class called Similarity. Among pure TF/IDF factors Similarity also provides a norm value per document that is, by default a float value composed out of length normalization and field boost. Nothing special so far! However, this float value is encoded into a single byte and written down to the index. Lucene trades some precision lost for space on disk and eventually in memory since norms are loaded into memory per field upon first access.
In lots of cases this precision lost is a fair trade-off, but once you find yourself in a situation where you need to store more information based on statistics collected during indexing you end up writing your own auxiliary data structure or “fork” Lucene for your app and mess with the source.
The upcoming version of Lucene already added support for a lot more scoring models like:
The abstractions added to Lucene to implement those models already opens the door for applications that either want to roll their own “awesome” scoring model or modify the low level scorer implementations. Yet, norms are still one byte!
Don’t worry! The post has a happy ending!
Read on if you want to be on the cutting edge of Lucene work.
Thanks Lucene Team!
Comments Off on Simon says: Single Byte Norms are Dead!
January 14, 2012
ToChildBlockJoinQuery in Lucene .
Mike McCandless writes:
In my last post I described a known limitation of BlockJoinQuery: it joins in only one direction (from child to parent documents). This can be a problem because some applications need to join in reverse (from parent to child documents) instead.
This is now fixed! I just committed a new query, ToChildBlockJoinQuery, to perform the join in the opposite direction. I also renamed the previous query to ToParentBlockJoinQuery.
This will included in Lucene 3.6.0 and 4.0.
Comments Off on ToChildBlockJoinQuery in Lucene
January 9, 2012
Searching relational content with Lucene’s BlockJoinQuery
Mike McCandless writes:
Lucene’s 3.4.0 release adds a new feature called index-time join (also sometimes called sub-documents, nested documents or parent/child documents), enabling efficient indexing and searching of certain types of relational content.
Most search engines can’t directly index relational content, as documents in the index logically behave like a single flat database table. Yet, relational content is everywhere! A job listing site has each company joined to the specific listings for that company. Each resume might have separate list of skills, education and past work experience. A music search engine has an artist/band joined to albums and then joined to songs. A source code search engine would have projects joined to modules and then files.
Mike covers how to index relational content with Lucene 3.4.0 as well as the current limitations on that relational indexing. Current work is projected to resolve some of those limitations.
This feature will be immediately useful in a number of contexts.
Even more promising is the development of thinking about indexing as more than term -> document. Both sides of that operator need more granularity.
Comments Off on Searching relational content with Lucene’s BlockJoinQuery
January 6, 2012
Katta – Lucene & more in the cloud
From the webpage:
Katta is a scalable, failure tolerant, distributed, data storage for real time access.
Katta serves large, replicated, indices as shards to serve high loads and very large data sets. These indices can be of different type. Currently implementations are available for Lucene and Hadoop mapfiles.
- Makes serving large or high load indices easy
- Serves very large Lucene or Hadoop Mapfile indices as index shards on many servers
- Replicate shards on different servers for performance and fault-tolerance
- Supports pluggable network topologies
- Master fail-over
- Fast, lightweight, easy to integrate
- Plays well with Hadoop clusters
- Apache Version 2 License
Now that the “new” has worn off of your holiday presents, ;-), something to play with over the weekend.
Comments Off on Katta – Lucene & more in the cloud
January 4, 2012
Hadoop for Archiving Email – Part 2 by Sunil Sitaula.
From the post:
Part 1 of this post covered how to convert and store email messages for archival purposes using Apache Hadoop, and outlined how to perform a rudimentary search through those archives. But, let’s face it: for search to be of any real value, you need robust features and a fast response time. To accomplish this we use Solr/Lucene-type indexing capabilities on top of HDFS and MapReduce.
Before getting into indexing within Hadoop, let us review the features of Lucene and Solr:
Continues Part 1 (my blog post) and mentions several applications and libraries that will be useful for indexing email.
January 1, 2012
Optimizing Findability in Lucene and Solr
From the post:
To paraphrase an age-old question about trees falling in the woods: “If content lives in your application and you can’t find it, does it still exist?” In this article, we explore how to make your content findable by presenting tips and techniques for discovering what is important in your content and how to leverage it in the Lucene Stack.
Table of Contents
Introduction
Planning for Findability
Knowing your Content
Knowing your Users
Garbage In, Garbage Out
Analyzing your Analysis
Stemming In Greater Detail
Query Techniques for Better Search
Navigation Hints
Final Thoughts
Resources
by Grant Ingersoll
You know when a blog post starts off with a table of contents it is long. Fortunately in this case, it is also very good. By one of the principal architects of Lucene, Grant Ingersoll.
A good start on developing findability skills but as the post points out, a lot of it will depend on your knowledge of what “findability” means to your users. Only you can answer that question.
Comments Off on Optimizing Findability in Lucene and Solr
Gora Graduates! (Incubator location)
Over Twitter I just saw a post announcing that Gora has graduated from the Apache Incubator!
Congratulations to all involved.
Oh, the project:
What is Gora?
Gora is an ORM framework for column stores such as Apache HBase and Apache Cassandra with a specific focus on Hadoop.
Why Gora?
Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores. Gora fills this gap by giving the user an easy-to-use ORM framework with data store specific mappings and built in Apache Hadoop support.
The overall goal for Gora is to become the standard data representation and persistence framework for big data. The roadmap of Gora can be grouped as follows.
- Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS.
- Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location.
- Indexing : Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API.
- Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading
- MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.
Comments Off on Gora Graduates!
« Newer Posts —
Older Posts »