Lucene « Another Word For It

April 26, 2012

LucidWorks 2.1

Filed under: Lucene,LucidWorks,Solr — Patrick Durusau @ 6:30 pm

There are times, not very often, when picking only a few features to report would be unfair to a product.

This is one of those times.

I have reproduced the description of LucidWorks 2.1 as it appears on the Lucid Imagination site:

LucidWorks 2.1 new features list:

Enhancement Areas Key Benefits

Includes the latest Lucene/Solr 4.0

Near Real Time

Fault Tolerance and High Availability

Data Durability

Centralized Configuration

Elasticity

Business Rules

Integrate your business processes and rules with the user search experience

Examples: Landing Pages, provide targeted search results per user, etc.

Framework to integrate with your BRMS (Business Rules Management System)

OOB integration with leading open source BRMS – Drools

Upgrade and Migrations

Lucid can help upgrade customers from Solr 3.x to 4.0 or older Solr versions to LucidWorks 2.1

Upgrades for existing LucidWorks customers on previous versions of LucidWorks to LucidWorks 2.1

Enhanced Connector Framework

Easily build integrations to index data from any application or data sources

Framework supports REST API driven integration, generates dynamic configuration UI, and allows admins to schedule the new connectors

Connectors available to crawl large amounts of HDFS data, integrate twitter updates into index, and CMIS connector to support CMS systems like Alfresco, etc.

Efficient Crawl of Large web content

OOB integration for Nutch (open source)

Helps crawl Webscale data into your index

REST API and UI Enhancements

Supports memory and cache settings, schema less configuration using Dynamic fields from UI

Subject Matter Experts can create Best Bets for improved search experience

Key features and benefits of LucidWorks search platform

Streamlined search configuration, optimization and operations: Well-organized UI makes Solr innovation easier to consume, better adapting to constant change.

Enterprise-grade, business critical manageability Includes tools for infrastructure administration, monitoring and reporting so your search application can thrive within a well-defined, well-managed operational environment; includes upgradability across successive releases. We can help migrate Solr installations to LucidWorks 2.1.

Broad-based content acquisition Access big data and enterprise content faster and more securely with built-in support for Hadoop and Amazon S3, along with Sharepoint and traditional online content types – plus a new open connector framework to customize access to other data sources

Versatile access and data security Flexible, resilient built-in security simplifies getting search connected right to the right data and content

Advanced search experience enhancements Powerful, innovative search capabilities deliver faster, better, more useful results for a richer user experience; easily integrates into your application and infrastructure; REST API automates and integrates search as a service with your application.

Open source power and innovation Complete, supported release of Lucene/Solr 4.0, including latest innovations in Near Real Time search, distributed indexing and more versatile field faceting over and above Apache Lucene/Solr 3.x; all the flexibility of open source, packaged for business-critical development, maintenance and deployment

Cost-effective commercial grade expertise & Global 24×7 Support a range of annual support subscriptions including bundled services, consulting, training and certification from the world’s leading experts in Lucene/Solr open source.

Comments Off

April 25, 2012

Replacing dtSearch

Filed under: dtSearch,Lucene,Query Language — Patrick Durusau @ 6:26 pm

An open source replacement for the dtSearch closed source search engine

From the webpage:

We’ve been working on a client project where we needed to replace the dtSearch closed source search engine, which doesn’t perform that well at scale in this case. As the client has significant investment in stored queries (it’s for a monitoring application) they were keen that the new engine spoke exactly the same query language as the old – so we’ve built a version of Apache Lucene to replace dtSearch. There are a few other modifications we had to do as well, to return such things as positional information from deep within the Lucene code (this is particularly important in monitoring as you want to show clients where the keywords they were interested in appeared in an article – they may be checking their media coverage in detail, and position on the page is important).

The preservation/reuse of stored queries is a testimony to the configurable nature of Lucene software.

How far can the query preservation/reuse capabilities of Lucene be extended?

Comments Off

April 20, 2012

On Schemas and Lucene

Filed under: Lucene,Schema,Solr — Patrick Durusau @ 6:24 pm

On Schemas and Lucene

Chris Male writes:

One of the very first thing users encounter when using Apache Solr is its schema. Here they configure the fields that their Documents will contain and the field types which define amongst other things, how field data will be analyzed. Solr’s schema is often touted as one of its major features and you will find it used in almost every Solr component. Yet at the same time, users of Apache Lucene won’t encounter a schema. Lucene is schemaless, letting users index Documents with any fields they like.

To me this schemaless flexibility comes at a cost. For example, Lucene’s QueryParsers cannot validate that a field being queried even exists or use NumericRangeQuerys when a field is numeric. When indexing, there is no way to automate creating Documents with their appropriate fields and types from a series of values. In Solr, the most optimal strategies for faceting and grouping different fields can be chosen based on field metadate retrieved from its schema.

Consequently as part of the modularisation of Solr and Lucene, I’ve always wondered whether it would be worth creating a schema module so that Lucene users can benefit from a schema, if they so choose. I’ve talked about this with many people over the last 12 months and have had a wide variety of reactions, but inevitably I’ve always come away more unsure. So in this blog I’m going ask you a lot of questions and I hope you can clarify this issue for me.

What follows is a deeply thoughtful examination of the pros and cons of schemas for Lucene and/or their role in Solr.

If you using Lucene, take the time to review Chris’s questions and voice your questions or concerns.

The Lucene you improve will be your own.

If you are interested in either Lucene or Solr, now would be a good time to speak up.

Comments Off

April 16, 2012

Lucene Revolution 2012

Filed under: Lucene — Patrick Durusau @ 8:03 am

Lucene Revolution 2012

Best advertising for the conference:

Presentations/videos from Lucene Revolution 2011.

Agenda for Lucene Revolution 2012.

Boston, May 7 – 10, The Royal Sonesta

The ad department thought otherwise:

Top 5 Reasons You Need To Attend Lucene Revolution!

Learn from the Best
Meet, socialize, collaborate, ask questions and network with fellow Lucene / Solr enthusiasts. A large contingency of the project committers will be in Boston to discuss your questions in real-time.
Innovate with Search
From field-collapsing to flexible indexing to integration with NoSQL technologies, you get the freshest thinking on solving the deepest, most interesting problems in open source search and big data.
Get connected in the community
The power of open source is demolishing traditional barriers and forging new opportunity for killer code and new killer search apps — and this is the place to meet the people doing it.
Fun…
We’ve scheduled in adequate time for fun at the conference! Networking breaks, Stump-the-Chump, and a big conference party at the Boston Museum of Science!
A Bargain
Save money with packaged deals on accelerated two-day, hands-on training workshops, coupled with conference sessions on real-world implementations from Solr/Lucene experts throughout the world.

Not traveling so depending on your blogs and tweets to capture the conference!

Comments Off

April 14, 2012

Faceting & result grouping

Filed under: Faceted Search,Facets,Lucene,Solr — Patrick Durusau @ 6:27 pm

Faceting & result grouping by Martijn van Groningen

From the post:

Result grouping and faceting are in essence two different search features. Faceting counts the number of hits for specific field values matching the current query. Result grouping groups documents together with a common property and places these documents under a group. These groups are used as the hits in the search result. Usually result grouping and faceting are used together and a lot of times the results get misunderstood.

The main reason is that when using grouping people expect that a hit is represented by a group. Faceting isn’t aware of groups and thus the computed counts represent documents and not groups. This different behaviour can be very confusion. A lot of questions on the Solr user mailing list are about this exact confusion.

In the case that result grouping is used with faceting users expect grouped facet counts. What does this mean? This means that when counting the number of matches for a specific field value the grouped faceting should check whether the group a document belongs to isn’t already counted before. This is best illustrated with some example documents.

Examples follow that make the distinction between groups and facets in Lucene and Solr clear. Not to mention specific suggestions on configuration of your service.

Comments Off

April 13, 2012

Lucene Core 3.6.0 and Solr 3.6.0 Available

Filed under: Lucene,Solr — Patrick Durusau @ 3:01 pm

Lucene Core 3.6.0 and Solr 3.6.0 Available

You weren’t seriously planning on doing Spring cleaning this weekend were you?

Thanks to the Lucene/Solr release, which you naturally have to evaluate before Monday, that has been pushed off another week.

Hopefully something big will drop in the Hadoop ecosystem this coming week or perhaps from one of the graph databases. Will keep an eye out.

The Lucene PMC is pleased to announce the availability of Apache Lucene 3.6.0 and Apache Solr 3.6.0

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Highlights of the Lucene release include:

In addition to Java 5 and Java 6, this release has now full Java 7 support (minimum JDK 7u1 required).

TypeTokenFilter filters tokens based on their TypeAttribute.

Fixed offset bugs in a number of CharFilters, Tokenizers and TokenFilters that could lead to exceptions during highlighting.

Added phonetic encoders: Metaphone, Soundex, Caverphone, Beider-Morse, etc.

CJKBigramFilter and CJKWidthFilter replace CJKTokenizer.

Kuromoji morphological analyzer tokenizes Japanese text, producing both compound words and their segmentation.

Static index pruning (Carmel pruning) removes postings with low within-document term frequency.

QueryParser now interprets ‘*’ as an open end for range queries.

FieldValueFilter excludes documents missing the specified field.

CheckIndex and IndexUpgrader allow you to specify the specific FSDirectory implementation to use with the new -dir-impl command-line option.

FSTs can now do reverse lookup (by output) in certain cases and can be packed to reduce their size. There is now a method to retrieve top N shortest paths from a start node in an FST.

New WFSTCompletionLookup suggester supports finer-grained ranking for suggestions.

FST based suggesters now use an offline (disk-based) sort, instead of in-memory sort, when pre-sorting the suggestions.

ToChildBlockJoinQuery joins in the opposite direction (parent down to child documents).

New query-time joining is more flexible (but less performant) than index-time joins.

Added HTMLStripCharFilter to strip HTML markup.

Security fix: Better prevention of virtual machine SIGSEGVs when using MMapDirectory: Code using cloned IndexInputs of already closed indexes could possibly crash VM, allowing DoS attacks to your application.

Many bug fixes.

Highlights of the Solr release include:

New SolrJ client connector using Apache Http Components http client (SOLR-2020)

Many analyzer factories are now ‘multi term query aware’ allowing for things like field type aware lowercasing when building prefix & wildcard queries. (SOLR-2438)

New Kuromoji morphological analyzer tokenizes Japanese text, producing both compound words and their segmentation. (SOLR-3056)

Range Faceting (Dates & Numbers) is now supported in distributed search (SOLR-1709)

HTMLStripCharFilter has been completely re-implemented, fixing many bugs and greatly improving the performance (LUCENE-3690)

StreamingUpdateSolrServer now supports the javabin format (SOLR-1565)

New LFU Cache option for use in Solr’s internal caches. (SOLR-2906)

Memory performance improvements to all FST based suggesters (SOLR-2888)

New WFSTLookupFactory suggester supports finer-grained ranking for suggestions. (LUCENE-3714)

New options for configuring the amount of concurrency used in distributed searches (SOLR-3221)

Many bug fixes

Comments Off

April 1, 2012

Stupid Solr tricks: Introduction (SST #0)

Filed under: Indexing,Lucene,Solr — Patrick Durusau @ 7:13 pm

Stupid Solr tricks: Introduction (SST #0)

Bill Dueber writes:

Those of you who read this blog regularly (Hi Mom!) know that while we do a lot of stuff at the University of Michigan Library, our bread-and-butter these days are projects that center around Solr.

Right now, my production Solr is running an ancient nightly of version 1.4 (i.e., before 1.4 was even officially released), and reflects how much I didn’t know when we first started down this path. My primary responsibility is for Mirlyn, our catalog, but there’s plenty of smart people doing smart things around here, and I’d like to be one of them.

Solr has since advanced to 3.x (with version 4 on the horizon), and during that time I’ve learned a lot more about Solr and how to push it around. More importantly, I’ve learned a lot more about our data, the vagaries in the MARC/AACR2 that I process and how awful so much of it really is.

So…starting today I’m going to be doing some on-the-blog experiments with a new version of Solr, reflecting some of the problems I’ve run into and ways I think we can get more out of Solr.

Definitely a series to watch, or to contribute to, or better yet, to start for your software package of choice!

Comments Off

March 27, 2012

Result Grouping Made Easier

Filed under: Lucene — Patrick Durusau @ 7:17 pm

Result Grouping Made Easier

From the post:

Lucene has result grouping for a while now as a contrib in Lucene 3.x and as a module in the upcoming 4.0 release. In both releases the actual grouping is performed with Lucene Collectors. As a Lucene user you need to use various of these Collectors in searches. However these Collectors have many constructor arguments. So they can become quite cumbersome to use grouping in pure Lucene apps. The example below illustrates this.

(code omitted)

In the above example basic grouping with caching is used and also the group count is retrieved. As you can see there is quite a lot of coding involved. Recently a grouping convenience utility has been added to the Lucene grouping module to alleviate this problem. As the code example below illustrates, using the GroupingSearch utility is much easier than interacting with actual grouping collectors.

Normally the document count is returned as hit count. However in the situation where groups are being used as hit, rather than a document the document count will not work with pagination. For this reason the group count can be used the have correct pagination. The group count returns the number of unique groups matching the query. The group count can in the case be used as hit count since the individual hits are groups.

There are really two lessons here.

The first lesson is that if you need the GroupingSearch utility, use it.

Second is that Lucene is evolving rapidly enough that if you are a regular user, you need to be monitoring developments and releases carefully.

Comments Off

March 25, 2012

Lucene Full Text Indexing with Neo4j

Filed under: Indexing,Lucene,Neo4j,Neo4jClient — Patrick Durusau @ 7:15 pm

Lucene Full Text Indexing with Neo4j by Romiko Derbynew.

From the post:

I spent some time working on full text search for Neo4j. The basic goals were as follows.

Control the pointers of the index

Full Text Search

All operations are done via Rest

Can create an index when creating a node

Can update and index

Can check if an index exists

When bootstrapping Neo4j in the cloud run Index checks

Query Index using full text search lucene query language.

Download:

This is based on Neo4jClient: http://nuget.org/List/Packages/Neo4jClient

Source Code at:http://hg.readify.net/neo4jclient/

Introduction

So with the above objectives, I decided to go with Manual Indexing. The main reason here is that I can put an index pointing to node A based on values in node B.

Imagine the following.

You have Node A with a list:

Surname, FirstName and MiddleName. However Node A also has a relationship to Node B which has other names, perhaps Display Names, Avatar Names and AKA’s.

So with manual indexing, you can have all the above entries for names in Node A and Node B point to Node A only. (emphasis added)

Not quite merging but it is an interesting take on creating a single point of reference.

BTW, search for Neo4j while you are at Romiko’s blog. Several very interesting posts and I am sure more are forthcoming.

Comments Off

March 23, 2012

Challenges in maintaining a high performance search engine written in Java

Filed under: Lucene,Search Algorithms,Search Engines — Patrick Durusau @ 7:24 pm

Challenges in maintaining a high performance search engine written in Java

You will find this on the homepage or may have to search for it. I was logged in when I accessed it.

Very much worth your time for a high level overview of issues that Lucene will face sooner rather than later.

After reviewing, think about it, make serious suggestions and if possible, contributions to the future of Lucene.

Just off the cuff, I would really like to see Lucene become a search engine framework with a default data structure that admits to either extension or replacement by other data structures. Some data structures may have higher performance costs than others, but if that is what your requirements call for, they can hardly be wrong. Yes? A “fast” search engine that doesn’t meet your requirements is no prize.

Comments Off

March 19, 2012

Document Frequency Limited MultiTermQuerys

Filed under: Lucene,Query Expansion,Searching — Patrick Durusau @ 6:55 pm

Document Frequency Limited MultiTermQuerys

From the post:

If you’ve ever looked at user generated data such as tweets, forum comments or even SMS text messages, you’ll have noticed there there are many variations in the spelling of words. In some cases they are intentional such as omissions of vowels to reduce message length, in other cases they are unintentional typos and spelling mistakes.

Querying this kind of data since only matching the traditional spelling of a word can lead to many valid results being missed. One way to includes matches on variations of a word is to use Lucene’s MultiTermQuerys such as FuzzyQuery or WildcardQuery. For example, to find matches for the word “hotel” and all its variations, you might use the queries “hotel~” and “h*t*l”. Unfortunately, depending on how many variations there are, the queries could end up matching 10s or even 100s of terms, which will impact your performance.

You might be willing to accept this performance degradation to capture all the variations, or you might want to only query those terms which are common in your index, dropping the infrequent variations and giving your users maximum results with little impact on performance.

Lets explore how you can focus your MultiTermQuerys on the most common terms in your index.

Not to give too much away, but you will learn how to tune a fuzzy match of terms. (To account for misspellings, for example.)

This is a very good site and blog for search issues.

Comments Off

March 14, 2012

New index statistics in Lucene 4.0

Filed under: Indexing,Lucene — Patrick Durusau @ 7:35 pm

New index statistics in Lucene 4.0

Mike McCandless writes:

In the past, Lucene recorded only the bare minimal aggregate index statistics necessary to support its hard-wired classic vector space scoring model.

Fortunately, this situation is wildly improved in trunk (to be 4.0), where we have a selection of modern scoring models, including Okapi BM25, Language models, Divergence from Randomness models and Information-based models. To support these, we now save a number of commonly used index statistics per index segment, and make them available at search time.

Mike uses a simple example to illustrate the statistics available in Lucene 4.0.

Comments Off

March 7, 2012

Integrating Lucene with HBase

Filed under: Geographic Information Retrieval,HBase,Lucene,Spatial Index — Patrick Durusau @ 5:40 pm

Integrating Lucene with HBase by Boris Lublinsky and Mike Segel.

You have to get to the conclusion for the punch line:

The simple implementation, described in this paper fully supports all of the Lucene functionality as validated by many unit tests from both Lucene core and contrib modules. It can be used as a foundation of building a very scalable search implementation leveraging inherent scalability of HBase and its fully symmetric design, allowing for adding any number of processes serving HBase data. It also avoids the necessity to close an open Lucene Index reader to incorporate newly indexed data, which will be automatically available to user with possible delay controlled by the cache time to live parameter. In the next article we will show how to extend this implementation to incorporate geospatial search support.

Put why your article is important in the introduction as well.

The second article does better:

Implementing Lucene Spatial Support

In our previous article [1], we discussed how to integrate Lucene with HBase for improved scalability and availability. In this article I will show how to extend this Implementation with the spatial support.

Lucene spatial contribution package [2, 3, 4, 5] provides powerful support for spatial search, but is limited to finding the closest point. In reality spatial search often has significantly more requirements, for example, which points belong to a given shape (circle, bounding box, polygon), which shapes intersect with a given shape and so on. Solution, presented in this article allows solving all of the above problems.

Comments Off

March 6, 2012

Using your Lucene index as input to your Mahout job – Part I

Filed under: Clustering,Collocation,Lucene,Mahout — Patrick Durusau @ 8:08 pm

Using your Lucene index as input to your Mahout job – Part I

From the post:

This blog shows you how to use an upcoming Mahout feature, the lucene2seq program or https://issues.apache.org/jira/browse/MAHOUT-944. This program reads the contents of stored fields in your Lucene index and converts them into text sequence files, to be used by a Mahout text clustering job. The tool contains both a sequential and MapReduce implementation and can be run from the command line or from Java using a bean configuration object. In this blog I demonstrate how to use the sequential version on an index of Wikipedia.

Access to original text can help with improving clustering results. See the blog post for details.

Comments (1)

March 5, 2012

Bad vs Good Search Experience

Filed under: Interface Research/Design,Lucene,Search Interface,Searching — Patrick Durusau @ 7:53 pm

Bad vs Good Search Experience by Emir Dizdarevic.

From the post:

The Problem

This article will show how a bad search solution can be improved. We will demonstrate how to build an enterprise search solution relatively easy using Apache Lucene/SOLR.

We took a local ad site as an example of a bad search experience.

We crawled the ad site with Apache Nutch, using a couple of home grown plugins to fetch only the data we want and not the whole site. Stay tuned for a separate article on this topic.

‘BAD’ search is based on real search results from the ad site i.e. how the website search currently works. ‘GOOD ‘ search is based on same data but indexed with Apache Lucene/Solr (inverted index).

BAD Search: We assume that it’s based on exact match criteria or something similar to ‘%like%’ database statement. To simulate this behavior we used content field that it tokenized by whitespace, lowercased and used phrase queries every time. This is the closest we could get to existing ad site search solution, but even this bad it was performing better.

An excellent post in part because of the detailed example but also to show that improving search results is an iterative process.

Enjoy!

Comments Off

March 1, 2012

Transactional Lucene

Filed under: Lucene,Searching — Patrick Durusau @ 9:01 pm

Transactional Lucene

Mike McCandless writes:

Many users don’t appreciate the transactional semantics of Lucene’s APIs and how this can be useful in search applications. For starters, Lucene implements ACID properties:

If you have some very strong coffee and time to play with your experimental setup of Lucene, this is a post to get lost in.

When I read a post like this it sparks one idea and then another and pretty soon most of the afternoon is gone.

Comments Off

February 9, 2012

Lucene-3759: Support joining in a distributed environment

Filed under: Lucene,Query Expansion,Sharding — Patrick Durusau @ 4:26 pm

Support joining in a distributed environment.

From the description:

Add two more methods in JoinUtil to support joining in a distributed manner.

Method to retrieve all from values.

Method to create a TermsQuery based on a set of from terms.

With these two methods distributed joining can be supported following these steps:

Retrieve from values from each shard

Merge the retrieved from values.

Create a TermsQuery based on the merged from terms and send this query to all shards.

Topic maps that have been split into shards could have values that would trigger merging if present in a single shard.

This appears to be a way to address that issue.

Time spent with Lucene is time well spent.

Comments Off

February 7, 2012

8 Best Open Source Search Engines built on top of Lucene

Filed under: Bobo Search,Compass,Constello,ElasticSearch,IndexTank,Katta,Lucene,Solr,Summa — Patrick Durusau @ 4:36 pm

8 Best Open Source Search Engines built on top of Lucene

By my count I get five (5) based on Lucene. See what you think.

Lucene base:

Apache Solr
Compass
Constellio
Elastic Search
Katta

No Lucene base:

Bobo Search
Index Tank
Summa

Post has short summaries about the search engines and links to their sites.

Do you think the terminology around search engines is as confused as around NoSQL databases?

Any cross-terminology comparisons you would recommend to CIO’s or even users?

Comments (2)

Lucene – Solr (new website)

Filed under: Lucene,Solr — Patrick Durusau @ 4:35 pm

Lucene – Solr (new website)

Must be something in the air that is leading to this rash of new websites. 😉

No complaints about having them, better design is always appreciated.

If you haven’t contributed to an Apache project lately, take this opportunity to check out Lucene, Solr or one of the related projects.

Use the software, make comments, find bugs, contribute fixes for bugs, documentation, etc.

You and the community will be richer for it.

Comments Off

February 6, 2012

Uwe Says: is your Reader atomic?

Filed under: Indexing,Lucene — Patrick Durusau @ 6:59 pm

Uwe Says: is your Reader atomic? by Uwe Schindler.

From the blog:

Since Day 1 Lucene exposed the two fundamental concepts of reading and writing an index directly through IndexReader & IndexWriter. However, the API didn’t reflect reality; from the IndexWriter perspective this was desirable but when reading the index this caused several problems in the past. In reality a Lucene index isn’t a single index while logically treated as a such. The latest developments in Lucene trunk try to expose reality for type-safety and performance, but before I go into details about Composite, Atomic and DirectoryReaders let me go back in time a bit.

If you don’t mind looking deep into the heart of indexing in Lucene, this is a post for you. Problems, both solved and remaining are discussed. This could be your opportunity to contribute to the Lucene community.

Comments Off

February 2, 2012

Query time joining in Lucene

Filed under: Joins,Lucene — Patrick Durusau @ 3:40 pm

Query time joining in Lucene

From the post:

Recently query time joining has been added to the Lucene join module in the Lucene svn trunk. The query time joining will be included in the Lucene 4.0 release and there is a possibility that it will also be included in Lucene 3.6.

Lets say we have articles and comments. With the query time join you can store these entities as separate documents. Each comment and article can be updates without re-indexing large parts of your index. Even better would be to store articles in an article index and comments in a comment index! In both cases a comment would have a field containing the article identifier.

Joins based upon matching terms in different indexes.

Work is not finished yet so now would be the time to contribute your experiences or opinions.

Comments Off

January 25, 2012

Berlin Buzzwords 2012

Filed under: BigData,Conferences,ElasticSearch,Hadoop,HBase,Lucene,MongoDB,Solr — Patrick Durusau @ 3:24 pm

Berlin Buzzwords 2012

Important Dates (all dates in GMT +2)

Submission deadline: March 11th 2012, 23:59 MEZ
Notification of accepted speakers: April 6st, 2012, MEZ
Publication of final schedule: April 13th, 2012
Conference: June 4/5. 2012

The call:

Call for Submission Berlin Buzzwords 2012 – Search, Store, Scale — June 4 / 5. 2012

The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics:

IR / Search – Lucene, Solr, katta, ElasticSearch or comparable solutions

NoSQL – like CouchDB, MongoDB, Jackrabbit, HBase and others

Large Data Processing – Hadoop itself, MapReduce, Cascading or Pig and relatives

Related topics not explicitly listed above are more than welcome. We are looking for presentations on the implementation of the systems themselves, technical talks, real world applications and case studies.

…(moved dates to top)…

High quality, technical submissions are called for, ranging from principles to practice. We are looking for real world use cases, background on the architecture of specific projects and a deep dive into architectures built on top of e.g. Hadoop clusters.

Here is your chance to experience summer in Berlin (Berlin Buzzwords 2012) and in Montreal (Balisage).

Seriously, both conferences are very strong and worth your attention.

Comments Off

January 23, 2012

Solr and Lucene Reference Guide updated for v3.5

Filed under: Lucene,Solr — Patrick Durusau @ 7:47 pm

Solr and Lucene Reference Guide updated for v3.5

From the post:

The free Solr Reference Guide published by Lucid Imagination has been updated to 3.5 – the current release version of Solr and Lucene. The changes weren’t major, but here are the key changes:

Support for the Hunspell stemmer

The new langid UpdateProcessor

Numeric types now support sortMissingFirst/Last

New parameter hl.q for use with highlighting

Field types supported by the StatsComponent now includes date and string fields

Almost 400 pages of rainy winter day reading.

OK, so you need a taste for that sort of thing. 😉

Comments Off

January 20, 2012

Simon says: Single Byte Norms are Dead!

Filed under: Indexing,Lucene — Patrick Durusau @ 9:19 pm

Simon says: Single Byte Norms are Dead!

From the post:

Apache Lucene turned 10 last year with a limitation that bugged many many users from day one. You may know Lucene’s core scoring model is based on TF/IDF (Vector Space Model). Lucene encapsulates all related calculations in a class called Similarity. Among pure TF/IDF factors Similarity also provides a norm value per document that is, by default a float value composed out of length normalization and field boost. Nothing special so far! However, this float value is encoded into a single byte and written down to the index. Lucene trades some precision lost for space on disk and eventually in memory since norms are loaded into memory per field upon first access.

In lots of cases this precision lost is a fair trade-off, but once you find yourself in a situation where you need to store more information based on statistics collected during indexing you end up writing your own auxiliary data structure or “fork” Lucene for your app and mess with the source.

The upcoming version of Lucene already added support for a lot more scoring models like:

Divergence from Randomness

Language Models

Information Based Models

Okapi BM25

The abstractions added to Lucene to implement those models already opens the door for applications that either want to roll their own “awesome” scoring model or modify the low level scorer implementations. Yet, norms are still one byte!

Don’t worry! The post has a happy ending!

Read on if you want to be on the cutting edge of Lucene work.

Thanks Lucene Team!

Comments Off

January 14, 2012

ToChildBlockJoinQuery in Lucene

Filed under: Lucene,Search Engines — Patrick Durusau @ 7:34 pm

ToChildBlockJoinQuery in Lucene .

Mike McCandless writes:

In my last post I described a known limitation of BlockJoinQuery: it joins in only one direction (from child to parent documents). This can be a problem because some applications need to join in reverse (from parent to child documents) instead.

This is now fixed! I just committed a new query, ToChildBlockJoinQuery, to perform the join in the opposite direction. I also renamed the previous query to ToParentBlockJoinQuery.

This will included in Lucene 3.6.0 and 4.0.

Comments Off

January 9, 2012

Searching relational content with Lucene’s BlockJoinQuery

Filed under: Indexing,Lucene — Patrick Durusau @ 1:32 pm

Searching relational content with Lucene’s BlockJoinQuery

Mike McCandless writes:

Lucene’s 3.4.0 release adds a new feature called index-time join (also sometimes called sub-documents, nested documents or parent/child documents), enabling efficient indexing and searching of certain types of relational content.

Most search engines can’t directly index relational content, as documents in the index logically behave like a single flat database table. Yet, relational content is everywhere! A job listing site has each company joined to the specific listings for that company. Each resume might have separate list of skills, education and past work experience. A music search engine has an artist/band joined to albums and then joined to songs. A source code search engine would have projects joined to modules and then files.

Mike covers how to index relational content with Lucene 3.4.0 as well as the current limitations on that relational indexing. Current work is projected to resolve some of those limitations.

This feature will be immediately useful in a number of contexts.

Even more promising is the development of thinking about indexing as more than term -> document. Both sides of that operator need more granularity.

Comments Off

January 6, 2012

Katta – Lucene & more in the cloud

Filed under: Hadoop,Katta,Lucene — Patrick Durusau @ 11:39 am

Katta – Lucene & more in the cloud

From the webpage:

Katta is a scalable, failure tolerant, distributed, data storage for real time access.

Katta serves large, replicated, indices as shards to serve high loads and very large data sets. These indices can be of different type. Currently implementations are available for Lucene and Hadoop mapfiles.

Makes serving large or high load indices easy

Serves very large Lucene or Hadoop Mapfile indices as index shards on many servers

Replicate shards on different servers for performance and fault-tolerance

Supports pluggable network topologies

Master fail-over

Fast, lightweight, easy to integrate

Plays well with Hadoop clusters

Apache Version 2 License

Now that the “new” has worn off of your holiday presents, ;-), something to play with over the weekend.

Comments Off

January 4, 2012

Hadoop for Archiving Email – Part 2

Filed under: Hadoop,Indexing,Lucene,Solr — Patrick Durusau @ 9:40 am

Hadoop for Archiving Email – Part 2 by Sunil Sitaula.

From the post:

Part 1 of this post covered how to convert and store email messages for archival purposes using Apache Hadoop, and outlined how to perform a rudimentary search through those archives. But, let’s face it: for search to be of any real value, you need robust features and a fast response time. To accomplish this we use Solr/Lucene-type indexing capabilities on top of HDFS and MapReduce.

Before getting into indexing within Hadoop, let us review the features of Lucene and Solr:

Continues Part 1 (my blog post) and mentions several applications and libraries that will be useful for indexing email.

Comments (1)

January 1, 2012

Optimizing Findability in Lucene and Solr

Filed under: Findability,Lucene,LucidWorks,Solr — Patrick Durusau @ 6:00 pm

Optimizing Findability in Lucene and Solr

From the post:

To paraphrase an age-old question about trees falling in the woods: “If content lives in your application and you can’t find it, does it still exist?” In this article, we explore how to make your content findable by presenting tips and techniques for discovering what is important in your content and how to leverage it in the Lucene Stack.

Table of Contents

Introduction
Planning for Findability
Knowing your Content
Knowing your Users
Garbage In, Garbage Out
Analyzing your Analysis
Stemming In Greater Detail
Query Techniques for Better Search
Navigation Hints
Final Thoughts
Resources

by Grant Ingersoll

You know when a blog post starts off with a table of contents it is long. Fortunately in this case, it is also very good. By one of the principal architects of Lucene, Grant Ingersoll.

A good start on developing findability skills but as the post points out, a lot of it will depend on your knowledge of what “findability” means to your users. Only you can answer that question.

Comments Off

Gora Graduates!

Filed under: Cassandra,Hadoop,HBase,Hive,Lucene,MapReduce,Pig,Solr — Patrick Durusau @ 5:54 pm

Gora Graduates! (Incubator location)

Over Twitter I just saw a post announcing that Gora has graduated from the Apache Incubator!

Congratulations to all involved.

Oh, the project:

What is Gora?

Gora is an ORM framework for column stores such as Apache HBase and Apache Cassandra with a specific focus on Hadoop.

Why Gora?

Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores. Gora fills this gap by giving the user an easy-to-use ORM framework with data store specific mappings and built in Apache Hadoop support.

The overall goal for Gora is to become the standard data representation and persistence framework for big data. The roadmap of Gora can be grouped as follows.

Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS.

Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location.

Indexing : Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API.

Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading

MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.

Comments Off

« Newer Posts — Older Posts »

Enhancement Areas	Key Benefits
Includes the latest Lucene/Solr 4.0	Near Real Time Fault Tolerance and High Availability Data Durability Centralized Configuration Elasticity
Business Rules	Integrate your business processes and rules with the user search experience Examples: Landing Pages, provide targeted search results per user, etc. Framework to integrate with your BRMS (Business Rules Management System) OOB integration with leading open source BRMS – Drools
Upgrade and Migrations	Lucid can help upgrade customers from Solr 3.x to 4.0 or older Solr versions to LucidWorks 2.1 Upgrades for existing LucidWorks customers on previous versions of LucidWorks to LucidWorks 2.1
Enhanced Connector Framework	Easily build integrations to index data from any application or data sources Framework supports REST API driven integration, generates dynamic configuration UI, and allows admins to schedule the new connectors Connectors available to crawl large amounts of HDFS data, integrate twitter updates into index, and CMIS connector to support CMS systems like Alfresco, etc.
Efficient Crawl of Large web content	OOB integration for Nutch (open source) Helps crawl Webscale data into your index
REST API and UI Enhancements	Supports memory and cache settings, schema less configuration using Dynamic fields from UI Subject Matter Experts can create Best Bets for improved search experience

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 26, 2012

April 25, 2012

April 20, 2012

April 16, 2012

April 14, 2012

April 13, 2012

April 1, 2012

March 27, 2012

March 25, 2012

March 23, 2012

March 19, 2012

March 14, 2012

March 7, 2012

March 6, 2012

March 5, 2012

March 1, 2012

February 9, 2012

February 7, 2012

February 6, 2012

February 2, 2012

January 25, 2012

January 23, 2012

January 20, 2012

January 14, 2012

January 9, 2012

January 6, 2012

January 4, 2012

January 1, 2012