Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 26, 2012

Using information retrieval technology for a corpus analysis platform

Filed under: Corpora,Corpus Linguistics,Information Retrieval,Lucene,MapReduce — Patrick Durusau @ 3:57 pm

Using information retrieval technology for a corpus analysis platform by Carsten Schnober.

Abstract:

This paper describes a practical approach to use the information retrieval engine Lucene for the corpus analysis platform KorAP, currently being developed at the Institut für Deutsche Sprache (IDS Mannheim). It presents a method to use Lucene’s indexing technique and to exploit it for linguistically annotated data, allowing full flexibility to handle multiple annotation layers. It uses multiple indexes and MapReduce techniques in order to keep KorAP scalable.

The support for multiple annotation layers is of particular interest to me because the “subjects” of interest in a text may vary from one reader to another.

Being mindful that for topic maps, the annotation layers and annotations themselves may be subjects for some purposes.

September 25, 2012

Battle of the Giants: Apache Solr 4.0 vs ElasticSearch

Filed under: ElasticSearch,Lucene,SolrCloud — Patrick Durusau @ 1:28 pm

Battle of the Giants: Apache Solr 4.0 vs ElasticSearch

From the post:

Apache Solr 4.0 release is imminent and we have a heavily anticipated Solr vs. ElasticSearch blog post series going on. What better time to share that our Rafał Kuć will be giving a talk titled Battle of the giants: Apache Solr 4.0 vs ElasticSearch at the upcoming ApacheCon/Lucene EuroCon in Germany this November.

Abstract:

In this talk audience will be able to hear about how the long awaited Apache Solr 4.0 (aka SolrCloud) compares to the second search engine built on top of Apache Lucene – ElasticSearch. From understanding the architectural differences and behavior in situations like split – brain, to cluster recovery. From distributed indexing and document distribution control, to handling multiple shards and replicas in a single cluster. During the talk, we will also compare the most used and anticipated features such as faceting handling, documents grouping and so on. At the end we will talk about performance differences, cluster monitoring and troubleshooting.

ApacheCon Europe 2012
Rhein-Neckar-Arena, Sinsheim, Germany
5–8 November 2012

Email, tweet, publicize ApacheCon Europe 2012!

Blog especially! A pale imitation but those of us unable to attend benefit from your posts!

Search Hadoop with Search-Hadoop.com

Filed under: Hadoop,Lucene — Patrick Durusau @ 10:46 am

Search Hadoop with Search-Hadoop.com by Russell Jurney.

As the Hadoop ecosystem has exploded into many projects, searching for the right answers when questions arise can be a challenge. Thats why I was thrilled to hear about search-hadoop.com, from Sematext. It has a sister site called search-lucene where you can… search lucene!

Class! Class! Pay attention now.

These are examples of value-added services.

Both of these are going on my browser tool bar. How about you?

August 26, 2012

Index your blog using tags and lucene.net

Filed under: .Net,Lucene — Patrick Durusau @ 4:56 am

Index your blog using tags and lucene.net by Ricci Gian Maria.

From the post:

In the last part of my series on Lucene I show how simple is adding tags to document to do a simple tag based categorization, now it is time to explain how you can automate this process and how to use some advanced characteristic of lucene. First of all I write a specialized analyzer called TagSnowballAnalyzer, based on standard SnowballAnalyzer plus a series of keywords associated to various tags, here is how I construct it.

There are various code around the net on how to add synonyms with weight, like described in this stackoverflow question, standard java lucene code has a SynonymTokenFilter in the codebase, but this example shows how simple is to write a Filter to add tags as synonym of related words.   First of all the filter was initialized with a dictionary of keyword and Tags, where Tag is a simple helper class that stores Tag string and relative weight, it also have a ConvertToToken() method that returns the tag enclosed by | (pipe) character. The use of pipe character is done to explicitly mark tags in the token stream, any word that is enclosed by pipe is by convention a tag.

Not the answer for every situation involving synonymy (as in “same subject,” i.e., topic maps) but certainly a useful one.

August 17, 2012

Lucene.Net becomes top-level project at Apache

Filed under: .Net,C#,Lucene — Patrick Durusau @ 2:41 pm

Lucene.Net becomes top-level project at Apache

From the post:

Lucene.Net, the port of the Lucene search engine library to C# and .NET, has left the Apache incubator and is now a top-level project. The announcement on the project’s blog says that the Apache board voted unanimously to accept the graduation resolution. The vote confirms that Lucene.Net is healthy and that the development and governance of the project follows the tenets of the “Apache way”. The developers will now be moving the project’s resources from the current incubator site to the main apache.org site.

Various flavors of MS Windows account for 80% of all operating systems.

What is the target for your next topic map app? (With or without Lucene.Net.)

August 16, 2012

Proximity Operators [LucidWorks]

Filed under: Lucene,LucidWorks,Query Language — Patrick Durusau @ 7:31 pm

Proximity Operators

From the webpage:

A proximity query searches for terms that are either near each other or occur in a specified order in a document rather than simply whether they occur in a document or not.

You will use some of these operators more than others but having a bookmark to the documentation will prove to be useful.

August 14, 2012

Lucene Core 4.0-BETA and Solr 4.0-BETA Available

Filed under: Lucene,Solr — Patrick Durusau @ 6:10 pm

I stopped by the Lucene site to check for upgrades to find:

The Lucene PMC is pleased to announce the availability of Apache Lucene 4.0-BETA and Apache Solr 4.0-BETA

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html
and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Highlights of the Lucene release include:

  • IndexWriter.tryDeleteDocument can sometimes delete by document ID, for higher performance in some applications.
  • New experimental postings formats: BloomFilteringPostingsFormat uses a bloom filter to sometimes avoid disk seeks when looking up terms, DirectPostingsFormat holds all postings as simple byte[] and int[]for very fast performance at the cost of very high RAM consumption.
  • CJK analysis improvements: JapaneseIterationMarkCharFilter normalizes Japanese iteration marks, added unigram+bigram support to CJKBigramFilter.
  • Improvements to Scorer navigation API (Scorer.getChildren) to support all queries, useful for determining which portions of the query matched.
  • Analysis improvements: factories for creating Tokenizer, TokenFilter, and CharFilter have been moved from Solr to Lucene’s analysis module, less memory overhead for StandardTokenizer and Snowball filters.

  • Improved highlighting for multi-valued fields.
  • Various other API changes, optimizations and bug fixes.

Highlights of the Solr release include:

  • Added a Collection management API for Solr Cloud.
  • Solr Admin UI now clearly displays failures related to initializing SolrCores
  • Updatable documents can create a document if it doesn’t already exist,nor you can force that the document must already exist.
  • Full delete-by-query support for Solr Cloud.
  • Default to NRTCachingDirectory for improved near-realtime performance.
  • Improved Solrj client performance with Solr Cloud: updates are only sent to leaders by default.
  • Various other API changes, optimizations and bug fixes.

August 13, 2012

Lucid Imagination become LucidWorks [Man Bites Dog Story]

Filed under: Lucene,LucidWorks,Solr — Patrick Durusau @ 3:27 pm

Lucid Imagination becomes LucidWorks

Soft news except for the note about the soon to appear SearchHub.org (September, 2012).

And the company listening to users refer to it as LucidWorks and deciding to change the name of the company from Lucid Imagination to LucidWorks.

Sort of a man bites dog sort of story don’t your think?

Hurray for LucidWorks!

Makes me curious about the SearchHub.org site. Likely to listen to users there as well.

August 8, 2012

Lucene Eurocon / ApacheCon Europe

Filed under: Lucene,LucidWorks — Patrick Durusau @ 1:48 pm

Lucene Eurocon / ApacheCon Europe November 5-8 | Sinsheim, Germany

From a post I got today from Lucid Imagination:

Lucid Imagination and the Apache Foundation have agreed to co-locate Lucid’s Apache Lucene EuroCon with ApacheCon Europe being held this November 5-8 in Sinsheim, Germany. Lucene EuroCon at ApacheCon Europe will cover the breadth and depth of search innovation and application. The dedicated track will bring together Apache Lucene/Solr committers and technologists from around the world to offer compelling presentations that share future directions for the project and technical implementation experiences. Topic examples include channeling the flood of structured and unstructured data into faster, more cost-effective Lucene/Solr search applications that span a host of sectors and industries.

Some of the most talented Lucene/Solr developers gather each year at Apache Lucene EuroCon to share best practices and create next-generation search applications. Coupling Apache Lucene EuroCon with this year’s ApacheCon Europe offers a great benefit to the community at large. The combined attendees benefit from expert trainings and in-depth sessions, real-world case studies, excellent networking and the opportunity to connect with the industry’s leading minds.

Call For Papers Deadline is August 13

The Call for Papers for ApacheCon has been extended to August 13, 2012, and can be found on the ApacheCon website. As always, proceeds from Apache Lucene EuroCon benefit The Apache Software Foundation. We encourage all Lucene/Solr committers and developers who have a technical story to tell to submit an abstract. Apache Lucene/Solr has a rich community of developers. Supporting ApacheCon Europe by submitting your abstract and sharing your story is important for maintaining this important and thriving community.

Just so you don’t think this is a search only event, papers are welcome on:

  • Apache Daily – Tools frameworks and components used on a daily basis
  • ApacheEE – Java enterprise projects
  • Big Data – Cassandra, Hadoop, HBase, Hive, Kafka, Mahout, Pig, Whirr, ZooKeeper and friends
  • Camel in Action – All things Apache Camel, from their problems to their solutions
  • Cloud – Cloud-related applications of a broad range of Apache projects
  • Linked Data – (need a concise caption for this track)
  • Lucene, SOLR and Friends – Learn about important web search technologies from the experts
  • Modular Java Applications – Using Felix, ACE, Karaf, Aries and Sling to deploy modular Java applications to public and private cloud environments
  • NoSQL Database – Use cases and recent developments in Cassandra, HBase, CouchDBa and Accumulo
  • OFBiz – The Apache Enterprise Automation project
  • Open Office – Open Office and the Apache Content Ecosystem
  • Web Infrastructure – HTTPD, TomCat and Traffic Server, the heart of many Internet projects

Submissions are welcome from any developer or user of Apache projects. First-time speakers are just as welcome as experienced ones, and we will do our best to make sure that speakers get all the help they need to give a great presentation.

August 1, 2012

Indexes in RAM?

Filed under: Indexing,Lucene,Zing JVM — Patrick Durusau @ 6:35 am

The Mike McCandless post: Lucene index in RAM with Azul’s Zing JVM will help make your case for putting your index in RAM!

From the post:

Google’s entire index has been in RAM for at least 5 years now. Why not do the same with an Apache Lucene search index?

RAM has become very affordable recently, so for high-traffic sites the performance gains from holding the entire index in RAM should quickly pay for the up-front hardware cost.

The obvious approach is to load the index into Lucene’s RAMDirectory, right?

Unfortunately, this class is known to put a heavy load on the garbage collector (GC): each file is naively held as a List of byte[1024] fragments (there are open Jira issues to address this but they haven’t been committed yet). It also has unnecessary synchronization. If the application is updating the index (not just searching), another challenge is how to persist ongoing changes from RAMDirectory back to disk. Startup is much slower as the index must first be loaded into RAM. Given these problems, Lucene developers generally recommend using RAMDirectory only for small indices or for testing purposes, and otherwise trusting the operating system to manage RAM by using MMapDirectory (see Uwe’s excellent post for more details).

While there are open issues to improve RAMDirectory (LUCENE-4123 and LUCENE-3659), they haven’t been committed and many users simply use RAMDirectory anyway.

Recently I heard about the Zing JVM, from Azul, which provides a pauseless garbage collector even for very large heaps. In theory the high GC load of RAMDirectory should not be a problem for Zing. Let’s test it! But first, a quick digression on the importance of measuring search response time of all requests.

There are obvious speed advantages to holding indexes in RAM.

Curious, is RAM just a quick disk? Or do we need to think about data structures/access differently with RAM? Pointers?

July 31, 2012

Running a UIMA Analysis Engine in a Lucene Analyzer Chain

Filed under: Indexing,Lucene,UIMA — Patrick Durusau @ 4:41 pm

Running a UIMA Analysis Engine in a Lucene Analyzer Chain by Sujit Pal.

From the post:

Last week, I wrote about a UIMA Aggregate Analysis Engine (AE) that annotates keywords in a body of text, optionally inserting synonyms, using a combination of pattern matching and dictionary lookups. The idea is that this analysis will be done on text on its way into a Lucene index. So this week, I describe the Lucene Analyzer chain that I built around the AE I described last week.

A picture is worth a thousand words, so here is one that shows what I am (or will be soon, in much greater detail) talking about.

[Graphic omitted]

As you can imagine, most of the work happens in the UimaAETokenizer. The tokenizer is a buffering (non-streaming) Tokenizer, ie, the entire text is read from the Reader and analyzed by the UIMA AE, then individual tokens returned on successive calls to its incrementToken() method. I decided to use the new (to me) AttributeSource.State object to keep track of the tokenizer’s state between calls to incrementToken() (found out about it by grokking through the Synonym filter example in the LIA2 book).

After (UIMA) analysis, the annotated tokens are marked as Keyword, any transformed values for the annotation are set into the SynonymMap (for use by the synonym filter, next in the chain). Text that is not annotated are split up (by punctuation and whitespace) and returned as plain Lucene Term (or CharTerm since Lucene 3.x) tokens. Here is the code for the Tokenizer class.

The second of two posts from Jack Park.

Part of my continuing interest in indexing. In part because we know that indexing scales. Seriously scales.

UIMA Analysis Engine for Keyword Recognition and Transformation

Filed under: Indexing,Lucene,UIMA — Patrick Durusau @ 4:39 pm

UIMA Analysis Engine for Keyword Recognition and Transformation by Sujit Pal.

From the post:

You have probably noticed that I’ve been playing with UIMA lately, perhaps a bit aimlessly. One of my goals with UIMA is to create an Analysis Engine (AE) that I can plug into the front of the Lucene analyzer chain for one of my applications. The AE would detect and mark keywords in the input stream so they would be exempt from stemming by downstream Lucene analyzers.

So couple of weeks ago, I picked up the bits and pieces of UIMA code that I had written and started to refactor them to form a sequence of primitive AEs that detected keywords in text using pattern and dictionary recognition. Each primitive AE places new KeywordAnnotation objects into an annotation index.

The primitive AEs I came up with are pretty basic, but offers a surprising amount of bang for the buck. There are just two annotators – the PatternAnnotator and DictionaryAnnotator – that do the processing for my primitive AEs listed below. Obviously, more can be added (and will, eventually) as required.

  • Pattern based keyword recognition
  • Pattern based keyword recognition and transformation
  • Dictionary based keyword recognition, case sensitive
  • Dictionary based keyword recognition and transformation, case sensitive
  • Dictionary based keyword recognition, case insensitive
  • Dictionary based keyword recognition and transformation, case insensitive

The first of two posts that I missed from last year, recently brought to my attention by Jack Park.

The ability to annotate, implying, among other things, the ability to create synonym annotations for keywords.

July 29, 2012

Building a new Lucene postings format

Filed under: Indexing,Lucene — Patrick Durusau @ 10:08 am

Building a new Lucene postings format by Mike McCandless.

From the post:

As of 4.0 Lucene has switched to a new pluggable codec architecture, giving the application full control over the on-disk format of all index files. We have a nice collection of builtin codec components, and developers can create their own such as this recent example using a Redis back-end to hold updatable fields. This is an important change since it removes the previous sizable barriers to innovating on Lucene’s index formats.

A codec is actually a collection of formats, one for each part of the index. For example, StoredFieldsFormat handles stored fields, NormsFormat handles norms, etc. There are eight formats in total, and a codec could simply be a new mix of pre-existing formats, or perhaps you create your own TermVectorsFormat and otherwise use all the formats from the Lucene40 codec, for example.

Current testing of formats requires the entire format be specified, which means errors are hard to diagnose.

Mike addresses that problem by creating a layered testing mechanism.

Great stuff!

PS: I think it will also be useful as an educational tool. Changing defined formats and testing as changes are made.

July 25, 2012

Using Luke the Lucene Index Browser to develop Search Queries

Filed under: Lucene,Luke — Patrick Durusau @ 3:27 pm

Using Luke the Lucene Index Browser to develop Search Queries

From the post:

Luke is a GUI tool written in Java that allows you to browse the contents of a Lucene index, examine individual documents, and run queries over the index. Whether you’re developing with PyLucene, Lucene.NET, or Lucene Core, Luke is your friend.

Which also covers:


Downloading, running Luke ….

Exploring Document Indexing ….

Exploring Search ….

Using the Lucene XML Query Parser ….

Nothing surprising but a well written introduction to Luke.

July 22, 2012

Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 with Realtime NRT available for download

Filed under: Lucene,Solr — Patrick Durusau @ 6:34 pm

Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 with Realtime NRT available for download

More good news!

I am very excited to announce the availability of Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 with Realtime NRT. The Realtime NRT implementation now supports both RankingAlgorithm and Lucene. Realtime NRT is a high performance and more granular NRT implementation as to soft commit. The update performance is about 70,000 documents / sec*. You can also scale up to 2 billion documents* in a single core, and query half a billion documents index in ms**.

RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or boolean queries and is compatible with the new Lucene 4.0-ALPHA api.

You can get more information about Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 Realtime performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x

You can download Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 from here: http://solr-ra.tgels.org

Please download and give the new version a try.

Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

Apache Lucene 3.6.1 and Apache Solr 3.6.1 available

Filed under: Indexing,Lucene,Solr — Patrick Durusau @ 6:21 pm

Lucene/Solr news on 22 July 2012:

The Lucene PMC is pleased to announce the availability of Apache Lucene 3.6.1 and Apache Solr 3.6.1.

This release is a bug fix release for version 3.6.0. It contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below.

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-3x-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-3x-redir.html

See the CHANGES.txt file included with the release for a full list of details.

Lucene 3.6.1 Release Highlights:

  • The concurrency of MMapIndexInput.clone() was improved, which caused a performance regression in comparison to Lucene 3.5.0.
  • MappingCharFilter was fixed to return correct final token positions.
  • QueryParser now supports +/- operators with any amount of whitespace.
  • DisjunctionMaxScorer now implements visitSubScorers().
  • Changed the visibility of Scorer#visitSubScorers() to public, otherwise it’s impossible to implement Scorers outside the Lucene package. This is a small backwards break, affecting a few users who implemented custom Scorers.
  • Various analyzer bugs where fixed: Kuromoji to not produce invalid token graph due to UNK with punctuation being decompounded, invalid position length in SynonymFilter, loading of Hunspell dictionaries that use aliasing, be consistent with closing streams when loading Hunspell affix files.
  • Various bugs in FST components were fixed: Offline sorter minimum buffer size, integer overflow in sorter, FSTCompletionLookup missed to close its sorter.
  • Fixed a synchronization bug in handling taxonomies in facet module.
  • Various minor bugs were fixed: BytesRef/CharsRef copy methods with nonzero offsets and subSequence off-by-one, TieredMergePolicy returned wrong-scaled floor segment setting.

Solr 3.6.1 Release Highlights:

  • The concurrency of MMapDirectory was improved, which caused a performance regression in comparison to Solr 3.5.0. This affected users with 64bit platforms (Linux, Solaris, Windows) or those explicitely using MMapDirectoryFactory.
  • ReplicationHandler “maxNumberOfBackups” was fixed to work if backups are triggered on commit.
  • Charset problems were fixed with HttpSolrServer, caused by an upgrade to a new Commons HttpClient version in 3.6.0.
  • Grouping was fixed to return correct count when not all shards are queried in the second pass. Solr no longer throws Exception when using result grouping with main=true and using wt=javabin.
  • Config file replication was made less error prone.
  • Data Import Handler threading fixes.
  • Various minor bugs were fixed.

What a nice way to start the week!

Thanks to the Lucene PMC!

July 17, 2012

elasticsearch. The Company

Filed under: ElasticSearch,Lucene,Search Engines — Patrick Durusau @ 3:45 pm

elasticsearch. The Company

ElasticSearch needs no introduction to readers of this blog or really anyone active in the search “space.”

It was encouraging to hear that after years of building an ever increasingly useful product, that ElasticSearch has matured into a company.

With all the warm fuzzies that support contracts and such bring.

Sounds like they will demonstrate that the open source and commercial worlds aren’t, you know, incompatible.

It helps that they have a good product in which they have confidence and not a product that their PR/Sales department is pushing as a “good” product. The fear of someone “finding out” would make you real defensive in the latter case.

Looking forward to good fortune for ElasticSearch, its founders and anyone who wants to follow a similar model.

July 6, 2012

Lucene Tutorial updated for Lucene 3.6

Filed under: Lucene — Patrick Durusau @ 4:15 pm

Lucene Tutorial updated for Lucene 3.6

From LingPipe:

The current Apache Lucene Java version is 3.6, released in April of 2012. We’ve updated the Lucene 3 tutorial and the accompanying source code to bring it in line with the current API so that it doesn’t use any deprecated methods and my, there are a lot of them. Bob blogged about this tutorial back in February 2011, shortly after Lucene Java rolled over to version 3.0.

Like other 3.x minor releases, Lucene 3.6 introduces performance enhancements, bug fixes, new analyzers, and changes that bring the Lucene API in line with Solr. In addition, Lucene 3.6 anticipates Lucene 4, billed as “the next major backwards-incompatible release.”

Excellent news! Although you will need to hurry reading it. Lucene/Solr 4.0 is just around the corner!

July 3, 2012

Lucene 4.0.0 alpha, at long last!

Filed under: Lucene,Solr — Patrick Durusau @ 5:33 pm

Lucene 4.0.0 alpha, at long last! by Mike McCandless.

Grabbing enough of the post to make you crazy until you read it in full (there’s lots more):

The 4.0.0 alpha release of Lucene and Solr is finally out!

This is a major release with lots of great changes. Here I briefly describe the most important Lucene changes, but first the basics:

  • All deprecated APIs as of 3.6.0 have been removed.
  • Pre-3.0 indices are no longer supported.
  • MIGRATE.txt describes how to update your application code.
  • The index format won’t change (unless a serious bug fix requires it) between this release and 4.0 GA, but APIs may still change before 4.0.0 beta.

Please try the release and report back!

Pluggable Codec

The biggest change is the new pluggable Codec architecture, which provides full control over how all elements (terms, postings, stored fields, term vectors, deleted documents, segment infos, field infos) of the index are written. You can create your own or use one of the provided codecs, and you can customize the postings format on a per-field basis.

There are some fun core codecs:

  • Lucene40 is the default codec.
  • Lucene3x (read-only) reads any index written with Lucene 3.x.
  • SimpleText stores everything in plain text files (great for learning and debugging, but awful for production!).
  • MemoryPostingsFormat stores all postings (terms, documents, positions, offsets) in RAM as a fast and compact FST, useful for fields with limited postings (primary key (id) field, date field, etc.)
  • PulsingPostingsFormat inlines postings for low-frequency terms directly into the terms dictionary, saving a disk seek on lookup.
  • AppendingCodec avoids seeking while writing, necessary for file-systems such as Hadoop DFS.

If you create your own Codec it’s easy to confirm all of Lucene/Solr’s tests pass with it. If tests fail then likely your Codec has a bug!

A new 4-dimensional postings API (to read fields, terms, documents, positions) replaces the previous postings API.

….

A good thing that tomorrow is a holiday in the U.S. 😉

June 7, 2012

Reducing Software Highway Friction

Filed under: Hadoop,Lucene,LucidWorks,Solr — Patrick Durusau @ 2:20 pm

Lucid Imagination Search Product Offered in Windows Azure Marketplace

From the post:

Ease of use and flexibility are two key business drivers that are fueling the rapid adoption of cloud computing. The ability to disconnect an application from its supporting architecture provides a new level of business agility that has never before been possible. To ease the move towards this new realm of computing, integrated platforms have begun emerge that make cloud computing easier to adopt and leverage.

Lucid Imagination, a trusted name in Search, Discovery and Analytics, today announced that its LucidWorks Cloud product has been selected by Microsoft Corp. to be offered as a Search-as-a-Service product in Microsoft’s Windows Azure Marketplace. LucidWorks Cloud is a full cloud service version of its LucidWorks Enterprise platform. LucidWorks Cloud delivers full open source Apache Lucene/Solr community innovation with support and maintenance from the world’s leading experts in open source search. An extensible platform architected for developers, LucidWorks Cloud is the only Solr distribution that provides security, abstraction and pre-built connectors for essential enterprise data sources – along with dramatic ease of use advantages in a well-tested, integrated and documented package.

Example use cases for LucidWorks Cloud include Search-as-a-Service for websites, embedding search into SaaS product offerings, and Prototyping and developing cloud-based search-enabled applications in general.

…..

Highlights of LucidWorks Cloud Search-as-a-Service

  • Sign-up for a plan and start building your search application in minute
  • Well-organized UI makes Apache Lucene/Solr innovation easier to consume and more adaptable to constant change
  • Create multiple search collections and manage them independently
  • Configure index and query settings, fields, stop words, synonyms for each collection
  • Built-in support for Hadoop, Microsoft SharePoint and traditional online content types
  • An open connector framework is available to customize access to other data sources
  • REST API automates and integrates search as a service with an application
  • Well-instrumented dashboard for infrastructure administration, monitoring and reporting
  • Monitored 24×7 by Lucid Development Operations insuring minimum downtime

Source: PR Newswire (http://s.tt/1dzre)

I find this deeply encouraging.

It is a step towards a diverse but reduced friction software highway.

The user community is not well served by uniform models for data, software or UIs.

The user community can be well served by a reduced friction software highway as they move data from application to application.

Microsoft has taken a large step towards a reduced friction software highway today. And it is appreciated!

Lucene Revolution 2012 – Slides/Videos

Filed under: Conferences,Lucene,Mahout,Solr,SolrCloud — Patrick Durusau @ 2:16 pm

Lucene Revolution 2012 – Slides/Videos

The slides and videos from Lucene Revolution 2012 are up!

Now you don’t have to search for old re-runs on Hulu to watch during lunch!

June 4, 2012

Different ways to make auto suggestions with Solr

Filed under: AutoSuggestion,Lucene,LucidWorks,Solr — Patrick Durusau @ 4:30 pm

Different ways to make auto suggestions with Solr

From the post:

Nowadays almost every website has a full text search box as well as the auto suggestion feature in order to help users to find what they are looking for, by typing the least possible number of characters possible. The example below shows what this feature looks like in Google. It progressively suggests how to complete the current word and/or phrase, and corrects typo errors. That’s a meaningful example which contains multi-term suggestions depending on the most popular queries, combined with spelling correction.

Starts with seven (7) questions you should ask yourself about auto-suggestions and then covers four methods for implementing them in Solr.

You can have the typical word completion seen in most search engines or you can be more imaginative, using custom dictionaries.

May 27, 2012

The Seven Deadly Sins of Solr

Filed under: Indexing,Lucene,Solr — Patrick Durusau @ 3:51 pm

The Seven Deadly Sins of Solr by Jay Hill.

From the post:

Working at Lucid Imagination gives me the opportunity to analyze and evaluate a great many instances of Solr implementations, running in some of the largest Fortune 500 companies as well as some of the smallest start-ups. This experience has enabled me to identify many common mistakes and pitfalls that occur, either when starting out with a new Solr implementation, or by not keeping up with the latest improvements and changes.Thanks to my colleague Simon Rosenthal for suggesting the title, and to Simon, Lance Norskog, and Tom Hill for helpful input and suggestions.So, without further ado…the Seven Deadly Sins of Solr.

Not recent and to some degree Solr specific.

You will encounter one or more of these “sins” with every IT solution, including topic maps.

This should be fancy printed, laminated and handed out as swag.

May 21, 2012

Solr 4 preview: SolrCloud, NoSQL, and more

Filed under: Lucene,NoSQL,Solr,SolrCloud — Patrick Durusau @ 10:32 am

Solr 4 preview: SolrCloud, NoSQL, and more

From the post:

The first alpha release of Solr 4 is quickly approaching, bringing powerful new features to enhance existing Solr powered applications, as well as enabling new applications by further blurring the lines between full-text search and NoSQL.

The largest set of features goes by the development code-name “Solr Cloud” and involves bringing easy scalability to Solr. Distributed indexing with no single points of failure has been designed from the ground up for near real-time (NRT), and NoSQL features such as realtime-get, optimistic locking, and durable updates.

We’ve incorporated Apache ZooKeeper, the rock-solid distributed coordination project that is immune to issues like split-brain syndrome that tend to plague other hand-rolled solutions. ZooKeeper holds the Solr configuration, and contains the cluster meta-data such as hosts, collections, shards, and replicas, which are core to providing an elastic search capability.

When a new node is brought up, it will automatically be assigned a role such as becoming an additional replica for a shard. A bounced node can do a quick “peer sync” by exchanging updates with its peers in order to bring itself back up to date. New nodes, or those that have been down too long, recover by replicating the whole index of a peer while concurrently buffering any new updates.

Run, don’t walk, to learn about the new features for Solr 4.

You won’t be disappointed.

Interested to see the “….blurriing [of] the lines between full-text search and NoSQL.”

Would be even more interested to see the “…blurring of indexing and data/data formats.”

That is to say that data, along with its format, is always indexed in digital media.

So why can’t I see the data as a table, as a graph, as a …., depending upon my requirements?

No ETL, JVD – Just View Differently.

Suspect I will have to wait a while for that, but in the mean time, enjoy Solr 4 alpha.

May 19, 2012

Developing Your Own Solr Filter

Filed under: Lucene,Solr — Patrick Durusau @ 7:45 pm

Developing Your Own Solr Filter

Rafał Kuć writes:

Sometimes Lucene and Solr out of the box functionality is not enough. When such time comes, we need to extend what Lucene and Solr gives us and create our own plugin. In todays post I’ll try to show how to develop a custom filter and use it in Solr.

Assumptions

Lets assume, that we need a filter that would allow us to reverse every word we have in a given field. So, if the input is “solr.pl” the output would be “lp.rlos”. It’s not the hardest example, but for the purpose of this entry it will be enough. One more thing – I decided to omit describing how to setup your IDE, how to compile your code, build jar and stuff like that. We will only focus on the code.

Template for creating your own Solr filter.

I persist in thinking that as “big data” arrives that the potential for ETL is going to decline. Where will you put your “big data” while processing it?

Much more likely to index “big data” in place and perform operations on the indexes to extract a subset of your “big data.”

So in terms of matching up data from different schemas or formats, what sort of filters will you be using?

May 16, 2012

Lucene-1622

Filed under: Indexing,Lucene,Synonymy — Patrick Durusau @ 9:32 am

Multi-word synonym filter (synonym expansion at indexing time) Lucene-1622

From the description:

It would be useful to have a filter that provides support for indexing-time synonym expansion, especially for multi-word synonyms (with multi-word matching for original tokens).

The problem is not trivial, as observed on the mailing list. The problems I was able to identify (mentioned in the unit tests as well):

  • if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., “big” in the synonym “big apple” for “new york city”) causes the document to match;
  • there are problems with highlighting the original document when synonym is matched (see unit tests for an example),
  • if the synonym is of different length than the original sequence of tokens to be matched, then phrase queries spanning the synonym and the original sequence boundary won’t be found. Example “big apple” synonym for “new york city”. A phrase query “big apple restaurants” won’t match “new york city restaurants”.

I am posting the patch that implements phrase synonyms as a token filter. This is not necessarily intended for immediate inclusion, but may provide a basis for many people to experiment and adjust to their own scenarios.

This remains an open issue as of 16 May 2012.

It is also an important open issue.

Think about it.

As “big data” gets larger and larger, at some point traditional ETL isn’t going to be practical. Due to storage, performance, selective granularity or other issues, ETL is going to fade into the sunset.

Indexing, on the other hand, which treats data “in situ” (“in position” for you non-archaeologists in the audience), avoids many of the issues with ETL.

The treatment of synonyms, that is synonyms across data sets, multi-word synonyms, specifying the ranges of synonyms (both for indexing and search), synonym expansion, a whole range of synonyms features and capabilities, needs to “man up” to take on “big data.”

May 14, 2012

Finite State Automata in Lucene

Filed under: Finite State Automata,Lucene — Patrick Durusau @ 6:12 pm

Finite State Automata in Luceneby Mike McCandless

From the post:

Lucene Revolution 2012 is now done, and the talk Robert and I gave went well! We showed how we are using automata (FSAs and FSTs) to make great improvements throughout Lucene.

You can view the slides here.

This was the first time I used Google Docs exclusively for a talk, and I was impressed! The real-time collaboration was awesome: we each could see the edits the other was doing, live. You never have to “save” your document: instead, every time you make a change, the document is saved to a new revision and you can then use infinite undo, or step back through all revisions, to go back.

Finally, Google Docs covers the whole life-cycle of your talk: editing/iterating, presenting (it presents in full-screen just fine, but does require an internet connection; I exported to PDF ahead of time as a backup) and, finally, sharing with the rest of the world!

I must confess to disappointment when I read at slide 23 that “multi-token synonyms mess up graph.”

Particularly since I suspect that not only do synonyms need to be “multi-token” but “multi-dimensional” as well.

Lucene conference touches many areas of growth in search

Filed under: BigData,Lucene,LucidWorks,Solr — Patrick Durusau @ 8:35 am

Lucene conference touches many areas of growth in search by Andy Oram.

From the post:

With a modern search engine and smart planning, web sites can provide visitors with a better search experience than Google. For instance, Google may well turn up interesting results if you search for a certain kind of shirt, but a well-designed clothing site can also pull up related trousers, skirts, and accessories. It’s not Google’s job to understand the intricate interrelationships of data on a particular web property, but the site’s own team can constantly tune searches to reflect what the site has to offer and what its visitors uniquely need.

Hence the important of search engines like Solr, based on the Lucene library. Both are open source Apache projects, maintained by Lucid Imagination, a company founded to commercialize the underlying technology. I attended parts of Lucid Imagination’s conference this week, Lucene Revolution, and found Lucene evolving in the ways much of the computer industry is headed.

Andy’s summary of the conference will make you wonder two things:

  1. Why weren’t you at the Lucene Revolution conference this year?
  2. Where are the videos from Lucene Revolution 2012?

I won’t ever be able to answer #1 but will post an answer to #2 as soon as it is available.

May 13, 2012

Dark Data

Filed under: BigData,Lucene,LucidWorks,Solr — Patrick Durusau @ 6:37 pm

Lucid Imagination Combines Search, Analytics and Big Data to Tackle the Problem of Dark Data

This post was too well written to break up as quotes/excerpts. I am re-posting it in full.

Organizations today have little to no idea how much lost opportunity is hidden in the vast amounts of data they’ve collected and stored.  They have entered the age of total data overload driven by the sheer amount of unstructured information, also called “dark” data, which is contained in their stored audio files, text messages, e-mail repositories, log files, transaction applications, and various other content stores.  And this dark data is continuing to grow, far outpacing the ability of the organization to track, manage and make sense of it.

Lucid Imagination, a developer of search, discovery and analytics software based on Apache Lucene and Apache Solr technology, today unveiled LucidWorks Big Data. LucidWorks Big Data is the industry’s first fully integrated development stack that combines the power of multiple open source projects including Hadoop, Mahout, R and Lucene/Solr to provide search, machine learning, recommendation engines and analytics for structured and unstructured content in one complete solution available in the cloud.

Tweet This: Lucid Imagination combines #search, analytics and #BigData in complete stack. Beta now open http://ow.ly/aMHef

With LucidWorks Big Data, Lucid Imagination equips technologists and business users with the ability to initially pilot Big Data projects utilizing technologies such as Apache Lucene/Solr, Mahout and Hadoop, in a cloud sandbox. Once satisfied, the project can remain in the cloud, be moved on premise or executed within a hybrid configuration.  This means they can avoid the staggering overhead costs and long lead times associated with infrastructure and application development lifecycles prior to placing their Big Data solution into production.

The product is now available in beta. To sign up for inclusion in the beta program, visit http://www.lucidimagination.com/products/lucidworks-search-platform/lucidworks-big-data.

Dark Data Problem Is Real

How big is the problem of dark data? The total amount of digital data in the world will reach 2.7 zettabytes in 2012, a 48 percent increase from 2011.* 90 percent of this data will be unstructured or “dark” data. Worldwide, 7.5 quintillion bytes of data, enough to fill over 100,000 Libraries of Congress get generated every day. Conversely, that deep volume of data can serve to help predict the weather, uncover consumer buying patterns or even ease traffic problems – if discovered and analyzed proactively.

“We see a strong opportunity for search to play a key role in the future of data management and analytics,” said Matthew Aslett, research manager, data management and analytics, 451 Research. “Lucid’s Big Data offering, and its combination of large-scale data storage in Hadoop with Lucene/Solr-based indexing and machine-learning capabilities, provides a platform for developing new applications to tackle emerging data management challenges.”

LucidWorks Big Data

Data analytics has traditionally been the domain of business intelligence technologies. Most of these tools, however, have been designed to handle structured data such as SQL, and cannot easily tap into the broad range of data types that can be used in a Big Data application. With the announcement of LucidWorks Big Data, organizations will be able to utilize a single platform for their Big Data search, discovery and analytics needs. LucidWorks Big Data is the only complete platform that:

  • Combines the real time, ad hoc data accessibility of LucidWorks (Lucene/Solr) with compute and storage capabilities of Hadoop
  • Delivers commonly used analytic capabilities along with Mahout’s proven, scalable machine learning algorithms for deeper insight into both content and users
  • Tackles data, both big and small with ease, seamlessly scaling while minimizing the impact of provisioning Hadoop, LucidWorks and other components
  • Supplies a single, coherent, secure and well documented REST API for both application integration and administration
  • Offers fault tolerance with data safety baked in
  • Provides choice and flexibility, via on premise, cloud hosted or hybrid deployment solutions
  • Is tested, integrated and fully supported by the world’s leading experts in open source search.
  • Includes powerful tools for configuration, deployment, content acquisition, security, and search experience that is packaged in a convenient, well-organized application

Lucid Imagination’s Open Search Platform uncovers real-time insights from any enterprise data, whether structured in databases, unstructured in formats such as emails or social channels, or semi-structured from sources such as websites.  The company’s rich portfolio of enterprise-grade solutions is based on the same proven open source Apache Lucene/Solr technology that powers many of the world’s largest e-commerce sites. Lucid Imagination’s on-premise and cloud platforms are quicker to deploy, cost less than competing products and are more easily tailored to specific needs than business intelligence solutions because they leverage innovation from the open source community.  

“We’re allowing a broad set of enterprises to test and implement data discovery and analysis projects that have historically been the province of large multinationals with large data centers. Cloud computing and LucidWorks Big Data finally level the field,” said Paul Doscher, CEO of Lucid Imagination. “Large companies, meanwhile, can use our Big Data stack to reduce the time and cost associated with evaluating and ultimately implementing big data search, discovery and analysis. It’s their data – now they can actually benefit from it.”

April 30, 2012

Lucene’s TokenStreams are actually graphs!

Filed under: Graphs,Lucene,Neo4j — Patrick Durusau @ 3:17 pm

Lucene’s TokenStreams are actually graphs!

Mike McCandless starts:

Lucene’s TokenStream class produces the sequence of tokens to be indexed for a document’s fields. The API is an iterator: you call incrementToken to advance to the next token, and then query specific attributes to obtain the details for that token. For example, CharTermAttribute holds the text of the token; OffsetAttribute has the character start and end offset into the original string corresponding to this token, for highlighting purposes. There are a number of standard token attributes, and some tokenizers add their own attributes.

He continues to illustrate the creation of graphs using SynonymFilter and discusses other aspects of graph creation from tokenstreams.

Including where the production of graphs needs to be added and issues in the current implementation.

If you see any of the Neo4j folks at the graph meetup in Chicago later today, you might want to mention Mike’s post to them.

« Newer PostsOlder Posts »

Powered by WordPress