Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 30, 2011

Solr Reference Guide 3.4

Filed under: Lucene,Solr — Patrick Durusau @ 6:04 pm

Solr Reference Guide 3.4

From the post:

The material as presented assumes that you’re familiar with some basic search concepts and that you can read XML; it does not assume that you are a Java programmer, although knowledge of Java is helpful when working directly with Lucene or when developing custom extensions to a Lucene/Solr installation.

GuideKey topics covered in the Reference Guide include:

  • Getting Started: Installing Solr and getting it running for the first time.
  • Using the Solr Admin Web Interface: How to use the built-in UI.
  • Documents, Fields, and Schema Design: Designing the index for optimal retrieval.
  • Understanding Analyzers, Tokenizers, and Filters: Setting up Solr to handle your content.
  • Indexing and Basic Data Operations: Indexing your content.
  • Searching: Ways to improve the search experience for your users.
  • The Well Configured Solr Instance: Optmal settings to keep the system running smoothly.
  • Managing Solr: Web containers, logging and backups.
  • Scaling and Distribution: Best practices for increasing system capacity.
  • Client APIs: Clients that can be used to provide search interfaces for users.

The guide is available online or as a download.

BTW, have you seen any books on Solr that you like? The reviews I have seen don’t look promising.

Why Not AND, OR, And NOT?

Filed under: Lucene,Query Language,Solr — Patrick Durusau @ 6:04 pm

Why Not AND, OR, And NOT?

From the post:

The following is written with Solr users in mind, but the principles apply to Lucene users as well.

I really dislike the so called “Boolean Operators” (“AND”, “OR”, and “NOT”) and generally discourage people from using them. It’s understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it’s a good idea to try to “set aside childish things” and start thinking (and encouraging your users to think) in terms of the superior “Prefix Operators” (“+”, “-”).

Required reading if you want to understand how these the “Boolean Operators” work in Lucene/Solr and a superior alternative.

LucidWorks Enterprise 2.0.1 Release

Filed under: Lucene,LucidWorks,Solr — Patrick Durusau @ 6:01 pm

LucidWorks Enterprise 2.0.1 Release

From the post:

LucidWorks Enterprise 2.0.1 is an interim bug-fix release. We’ve have resolved couple of critical bugs and LDAP integration issues. The list of issues resolved with this updates are available here.

December 28, 2011

Luke – The Lucene Index Toolbox v. 3.5.0

Filed under: Lucene,Luke — Patrick Durusau @ 9:33 pm

Luke – The Lucene Index Toolbox v. 3.5.0

Andrzej Bialecki writes:

I’m happy to announce the release of Luke – The Lucene Index Toolbox, version 3.5.0. This release includes Lucene 3.5.0 libraries, and you can download it from:

http://code.google.com/p/luke

Changes in version 3.5.0 (released on 2011.12.28):
* Update to Lucene 3.5.0 and fix some deprecated API usage.
* Issue 49 : fix faulty logic that prevented opening indexes in
read-only mode (MarkHarwood).
* Issue 43 : fix left-over references to Field (merkertr).
* Issue 42 : Luke should indicate if a field is a numeric field (merkertr).

Enjoy!

PS. Merry Christmas and a happy New Year to you all! 🙂

About Luke (from its homepage):

Lucene is an Open Source, mature and high-performance Java search engine. It is highly flexible, and scalable from hundreds to millions of documents.

Luke is a handy development and diagnostic tool, which accesses already existing Lucene indexes and allows you to display and modify their content in several ways:

  • browse by document number, or by term
  • view documents / copy to clipboard
  • retrieve a ranked list of most frequent terms
  • execute a search, and browse the results
  • analyze search results
  • selectively delete documents from the index
  • reconstruct the original document fields, edit them and re-insert to the index
  • optimize indexes
  • open indexes consisting of multiple parts, and located on Hadoop filesystem
  • and much more…

Current stable release of Luke is 3.5.0 and it includes Lucene 3.5.0 and Hadoop 0.20.2. Available is also Luke 1.0.1 (using Lucene 3.0.1), 0.9.9.1 based on Lucene 2.9.1, and other versions as well – please see the Downloads section.

Luke releases are numbered the same as the version of Lucene libraries that they use (plus a minor number in case of bugfix releases).

Below is a screenshot of the application showing the Overview section, which displays the details of the index format and some overall statistics.

Luke Overview tab

December 22, 2011

Lucene & Solr Year 2011 in Review

Filed under: Lucene,Solr — Patrick Durusau @ 7:38 pm

Lucene & Solr Year 2011 in Review

An excellent review of the developments in Lucene and Solr for 2011.

“Big data” may be the buzz words for 2012 but Lucene and Solr are part of the buzz saw (along with SQL and NoSQL databases) tool kit to tame “bag data.”

If you have the time, you would be well advised to at least monitory the user if not developer lists for both projects.

December 21, 2011

Reusable TokenStreams

Filed under: Lucene,Text Analytics — Patrick Durusau @ 7:21 pm

Reusable TokenStreams by Chris Male.

Abstract:

This white paper covers how Lucene’s text analysis system works today and explores the system and provides an understanding of what a TokenStream is, what the difference between Analyzers, TokenFilters and Tokenizers are, and how reuse impacts the design and implementation of each of these components.

Useful treatment of Lucene’s text analysis features. Those are still developing and more changes are promised (but left rather vague) for the future.

One feature that is covered of particular interest was the ability to associate geographic location data with terms deemed to represent locations.

Occurs to me that such a feature could also be used to annotate terms during text analysis to associate subject identifiers with those terms.

An application doesn’t have to “understand” that terms have different meanings so long as it can distinguish one from another based on annotations. (Or map them together despite different identifiers.)

December 20, 2011

Lucene today, tomorrow and beyond

Filed under: Lucene — Patrick Durusau @ 8:26 pm

Lucene today, tomorrow and beyond

Presentation by Simon Willnauer, mostly about what Lucene doesn’t do or do well today. Suggestions for possible evolution of Lucene but the direction depends on the community. Exciting times look like they are going to continue!

November 26, 2011

Lucene/Solr 3.5 Release Imminient!

Filed under: Lucene,Solr — Patrick Durusau @ 8:04 pm

The mirrors are being updated for the release of Lucene/Solr 3.5!

Expect the formal announcement any time now.

My mirror sites show directory creation dates of 25 Nov. 2011.

For Lucene, your nearer download site.

For Solr, your nearer download site.

November 14, 2011

SearcherLifetimeManager prevents a broken search user experience

Filed under: Interface Research/Design,Lucene,Searching — Patrick Durusau @ 7:16 pm

SearcherLifetimeManager prevents a broken search user experience

From the post:

In the past, search indices were usually very static: you built them once, called optimize at the end and shipped them off, and didn’t change them very often.

But these days it’s just the opposite: most applications have very dynamic indices, constantly being updated with a stream of changes, and you never call optimize anymore.

Lucene’s near-real-time search, especially with recent improvements including manager classes to handle the tricky complexities of sharing searchers across threads, offers very fast search turnaround on index changes.

But there is a serious yet often overlooked problem with this approach. To see it, you have to put yourself in the shoes of a user. Imagine Alice comes to your site, runs a search, and is looking through the search results. Not satisfied, after a few seconds she decides to refine that first search. Perhaps she drills down on one of the nice facets you presented, or maybe she clicks to the next page, or picks a different sort criteria (any follow-on action will do). So a new search request is sent back to your server, including the first search plus the requested change (drill down, next page, change sort field, etc.).

How do you handle this follow-on search request? Just pull the latest and greatest searcher from your SearcherManager or NRTManager and search away, right?

Wrong!

Read at the post why that’s wrong (it involves getting different searchers for the same search) but consider your topic map.

Does it have the same issue?

A C-Suite users queries your topic map and gets one answer. Several minutes later, a non-C-Suite user does the same query and gets an updated answer. One that isn’t consistent with the information given the C-Suite user. Obviously the non-C-Suite user is wrong as is your software, should push come to shove.

How do you avoid a “broken search user experience” with your topic map? Or do you just hope information isn’t updated often enough for anyone to notice?

Bet You Didn’t Know Lucene Can…

Filed under: Lucene — Patrick Durusau @ 7:15 pm

Bet You Didn’t Know Lucene Can… by Grant Ingersoll.

Grant’s slides from ApacheCon 2011.

Imaginative uses of Lucene for non-search(?) purposes. Depends on how you define search.

November 10, 2011

Importing data from another Solr

Filed under: Lucene,Solr — Patrick Durusau @ 6:48 pm

Importing data from another Solr

Luca Cavanna writes:

The Data Import Handler is a popular method to import data into a Solr instance. It provides out of the box integration with databases, xml sources, e-mails and documents. A Solr instance often has multiple sources and the process to import data is usually expensive in terms of time and resources. Meanwhile, if you make some schema changes you will probably find you need to reindex all your data; the same happens with indexes when you want to upgrade to a Solr version without backward compatibility. We can call it “re-index bottleneck”: once you’ve done the first data import involving all your external sources, you will never want to do it the same way again, especially on large indexes and complex systems.

Retrieving stored fields from a running Solr

An easier solution to do this is based on querying your existing Solr whereby it retrieves all its stored fields and reindexes them on a new instance. Everyone can write their own script to achieve this, but wouldn’t it be useful having a functionality like this out of the box inside Solr? This is the reason why the SOLR-1499 issue was created about two years ago. The idea was to have a new EntityProcessor which retrieves data from another Solr instance using Solrj. Recently effort has been put into getting this feature committed to Solr’s dataimport contrib module. Bugs have been fixed and test coverage has been increased. Hopefully this issue will get released with Solr 3.5.

A look ahead to the next release of Solr!

November 8, 2011

Apache Lucene Eurocon 2011 – Presentations

Filed under: Conferences,Lucene,Solr — Patrick Durusau @ 7:45 pm

Apache Lucene Eurocon 2011 – Presentations

From the website:

Apache Lucene Eurocon 2011, held in Barcelona between October 17-20 was a huge success.The conference was packed with technical sessions, developer content, user case studies, panels, and networking opportunities, Lucene Revolution featured the thought leaders building and deploying Lucene/Solr open source search technology. Compelling speakers and unmatched networking opportunities created a unique community of practice and experience, so you too can unlock the power, versatility and cost-effective capabilities of search across industries, data, and applications.

If you missed the chance attend the Apache Lucene Eurocon or any part of it, you can still get your hands on what the speakers delivered. We have posted most of the presentations here for below download and review, along with videos of select speakers (as available).

Many thanks to Lucid Imagination for the conference and making these conference materials available. Not like being there but does extend the conversations to include those not present.

Search + Big Data: It’s (still) All About the User (Users or Documents?)

Filed under: Hadoop,Lucene,LucidWorks,Mahout,Solr,Topic Maps — Patrick Durusau @ 7:44 pm

Search + Big Data: It’s (still) All About the User by Grant Ingersoll.

Slides

Abstract:

Apache Hadoop has rapidly become the primary framework of choice for enterprises that need to store, process and manage large data sets. It helps companies to derive more value from existing data as well as collect new data, including unstructured data from server logs, social media channels, call center systems and other data sets that present new opportunities for analysis. This keynote will provide insight into how Apache Hadoop is being leveraged today and how it evolving to become a key component of tomorrow’s enterprise data architecture. This presentation will also provide a view into the important intersection between Apache Hadoop and search.

Awesome as always!

Please watch the presentation and review the slides before going further. What follows won’t make much sense without Grant’s presentation as a context. I’ll wait……

Back so soon? 😉

On slide 4 (I said to review the slides), Grant presents four overlapping areas, starting with Documents: Models, Feature Selection; Content Relationships: Page Rank, etc., Organization; Queries: Phrases, NLP; User Interaction: Clicks, Ratings/Reviews, Learning to Rank, Social Graph; and the intersection of those four areas is where Grant says search is rapidly evolving.

On slide 5 (sorry, last slide reference), Grant say to mine that intersection is a loop composed of: Search -> Discovery -> Analytics -> (back to Search). All of which involve processing of data that has been collected from use of the search interface.

Grant’s presentation made clear something that I have been overlooking:

Search/Indexing, as commonly understood, does not capture any discoveries or insights of users.

Even the search trails that Grant mentions are just lemming tracks complete with droppings. You can follow them if you like, may find interesting data, may not.

My point being that there is no way to capture the user’s insight that LBJ, for instance, is a common acronym for Lyndon Baines Johnson. So that the next user who searches for LBJ will find the information contributed by a prior user. Such as distinguishing application of Lyndon Baines Johnson to a graduate school (Lyndon B. Johnson School of Public Affairs), a hospital (Lyndon B. Johnson General Hospital), a PBS show (American Experience . The Presidents . Lyndon B. Johnson), a biography (American President: Lyndon Baines Johnson), and that is in just the first ten (10) “hits.” Oh, and as the name of an American President.

Grant made that clear for me with his loop of Search -> Discovery -> Analytics -> (back to Search) because Search only ever focuses on the documents, never the user’s insight into the documents.

And with every search, every user (with the exception of search trails), starts over at the beginning.

What if a colleague found a bug in program code, but you have to start at the beginning of the program and work your way there. Good use of your time? To reset with every user? That is what happens with search, nearly a complete reset. (Not complete because of page rank, etc. but only just.)

If we are going to make it “All About the User,” shouldn’t we be indexing their insights* into data? (Big or otherwise.)

*”Clicks” are not insights. Could be an unsteady hand, DTs, etc.

November 7, 2011

Using Lucene and Cascalog for Fast Text Processing at Scale

Filed under: Cascalog,Clojure,LingPipe,Lucene,Natural Language Processing,OpenNLP,Stanford NLP — Patrick Durusau @ 7:29 pm

Using Lucene and Cascalog for Fast Text Processing at Scale

From the post:

Here at Yieldbot we do a lot of text processing of analytics data. In order to accomplish this in a reasonable amount of time, we use Cascalog, a data processing and querying library for Hadoop; written in Clojure. Since Cascalog is Clojure, you can develop and test queries right inside of the Clojure REPL. This allows you to iteratively develop processing workflows with extreme speed. Because Cascalog queries are just Clojure code, you can access everything Clojure has to offer, without having to implement any domain specific APIs or interfaces for custom processing functions. When combined with Clojure’s awesome Java Interop, you can do quite complex things very simply and succinctly.

Many great Java libraries already exist for text processing, e.g., Lucene, OpenNLP, LingPipe, Stanford NLP. Using Cascalog allows you take advantage of these existing libraries with very little effort, leading to much shorter development cycles.

By way of example, I will show how easy it is to combine Lucene and Cascalog to do some (simple) text processing. You can find the entire code used in the examples over on Github.  

The world of text exploration just gets better all the time!

November 4, 2011

Near-real-time readers with Lucene’s SearcherManager and NRTManager

Filed under: Indexing,Lucene,Software — Patrick Durusau @ 6:11 pm

Near-real-time readers with Lucene’s SearcherManager and NRTManager

From the post:

Last time, I described the useful SearcherManager class, coming in the next (3.5.0) Lucene release, to periodically reopen your IndexSearcher when multiple threads need to share it. This class presents a very simple acquire/release API, hiding the thread-safe complexities of opening and closing the underlying IndexReaders.

But that example used a non near-real-time (NRT) IndexReader, which has relatively high turnaround time for index changes to become visible, since you must call IndexWriter.commit first.

If you have access to the IndexWriter that’s actively changing the index (i.e., it’s in the same JVM as your searchers), use an NRT reader instead! NRT readers let you decouple durability to hardware/OS crashes from visibility of changes to a new IndexReader. How frequently you commit (for durability) and how frequently you reopen (to see new changes) become fully separate decisions. This controlled consistency model that Lucene exposes is a nice “best of both worlds” blend between the traditional immediate and eventual consistency models.

Getting into the hardcore parts of Lucene!

Understanding Lucene (or a similar indexing engine) is critical to both mining data as well as delivery of topic map based information to users.

October 26, 2011

Computing Document Similarity using Lucene Term Vectors

Filed under: Lucene,Similarity,Vectors — Patrick Durusau @ 6:58 pm

Computing Document Similarity using Lucene Term Vectors

From the post:

Someone asked me a question recently about implementing document similarity, and since he was using Lucene, I pointed him to the Lucene Term Vector API. I hadn’t used the API myself, but I knew in general how it worked, so when he came back and told me that he could not make it work, I wanted to try it out for myself, to give myself a basis for asking further questions.

I already had a Lucene index (built by SOLR) of about 3000 medical articles for whose content field I had enabled term vectors as part of something I was trying out for highlighting, so I decided to use that. If you want to follow along and have to build your index from scratch, you can either use a field definition in your SOLR schema.xml file similar to this:

Nice walk through on document vectors.

Plus a reminder that “document” similarity can only take you so far. Once you find a relevant document, you still have to search for the subject of interest. Not to mention that you view that subject absent its relationship to other subjects, etc.

October 23, 2011

How to create and search a Lucene.Net index…

Filed under: .Net,C#,Lucene — Patrick Durusau @ 7:21 pm

How to create and search a Lucene.Net index in 4 simple steps using C#, Step 1

From the post:

As mentioned in a previous blog, using Lucene.Net to create and search an index was quick and easy. Here I will show you in these 4 steps how to do it.

  • Create an index
  • Build the query
  • Perform the search
  • Display the results

Before we get started I wanted to mention that Lucene.Net was originally designed for Java. Because of this I think the creators used some classes in Lucene that already exist in the .Net framework. Therefore, we need to use the entire path to the classes and methods instead of using a directive to shorten it for us.

Useful for anyone exploring topic maps as a native to MS Windows application.

Lucene Search Programming

Filed under: Lucene — Patrick Durusau @ 7:21 pm

Lucene Search Programming

Nothing startling but a good review of Lucene based searching with examples.

Recommended for .Net programmers.

October 15, 2011

Alternatives to full text queries

Filed under: Lucene,Solr,Sphinx,Xapian — Patrick Durusau @ 4:27 pm

Alternatives to full text queries (part1)Alternatives…(part2) by Fernando Doglio.

Useful pair of posts but I found the title misleading.

From the post:

Another point of interest to consider is that though on the long run, all four solutions provide very similar services; they do it a bit differently, since they can be categorized into two places:

  • Full text search servers: They provide a finished solution, ready for the developers to install and interact with. You don’t have to integrate them into your application; you only have to interact with them. In here we have Solr and Sphinx.
  • Full text search APIs: They provide the functionalities needed by the developer, but at a lower level. You’ll need to integrate these APIs into your application, instead of just consuming it’s services through a standard interface (like what happens with the servers). In here, we have the Lucene Project and the Xapian project.

But neither option is an “alternative” to “full text queries.” Alternatives to “full text queries” would include LCSH or MeSH or similar systems.

Useful posts as I said but the area is cloudy enough without inventing non-helpful distinctions.

October 10, 2011

Integrating Zend Framework Lucene with your Cake Application

Filed under: Lucene,PHP — Patrick Durusau @ 6:19 pm

Integrating Zend Framework Lucene with your Cake Application

From the post:

This is a short tutorial that teaches you how to integrate Zend Framework’s Lucene implementation (100% PHP) to your application. It requires your server to have PHP5 installed, since ZF only runs on PHP5, and is likelly to be deprecated very soon.

Another search implementation guide.

Curious, (for my students), could I take the results of a search at one site and combine them with the results from another site? If that were your task, what questions would you ask? Is using the same search engine enough? If not, what more would you need to know? Is there anything you would like to do as part of the combining process? Assume that you have free access to any needed data.

October 7, 2011

Optimizing Findability in Lucene and Solr

Filed under: Lucene,Solr — Patrick Durusau @ 6:18 pm

Optimizing Findability in Lucene and Solr

From the post:

To paraphrase an age-old question about trees falling in the woods: “If content lives in your application and you can’t find it, does it still exist?” In this article, we explore how to make your content findable by presenting tips and techniques for discovering what is important in your content and how to leverage it in the Lucene Stack.

I would ask:

“If content is available on the WWW and you can’t find it, does it still exist?”

Unlike the tree example, I think that has a fairly clear answer: No.

It can’t influence your decisions, it can’t shape your policies, it can’t form the basis for new ideas or products, or to help you avoid costly mistakes. That sounds like it doesn’t exist to me.

The post is fairly detailed but well worth the effort. Enjoy!

October 4, 2011

VinWiki Part 1: Building an intelligent Web app using Seam, Hibernate, RichFaces, Lucene and Mahout

Filed under: Lucene,Mahout,Recommendation — Patrick Durusau @ 7:57 pm

VinWiki Part 1: Building an intelligent Web app using Seam, Hibernate, RichFaces, Lucene and Mahout

From the webpage:

This is the first post in a four part series about a wine rating and recommendation Web application, named VinWiki, built using open source technology. The purpose of this series is to document key design and implementation decisions, which may be of interest to anyone wanting to build an intelligent Web application using Java technologies. The end result will not be a 100% functioning Web application, but will have enough functionality to prove the concepts.

I thought about Lars Marius and his expertise at beer evaluation when I saw this series. Not that Lars would need it but it looks like the sort of thing you could build to recommend things you know something about, and like. Whatever that may be. 😉

October 1, 2011

The Getty Search Gateway

Filed under: Lucene,Museums,Solr — Patrick Durusau @ 8:28 pm

The Getty Search Gateway at all things cataloged

Interesting review of the new search capabilities at the Getty. Covers their use of Solr and some of its more interesting capabilities. Searches across collections and other information sources.

After reading the post and using the site, what would you do differently with a topic map? In particular?

September 29, 2011

Hadoop for Archiving Email

Filed under: Hadoop,Lucene,Solr — Patrick Durusau @ 6:35 pm

Hadoop for Archiving Email by Sunil Sitaula.

When I saw the title of this post I started wondering if the NSA was having trouble with archiving all my email. 😉

From the post:

This post will explore a specific use case for Apache Hadoop, one that is not commonly recognized, but is gaining interest behind the scenes. It has to do with converting, storing, and searching email messages using the Hadoop platform for archival purposes.

Most of us in IT/Datacenters know the challenges behind storing years of corporate mailboxes and providing an interface for users to search them as necessary. The sheer volume of messages, the content structure and its complexity, the migration processes, and the need to provide timely search results stand out as key points that must be addressed before embarking on an actual implementation. For example, in some organizations all email messages are stored in production servers; others just create a backup dump and store them in tapes; and some organizations have proper archival processes that include search features. Regardless of the situation, it is essential to be able to store and search emails because of the critical information they hold as well as for legal compliance, investigation, etc. That said, let’s look at how Hadoop could help make this process somewhat simple, cost effective, manageable, and scalable.

The post concludes:

In this post I have described the conversion of email files into sequence files and store them using HDFS. I have looked at how to search through them to output results. Given the “simply add a node” scalability feature of Hadoop, it is very straightforward to add more storage as well as search capacity. Furthermore, given that Hadoop clusters are built using commodity hardware, that the software itself is open source, and that the framework makes it simple to implement specific use cases. This leads to an overall solution that is very cost effective compared to a number of existing software products that provide similar capabilities. The search portion of the solution, however, is very rudimentary. In part 2, I will look at using Lucene/Solr for indexing and searching in a more standard and robust way.

Read part one and get ready for part 2!

And start thinking about what indexing/search capabilities you are going to want.


Update: Hadoop for Archiving Email – Part 2

September 27, 2011

LucidWorks 2.0, the search platform for Apache Solr/Lucene (stolen post)

Filed under: Lucene,Search Engines,Solr — Patrick Durusau @ 6:48 pm

LucidWorks 2.0, the search platform for Apache Solr/Lucene by David M. Fishman.

Apologies to David because I stole his entire post, with links to the Lucid site. Could not figure out what to leave out so I included it all.

If you’re a search application developer or architect, if you’ve got big data on your hands or on the brain, or if you’ve got big plans for Apache Lucene/Solr, this announcement is for you.

Today marks the 2.0 release of LucidWorks, the search platform that accelerates and simplifies development of highly accurate, scalable, and cost-effective search applications. We’ve bottled the best of Apache Lucene/Solr, including key innovations from the 4.x branch, in a commercial-grade package that’s designed for the rigors of production search application deployment.

Killer search applications are popping up everywhere, and it’s no surprise. On the one hand, big data technologies disrupting old barriers of speed, structure, cost and addressability of data storage; on the other, the new frontier of query-driven analytics is shifting from old-school reporting to instant, unlimited reach into mixed data structures, driven by users. (There are places these converge: 7 years of data in Facebook combine content with user context, creating a whole new way to look at life as we know it on line.)

Or, to put it a little less breathlessly: Search is now the UI for Big Data. LucidWorks 2.0 is the only distribution of Apache Solr/Lucene that lets you:

  • Build killer business-critical search apps more quickly and easily
  • Streamline search setup and optimization for more reliable operations
  • Access big data and enterprise content faster and more securely
  • Scale to billions without spending millions

If you surf through our website, you’ll find info on features and benefits, screenshots, a detailed technical overview, and access to product documentation. But that’s all talk. Download LucidWorks Enterprise 2.0, or apply for a spot in the Private Beta for LucidWorks Cloud, and take it for a spin.

They say imitation is the sincerest form of flattery. Maybe that will make David feel better!

Seriously, this is an important milestone, both for today and for what is yet to come in the search arena.

September 26, 2011

Lucene and Solr’s CheckIndex to the Rescue!

Filed under: Indexing,Lucene,Solr — Patrick Durusau @ 7:03 pm

Lucene and Solr’s CheckIndex to the Rescue! by Rafał Kuć.

From the post:

While using Lucene and Solr we are used to a very high reliability. However, there may come a day when Solr will inform us that our index is corrupted, and we need to do something about it. Is the only way to repair the index to restore it from the backup or do full indexation? No – there is hope in the form of CheckIndex tool.

What is CheckIndex ?

CheckIndex is a tool available in the Lucene library, which allows you to check the files and create new segments that do not contain problematic entries. This means that this tool, with little loss of data is able to repair a broken index, and thus save us from having to restore the index from the backup (of course if we have it) or do the full indexing of all documents that were stored in Solr.

The question about when the last backup was run at the end of the article isn’t meant to be funny.

When I was training to be a NetWare sysadmin, more than a little while ago, one of the manuals advised that the #1 reason for sysadmins being fired was failure to maintain proper backups. I suspect that is probably still the case. Or at least I hope it is. There really is no excuse for failing to maintain proper backups.

September 20, 2011

Estimating Memory and Storage for Lucene/Solr

Filed under: Lucene,Solr — Patrick Durusau @ 7:52 pm

Estimating Memory and Storage for Lucene/Solr

This is very cool!

Grant Ingersoll has put together an Excel spreadsheet to enable modeling of memory and disk space based on the formula in Lucene in Action (2nd ed.) with caveats for its use.

Starting a Search Application

Filed under: Lucene,Searching,Solr — Patrick Durusau @ 7:52 pm

Starting a Search Application

A useful whitepaper by Marc Krellenstein, CTO at Lucid Imagination.

I am interested in your reaction to Marc’s listing of the use cases for full-text searching:

Full-text search is good at a variety of information requests that can be hard to satisfy with other technologies. These include:

  • Finding the most relevant information about a specific topic, or an answer to a particular question,
  • Locating a specific document or content item, and
  • Exploring information in a general area, or even browsing the collection of documents or other content as a whole (this is often supported by clustering; see below).

For my class, do a “reaction” of one page in length giving your reaction to each of these points (that’s 3 pages total), and what “other” technologies might you use?

For class discussion, it would be nice if you can offer an example of either full-text searching meeting the requests or “other” technologies meeting these requests.

Testing/exploring Marc’s “information requests:”

Two teams.

Team One has a set of the Great Books of the Western World and use the the Syntopicon to answer information requests.

Team Two has access to a full-text version of Great Books of the Western World to answer information requests.

The class, including the teams, creates questions that are sent to me privately and I will prepare the final list of questions to be submitted by the teams. Questions are given to both teams at the same time and the first team with the correct answer (must have citation in the Great Books) wins.

I am open to suggestions for prizes.

The class following the contest we will discuss why some questions were better for full-text and why some worked better with the Syntopicon. It will give you insight into the choices you will have to make when creating a topic map.

BTW, the requirements section of Marc’s paper will help you in designing any information system. If you don’t know what is expected and can’t test for it, you are unlikely to satisfy anyone’s needs.

September 16, 2011

What’s new in Apache Solr 3.4(?)

Filed under: Lucene,Solr — Patrick Durusau @ 6:42 pm

What’s new in Apache Solr 3.4: New Programmer’s Guide now available

From the post:

Yesterday’s announcement of the release of Solr 3.4 brings with it a host of welcome improvements that make search-related applications more powerful, faster, and easier to build. We’ve put together a new
Programmer’s Guide to Open Source Search Search: What’s New in Apache Solr / Lucene 3.4 with details on what this new release holds for you, both in terms of what’s under the hood, new usability and user experience features, as well as new search capabilities:

This paper covers innovations including:

  • New search capabilities such as query support, function queries, analysis, input and output formats.
  • Performance improvements such as index segment management and distributed support for spellchecking.
  • New search application development options such as better range faceting and a new Velocity-driven search UI, plus spatial search and using Apache UIMA.
  • What to expect in Solr 4

Be sure to check out the annotators that link to services such as OpenCalias. (page 20 of the whitepaper) Won’t be perfect but certainly do well enough (with your assistance) to be useful.

September 15, 2011

Lucene and Solr 3.4.0 Released

Filed under: Lucene,Solr — Patrick Durusau @ 7:52 pm

Lucene and Solr 3.4.0 Released

Eric Hatcher writes:

There are several juicy additions, but also a critical bug fix. It is recommended that all 3.x-using applications upgrade to 3.4 as soon as possible. Here’s the scoop on this fixed bug:

* Fixed a major bug (LUCENE-3418) whereby a Lucene index could
  easily become corrupted if the OS or computer crashed or lost
  power.

Lucene 3.4.0 includes: a new faceting module (contrib/facet) for computing facet counts (both hierarchical and non-hierarchical) at search time (LUCENE-3079); a new join module (contrib/join), enabling indexing and searching of nested (parent/child) documents using (LUCENE-3171); ability to now to index documents with term frequencies included but without positions (LUCENE-2048) – previously omitTermFreqAndPositions always omitted both; and a few other improvements.

Solr 3.4.0 includes: Lucene 3.4.0, which fixes the serious bug mentioned above; a new XsltUpdateRequestHandler allows posting XML that’s transformed by a provided XSLT into a valid Solr document (SOLR-2630); field grouping/collapsing: post-group faceting option (group.truncate) can now compute facet counts for only the highest ranking documents per-group (SOLR-2665); The query cache and filter cache can now be disabled per request (SOLR-2429); Improved memory usage, build time, and performance of SynonymFilterFactory (LUCENE-3233); various fixes for multi-threaded DataImportHandler; and a few other improvements.

Here are the links for more information and download access: Lucene 3.4.0 and Solr 3.4.0

« Newer PostsOlder Posts »

Powered by WordPress