Lucene « Another Word For It

September 15, 2011

The State of Solandra – Summer 2011

Filed under: Lucene,Solandra,Solr — Patrick Durusau @ 7:52 pm

The State of Solandra – Summer 2011

From SemanText:

A little over 18 months ago we talked to Jake Luciani about Lucandra – a Cassandra-based Lucene backend. Since then Jake has moved away from raw Lucene and married Cassandra with Solr, which is why Lucandra now goes by Solandra. Let’s see what Jake and Solandra are up to these days.

What is the current status of Solandra in terms of features and stability?

Solandra has gone through a few iterations. First as Lucandra which partitioned data by terms and used thrift to communicate with Cassandra. This worked for a few big use cases, mainly how to manage a index per user, and garnered a number of adopters. But it performed poorly when you had very large indexes with many dense terms, due to the number and size of remote calls needed to fulfill a query.Last summer I started off on a new approach based on Solr that would address Lucandra’s shortcomings: Solandra. The core idea of Solandra is to use Cassandra as a foundation for scaling Solr. It achieves this by embedding Solr in the Cassandra runtime and uses the Cassandra routing layer to auto shard a index across the ring (by document). This means good random distribution of data for writes (using Cassandra’s RandomParitioner) and good search performance since individual shards can be searched in parallel across nodes (using SolrDistributedSearch). Cassandra is responsible for sharding, replication, failover and compaction. The end user now gets a single scalable component for search without changing API’s which will scale in the background for them. Since search functionality is performed by Solr so it will support anything Solr does.

I gave a talk recently on Solandra and how it works: http://blip.tv/datastax/scaling-solr-with-cassandra-5491642

…more follows, worth your attention.

Comments Off

Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search

Filed under: Lucene,Solr — Patrick Durusau @ 7:51 pm

Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search

Just to temp you to read the rest of the post:

As promised in Part 1 of Solr Digest, Spring-Summer 2011, in this Part 2 post we’ll summarize what’s new with Solr’s Near-Real-Time Search support and Solr Cloud (if you love clouds and search with some big data on the side, get in touch). Let’s first examine what is being worked on for Solr Cloud and what else is in the queue for the near future. A good overview of what is currently functional can be found in the old Solr Cloud wiki page. Also, there is now another wiki page covering New Solr Cloud Design, which we find quite useful. The individual pieces of Solr Cloud functionality that are being worked on are as follows:

Work is still in progress on Distributed Indexing and Shard distribution policy. Patches exist, although they are now over 6 months old, so you can expect to see them updated soon.

As part of the Distributed Indexing effort, shard leader functionality deals with leader election and with publishing the information about which node is a leader of which shard and in Zookeeper in order to notify all interested parties. The development is pretty active here and initial patches already exist.

At some point in the future, Replication Handler may become cloud aware, which means it should be possible to switch the roles of masters and slaves, master URLs will be able to change based on cluster state, etc. The work hasn’t started on this issue.

Another feature Solr Cloud will have is automatic Spliting and migrating of Indices. The idea is that when some shard’s index becomes too large or the shard itself starts having bad query response times, we should be able to split parts of that index and migrate it (or merge) with indices on other (less loaded) nodes. Again, the work on this hasn’t started yet. Once this is implemented one will be able to split and move/merge indices using a Solr Core Admin as described in SOLR-2593.

To achieve more efficiency in search and gain control over where exactly each document gets indexed to, you will be able to define a custom shard lookup mechanism. This way, you’ll be able to limit execution of search requests to only some shards that are known to hold target documents, thus making the query more efficient and faster. This, along with the above mentioned shard distribution policy, is akin to routing functionality in ElasticSearch.

Isn’t that an amazing level of activity? I get tired just reading about it. 😉 Now if it can just be applied as cleverly as it has been written.

BTW, Part 1 if you are interested.

Comments Off

September 14, 2011

Flexible ranking in Lucene 4

Filed under: Lucene,Ranking — Patrick Durusau @ 7:01 pm

Flexible ranking in Lucene 4

Robert Muir writes:

Over the summer I served as a Google Summer of Code mentor for David Nemeskey, PhD student at Eötvös Loránd University. David proposed to improve Lucene’s scoring architecture and implement some state-of-the-art ranking models with the new framework.

These improvements are now committed to Lucene’s trunk: you can use these models in tandem with all of Lucene’s features (boosts, slops, explanations, etc) and queries (term, phrase, spans, etc). A JIRA issue has been created to make it easy to use these models from Solr’s schema.xml.

Relevance ranking is the heart of the search engine, and I hope the additional models and flexibility will improve the user experience for Lucene: whether you’ve been frustrated with tuning TF/IDF weights and find an alternative model works better for your case, found it difficult to integrate custom logic that your application needs, or just want to experiment.

The wiki page for this project has a pointer to the search engine in A “Terrier” For Your Tool Box?.

I count a baker’s dozen or so new features described in this post.

Comments Off

September 10, 2011

SearchWorkings

Filed under: ElasticSearch,Lucene,Mahout,Solr — Patrick Durusau @ 6:02 pm

SearchWorkings

From the About Us page:

SearchWorkings.org was created by a bunch of really passionate search technology professionals who realised that the world (read: other search professionals) doesn’t have a single point of contact or comprehensive resource where they can learn and talk about all the exciting new developments in the wonderful world of open source search solutions. These professionals all work at JTeam, a leading supplier of high-quality custom-built applications and end-to-end solutions provider, and moreover a market leader when it comes to search solutions.

A wide variety of materials, from whitepapers and articles, forums (Lucene, Solr, ElasticSearch, Mahout), training videos, news, and blogs.

You do have to register/join (free) to get access to the good stuff.

Comments Off

September 9, 2011

Hibernate Search with Lucene

Filed under: Hibernate,Lucene — Patrick Durusau @ 7:17 pm

Hibernate Search with Lucene

From the post:

This post is in continuation of my last post – http://blogs.globallogic.com/introduction-to-lucene – in which I gave a brief introduction to Lucene.

There are many Web applications out there to provide access to data stored in a relational database, but what’s the easiest way to enable users to search through that data and find what they need? There are a number of query types that RDBMSs in general do not support without vendor extensions:

Fuzzy queries, in which “fuzzy” and “wuzzy” are considered matches

Word stemming queries, which consider “take,” “took,” and “taken” to be identical

Sound-like queries, which consider “cat” and “kat” to be identical

Synonym queries, which consider “jump,” “hop,” and “leap” to be identical

Queries on binary BLOB data types, such as PDF documents, Microsoft Word or Excel documents, or HTML and XML documents

Hibernate Search brings the power of full text search engines to the persistence domain model by combining Hibernate Core with the capabilities of the Apache Lucene™ search engine. Even though Hibernate Search is using Apache Lucene™ under the hood you can always fallback to the native Lucene APIs if the need arises.

These posts were written against Hibernate 3.4.1. Just so you know, Hibernate 4.0.0 Alpha2 is out (8 September 2011).

Introduces the basics of Hibernate search.

Comments Off

August 28, 2011

Road To A Distibuted Search Engine

Filed under: ElasticSearch,Lucene,Search Engines — Patrick Durusau @ 7:54 pm

Road To A Distributed Search Engine by Shay Banon.

If you are looking for a crash course on the construction details of Elasticsearch, you are in the right place.

My only quibble and this is common to all really good presentations (this is one of those) is that there isn’t a transcript to go along with it. There is so much information that I will have to watch it more than once to take it all in.

If you watch the presentation, do pay attention so you are not like the person who suggested that Solr and Elasticsearch were similar. 😉

Comments Off

August 23, 2011

Lucene 4.0: The Revolution

Filed under: Indexing,Lucene — Patrick Durusau @ 6:46 pm

Lucene 4.0: The Revolution by Simon Willnauer.

From the post:

The near-limitless innovative potential of a thriving open source community often has to be tempered by the need for a steady roadmap with version compatibility. As a result, once the decision to break backward compatibility in Lucene 4.0 had been made, it opened the floodgates on a host of step changes, which, together, will deliver a product whose performance is unrecognisable from previous 3.x releases.

One of the most significant changes in Lucene 4.0 is the full switch to using bytes (UTF8) in place of text strings for indexing within the search engine library. This change has improved the efficiency of a number of core processes: the ‘term dictionary’, used as a core part of the index, can now be loaded up to 30 times faster; it uses 10% of the memory; and search speeds are increased by removing the need for string conversion.

This switch to using bytes for indexing has also facilitated one of the main goals for Lucene 4.0, which is ‘flexible indexing’. The data structure for the index format can now be chosen and loaded into Lucene as a pluggable codec. As such, optimised codecs can be loaded to suit the indexing of individual datasets or even individual fields.

The performance enhancements through flexible indexing are highly case specific. However, flexible indexing introduces an entirely new dimension to the Lucene project. New indexing codecs can be developed and existing ones updated without the need for hard-coding within Lucene. There is no longer any need for project-level compromise on the best general-purpose index formats and data structures. A new field of specialised codec development can take place independently from development of the Lucene kernel.

Looks like the time to be learning new features of Lucene 4.0 is now!

Flexible indexing! That sounds very cool.

Comments Off

August 9, 2011

Solr Powered ISFDB

Filed under: Dataset,Lucene,Solr — Patrick Durusau @ 7:58 pm

Solr Powered ISFDB

The first in a series of posts on Solr and the ISFDB. (Try Solr-ISFDB for all the posts.)

ISFDB = Internet Speculative Fiction Database.

A bit over 650,000 documents when this series started last January so we aren’t talking “big data” but its a fun data set. And the lessons to be learned here will stay us in good stead with much larger data sets.

I haven’t read all the posts yet but did notice comments about modeling relationships. As I work through the posts, will see how close/far away that modeling comes to a topic maps approach.

Working through something like this won’t hurt in terms of preparing for Lucene/Solr certification either. Haven’t decided on that but until we have a topic map certification it would not hurt.

Comments Off

Modifying a Lucene Snowball Stemmer

Filed under: Lexical Analyzer,Lucene,String Matching — Patrick Durusau @ 7:57 pm

Modifying a Lucene Snowball Stemmer

From the post:

This post is written for advanced users. If you do not know what SVN (Subversion) is or if you’re not ready to get your hands dirty, there might be something more interesting to read on Wikipedia. As usual. This is an introduction to how to get a Lucene development environment running, a Solr environment and lastly, to create your own Snowball stemmer. Read on if that seems interesting. The receipe for regenerating the Snowball stemmer (I’ll get back to that…) assumes that you’re running Linux. Please leave a comment if you’ve generated the stemmer class under another operating system.

When indexing data in Lucene (a fulltext document search library) and Solr (which uses Lucene), you may provide a stemmer (a piece of code responsible for “normalizing” words to their common form (horses => horse, indexing => index, etc)) to give your users better and more relevant results when they search. The default stemmer in Lucene and Solr uses a library named Snowball which was created to do just this kind of thing. Snowball uses a small definition language of its own to generate parsers that other applications can embed to provide proper stemming.

Really advanced users will want to check out Snowball’s namesake, SNOBOL, a powerful pattern matching language that was invented in 1962. (That’s not a typo, 1962.) See SNOBOL4.org for more information and resources.

The post outlines how to change the default stemmer for Lucene and Solr to improve its stemming of Norwegian words. Useful in case you want to write/improve a stemmer for another language.

Comments Off

August 8, 2011

Creating an Elasticsearch Plugin

Filed under: ElasticSearch,Lucene — Patrick Durusau @ 6:28 pm

Creating an Elasticsearch Plugin

From the post:

Elasticsearch is a great search engine built on top of Apache Lucene. We came across the need to add new functionality and did not want to fork Elasticsearch for this. Luckily Elasticsearch comes with a plugin framework. We all ready leverage this framework to use the Apache Thrift transport. There was no documentation on how to create a plugin so after digging around in the code a little we where able to to create our own plugin.

Here is a tutorial on creating a plugin and installing it into Elasticsearch.

Just in case you are using and need to extent Elasticsearch.

Comments Off

July 27, 2011

Neo4j: Super-Nodes and Indexed Relationships

Filed under: Indexing,Lucene,Neo4j — Patrick Durusau @ 8:34 am

Neo4j: Super-Nodes and Indexed Relationships by Aleksa Vukotic.

From the post:

As part of my recent work for Open Credo, I have been involved in the project that was using Neo4J Graph database for application storage.

Neo4J is one of the first graph databases to appear on the global market. Being open source, in addition to its power and simplicity in supporting graph data model it represents good choice for production-ready graph database.

However, there has been one area I have struggled to get good-enough performance from Neo4j recently – super nodes.

Super nodes represent nodes with dense relationships (100K or more), which are quickly becoming bottlenecks in graph traversal algorithms when using Neo4J. I have tried many different approaches to get around this problem, but introduction of auto indexing in Neo4j 1.4 gave me an idea that I had success with. The approach I took is to try to fetch relationships of the super nodes using Lucene indexes, instead of using standard Neo APIs. In this entry I’ll share what I managed to achieve and how.

This looks very promising. Particularly the retrieval of only the relationships of interest for traversal. To me that suggests that we can keep indexes of relationships that may not be frequently consulted. I wonder if that means a facility to “expose” more or less relationships as the situation requires?

Comments Off

July 23, 2011

Lucene.net is back on track

Filed under: Lucene,Search Algorithms,Search Engines — Patrick Durusau @ 3:08 pm

Lucene.net is back on track by Simone Chiaretta

From the post:

More than 6 months ago I blogged about Lucene.net starting his path toward extinction. Soon after that, due to the “stubbornness” of the main committer, a few forks appeared, the biggest of which was Lucere.net by Troy Howard.

At the end of the year, despite the promises of the main committer of complying to the request of the Apache board by himself, nothing happened and Lucene.net went really close to be being shut down. But luckily, the same Troy Howard that forked Lucene.net a few months before, decided, together with a bunch of other volunteers, to resubmit the documents required by the Apache Board for starting a new project into the Apache Incubator; by the beginning of February the new proposal was voted for by the Board and the project re-entered the incubator.

If you are interested in search engines and have .Net skills (or want to acquire them), this would be a good place to start.

Comments Off

July 21, 2011

Oracle, Sun Burned, and Solr Exposure

Filed under: Data Mining,Database,Facets,Lucene,SQL,Subject Identity — Patrick Durusau @ 6:27 pm

Oracle, Sun Burned, and Solr Exposure

From the post:

Frankly we wondered when Oracle would move off the dime in faceted search. “Faceted search”, in my lingo, is showing users categories. You can fancy up the explanation, but a person looking for a subject may hit a dead end. The “facet” angle displays links to possibly related content. If you want to educate me, use the comments section for this blog, please.

We are always looking for a solution to our clients’ Oracle “findability” woes. It’s not just relevance. Think performance. Query and snack is the operative mode for at least one of our technical baby geese. Well, Oracle is a bit of a red herring. The company is not looking for a solution to SES11g functionality. Lucid Imagination, a company offering enterprise grade enterprise search solutions, is.

If “findability” is an issue at Oracle, I would be willing to bet that subject identity is as well. Rumor has it that they have paying customers.

Comments Off

July 1, 2011

Apache Lucene 3.3 / Solr 3.3

Filed under: Indexing,Lucene,Search Engines,Solr — Patrick Durusau @ 2:47 pm

Lucene 3.3 Announcement

Lucene Features:

The spellchecker module now includes suggest/auto-complete functionality, with three implementations: Jaspell, Ternary Trie, and Finite State.

Support for merging results from multiple shards, for both “normal” search results (TopDocs.merge) as well as grouped results using the grouping module (SearchGroup.merge, TopGroups.merge).

An optimized implementation of KStem, a less aggressive stemmer for English.

Single-pass grouping implementation based on block document indexing.

Improvements to MMapDirectory (now also the default implementation returned by FSDirectory.open on 64-bit Linux).

NRTManager simplifies handling near-real-time search with multiple search threads, allowing the application to control which indexing changes must be visible to which search requests.

TwoPhaseCommitTool facilitates performing a multi-resource two-phased commit, including IndexWriter.

The default merge policy, TieredMergePolicy, has a new method (set/getReclaimDeletesWeight) to control how aggressively it targets segments with deletions, and is now more aggressive than before by default.

PKIndexSplitter tool splits an index by a mid-point term.

Solr 3.3 Announcement

Solr Features:

Grouping / Field Collapsing

A new, automaton-based suggest/autocomplete implementation offering an order of magnitude smaller RAM consumption.

KStemFilterFactory, an optimized implementation of a less aggressive stemmer for English.

Solr defaults to a new, more efficient merge policy (TieredMergePolicy). See http://s.apache.org/merging for more information.

Important bugfixes, including extremely high RAM usage in spellchecking.

Bugfixes and improvements from Apache Lucene 3.3

Comments Off

June 30, 2011

Faceting Module for Lucene!

Filed under: Facets,Lucene — Patrick Durusau @ 4:03 pm

Faceting Module for Lucene!

Reading the log for this issue is an education on how open source projects proceed at their best.

Oh, worth reading about the faceting aspects that you want to include in a topic map or other application as well.

Comments Off

June 24, 2011

How to use Scala and Lucene to create a basic search application

Filed under: Lucene,Scala,Search Engines — Patrick Durusau @ 10:45 am

How to use Scala and Lucene to create a basic search application

From the post:

How to use Scala and Lucene to create a basic search application. One of the powerful benefits of Scala is that it has full access to any Java libraries; giving you a tremendous number of available resources and technology. This example doesn’t tap into the full power of Lucene, but highlights how easy it is to incorporate Java libraries into a Scala project.

This example is based off a Twitter analysis app I’ve been noodling on; which I am utilizing Lucene. The code below takes a list of tweets from a text file; creates an index that you can search and extract info from.

Nice way to become familiar both with Scala and Lucene.

Comments Off

June 21, 2011

Heavy Committing: DocValues aka. Column Stride Fields in Lucene 4.0

Filed under: Indexing,Lucene,Search Engines — Patrick Durusau @ 7:10 pm

Heavy Committing: DocValues aka. Column Stride Fields in Lucene 4.0 by Simon Willnauer, Apache Lucene PMC.

Abstract:

Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the “next generation” features. DocValues enable Lucene to efficiently store and retrieve type-safe Document & Value pairs in a column stride fashion either entirely memory resident random access or disk resident iterator based without the need to un-invert fields. It’s final goal is to provide a independently update-able per document storage for scoring, sorting or even filtering. This talk will introduce the current state of development, implementation details, its features and how DocValues have been integrated into Lucene’s Codec API for full extendability.

Excellent video!

Comments Off

June 16, 2011

Apache Lucene EuroCon Barcelona

Filed under: Conferences,Lucene,Search Engines — Patrick Durusau @ 3:40 pm

Apache Lucene EuroCon Barcelona

From the webpage:

Apache Lucene EuroCon 2011 is the largest conference for the European Apache Lucene/Solr open source search community. Now in its second year, Apache Lucene Eurocon provides an unparalleled opportunity for European search application developers, thought leaders and market makers to connect and network with their peers and get on board with the technology that’s changing the shape of search: Apache Lucene/Solr.

The conference, taking place in cosmopolitan Barcelona, features a wide range of hands-on technical sessions, spanning the breadth and depth of use cases and technical sessions — plus a complete set of technical training workshops. You will hear from the foremost experts on open source search technology, commiters and developers practiced in the art and science of search. When you’re at Apache Lucene Eurocon, you can…

Even with feel me up security measures at the airport, a trip to Barcelona would be worthwhile anytime. Add a Lucene conference to boot, and who could refuse?

Seriously take advantage of this opportunity to travel this year. Next year, a U.S. presidential election year, will see rumors of security alerts, security alerts, FBI informant sponsored terror plots and the like, which will make travel more difficult.

Comments Off

June 15, 2011

elasticsearch: The Road to a Distributed, (Near) Real Time, Search Engine

Filed under: ElasticSearch,Lucene — Patrick Durusau @ 3:08 pm

elasticsearch: The Road to a Distributed, (Near) Real Time, Search Engine by Shay Banon

Covers Lucene basics and then shards and replicas using elasticsearch

Comments Off

June 10, 2011

Lucene Revolution 2011

Filed under: Conferences,Lucene — Patrick Durusau @ 6:33 pm

Lucene Revolution 2011

A materials from Lucene Revolution 2011 is now online.

I must admit that Searching The United States Code with Solr/Lucene caught my eye first. 😉

That presentation and the others are worth a close reading!

Comments Off

June 6, 2011

Apache Lucene 3.2 / Solr 3.2

Filed under: Indexing,Lucene,Search Engines,Solr — Patrick Durusau @ 1:54 pm

Apache Lucene 3.2 / Solr 3.2 released!

From the website:

Lucene can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/java/ and Solr can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/solr/

Highlights of the Lucene release include:

A new grouping module, under lucene/contrib/grouping, enables search results to be grouped by a single-valued indexed field

A new IndexUpgrader tool fully converts an old index to the current format.

A new Directory implementation, NRTCachingDirectory, caches small segments in RAM, to reduce the I/O load for applications with fast NRT reopen rates.

A new Collector implementation, CachingCollector, is able to gather search hits (document IDs and optionally also scores) and then replay them. This is useful for Collectors that require two or more passes to produce results.

Index a document block using IndexWriter’s new addDocuments or updateDocuments methods. These experimental APIs ensure that the block of documents will forever remain contiguous in the index, enabling interesting future features like grouping and joins.

A new default merge policy, TieredMergePolicy, which is more efficient due to being able to merge non-contiguous segments. See http://s.apache.org/merging for details.

NumericField is now returned correctly when you load a stored document (previously you received a normal Field back, with the numeric value converted string).

Deleted terms are now applied during flushing to the newly flushed segment, which is more efficient than having to later initialize a reader for that segment.

Highlights of the Solr release include:

Ability to specify overwrite and commitWithin as request parameters when using the JSON update format.

TermQParserPlugin, useful when generating filter queries from terms returned from field faceting or the terms component.

DebugComponent now supports using a NamedList to model Explanation objects in its responses instead of Explanation.toString.

Improvements to the UIMA and Carrot2 integrations.

Highlighting performance improvements.

A test-framework jar for easy testing of Solr extensions.

Bugfixes and improvements from Apache Lucene 3.2.

Comments Off

June 3, 2011

IBM InfoSphere BigInsights

Filed under: Avro,BigInsights,Hadoop,HBase,Lucene,Pig,Zookeeper — Patrick Durusau @ 2:32 pm

IBM InfoSphere BigInsights

Two items stand out in the usual laundry list of “easy administration” and “IBM supports open source” list of claims:

The Jaql query language. Jaql, a Query Language for JavaScript Object Notation (JSON), provides the capability to process both structured and non-traditional data. Its SQL-like interface is well suited for quick ramp-up by developers familiar with the SQL language and makes it easier to integrate with relational databases.

….

Integrated installation. BigInsights includes IBM value-added technologies, as well as open source components, such as Hadoop, Lucene, Hive, Pig, Zookeeper, Hbase, and Avro, to name a few.

I guess it must include a “few” things since the 64-bit Linux download is 398 MBs.

Just pointing out its availability. More commentary to follow.

Comments Off

May 20, 2011

SIREn: Efficient semi-structured Information Retrieval for Lucene

Filed under: Information Retrieval,Lucene,RDF — Patrick Durusau @ 4:06 pm

SIREn: Efficient semi-structured Information Retrieval for Lucene

From the announcement:

Efficient, large scale handling of semi-structured data (including RDF) is increasingly an important issue to many web and enterprise information reuse scenarios.

Querying graph structured data (RDF) is commonly achieved using specific solutions, called triplestores, typically based on DBMS backends. In Sindice we however needed something much more scalable than DBMS and with the desirable features of the typical Web Search engines: top-k query processing, real time updates, full text search, distributed indexes over shards, etc.

While Lucene has long offered these capabilities, its native capabilities are not intended for large semi-structured document collections (or documents with very different schemas). For this reason we developed SIREn – Semantic Information Retrieval Engine – a Lucene plugin to overcome these shortcomings and efficiently index and query RDF, as well as any textual document with an arbitrary amount of metadata fields.

Given its general applicability, we are delighted to release SIREn under the Apache 2.0 open source license. We hope businesses will find SIREn useful in implementing solutions upon the Web of Data.

You can start by looking at the features, review the performance benchmarks, learn more by reading the short tutorial and then download and try SIREn by yourself.

This looks very cool!

It’s tuple processing capabilities in particular!

Comments Off

May 19, 2011

Duke 0.1 Release

Filed under: Duke,Entity Resolution,Lucene,Record Linkage — Patrick Durusau @ 3:28 pm

Duke 0.1 Release

Lars Marius Garshol on Duke 0.1:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. At the moment (2011-04-07) it can process 1,000,000 records in 11 minutes on a standard laptop in a single thread.

Version 0.1 has been released, consisting of a command-line tool which can read CSV, JDBC, SPARQL, and NTriples data. There is also an API for programming incremental processing and storing the result of processing in a relational database.

The GettingStarted page explains how to get started and has links to further documentation. This blog post describes the basic approach taken to match records. It does not deal with the Lucene-based lookup, but describes an early, slow O(n^2) prototype. This presentation describes the ideas behind the engine and the intended architecture.

If you have questions, please contact the developer, Lars Marius Garshol, larsga at garshol.priv.no.

I will look around for sample data files.

Comments (3)

May 15, 2011

Luke 3.1

Filed under: Lucene,Luke — Patrick Durusau @ 5:55 pm

Luke 3.1

Luke is a development and diagnostic tool for use with Lucene.

Luke is now being numbered consistently with Lucene.

See my prior blog post on Luke.

Comments Off

April 27, 2011

Solr Result Grouping / Field Collapsing Improvements

Filed under: Lucene,Solr — Patrick Durusau @ 2:25 pm

Solr Result Grouping / Field Collapsing Improvements by Yonik Seeley

From the post:

I previously introduced Solr’s Result Grouping, also called Field Collapsing, that limits the number of documents shown for each “group”, normally defined as the unique values in a field or function query.

Since then, there have been a number of bug fixes, performance improvements, and feature enhancements. You’ll need a recent nightly build of Solr 4.0-dev, or the newly released LucidWorks Enterprise v1.6, our commercial version of Solr.

A short but useful article on new grouping capabilities in Solr.

What you do with results once they are grouped, which could include “merging,” is up to you.

Comments Off

Migration from a Commercial Search Platform (specifically FAST ESP) to Lucene/Solr

Filed under: Lucene,Solr — Patrick Durusau @ 2:24 pm

Migration from a Commercial Search Platform (specifically FAST ESP) to Lucene/Solr posted by Mitchell Pronschinske.

From the post:

There are many reasons that an IT department with a large scale search installation would want to move from a proprietary platform to Lucene Solr. In the case of FAST Search, the company’s purchase by Microsoft and discontinuation of the Linux platform has created an urgency for FAST users. This presentation will compare Lucene/Solr to FAST ESP on a feature basis, and as applied to an enterprise search installation. We will further explore how various advanced features of commercial enterprise search platforms can be implemented as added functions for Lucene/Solr. Actual cases will be presented describing how to map the various functions between systems.

Excellent presentation if you need to make the case for Lucene/Solr, function by function.

Comments Off

April 20, 2011

Adopting Apache Hadoop in the Federal Government

Filed under: Hadoop,Lucene,NoSQL,Solr — Patrick Durusau @ 2:17 pm

Adopting Apache Hadoop in the Federal Government

Background:

The United States federal government’s USASearch program provides hosted search services for government affiliate organizations, shares APIs and web services, and operates the government’s official search engine at Search.USA.gov. The USASearch affiliate program offers free search services to any federal, state, local, tribal, or territorial government agency. Several hundred websites make use of this service, ranging from the smallest municipality to larger federal sites like weather.gov and usa.gov. The USASearch program leverages the Bing API as the basis for its web results and then augments the user search experience by providing a variety of government-centric information such as related search topics and highlighted editorial content. The entire system is comprised of a suite of open-source tools and resources, including Apache Solr/Lucene, OpenCalais, and Apache Hadoop. Of these, our usage of Hadoop is the most recent. We began using Cloudera’s Distribution including Apache Hadoop (CDH3) for the first time in the Fall, and since then we’ve seen our usage grow every month— not just in scale, but in scope as well. But before highlighting everything the USASearch program is doing with Hadoop today, I should explain why we began using it in the first place.

Thoughts on how to relate topic maps to technologies that already have their foot in the door?

Comments Off

April 15, 2011

Lucene Revolution

Filed under: Lucene,Search Engines,Searching,Solr — Patrick Durusau @ 6:27 am

Lucene Revolution

May 23 – 24, 2011 Training
May 25 – 26, 2011 Conference

San Francisco Airport Hyatt Regency

From the website:

Lucene Revolution 2011 is the largest conference dedicated to open source search. The Lucene Revolution 2011 brings together developers, thought leaders, and market makers who understand that the search technology they’ve been looking has arrived.This is an event that should not be missed by anyone that is using, or considering, Apache Lucene/Solr or LucidWorks Enterprise for their search applications.

You will get a chance to hear from a wide range of speakers, from the foremost experts on open source search technology to a broad cross-section of users that have implemented Lucene, Solr, or LucidWorks Enterprise to improve search application performance, scalability, flexibility, and relevance, while lowering their costs. The two-day conference agenda is packed with technical sessions, developer content, user case studies, panels, and networking opportunities. You will learn new ways to develop, deploy, and enhance search applications using Lucene/Solr — and LucidWorks Enterprise.

Preceding the conference there are two days of intensive hands-on training on Solr, Lucene, and LucidWorks Enterprise on May 23 and 24. Whether you are new to Lucene/Solr, want to brush up on your skills, or want to get new insights, tips & tricks, you will get the knowledge you need to be successful.

This could be very cool.

Comments Off

April 14, 2011

Deduplication

Filed under: Duplicates,Lucene,Record Linkage — Patrick Durusau @ 7:25 am

Deduplication

Lars Marius Garshol slides from an internal Bouvet conference on deduplication of data.

And, DUplicate KillEr, DUKE.

As Lars points out, people have been here before.

I am not sure I share Lars’ assessment of the current state of record linkage software.

Consider for example, FRIL – Fine-Grained Record Integration and Linkage Tool, which is described as:

FRIL is FREE open source tool that enables fast and easy record linkage. The tool extends traditional record linkage tools with a richer set of parameters. Users may systematically and iteratively explore the optimal combination of parameter values to enhance linking performance and accuracy.
Key features of FRIL include:

Rich set of user-tunable parameters

Advanced features of schema/data reconciliation

User-tunable search methods (e.g. sorted neighborhood method, blocking method, nested loop join)

Transparent support for multi-core systems

Support for parameters configuration

Dynamic analysis of parameters

And many, many more…

I haven’t used FRIL but do note that it has documentation, videos, etc. for user instruction.

I have reservations about record linkage in general, but those are concerns about re-use of semantic mappings and not record linkage per se.

Comments (3)

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 15, 2011

September 14, 2011

September 10, 2011

September 9, 2011

August 28, 2011

August 23, 2011

August 9, 2011

August 8, 2011

July 27, 2011

July 23, 2011

July 21, 2011

July 1, 2011

June 30, 2011

June 24, 2011

June 21, 2011

June 16, 2011

June 15, 2011

June 10, 2011

June 6, 2011

June 3, 2011

May 20, 2011

May 19, 2011

May 15, 2011

April 27, 2011

April 20, 2011

April 15, 2011

April 14, 2011