Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 27, 2013

Apache Lucene and Solr 4.6.0!

Filed under: Lucene,Search Engines,Solr — Patrick Durusau @ 11:37 am

Apache Lucene and Solr 4.6.0 are out!

From the announcement:

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html.

Both releases contain a number of bug fixes.

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

While it is fair to say that “Both releases contain a number of bug fixes.” I think that gives the wrong impression.

The Lucene 4.6.0 release has 23 new features versus 5 bugs and Solr 4.6.0 has 17 new features versus 14 bug fixes. Closer but 40 new features total versus 22 bug fixes sounds good to me! 😉

Just to whet your appetite for looking at the detailed change lists:

LUCENE-5294 Suggester Dictionary implementation that takes expressions as term weights

From the description:

It could be an extension of the existing DocumentDictionary (which takes terms, weights and (optionally) payloads from the stored documents in the index). The only exception being that instead of taking the weights for the terms from the specified weight fields, it could compute the weights using an user-defn expression, that uses one or more NumicDocValuesField from the document.

Example:
let the document have

  • product_id
  • product_name
  • product_popularity
  • product_profit

Then this implementation could be used with an expression of “0.2*product_popularity + 0.8*product_profit” to determine the weights of the terms for the corresponding documents (optionally along with a payload (product_id))

You may remember I pointed out Mike McCandless’ blog post on this issue.

SOLR-5374 Support user configured doc-centric versioning rules

From the description:

The existing optimistic concurrency features of Solr can be very handy for ensuring that you are only updating/replacing the version of the doc you think you are updating/replacing, w/o the risk of someone else adding/removing the doc in the mean time – but I’ve recently encountered some situations where I really wanted to be able to let the client specify an arbitrary version, on a per document basis, (ie: generated by an external system, or perhaps a timestamp of when a file was last modified) and ensure that the corresponding document update was processed only if the “new” version is greater then the “old” version – w/o needing to check exactly which version is currently in Solr. (ie: If a client wants to index version 101 of a doc, that update should fail if version 102 is already in the index, but succeed if the currently indexed version is 99 – w/o the client needing to ask Solr what the current version)

November 25, 2013

How-to: Index and Search Data with Hue’s Search App

Filed under: Hue,Indexing,Interface Research/Design,Solr — Patrick Durusau @ 4:32 pm

How-to: Index and Search Data with Hue’s Search App

From the post:

You can use Hue and Cloudera Search to build your own integrated Big Data search app.

In a previous post, you learned how to analyze data using Apache Hive via Hue’s Beeswax and Catalog apps. This time, you’ll see how to make Yelp Dataset Challenge data searchable by indexing it and building a customizable UI with the Hue Search app.

Don’t be discouraged by the speed of the presenter in the video.

I suspect he is more than “familiar” with the Hue, Solr and the Yelp dataset. 😉

Like all great “how-to” guides you get a very positive outcome.

A positive outcome with minimal effort may be essential reinforcement for new technologies.

November 24, 2013

Multi-term Synonym Mapping in Solr

Filed under: Solr,Synonymy — Patrick Durusau @ 2:08 pm

Why is Multi-term synonym mapping so hard in Solr? by John Berryman.

From the post:

There is a very common need for multi-term synonyms. We’ve actually run across several use cases among our recent clients. Consider the following examples:

  • Ecommerce: If a customer searches for “weed whacker”, but the more canonical name is “string trimmer”, then you need synonyms, otherwise you’re going to lose a sale.
  • Law: Consider a layperson attempting to find a section of legal code pertaining to their “truck”. If the law only talks about “motor vehicles”, then, without synonyms, this individual will go away uninformed.
  • Medicine: When a doctor is looking up recent publications on “heart attack”, synonyms make sure that he also finds documents that happen to only mention “myocardial infarction”.

One would hope that working with synonyms should be as simple as tossing a set of synonyms into the synonyms.txt file and just having Solr “do the right thing.”™ And when we’re talking about simple, single-term synonyms (e.g. TV = televisions), synonyms really are just that straight forward. Unfortunately, especially as you get into more complex uses of synonyms, such as multi-term synonyms, there are several gotchas. Sometimes, there are workarounds. And sometimes, for now at least, you’ll just have to make do what you can currently achieve using Solr! In this post we’ll provide a quick intro to synonyms in Solr, we’ll walk through some of the pain points, and then we’ll propose possible resolutions.

John does a great review of basic synonym mapping in Solr as a prelude to illustrating the difficulty with multi-term synonyms.

His example case is the mapping:

spider man ==> spiderman

“Obvious” solutions fail but John does conclude with a pointer to one solution to the issue.

Recommended for a deeper understanding of Solr’s handling of synonymy.

While reading John’s post it occurred to me to check with Wikipedia on disambiguation of the term “spider.”

  • Comics – 17
  • Other publications – 5
  • Culinary – 3
  • Film and television – 10
  • Games and sports – 10
  • Land vehicles – 4
  • Mathematics – 1
  • Music – 16
  • People – 7
  • Technology – 14
  • Other uses – 7

I count eighty-eight (88) distinct “spiders” (counting spider as “an air-breathing eight-legged animal“, of which there are 44032 species identified as of June 23, 2013).

John suggests a parsing solution for the multi-term synonym problem in Solr, but however “spider” is parsed, there remains ambiguity.

An 88-fold ambiguity (at minimum).

At least for Solr and other search engines.

Not so much for us as human readers.

A human reader is not limited to “spider” in deciding which of 88 possible spiders is the correct one and/or the appropriate synonyms to use.

Each “spider” is seen in a “context” and a human reader will attribute (perhaps not consciously) characteristics to a particular “spider” in order to identify it.

If we record characteristics for each “spider,” then distinguishing and matching spiders to synonyms (also with characteristics) becomes a task of:

  1. Deciding which characteristic(s) to require for identification/synonymy.
  2. Fashioning rules for identification/synonymy.

Much can be said about those two tasks but for now, I will leave you with a practical example of their application.

Assume that you are indexing some portion of web space and you encounter The World Spider Catalog, Version 14.0.

We know for every instance of “spider” (136) at that site has the characteristics of order Araneae. How you wish to associate that with every instance of “spider” or other names from the spider database is an implementation issue.

However, knowing “order Araneae” allows us to reliably distinguish all the instances of “spider” at this resource from other instances of “spider” that lack that characteristic.

Just as importantly, we only have to perform that task once. Not rely upon our users to perform that task over and over again.

The weakness of current indexing is that it harvests only the surface text and not the rich semantic soil in which it grows.

November 20, 2013

Dublin Lucene Revolution 2013 (videos/slides)

Filed under: Conferences,Lucene,Solr — Patrick Durusau @ 7:46 pm

Dublin Lucene Revolution 2013 (slides/presentations)

I had confidence that LuceneRevolution wouldn’t abandon non-football fans in the U.S. for the Thanksgiving or Black Friday!

My faith has been vindicated!

I’ll create a sorted list of the presentations by author and title, to post here tomorrow.

In the meantime, I wanted to relieve your worry about endless hours of sports or shopping next week. 😉

R and Solr Integration…

Filed under: R,Solr — Patrick Durusau @ 5:41 pm

R and Solr Integration Using Solr’s REST APIs by Jitender Aswani.

From the post:

Solr is the most popular, fast and reliable open source enterprise search platform from the Apache Luene project. Among many other features, we love its powerful full-text search, hit highlighting, faceted search, and near real-time indexing. Solr powers the search and navigation features of many of the world’s largest internet sites. Solr, written in Java, uses the Lucene Java search library for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language including R.

We invested significant amount of time integrating our R-based data-management platform with Solr using HTTP/JSON based REST interface. This integration allowed us to index millions of data-sets in solr in real-time as these data-sets get processed by R. It took us few days to stabilize and optimize this approach and we are very proud to share this approach and source code with you. The full source code can be found and downloaded from datadolph.in’s git repository.

The script has R functions for:

  • querying Solr and returning matching docs
  • posting a document to solr (taking a list and converting it to JSON before posting it)
  • deleting all indexes, deleting indexes for a certain document type and for a certain category within document type

Integration across systems is the lifeblood of enterprise IT systems.

I was extolling the virtues of reaching across silos earlier today.

A silo may provide comfort but it doesn’t offer much room for growth.

Or to put it another way, semantic integration doesn’t have one path, one process or one technology.

Once you’re past that, the rest is a question of requirements, resources and understanding identity in your domain (and/or across domains).

November 18, 2013

Solr Query Parsing

Filed under: Lucene,Solr — Patrick Durusau @ 7:07 pm

Solr Query Parsing by Eric Hatcher.

From the description:

Interpreting what the user meant and what they ideally would like to find is tricky business. This talk will cover useful tips and tricks to better leverage and extend Solr‘s analysis and query parsing capabilities to more richly parse and interpret user queries.

It may just be me but does it seem like Solr presentations hit the ground thinking you have a background on the subject at hand?

I won’t name names or topics but presentations start off with the same basics in any number of other talks, it’s hard to get interested.

That’s not the case, even just with the slides from Eric’s presentation!

Highly recommended!

November 17, 2013

Spelling isn’t a subject…

Filed under: Lucene,Solr — Patrick Durusau @ 8:39 pm

Have you seen Alec Baldwin’s teacher commercial?

A student suggests spelling as a subject and Alec responds: “Spelling isn’t a subject, spell-check, that’s a program, right?”

In Spellchecking in Trovit by Xavier Sanchez Loro, you will find that spell-check is more than a “program.”

Especially in a multi-language environment where the goal isn’t just correct spelling but delivery of relevant information to users.

From the post:

This post aims to explain the implementation and use case for spellchecking in the Trovit search engine that we will be presenting at the Lucene/Solr Revolution EU 2013 [1]. Trovit [2] is a classified ads search engine supporting several different sites, one for each country and vertical. Our search engine supports multiple indexes in multiple languages, each with several millions of indexed ads. Those indexes are segmented in several different sites depending on the type of ads (homes, cars, rentals, products, jobs and deals). We have developed a multi-language spellchecking system using SOLR [3] and Lucene [4] in order to help our users to better find the desired ads and to avoid the dreaded 0 results as much as possible (obviously, whilst still reporting back relevant information to the user). As such, our goal is not pure orthographic correction, but also to suggest correct searches for a certain site.

Our approach: Contextual Spellchecking

One key element in the spellchecking process is choosing the right dictionary, one with a relevant vocabulary for the type of information included in each site. Our approach is specializing the dictionaries based on user’s search context. Our search contexts are composed of country (with a default language) and vertical (determining the type of ads and vocabulary). Each site’s document corpus has a limited vocabulary, reduced to the type of information, language and terms included in each site’s ads. Using a more generalized approach is not suitable for our needs, since a unique vocabulary for each language (regardless of the vertical) is not as precise as specialized vocabularies for each language and vertical. We have observed drastic differences in the type of terms included in the indexes and the semantics of each vertical. Terms that are relevant in one context are meaningless in another one (e.g. “chalet” is not a relevant word in cars vertical, but is a highly relevant word for homes vertical). As such, Trovit’s spellchecking implementation exhibits very different vocabularies for each site, even when supporting the same language.

I like the emphasis on “contextual” spellchecking.

Sounds a lot like “contextual” subject recognition.

Yes?

Walking through this post in detail is an excellent exercise!

November 15, 2013

Shrinking the Haystack with Solr and NLP

Filed under: BigData,Natural Language Processing,Solr — Patrick Durusau @ 8:39 pm

Shrinking the Haystack with Solr and NLP by Wes Caldwell.

A very high level view of using Solr and NLP to shrink a data haystack but a useful one none the less.

If you think of this in the context of Chuck Hollis’ “modest data,” you begin to realize that the inputs may be “big data” but to be useful to a human analyst, it needs to be pared down to “modest data.”

Or even further to “actionable data.”

There’s an interesting contrast: Big data vs. Actionable data.

Ask your analyst if they prefer five terabytes of raw data or five pages of actionable data?

Adjust your deliverables accordingly.

November 12, 2013

Using Solr to Search and Analyze Logs

Filed under: Hadoop,Log Analysis,logstash,Lucene,Solr — Patrick Durusau @ 4:07 pm

Using Solr to Search and Analyze Logs by Radu Gheorghe.

From the description:

Since we’ve added Solr output for Logstash, indexing logs via Logstash has become a possibility. But what if you are not using (only) Logstash? Are there other ways you can index logs in Solr? Oh yeah, there are! The following slides are from Lucene Revolution conference that just took place in Dublin where we talked about indexing and searching logs with Solr.

Slides but a very good set of slides.

Radu’s post reminds me I over looked logs in the Hadoop eco-system when describing semantic diversity (Hadoop Ecosystem Configuration Woes?).

Or for that matter, how do you link up the logs with particular configuration or job settings?

Emails to the support desk and sticky notes don’t seem equal to the occasion.

November 5, 2013

Email Indexing Using Cloudera Search and HBase

Filed under: Cloudera,HBase,Solr — Patrick Durusau @ 6:38 pm

Email Indexing Using Cloudera Search and HBase by Jeff Shmain.

From the post:

In my previous post you learned how to index email messages in batch mode, and in near real time, using Apache Flume with MorphlineSolrSink. In this post, you will learn how to index emails using Cloudera Search with Apache HBase and Lily HBase Indexer, maintained by NGDATA and Cloudera. (If you have not read the previous post, I recommend you do so for background before reading on.)

Which near-real-time method to choose, HBase Indexer or Flume MorphlineSolrSink, will depend entirely on your use case, but below are some things to consider when making that decision:

  • Is HBase an optimal storage medium for the given use case?
  • Is the data already ingested into HBase?
  • Is there any access pattern that will require the files to be stored in a format other than HFiles?
  • If HBase is not currently running, will there be enough hardware resources to bring it up?

There are two ways to configure Cloudera Search to index documents stored in HBase: to alter the configuration files directly and start Lily HBase Indexer manually or as a service, or to configure everything using Cloudera Manager. This post will focus on the latter, because it is by far the easiest way to enable Search on HBase — or any other service on CDH, for that matter.

This rocks!

Including the reminder to fit the solution to your requirements, not the other way around.

The phrase “…near real time…” reminds me that HBase can operate in “…near real time…” but no analyst using HBase can.

Think about it. A search result comes back, the analyst reads it, perhaps compares it to their memory of other results and/or looks for other results to make the comparison. Then the analyst has to decide what if anything the results mean in a particular context and then communicate those results to others or take action based on those results.

That doesn’t sound even close to “…near real time…” to me.

You?

October 25, 2013

Collection Aliasing:…

Filed under: BigData,Cloudera,Lucene,Solr — Patrick Durusau @ 7:29 pm

Collection Aliasing: Near Real-Time Search for Really Big Data by Mark Miller.

From the post:

The rise of Big Data has been pushing search engines to handle ever-increasing amounts of data. While building Cloudera Search, one of the things we considered in Cloudera Engineering was how we would incorporate Apache Solr with Apache Hadoop in a way that would enable near-real-time indexing and searching on really big data.

Eventually, we built Cloudera Search on Solr and Apache Lucene, both of which have been adding features at an ever-faster pace to aid in handling more and more data. However, there is no silver bullet for dealing with extremely large-scale data. A common answer in the world of search is “it depends,” and that answer applies in large-scale search as well. The right architecture for your use case depends on many things, and your choice will generally be guided by the requirements and resources for your particular project.

We wanted to make sure that one simple scaling strategy that has been commonly used in the past for large amounts of time-series data would be fairly simple to set up with Cloudera Search. By “time-series data,” I mean logs, tweets, news articles, market data, and so on — data that is continuously being generated and is easily associated with a current timestamp.

One of the keys to this strategy is a feature that Cloudera recently contributed to Solr: collection aliasing. The approach involves using collection aliases to juggle collections in a very scalable little “dance.” The architecture has some limitations, but for the right use cases, it’s an extremely scalable option. I also think there are some areas of the dance that we can still add value to, but you can already do quite a bit with the current functionality.

A great post if you have really big data. 😉

Seriously, it is a great post and introduction to collection aliases.

On the other hand, I do wonder what routine Abbot and Costello would do with the variations on big, bigger, really big, etc., data.

Suggestions welcome!

From Text to Truth:…

Filed under: Facets,Solr — Patrick Durusau @ 6:50 pm

From Text to Truth: Real-World Facets for Multilingual Search by Benson Margulies.

Description:

Solr’s ability to facet search results gives end-users a valuable way to drill down to what they want. But for unstructured documents, deriving facets such as the persons mentioned requires advanced analytics. Even if names can be extracted from documents, the user doesn’t want a “George Bush” facet that intermingles documents mentioning either the 41st and 43rd U.S. Presidents, nor does she want separate facets for “George W. Bush” or even “乔治·沃克·布什” (a Chinese translation) that are limited to just one string. We’ll explore the benefits and challenges of empowering Solr users with real-world facets.

One of the better conference presentations I have seen in quite some time.

This is likely to change your mind about how you think about facets. Or at least how to construct them.

If you think of facets as the decoration you see at ecommerce sites, think again.

Enjoy!

Apache Lucene and Solr 4.5.1 (bugfix)

Filed under: Lucene,Solr — Patrick Durusau @ 8:04 am

Apache Lucene and Solr 4.5.1

From the post:

Today Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbered 4.5.1. This is a minor bugfix release.

Apache Lucene 4.5.1 library can be downloaded from the following address: http://www.apache.org/dyn/closer.cgi/lucene/java/. Apache Solr 4.5.1 can be downloaded at the following URL address: http://www.apache.org/dyn/closer.cgi/lucene/solr/. Release note for Apache Lucene 4.5.1 can be found at: http://wiki.apache.org/lucene-java/ReleaseNote451, Solr release notes can be found at: http://wiki.apache.org/solr/ReleaseNote451.

Without a “tech surge” no less.

October 22, 2013

Solr-RA – Solr With RankingAlgorithm WARNING!

Filed under: Solr — Patrick Durusau @ 7:15 pm

Solr-RA – Solr With RankingAlgorithm WARNING!

Google has become very aggressive with search completions and while trying to search for “solr,” Google completed it to Solr-RA.

On the homepage you will see:

Downloads (it is free) (emphasis added)

But there is also a rather prominent Software Agreement.

I can blow by EULAs with the best of them but I think you may want to read this license before you hit download. Seriously.

I am not going to give you legal advice on what I think it says or does not say.

All I want to do is give you a heads up to read the license for yourself. Then decide on your course of action.

I have no problems with commercial software.

But I do prefer to see “trial” or “limited license,” or some other notice of pending obligation on my part.

Please forward this about.

October 20, 2013

Crawl Anywhere

Filed under: Search Engines,Search Interface,Solr,Webcrawler — Patrick Durusau @ 5:59 pm

Crawl Anywhere 4.0.0-release-candidate available

From the Overview:

What is Crawl Anywhere?

Crawl Anywhere allows you to build vertical search engines. Crawl Anywhere includes :   

  • a Web Crawler with a powerful Web user interface
  • a document processing pipeline
  • a Solr indexer
  • a full featured and customizable search application

You can see the diagram of a typical use of all components in this diagram.

Why was Crawl Anywhere created?

Crawl Anywhere was originally developed to index in Apache Solr 5400 web sites (more than 10.000.000 pages) for the Hurisearch search engine: http://www.hurisearch.org/. During this project, various crawlers were evaluated (heritrix, nutch, …) but one key feature was missing : a user friendly web interface to manage Web sites to be crawled with their specific crawl rules. Mainly for this raison, we decided to develop our own Web crawler. Why did we choose the name "Crawl Anywhere" ? This name may appear a little over stated, but crawl any source types (Web, database, CMS, …) is a real objective and Crawl Anywhere was designed in order to easily implement new source connectors.

Can you create a better search corpus for some domain X than Google?

Less noise and trash?

More high quality content?

Cross referencing? (Not more like this but meaningful cross-references.)

There is only one way to find out!

Crawl Anywhere will help you with the technical side of creating a search corpus.

What it won’t help with is developing the strategy to build and maintain such a corpus.

Interested in how you go beyond creating a subject specific list of resources?

A list that leaves a reader to sort though the chaff. Time and time again.

Pointers, suggestions, comments?

October 10, 2013

Apache Lucene: Then and Now

Filed under: Lucene,Solr,SolrCloud — Patrick Durusau @ 3:06 pm

Apache Lucene: Then and Now by Doug Cutting.

From the description at Washington DC Hadoop Users Group:

Doug Cutting originally wrote Lucene in 1997-8. It joined the Apache Software Foundation’s Jakarta family of open-source Java products in September 2001 and became its own top-level Apache project in February 2005. Until recently it included a number of sub-projects, such as Lucene.NET, Mahout, Solr and Nutch. Solr has merged into the Lucene project itself and Mahout, Nutch, and Tika have moved to become independent top-level projects. While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. At the core of Lucene’s logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene’s API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others (except images), can all be indexed as long as their textual information can be extracted.

In today’s discussion, Doug will share background on the impetus and creation of Lucene. He will talk about the evolution of the project and explain what the core technology has enabled today. Doug will also share his thoughts on what the future holds for Lucene and SOLR

Interesting walk down history lane with the creator of Lucene, Doug Cutting.

October 9, 2013

Global Biodiversity Information Facility

Filed under: Biodiversity,Biology,PostgreSQL,Solr — Patrick Durusau @ 7:10 pm

Global Biodiversity Information Facility

Some stats:

417,165,184 occurrences

1,426,888 species

11,976 data sets

578 data publishers

What lies at the technical heart of this beast?

Would you believe a PostgreSQL database and an embedded Apache SOLR index?

Start with the Summary of the GBIF infrastructure. The details on PostgreSQL and Solr are under the Registry tab.

BTW, the system recognizes multiple identification systems and more are to be added.

Need to read more of the documents on that part of the system.

October 8, 2013

Quepid [Topic Map Tuning?]

Filed under: Recommendation,Searching,Solr — Patrick Durusau @ 4:36 pm

Measure and Improve Search Quality with Quepid by Doug Turnbull.

From the post:

Let’s face it, returning good search results means making money. To this end, we’re often hired to tune search to ensure that search results are as close as possible to the intent of a user’s search query. Matching users intent to results, what we call “relevancy” is what gets us up in the morning. It’s what drives us to think hard about the dark mysteries of tuning Solr or machine-learning topics such as recommendation-based product search.

While we can do amazing feats of wizardry to make individual improvements, it’s impossible with today’s tools to do much more than prove that one problem has been solved. Search engines rank results based on a single set of rules. This single set of rules is in charge of how all searches are ranked. It’s very likely that even as we solve one problem by modifying those rules, we create another problem — or dozens of them, perhaps far more devastating than the original problem we solved.

Quepid is our instant search quality testing product. Born out of our years of experience tuning search, Quepid has become our go to tool for relevancy problems. Built around the idea of Test Driven Relevancy, Quepid allows the search developer to collaborate with product and content experts to

  1. Identify, store, and execute important queries
  2. Provide statistics/rankings that measure the quality of a search query
  3. Tune search relevancy
  4. Immediately visualize the impact of tuning on queries
  5. Rinse & Repeat Instantly

The result is a tool that empowers search developers to experiment with the impact of changes across the search experience and prove to their bosses that nothing broke. Confident in that data will prove or disprove their ideas instantly, developers are even freer experiment more than they might ever have before.

Any thoughts on automating a similar cycle to test the adding of subjects to a topic map?

Or adding subject identifies that would trigger additional merging?

Or just reporting the merging over and above what was already present?

Search-Aware Product Recommendation in Solr (Users vs. Experts?)

Filed under: Interface Research/Design,Recommendation,Searching,Solr — Patrick Durusau @ 10:43 am

Search-Aware Product Recommendation in Solr by John Berryman.

From the post:

Building upon earlier work with semantic search, OpenSource Connections is excited to unveil exciting new possibilities with Solr-based product recommendation. With this technology, it is now possible to serve user-specific, search-aware product recommendations directly from Solr.

In this post, we will review a simple Search-Aware Recommendation using an online grocery service as an example of e-commerce product recommendation. In this example I have built up a basic keyword search over the product catalog. We’ve also added two fields to Solr: purchasedByTheseUsers and recommendToTheseUsers. Both fields contain lists of userIds. Recall that each document in the index corresponds to a product. Thus the purchasedByTheseUsers field literally lists all of the users who have purchased said product. The next field, recommendToTheseUsers, is the special sauce. This field lists all users who might want to purchase the corresponding product. We have extracted this field using a process called collaborative filtering, which is described in my previous post, Semantic Search With Solr And Python Numpy. With collaborative filtering, we make product recommendation by mathematically identifying similar users (based on products purchased) and then providing recommendations based upon the items that these users have purchased.

Now that the background has been established, let’s look at the results. Here we search for 3 different products using two different, randomly-selected users who we will refer to as Wendy and Dave. For each product: We first perform a raw search to gather a base understanding about how the search performs against user queries. We then search for the intersection of these search results and the products recommended to Wendy. Finally we also search for the intersection of these search results and the products recommended to Dave.

BTW, don’t miss the invitation to be an alpha tester for Solr Search-Aware Product Recommendation at the end of John’s post.

Reading John’s post it occurred to me that an alternative to mining other users’ choices, you could have an expert develop the recommendations.

Much like we use experts to develop library classification systems.

But we don’t, do we?

Isn’t that interesting?

I suspect we don’t use experts for product recommendations because we know that shopping choices depends on a similarity between consumers

We may not know what the precise nature of the similarity may be, but it is sufficient that we can establish its existence in the aggregate and sell more products based upon it.

Shouldn’t the same be true for finding information or data?

If similar (in some possibly unknown way) consumers of information find information in similar ways, why don’t we organize information based on similar patterns of finding?

How an “expert” finds information may be more “precise” or “accurate,” but if a user doesn’t follow that path, the user doesn’t find the information.

A great path that doesn’t help users find information is like having a great road with sidewalks, a bike path, cross-walks, good signage, that goes no where.

How do you incorporate user paths in your topic map application?

October 7, 2013

Webinar: Trubo-Charging Solr

Filed under: Entity Resolution,Lucene,LucidWorks,Relevance,Solr — Patrick Durusau @ 10:40 am

Turbo-charge your Solr instance with Entity Recognition, Business Rules and a Relevancy Workbench by Yann Yu.

Date: Thursday, October 17, 2013
Time: 10:00am Pacific Time

From the post:

LucidWorks has three new modules available in the Solr Marketplace that run on top of your existing Solr or LucidWorks Search instance. Join us for an overview of each module and learn how implementing one, two or all three will turbo-charge your Solr instance.

  • Business Rules Engine: Out of the box integration with Drools, the popular open-source business rules engine is now available for Solr and LucidWorks Search. With the LucidWorks Business Rules module, developers can write complex rules using declarative syntax with very little programming. Data can be modified, cleaned and enriched through multiple permutations and combinations.
  • Relevancy Workbench: Experiment with different search parameters to understand the impact of these changes to search results. With intuitive, color-code and side-by-side comparisons of results for different sets of parameters, users can quickly tune their application to produce the results they need. The Relevancy Workbench encourages experimentation with a visual “before and after” view of the results of parameter changes.
  • Entity Recognition: Enhance Search applications beyond simple keyword search by adding intelligence through metadata. Help classify common patterns from unstructured data/content into predefined categories. Examples include names of persons, organizations, locations, expressions of time, quantities, monetary values, percentages etc.

All of these modules will be of interest to topic mappers who are processing bulk data.

October 6, 2013

Apache Solr 4.5 documentation

Filed under: Lucene,Solr — Patrick Durusau @ 4:22 pm

Apache Solr 4.5 documentation

From the post:

Apache Solr PMC announced that the newest version of official Apache Solr documentation for Solr 4.5 (more about that version) is now available. The PDF file with documentation is available at: https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/.

If Apache Solr 4.5 was welcome news, this is even more so!

I am doing a lot of proofing of drafts (not by me) this week. Always refreshing to have alternative reading material that doesn’t make me wince.

That unfair. To the Apache Solr Reference Manual.

It is way better than simply not making me wince.

I am sure I will find things I would state differently but I feel confident I won’t encounter writing errors we were encouraged since grade school to avoid.

I won’t go into the details as someone might mistake description for recommendation. 😉

Enjoy the Apache Solr 4.5 documentation!

October 5, 2013

Apache Lucene 4.5 and Apache SolrTM 4.5 available

Filed under: Lucene,Solr — Patrick Durusau @ 7:07 pm

From: Apache Lucene News:

The Lucene PMC is pleased to announce the availability
of Apache Lucene 4.5 and Apache Solr 4.5.

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

Highlights of the Lucene release include:

  • Added support for missing values to DocValues fields through AtomicReader.getDocsWithField.
  • Lucene 4.5 has a new Lucene45Codec with Lucene45DocValues, supporting missing values and with most datastructures residing off-heap.
  • New in-memory DocIdSet implementations which are especially better than FixedBitSet on small sets: WAH8DocIdSet, PFORDeltaDocIdSet and EliasFanoDocIdSet.
  • CachingWrapperFilter now caches filters with WAH8DocIdSet by default, which has the same memory usage as FixedBitSet in the worst case but is smaller and faster on small sets.
  • TokenStreams now set the position increment in end(), so we can handle trailing holes.
  • IndexWriter no longer clones the given IndexWriterConfig.

Lucene 4.5 also includes numerous optimizations and bugfixes.

Highlights of the Solr release include:

  • Custom sharding support, including the ability to shard by field.
  • DocValue improvements: single valued fields no longer require a default value, allowiing dynamicFields to contain doc values, as well as sortMissingFirst and sortMissingLast on docValue fields.
  • Ability to store solr.xml in ZooKeeper.
  • Multithreaded faceting.
  • CloudSolrServer can now route updates directly to the appropriate shard leader.

Solr 4.5 also includes numerous optimizations and bugfixes.

Excellent!

October 3, 2013

Easy k-NN Document Classification with Solr and Python

Filed under: K-Nearest-Neighbors,Python,Solr — Patrick Durusau @ 7:02 pm

Easy k-NN Document Classification with Solr and Python by John Berryman.

From the post:

You’ve got a problem: You have 1 buzzillion documents that must all be classified. Naturally, tagging them by hand is completely infeasible. However you are fortunate enough to have several thousand documents that have already been tagged. So why not…

Build a k-Nearest Neighbors Classifier!

The concept of a k-NN document classifier is actually quite simple. Basically, given a new document, find the k most similar documents within the tagged collection, retrieve the tags from those documents, and declare the input document to have the same tag as that which was most common among the similar documents. Now, taking a page from Taming Text (page 189 to be precise), do you know of any opensource products that are really good at similarity-based document retrieval? That’s right, Solr! Basically, given a new input document, all we have to do is scoop out the “statistically interesting” terms, submit a search composed of these terms, and count the tags that come back. And it even turns out that Solr takes care of identifying the “statistically interesting” terms. All we have to do is submit the document to the Solr MoreLikeThis handler. MoreLikeThis then scans through the document and extracts “Goldilocks” terms – those terms that are not too long, not too short, not too common, and not too rare… they’re all just right.

I don’t know how timely John’s post is for you but it is very timely for me. 😉

I was being asked yesterday about devising a rough cut over a body of texts.

Looking forward to putting this approach through its paces.

Dublin Lucene Revolution 2013 Sessions

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 6:45 pm

Dublin Lucene Revolution 2013 Sessions

Just a sampling to whet your appetite:

With many more entries in the intermediate and introductory levels.

Of all of the listed sessions, which ones will set your sights on Dublin?

Reminder: Training: November 4-5, Conference: November 6-7

September 30, 2013

Classifying Non-Patent Literature…

Filed under: Classification,Natural Language Processing,Patents,Solr — Patrick Durusau @ 6:29 pm

Classifying Non-Patent Literature To Aid In Prior Art Searches by John Berryman.

From the post:

Before a patent can be granted, it must be proven beyond a reasonable doubt that the innovation outlined by the patent application is indeed novel. Similarly, when defending one’s own intellectual property against a non-practicing entity (NPE – also known as a patent troll) one often attempts to prove that the patent held by the accuser is invalid by showing that relevant prior art already exists and that their patent is actual not that novel.

Finding Prior Art

So where does one get ahold of pertinent prior art? The most obvious place to look is in the text of earlier patents grants. If you can identify a set of reasonably related grants that covers the claims of the patent in question, then the patent may not be valid. In fact, if you are considering the validity of a patent application, then reviewing existing patents is certainly the first approach you should take. However, if you’re using this route to identify prior art for a patent held by an NPE, then you may be fighting an uphill battle. Consider that a very bright patent examiner has already taken this approach, and after an in-depth examination process, having found no relevant prior art, the patent office granted the very patent that you seek to invalidate.

But there is hope. For a patent to be granted, it must not only be novel among the roughly 10Million US Patents that currently exist, but it must also be novel among all published media prior to the application date – so called non-patent literature (NPL). This includes conference proceeding, academic articles, weblogs, or even YouTube videos. And if anyone – including the applicant themselves – publicly discloses information critical to their patent’s claims, then the patent may be rendered invalid. As a corollary, if you are looking to invalidate a patent, then looking for prior art in non-patent literature is a good idea! While tools are available to systematically search through patent grants, it is much more difficult to search through NPL. And if the patent in question truly is not novel, then evidence must surely exists – if only you knew where to look.

More suggestions than solutions but good suggestions, such as these, are hard to come by.

John suggests using existing patents and their classifications as a learning set to classify non-patent literature.

Interesting but patent language is highly stylized and quite unlike the descriptions you encounter in non-patent literature.

It would be an interesting experiment to take some subset of patents and their classifications along with a set of non-patent literature, known to describe the same “inventions” covered by the patents.

Suggestions for subject areas?

September 28, 2013

Language support and linguistics

Filed under: Language,Lucene,Solr — Patrick Durusau @ 7:31 pm

Language support and linguistics in Apache Lucene™ and Apache Solr™ and the eco-system by Gaute Lambertsen and Christian Moen.

Slides from Lucene Revolution May, 2013.

Good overview of language support and linguistics in both Lucene and Solr.

A few less language examples at the beginning would shorten the slide deck from its current one hundred and fifty-one (151) count without impairing its message.

Still, if you are unfamiliar with language support in Lucene and Solr, the extra examples don’t hurt anything.

September 10, 2013

Migrating [AMA] search to Solr

Filed under: Searching,Solr — Patrick Durusau @ 3:59 am

Migrating American Medical Association’s search to Solr by Doug Turnbull.

Read the entire post but one particular point is important to me:

Research journal users value recent publications very highly. Users want to see recent research, not just documents that score well due to how frequently search terms occur in a document. If you were a doctor, would you rather see brain cancer research that occurred this decade or in the early 20th century?

I call this out because it is one of my favorite peeves about Google search results.

Even if generalized date parsing is too hard, Google should know when it first encountered a particular resource.

At the very least a listing by “age” of a link should be trivially possible.

How important is recent information to your users?

September 8, 2013

Postgres and Full Text Indexes

Filed under: Indexing,PostgreSQL,Solr — Patrick Durusau @ 4:06 pm

After reading Jeff Larson’s account of his text mining adventures in ProPublica’s Jeff Larson on the NSA Crypto Story, I encountered a triplet of post from Gary Sieling on Postgres and full text indexes.

In order of appearance:

Fixing Issues Where Postgres Optimizer Ignores Full Text Indexes

GIN vs GiST For Faceted Search with Postgres Full Text Indexes

Querying Multiple Postgres Full-Text Indexes

If Postgres and full text indexing are project requirements, these are must read posts.

Gary does note in the middle post that Solr with default options (no tuning) out performs Postgres.

Solr would have been the better option for Jeff Larson when compared to Postgres.

But the difference in that case is a contrast between structured data and “dumpster data.”

It appears that the hurly-burly race to enable “connecting the dots” post-9/11:

Structural barriers to performing joint intelligence work. National intelligence is still organized around the collection disciplines of the home agencies, not the joint mission. The importance of integrated, all-source analysis cannot be overstated. Without it, it is not possible to “connect the dots.” No one component holds all the relevant information.

Yep, #1 with a bullet problem.

Response? From the Manning and Snowden leaks, one can only guess that “dumpster data” is the preferred solution.

By “dumpster data” I mean that data from different sources, agencies, etc., are simply dumped into a large data store.

No wonder the NSA runs 600,000 of queries a day or about 20 million queries a month. That is a lot of data dumpster diving.

Secrecy may be hiding that data from the public, but poor planning is hiding it from the NSA.

September 1, 2013

Notes on DIH Architecture: Solr’s Data Import Handler

Filed under: Searching,Solr — Patrick Durusau @ 6:43 pm

Notes on DIH Architecture: Solr’s Data Import Handler by Mark Bennett.

From the post:

What the world really needs are some awesome examples of extending DIH (Solr DataImportHanlder), beyond the classes and unit tests that ship with Solr. That’s a tall order given DIH’s complexity, and sadly this post ain’t it either! After doing a lot of searches online, I don’t think anybody’s written an “Extending DIH Guide” yet – everybody still points to the Solr wiki, quick start, FAQ, source code and unit tests.

However, in this post, I will review a few concepts to keep in mind. And who knows, maybe in a future post I’ll have some concrete code.

When I make notes, I highlight the things that are different from what I’d expect and why, so I’m going to start with that. Sure DIH has an XML config where you tell it about your database or filesystem or RSS feed, and map those things into your Solr schema, so no surprise there. But the layering of that configuration really surprised me. (and turns out there’s good reasons for it)

If you aspire to be Solr proficient, print this article and work through it.

It will be time well spent.

August 25, 2013

Better synonym handling in Solr

Filed under: Solr,Synonymy — Patrick Durusau @ 7:05 pm

Better synonym handling in Solr by Nolan Lawson.

A very deep dive into synonym handling in Solr, along with a proposed fix.

The problems Nolan uncovers are now in a JIRA issue, SOLR-4381.

And Nolan has a Github repository with his proposed fix.

The Solr JIRA lists the issue as still “open.”

Start with the post and then go onward to the JIRA issue and Github repository. I say that because Nolan does a great job detailing the issue he discovered and his proposed solution.

I can think of several other improvements to synonym handling in Solr.

Such as allowing specification of tokens and required values in other fields for synonyms. (An indexing analog to scope.)

Or even allowing Solr queries in a synonym table.

Not to mention making Solr synonym tables by default indexed.

Just to name a few.

« Newer PostsOlder Posts »

Powered by WordPress