Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 21, 2014

Your own search engine…

Filed under: Search Engines,Searching,Solr — Patrick Durusau @ 4:46 pm

Your own search engine (based on Apache Solr open-source enterprise-search)

From the webpage:

Tools for easier searching with free software on your own server

  • search in many documents, images and files
    • full text search with powerful search operators
    • in many different formats (text, word, openoffice, PDF, sheets, csv, doc, images, jpg, video and many more)
    • get a overview by explorative search and comfortable and powerful navigation with faceted search (easy to use interactive filters)
  • analyze documents (preview, extracted text, wordlists and visualizations with wordclouds and trend charts)
  • structure your research, investigation, navigation, metadata or notes (semantic wiki for tagging documents, annotations and structured notes)
  • OCR: automatic text recognition for images and graphical content or scans inside PDF, i.e. for scanned or photographed documents

Do you think this would be a way to pull back the curtain on search a bit? To show people that even results like we see from Google require more than casual effort?

I ask because Jeni Tennison tweeted earlier today:

#TDC14 @emckean “search is the hammer that makes us think everything is a nail that can be searched for”

Is a common misunderstanding of search making “improved” finding methods a difficult sell?

Not that I have a lot of faith or interest in educating potential purchasers. Finding a way to use the misunderstanding seems like a better marketing strategy to me.

Suggestions?

Apache Lucene/Solr 4.8.1 (Bug Fixes)

Filed under: Lucene,Solr — Patrick Durusau @ 1:23 pm

From the Lucene News:

The Lucene PMC is pleased to announce the availability of Apache Lucene 4.8.1 and Apache Solr 4.8.1.

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Both releases contain a number of bug fixes.

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

It’s upgrade time again!

May 17, 2014

Building a Recipe Search Site…

Filed under: ElasticSearch,Lucene,Search Engines,Solr — Patrick Durusau @ 4:32 pm

Building a Recipe Search Site with Angular and Elasticsearch by Adam Bard.

From the post:

Have you ever wanted to build a search feature into an application? In the old days, you might have found yourself wrangling with Solr, or building your own search service on top of Lucene — if you were lucky. But, since 2010, there’s been an easier way: Elasticsearch.

Elasticsearch is an open-source storage engine built on Lucene. It’s more than a search engine; it’s a true document store, albeit one emphasizing search performance over consistency or durability. This means that, for many applications, you can use Elasticsearch as your entire backend. Applications such as…

Think of this as a snapshot of the capabilities of most search solutions.

Which makes this a great baseline for answering the question: What does your app do that Elasticsearch + Angular cannot?

That’s a serious question.

Responses that don’t count include:

  1. My app is written in the Linear B programming language.
  2. My app uses a Post-Pre-NOSQL DB engine.
  3. My app will bring freedom and health to the WWW.
  4. (insert your reason)

You can say all those things if you like, but the convincing point for users is going to be exceeding their expectations about current solutions.

Do the best you can with Elasticsearch and Angular and use that as your basepoint for comparison.

May 13, 2014

Choosing a fast unique identifier (UUID) for Lucene

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 9:44 am

Choosing a fast unique identifier (UUID) for Lucene by Michael McCandless.

From the post:

Most search applications using Apache Lucene assign a unique id, or primary key, to each indexed document. While Lucene itself does not require this (it could care less!), the application usually needs it to later replace, delete or retrieve that one document by its external id. Most servers built on top of Lucene, such as Elasticsearch and Solr, require a unique id and can auto-generate one if you do not provide it.

Sometimes your id values are already pre-defined, for example if an external database or content management system assigned one, or if you must use a URI, but if you are free to assign your own ids then what works best for Lucene?

One obvious choice is Java’s UUID class, which generates version 4 universally unique identifiers, but it turns out this is the worst choice for performance: it is 4X slower than the fastest. To understand why requires some understanding of how Lucene finds terms.
….

Excellent tips for creating identifiers for Lucene! Complete with tests and an explanation for the possible choices.

Enjoy!

May 10, 2014

Parameterizing Queries in Solr and Elasticsearch

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 6:19 pm

Parameterizing Queries in Solr and Elasticsearch by RAFAŁ KUĆ.

From the post:

We all know how good it is to have abstraction layers in software we create. We tend to abstract implementation from the method contracts using interfaces, we use n-tier architectures so that we can abstract and divide different system layers from each other. This is very good – when we change one piece, we don’t need to touch the other parts that only knew about method contracts, API’s, etc. Why not do the same with search queries? Can we even do that in Elasticsearch and Solr? We can and I’ll show you how to do that.

The problem

Imagine, that we have a query, a complicated one, with boosts, sorts, facets and so on. However in most cases the query is pretty static when it comes to its structure and the only thing that changes is one of the filters in the query (actually a filter value) and the query entered by the user. I guess such situation could ring a bell for someone who developed a search application. Of course we can include the whole query in the application itself and reuse it. But in such case, changes to boosts for example requires us to deploy the application or a configuration file. And if more than a single application uses the same query, than we need to change them all.

What if we could make the change on the search server side only and let application pass the necessary data only? That would be nice, but it requires us to do some work on the search server side.

For the purpose of the blog post, let’s assume that we want to have a query that:

  • searches for documents with terms entered by the user,
  • limits the searches to a given category,
  • displays facet results for the price ranges

This is a simple example, so that the queries are easy to understand. So, in the perfect world we would only need to provide user query and category identifier to a search engine.

It is encouraging to see someone give solutions to the same search problem from Solr and Elasticsearch perspectives.

Not to mention that I think you will find this very useful.

May 7, 2014

New in Solr 4.8: Document Expiration

Filed under: Search Engines,Solr,Topic Maps — Patrick Durusau @ 7:07 pm

New in Solr 4.8: Document Expiration

From the post:

Lucene & Solr 4.8 were released last week and you can download Solr 4.8 from the Apache mirror network. Today I’d like to introduce you to a small but powerful feature I worked on for 4.8: Document Expiration.

The DocExpirationUpdateProcessorFactory provides two features related to the “expiration” of documents which can be used individually, or in combination:

  • Periodically delete documents from the index based on an expiration field
  • Computing expiration field values for documents from a “time to live” (TTL)

Assuming you are using a topic maps solution that presents topics as merged, this could be an interesting feature to emulate.

After all, if you are listing ticket sale outlets for concerts in a music topic map, good maintenance suggests those occurrences should go away after the concert has occurred.

Or if you need the legacy information for some purpose, at least not have it presented as currently available. Perhaps a change of its occurrence type?

Would you actually delete topics or add an “internal” occurrence so they would not participate in future presentations of merged topics?

May 2, 2014

Apache Solr 4.8 Documentation

Filed under: Search Engines,Solr — Patrick Durusau @ 7:22 pm

Apache Solr 4.8 Reference Guide (pdf)

Apache Solr 4.8.0 Documentation

From the documentation page:

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr’s powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

This is the official documentation for Apache Solr 4.8.0.

I haven’t had good experiences with either the “official” Solr documentation or commercial publications on the same.

Not that any of it in particular was wrong so much as it was incomplete. Not that any of it was short. 😉

Perhaps it was more of an organizational problem than anything else.

I will be using the documentation on a regular basis for a while so I will start contributing suggestions as issues arise.

Curious to know if your experience with the Solr documentation has been the same? Different?

April 28, 2014

Apache Lucene/Solr 4.8.0 Available!

Filed under: Lucene,Solr — Patrick Durusau @ 7:55 pm

The Lucene PMC is pleased to announce the availability of Apache Lucene 4.8.0 and Apache Solr 4.8.0.

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Both releases now require Java 7 or greater (recommended is Oracle Java 7 or OpenJDK 7, minimum update 55; earlier versions have known JVM bugs affecting Lucene and Solr). In addition, both are fully compatible with Java 8.

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

Highlights of the Lucene release include:

  • All index files now store end-to-end checksums, which are now validated during merging and reading. This ensures that corruptions caused by any bit-flipping hardware problems or bugs in the JVM can be detected earlier. For full detection be sure to enable all checksums during merging (it’s disabled by default).
  • Lucene has a new Rescorer/QueryRescorer API to perform second-pass rescoring or reranking of search results using more expensive scoring functions after first-pass hit collection.
  • AnalyzingInfixSuggester now supports near-real-time autosuggest.
  • Simplified impact-sorted postings (using SortingMergePolicy and EarlyTerminatingCollector) to use Lucene’s Sort class to express the sort order.
  • Bulk scoring and normal iterator-based scoring were separated, so some queries can do bulk scoring more effectively.
  • Switched to MurmurHash3 to hash terms during indexing.
  • IndexWriter now supports updating of binary doc value fields.
  • HunspellStemFilter now uses 10 to 100x less RAM. It also loads all known OpenOffice dictionaries without error.
  • Lucene now also fsyncs the directory metadata on commits, if the operating system and file system allow it (Linux, MacOSX are known to work).
  • Lucene now uses Java 7 file system functions under the hood, so index files can be deleted on Windows, even when readers are still open.
  • A serious bug in NativeFSLockFactory was fixed, which could allow multiple IndexWriters to acquire the same lock. The lock file is no longer deleted from the index directory even when the lock is not held.

Highlights of the Solr release include:

  • <fields> and <types> tags have been deprecated from schema.xml. There is no longer any reason to keep them in the schema file, they may be safely removed. This allows intermixing of <fieldType>, <field> and <copyField> definitions if desired.
  • The new {!complexphrase} query parser supports wildcards, ORs etc. inside Phrase Queries.
  • New Collections API CLUSTERSTATUS action reports the status of collections, shards, and replicas, and also lists collection aliases and cluster properties.
  • Added managed synonym and stopword filter factories, which enable synonym and stopword lists to be dynamically managed via REST API.
  • JSON updates now support nested child documents, enabling {!child} and {!parent} block join queries.
  • Added ExpandComponent to expand results collapsed by the CollapsingQParserPlugin, as well as the parent/child relationship of nested child documents.
  • Long-running Collections API tasks can now be executed asynchronously; the new REQUESTSTATUS action provides status.
  • Added a hl.qparser parameter to allow you to define a query parser for hl.q highlight queries.
  • In Solr single-node mode, cores can now be created using named configsets.
  • New DocExpirationUpdateProcessorFactory supports computing an expiration date for documents from the “TTL” expression, as well as automatically deleting expired documents on a periodic basis.

All exciting additions, except that today I finished configuring Tomcat7/Solr/Nutch, where Solr = 4.7.2.

Sigh, well, I suppose that was just a trial run. 😉

April 26, 2014

Solr 4.8 Features

Filed under: Lucene,Solr — Patrick Durusau @ 6:57 pm

Solr 4.8 Features by Yonik Seeley.

Yonik reviews the coming new features for Solr 4.8:

  • Complex Phrase Queries
  • Indexing Child Documents in JSON
  • Expand Component
  • Named Config Sets
  • Stopwords and Synonyms REST API

Do you think traditional publishing models work well for open source projects that evolve as rapidly as Solr?

I first saw this in a tweet by Martin Grotzke.

April 25, 2014

Solr In Action – Bug/Feature

Filed under: Solr — Patrick Durusau @ 7:40 pm

Solr In Action has recently appeared from Manning. I bought it on MEAP and am working from the latest version.

There is a bug/feature that you should be aware of if you are using the source code for Solr in Action.

The data-config.xml file (solrpedia/conf/data-config.xml) has the line:

url=”solrpedia.xml”

Which works, if and only if you are using Jetty, which resolves the path relative to solrpedia core.

However, if you are running Solr under Tomcat7, you are going to get an indexing failed with the following log message:

Could not find file: solrpedia.xml (resolved to: /var/lib/tomcat/./solrpedia.xml)

If you change:

url=”solrpedia.xml”

in solrpedia/conf/data-config.xml to:

url=”/usr/share/solr/example/solrpedia.xml”

it works like a charm.

Good to know before I started on a much larger data import. 😉

April 19, 2014

Apache Lucene/Solr 4.7.2

Filed under: Lucene,Solr — Patrick Durusau @ 6:23 pm

Apache Lucene 4.7.2

http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene Changes.txt

Fixes potential index corruption, LUCENE-5574.

Apache Solr 4.7.2

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr Changes.txt

In view of possible index corruption, I would not take this as an optional upgrade.

April 17, 2014

Cloudera Live (beta)

Filed under: Cloudera,Hadoop,Hive,Impala,Oozie,Solr,Spark — Patrick Durusau @ 4:57 pm

Cloudera Live (beta)

From the webpage:

Try a live demo of Hadoop, right now.

Cloudera Live is a new way to get started with Apache Hadoop, online. No downloads, no installations, no waiting. Watch tutorial videos and work with real-world examples of the complete Hadoop stack included with CDH, Cloudera’s completely open source Hadoop platform, to:

  • Learn Hue, the Hadoop User Interface developed by Cloudera
  • Query data using popular projects like Apache Hive, Apache Pig, Impala, Apache Solr, and Apache Spark (new!)
  • Develop workflows using Apache Oozie

Great news for people interested in Hadoop!

Question: Will this become the default delivery model for test driving software and training?

Enjoy!

April 2, 2014

Apache Lucene/Solr 4.7.1

Filed under: Lucene,Solr — Patrick Durusau @ 3:16 pm

Apache Lucene 4.7.1

http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene Changes.txt

Apache Solr 4.7.1

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr Changes.txt

Fixes include a bad memory leak: Solr-5875 so upgrading is advised.

Hortonworks Data Platform 2.1

Filed under: Apache Ambari,Falcon,Hadoop,Hadoop YARN,Hive,Hortonworks,Knox Gateway,Solr,Storm,Tez — Patrick Durusau @ 2:49 pm

Hortonworks Data Platform 2.1 by Jim Walker.

From the post:

The pace of innovation within the Apache Hadoop community is truly remarkable, enabling us to announce the availability of Hortonworks Data Platform 2.1, incorporating the very latest innovations from the Hadoop community in an integrated, tested, and completely open enterprise data platform.

A VM available now, full releases to follow later in April.

Just grabbing the headings from Jim’s post:

The Stinger Initiative: Apache Hive, Tez and YARN for Interactive Query

Data Governance with Apache Falcon

Security with Apache Knox

Stream Processing with Apache Storm

Searching Hadoop Data with Apache Solr

Advanced Operations with Apache Ambari

See Jim’s post for some of the details and the VM for others.

March 29, 2014

Installing Apache Solr 4.7 multicore…

Filed under: Search Engines,Solr — Patrick Durusau @ 6:28 pm

Installing Apache Solr 4.7 multicore on Ubuntu 12.04 and Tomcat7

From the post:

I will show you how to install the ApacheSolr search engine under Tomcat7 servlet container on Ubuntu 12.04.4 LTS (Precise Pangolin) to be used later with Drupal 7. In this writeup I’m gonna discuss only the installation and setup of the ApacheSolr server. Specific Drupal configuration and/or Drupal side configuration to be discussed in future writeup.

Nothing you don’t already know but a nice checklist for the installation.

I’m glad I found it because I am writing a VM script to auto-install Solr as part of a VM distribution.

Manually I do ok but am likely to forget something the script needs explicitly.

March 18, 2014

Automatic bulk OCR and full-text search…

Filed under: ElasticSearch,Search Engines,Solr,Topic Maps — Patrick Durusau @ 8:48 pm

Automatic bulk OCR and full-text search for digital collections using Tesseract and Solr by Chris Adams.

From the post:

Digitizing printed material has become an industrial process for large collections. Modern scanning equipment makes it easy to process millions of pages and concerted engineering effort has even produced options at the high-end for fragile rare items while innovative open-source projects like Project Gado make continual progress reducing the cost of reliable, batch scanning to fit almost any organization’s budget.

Such efficiencies are great for our goals of preserving history and making it available but they start making painfully obvious the degree to which digitization capacity outstrips our ability to create metadata. This is a big problem because most of the ways we find information involves searching for text and a large TIFF file is effectively invisible to a full-text search engine. The classic library solution to this challenge has been cataloging but the required labor is well beyond most budgets and runs into philosophical challenges when users want to search on something which wasn’t considered noteworthy at the time an item was cataloged.

In the spirit of finding the simplest thing that could possibly work I’ve been experimenting with a completely automated approach to perform OCR on new items and offering combined full-text search over both the available metadata and OCR text, as can be seen in this example:

If this weren’t impressive enough, Chris has a number of research ideas, including:

the idea for a generic web application which would display hOCR with the corresponding images for correction with all of the data stored somewhere like Github for full change tracking and review. It seems like something along those lines would be particularly valuable as a public service to avoid the expensive of everyone reinventing large parts of this process customized for their particular workflow.

More grist for a topic map mill!

PS: Should you ever come across a treasure trove of not widely available documents, please replicate them to as many public repositories as possible.

Traditional news outlets protect people in leak situations who knew they were playing in the street. Why they merit more protection than the average person is a mystery to me. Let’s protect the average people first and the players last.

February 26, 2014

Secrets of Cloudera Support:…

Filed under: Cloudera,Hadoop,MapReduce,Solr — Patrick Durusau @ 3:50 pm

Secrets of Cloudera Support: Inside Our Own Enterprise Data Hub by Adam Warrington.

From the post:

Here at Cloudera, we are constantly pushing the envelope to give our customers world-class support. One of the cornerstones of this effort is the Cloudera Support Interface (CSI), which we’ve described in prior blog posts (here and here). Through CSI, our support team is able to quickly reason about a customer’s environment, search for information related to a case currently being worked, and much more.

In this post, I’m happy to write about a new feature in CSI, which we call Monocle Stack Trace.

Stack Trace Exploration with Search

Hadoop log messages and the stack traces in those logs are critical information in many of the support cases Cloudera handles. We find that our customer operation engineers (COEs) will regularly search for stack traces they find referenced in support cases to try to determine where else that stack trace has shown up, and in what context it would occur. This could be in the many sources we were already indexing as part of Monocle Search in CSI: Apache JIRAs, Apache mailing lists, internal Cloudera JIRAs, internal Cloudera mailing lists, support cases, Knowledge Base articles, Cloudera Community Forums, and the customer diagnostic bundles we get from Cloudera Manager.

It turns out that doing routine document searches for stack traces doesn’t always yield the best results. Stack traces are relatively long compared to normal search terms, so search indexes won’t always return the relevant results in the order you would expect. It’s also hard for a user to churn through the search results to figure out if the stack trace was actually an exact match in the document to figure out how relevant it actually is.

To solve this problem, we took an approach similar to what Google does when it wants to allow searching over a type that isn’t best suited for normal document search (such as images): we created an independent index and search result page for stack-trace searches. In Monocle Stack Trace, the search results show a list of unique stack traces grouped with every source of data in which unique stack trace was discovered. Each source can be viewed in-line in the search result page, or the user can go to it directly by following a link.

We also give visual hints as to how the stack trace for which the user searched differs from the stack traces that show up in the search results. A green highlighted line in a search result indicates a matching call stack line. Yellow indicates a call stack line that only differs in line number, something that may indicate the same stack trace on a different version of the source code. A screenshot showing the grouping of sources and visual highlighting is below:

See Adam’s post for the details.

I like the imaginative modification of standard search.

Not all data is the same and searching it as if it were, leaves a lot of useful data unfound.

Lucene and Solr 4.7

Filed under: Lucene,Solr — Patrick Durusau @ 2:53 pm

Lucene and Solr 4.7

From the post:

Today Apache Lucene and Solr PMC announced a new of Apache Lucene library and Apache Solr search server – the 4.7 one. This is another release from the 4.x brach bringing new functionalities and bugfixes.

Apache Lucene 4.7 library can be downloaded from the following address: http://www.apache.org/dyn/closer.cgi/lucene/java/. Apache Solr 4.7 can be downloaded at the following URL address: http://www.apache.org/dyn/closer.cgi/lucene/solr/. Release note for Apache Lucene 4.7 can be found at: http://wiki.apache.org/lucene-java/ReleaseNote47, Solr release notes can be found at: http://wiki.apache.org/solr/ReleaseNote47.

Time to upgrade, again!

February 24, 2014

Index and Search Multilingual Documents in Hadoop

Filed under: Hadoop,Lucene,Solr — Patrick Durusau @ 4:27 pm

Index and Search Multilingual Documents in Hadoop by Justin Kestelyn.

From the post:

Basis Technology’s Rosette Base Linguistics for Java (RBL-JE) provides a comprehensive multilingual text analytics platform for improving search precision and recall. RBL provides tokenization, lemmatization, POS tagging, and de-compounding for Asian, European, Nordic, and Middle Eastern languages, and has just been certified for use with Cloudera Search.

Cloudera Search brings full-text, interactive search, and scalable indexing to Apache Hadoop by marrying SolrCloud with HDFS and Apache HBase, and other projects in CDH. Because it’s integrated with CDH, Cloudera Search brings the same fault tolerance, scale, visibility, and flexibility of your other Hadoop workloads to search, and allows for a number of indexing, access control, and manageability options.

In this post, you’ll learn how to use Cloudera Search and RBL-JE to index and search documents. Since Cloudera takes care of the plumbing for distributed search and indexing, the only work needed to incorporate Basis Technology’s linguistics is loading the software and configuring your Solr collections.

You may have guessed by the way the introduction is worded that Rosette Base Linguistics isn’t free. I checked at the website but found no pricing information. Not to mention that the coverage looks spotty:

  • Arabic
  • Chinese (simplified)
  • Chinese (traditional)
  • English
  • Japanese
  • Korean
  • If your multilingual needs fall in one or more of those languages, this may work for you.

    On the other hand, for indexing and searching multilingual text, you should compare Solr, which has factories for the following languages:

    • Arabic
    • Brazilian Portuguese
    • Bulgarian
    • Catalan
    • Chinese
    • Simplified Chinese
    • CJK
    • Czech
    • Danish
    • Dutch
    • Finnish
    • French
    • Galician
    • German
    • Greek
    • Hebrew, Lao, Myanmar, Khmer
    • Hindi
    • Indonesian
    • Italian
    • Irish
    • Kuromoji (Japanese)
    • Latvian
    • Norwegian
    • Persian
    • Polish
    • Portuguese
    • Romanian
    • Russian
    • Spanish
    • Swedish
    • Thai
    • Turkish

    Source: Solr Wiki.

February 19, 2014

Why Not AND, OR, And NOT?

Filed under: Boolean Operators,Lucene,Searching,Solr — Patrick Durusau @ 3:20 pm

Why Not AND, OR, And NOT?

From the post:

The following is written with Solr users in mind, but the principles apply to Lucene users as well.

I really dislike the so called “Boolean Operators” (“AND”, “OR”, and “NOT”) and generally discourage people from using them. It’s understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it’s a good idea to try to “set aside childish things” and start thinking (and encouraging your users to think) in terms of the superior “Prefix Operators” (“+”, “-”).

Background: Boolean Logic Makes For Terrible Scores

Boolean Algebra is (as my father would put it) “pretty neat stuff” and the world as we know it most certainly wouldn’t exist with out it. But when it comes to building a search engine, boolean logic tends to not be very helpful. Depending on how you look at it, boolean logic is all about truth values and/or set intersections. In either case, there is no concept of “relevancy” — either something is true or it’s false; either it is in a set, or it is not in the set.

When a user is looking for “all documents that contain the word ‘Alligator’” they aren’t going to very be happy if a search system applied simple boolean logic to just identify the unordered set of all matching documents. Instead algorithms like TF/IDF are used to try and identify the ordered list of matching documents, such that the “best” matches come first. Likewise, if a user is looking for “all documents that contain the words ‘Alligator’ or ‘Crocodile’”, a simple boolean logic union of the sets of documents from the individual queries would not generate results as good as a query that took into the TF/IDF scores of the documents for the individual queries, as well as considering which documents matches both queries. (The user is probably more interested in a document that discusses the similarities and differences between Alligators to Crocodiles then in documents that only mention one or the other a great many times).

This brings us to the crux of why I think it’s a bad idea to use the “Boolean Operators” in query strings: because it’s not how the underlying query structures actually work, and it’s not as expressive as the alternative for describing what you want.

As if you needed more proof that knowing “how” a search system is constructed is as important as knowing the surface syntax.

A great post that gives examples to illustrate each of the issues.

In case you are wondering about the December 28, 2011 date on the post, BooleanCause.Occur Lucene 4.6.1.

February 15, 2014

Easy Hierarchical Faceting and display…

Filed under: Facets,JQuery,Solr — Patrick Durusau @ 1:44 pm

Easy Hierarchical Faceting and display with Solr and jQuery (and a tiny bit of Python) by Grant Ingersoll.

From the post:

Visiting two major clients in two days last week, each presented me with the same question: how do we better leverage hierarchical information like taxonomies, file paths, etc. in LucidWorks Search (LWS) (and Apache Solr) their applications, such that they could display something like the following image in their UI:

facets

Since this is pretty straight forward (much of it is captured already on the Solr Wiki) and I have both the client-side and server side code for this already in a few demos we routinely give here at Lucid, I thought I would write it up as a blog instead of sending each of them a one-off answer. I am going to be showing this work in the context of the LWS Financial Demo, for those who wish to follow along at the code level. We’ll use it to show a little bit of hierarchical faceting that correlates the industry sector of an S&P 500 company with the state and city of the HQ of that company. In your particular use case, you may wish to use it for organizing content in filesystems, websites, taxonomies or pretty much anything that exhibits, as the name implies, hierarchical relationships.

Who knew? Hierarchies are just like graphs! They’re everywhere! 😉

Grant closes by suggesting Solr analysis capabilities for faceting would be a nice addition to Solr. Are you game?

UpdateRequestProcessor factories in Apache Solr 4.6.1

Filed under: Solr — Patrick Durusau @ 1:28 pm

The full list of UpdateRequestProcessor factories in Apache Solr 4.6.1 by Alexandre Rafalovitch.

From the post:

UpdateRequestProcessor is a mechinism in Solr to change the documents that are being submitted for indexing to Solr. They provide advanced functions such as language identification, duplicate detection, intelligent defaults, external text processing pipelines integration, and – most recently – dynamic schema definition.

UpdateRequestProcessor factories (a.k.a. Update Request Processors or URPs) can be chained and multiple chains can be defined for one Solr collection. A chain is assigned to a request handler with update.chain parameter that can be defined in the configuration file or passed as a part of the URL. See example solrconfig.xml or consult Solr WIKI.

A very useful collection but it can be improved with one liners from the JavaDocs:

January 29, 2014

Apache Lucene 4.6.1 and Apache SolrTM 4.6.1

Filed under: Lucene,Solr — Patrick Durusau @ 3:49 pm

Download: Apache Lucene 4.6.1

Lucene:

Lucene CHANGES.txt

New features include: FreeTextSuggester (Lucene-5214), New Document Dictionary (Lucene-5221), and twenty-one (21) others!

Solr:

Download: Apache SolrTM 4.6.1

Solr CHANGES.txt

New features include: support for AnalyzingInfixSuggester (Solr-5167), new field type EnumField (Solr-5084), and fifteen (15) others!

Not that I will know anytime soon but I am curious how well the AnalyzingInfixSuggester would work with Akkadian.

January 25, 2014

Use Cases for Taming Text, 2nd ed.

Filed under: Lucene,Mahout,MALLET,OpenNLP,Solr,Stanford NLP — Patrick Durusau @ 5:31 pm

Use Cases for Taming Text, 2nd ed. by Grant Ingersoll.

From the post:

Drew Farris, Tom Morton and I are currently working on the 2nd Edition of Taming Text (http://www.manning.com/ingersoll for first ed.) and are soliciting interested parties who would be willing to contribute to a chapter on practical use cases (i.e. you have something in production and are willing to write about it) for search with Solr, NLP using OpenNLP or Stanford NLP and machine learning using Mahout, OpenNLP or MALLET — ideally you are using combinations of 2 or more of these to solve your problems. We are especially interested in large scale use cases in eCommerce, Advertising, social media analytics, fraud, etc.

The writing process is fairly straightforward. A section roughly equates to somewhere between 3 – 10 pages, including diagrams/pictures. After writing, there will be some feedback from editors and us, but otherwise the process is fairly simple.

In order to participate, you must have permission from your company to write on the topic. You would not need to divulge any proprietary information, but we would want enough information for our readers to gain a high-level understanding of your use case. In exchange for your participation, you will have your name and company published on that section of the book as well as in the acknowledgments section. If you have a copy of Lucene in Action or Mahout In Action, it would be similar to the use case sections in those books.

Cool!

I am guessing the second edition isn’t going to take as long as the first. 😉

Couldn’t be in better company as far as co-authors.

See the post for the contact details.

Searching in Solr, Analyzing Results and CJK

Filed under: CJK,Lucene,Solr — Patrick Durusau @ 5:08 pm

Searching in Solr, Analyzing Results and CJK

From the post:

In my recently completed twelve post series on Chinese, Japanese and Korean (CJK) with Solr for Libraries, my primary objective was to make information available to others in an expeditious manner. However, the organization of the topics is far from optimal for readers, and the series is too long for easy skimming for topics of interest. Therefore, I am providing this post as a sort of table of contents into the previous series.
Introduction

In Fall 2013, we rolled out some significant improvements for Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford library “catalog” built with Blacklight on top of our Solr index. If your collection has a significant number of CJK resources and they are in multiple languages, you might be interested in our recipes. You might also be interested if you have a significant number of resources in multiple languages, period.

If you are interested in improving searching, or in improving your methodology when working on searching, these posts provide a great deal of information. Analysis of Solr result relevancy figured heavily in this work, as did testing: relevancy/acceptance/regression testing against a live Solr index, unit testing, and integration testing. In addition, there was testing by humans, which was well managed and produced searches that were turned into automated tests. Many of the blog entries contain useful approaches for debugging Solr relevancy and for test driven development (TDD) of new search behavior.
….

Excellent!

I am sure many of the issues addressed here will be relevant should anyone decide to create a Solr index to the Assyrian Dictionary of the Oriental Institute of the University of Chicago (CAD).

Quite serious. At least I would be interested at any rate.

December 20, 2013

Principles of Solr application design

Filed under: Searching,Solr — Patrick Durusau @ 7:35 pm

Principles of Solr application design – part 1 of 2

Principles of Solr application design – part 2 of 2

From part 1:

We’ve been working internally on a document encapsulating how we build (and recommend others should build) search applications based on Apache Solr, probably the most popular open source search engine library. As an early Christmas present we’re releasing these as a two part series – if you have any feedback we’d welcome comments! So without further ado here’s the first part:

Over two posts you get thirteen (13) points to check off while building a Solr application.

You won’t find anything startling but it will make a useful checklist.

Solr Cluster

Filed under: LucidWorks,Search Engines,Searching,Solr — Patrick Durusau @ 7:30 pm

Solr Cluster

From the webpage:

Join us weekly for tips and tricks, product updates and Q&A on topics you suggest. Guest appearances from Lucene/Solr committers and PMC members. Send questions to SolrCluster@lucidworks.com

So far:

#1 Entity Recognition

Enhance Search applications beyond simple keyword search by adding intelligence through metadata. Help classify common patterns from unstructured data/content into predefined categories. Examples include names of persons, organizations, locations, expressions of time, quantities, monetary values, percentages etc. Entity recognition is usually built using either linguistic grammar-based techniques or statistical models.

#2 On Enterprise and Intranet Search

What use is search to an enterprise? What is the purpose of intranet search? How hard is it to implement? In this episode we speak with LucidWorks consultant Evan Sayer about the benefits of internal search and how to prepare your business data to best take advantage of full-text search.

Well, the lead in music isn’t Beaker Street, but it’s not that long.

I think the discussion would be easier to follow with a webpage with common terms and an outline of the topic for the day.

Has real potential so I urge you to listen, send in questions and comments.

December 13, 2013

Implementing a Custom Search Syntax…

Filed under: Lucene,Patents,Solr — Patrick Durusau @ 8:33 pm

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled by John Berryman.

Description:

In a recent project with the United States Patent and Trademark Office, Opensource Connections was asked to prototype the next generation of patent search – using Solr and Lucene. An important aspect of this project was the implementation of BRS, a specialized search syntax used by patent examiners during the examination process. In this fast paced session we will relate our experiences and describe how we used a combination of Parboiled (a Parser Expression Grammar [PEG] parser), Lucene Queries and SpanQueries, and an extension of Solr’s QParserPlugin to build BRS search functionality in Solr. First we will characterize the patent search problem and then define the BRS syntax itself. We will then introduce the Parboiled parser and discuss various considerations that one must make when designing a syntax parser. Following this we will describe the methodology used to implement the search functionality in Lucene/Solr. Finally, we will include an overview our syntactic and semantic testing strategies. The audience will leave this session with an understanding of how Solr, Lucene, and Parboiled may be used to implement their own custom search parser.

One part of the task was to re-implement a thirty (30) year old query language on modern software. (Ouch!)

Uses parboiled to parse the query syntax.

On parboiled:

parboiled is a mixed Java/Scala library providing for lightweight and easy-to-use, yet powerful and elegant parsing of arbitrary input text based on Parsing expression grammars (PEGs). PEGs are an alternative to context free grammars (CFGs) for formally specifying syntax, they make a good replacement for regular expressions and generally have quite a few advantages over the “traditional” way of building parsers via CFGs. parboiled is released under the Apache License 2.0.

Covers a plugin for the custom query language.

Great presentation, although one where you will want to be following the slides (below the video).

December 10, 2013

Paginated Collections with Ember.js + Solr + Rails

Filed under: Interface Research/Design,Solr — Patrick Durusau @ 5:11 pm

Paginated Collections with Ember.js + Solr + Rails by Eduardo Figarola.

From the post:

This time, I would like to show you how to add a simple pagination helper to your Ember.js application.

For this example, I will be using Rails + Solr for the backend and Ember.js as my frontend framework.

I am doing this with Rails and Solr, but you can do it using other backend frameworks, as long as the JSON’s response resembles what we have here:
….

I mention this just on the off-chance that you will encounter users requesting pagination.

I’m not sure anything beyond page 1 and page 2 is needed for most pagination needs.

I remember reading in a study of query behavior using PubMed, you better have a disease that appears in the first two pages of results.

Anywhere beyond the first two pages, well, your family’s best hope is that you have life insurance.

If a client asks for beyond 2 pages of results, I would suggest monitoring search query behavior for say six months.

Just to give them an idea of what beyond page two is really accomplishing.

December 9, 2013

Building Client-side Search Applications with Solr

Filed under: Lucene,Search Interface,Searching,Solr — Patrick Durusau @ 7:46 pm

Building Client-side Search Applications with Solr by Daniel Beach.

Description:

Solr is a powerful search engine, but creating a custom user interface can be daunting. In this fast paced session I will present an overview of how to implement a client-side search application using Solr. Using open-source frameworks like SpyGlass (to be released in September) can be a powerful way to jumpstart your development by giving you out-of-the box results views with support for faceting, autocomplete, and detail views. During this talk I will also demonstrate how we have built and deployed lightweight applications that are able to be performant under large user loads, with minimal server resources.

If you need a compelling reason to watch this video, check out:

Global Patent Search Network.

What is the Global Patent Search Network?

As a result of cooperative effort between the United States Patent and Trademark Office (USPTO) and State Intellectual Property Office (SIPO) of the People’s Republic of China, Chinese patent documentation is now available for search and retrieval from the USPTO website via the Global Patent Search Network. This tool will enable the user to search Chinese patent documents in the English or Chinese language. The data available include fulltext Chinese patents and machine translations. Also available are full document images of Chinese patents which are considered the authoritative Chinese patent document. Users can search documents including published applications, granted patents and utility models from 1985 to 2012.

Something over four (4) million patents.

Try the site, then watch the video.

Software mentioned: Spyglass, Ember.js.

« Newer PostsOlder Posts »

Powered by WordPress