Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 24, 2014

Google Search Appliance and Libraries

Using Google Search Appliance (GSA) to Search Digital Library Collections: A Case Study of the INIS Collection Search by Dobrica Savic.

From the post:

In February 2014, I gave a presentation at the conference on Faster, Smarter and Richer: Reshaping the library catalogue (FSR 2014), which was organized by the Associazione Italiana Biblioteche (AIB) and Biblioteca Apostolica Vaticana in Rome, Italy. My presentation focused on the experience of the International Nuclear Information System (INIS) in using Google Search Appliance (GSA) to search digital library collections at the International Atomic Energy Agency (IAEA). 

Libraries are facing many challenges today. In addition to diminished funding and increased user expectations, the use of classic library catalogues is becoming an additional challenge. Library users require fast and easy access to information resources, regardless of whether the format is paper or electronic. Google Search, with its speed and simplicity, has established a new standard for information retrieval which did not exist with previous generations of library search facilities. Put in a position of David versus Goliath, many small, and even larger libraries, are losing the battle to Google, letting many of its users utilize it rather than library catalogues.

The International Nuclear Information System (INIS)

The International Nuclear Information System (INIS) hosts one of the world's largest collections of published information on the peaceful uses of nuclear science and technology. It offers on-line access to a unique collection of 3.6 million bibliographic records and 483,000 full texts of non-conventional (grey) literature. This large digital library collection suffered from most of the well-known shortcomings of the classic library catalogue. Searching was complex and complicated, it required training in Boolean logic, full-text searching was not an option, and response time was slow. An opportune moment to improve the system came with the retirement of the previous catalogue software and the adoption of Google Search Appliance (GSA) as an organization-wide search engine standard.
….

To be completely honest, my first reaction wasn’t a favorable one.

But even the complete blog post does not do justice to the project in question.

Take a look at the slides, which include screen shots of the new interface before reaching an opinion.

Take this as a lesson on what your search interface should be offering by default.

There are always other screens you can fill with advanced features.

March 21, 2014

Elasticsearch: The Definitive Guide

Filed under: ElasticSearch,Indexing,Search Engines,Searching — Patrick Durusau @ 5:52 pm

Elasticsearch: The Definitive Guide (Draft)

From the Preface, who should read this book:

This book is for anybody who wants to put their data to work. It doesn’t matter whether you are starting a new project and have the flexibility to design the system from the ground up, or whether you need to give new life to a legacy system. Elasticsearch will help you to solve existing problems and open the way to new features that you haven’t yet considered.

This book is suitable for novices and experienced users alike. We expect you to have some programming background and, although not required, it would help to have used SQL and a relational database. We explain concepts from first principles, helping novices to gain a sure footing in the complex world of search.

The reader with a search background will also benefit from this book. Elasticsearch is a new technology which has some familiar concepts. The more experienced user will gain an understanding of how those concepts have been implemented and how they interact in the context of Elasticsearch. Even in the early chapters, there are nuggets of information that will be useful to the more advanced user.

Finally, maybe you are in DevOps. While the other departments are stuffing data into Elasticsearch as fast as they can, you’re the one charged with stopping their servers from bursting into flames. Elasticsearch scales effortlessly, as long as your users play within the rules. You need to know how to setup a stable cluster before going into production, then be able to recognise the warning signs at 3am in the morning in order to prevent catastrophe. The earlier chapters may be of less interest to you but the last part of the book is essential reading — all you need to know to avoid meltdown.

I fully understand the need, nay, compulsion for an author to say that everyone who is literate needs to read their book. And, if you are not literate, their book is a compelling reason to become literate! 😉

As the author of a book (two editions) and more than one standard, I can assure you an author’s need to reach everyone serves no one very well.

Potential readers ranges from novices, intermediate users and experts.

A book that targets all three will “waste” space on matter already know to experts but not to novices and/or intermediate users.

At the same time, space in a physical book being limited, some material relevant to the expert will be left out all together.

I had that experience quite recently when the details of LukeRequestHandler (Solr) were described as:

Reports meta-information about a Solr index, including information about the number of terms, which fields are used, top terms in the index, and distributions of terms across the index. You may also request information on a per-document basis.

That’s it. Out of more than 600+ pages of text, that is all the information you will find on LukeRequestHandler.

Fortunately I did find: https://wiki.apache.org/solr/LukeRequestHandler.

I don’t fault the author because several entire books could be written with the material they left out.

That is the hardest part of authoring, knowing what to leave out.

PS: Having said all that, I am looking forward to reading Elasticsearch: The Definitive Guide as it develops.

March 18, 2014

Automatic bulk OCR and full-text search…

Filed under: ElasticSearch,Search Engines,Solr,Topic Maps — Patrick Durusau @ 8:48 pm

Automatic bulk OCR and full-text search for digital collections using Tesseract and Solr by Chris Adams.

From the post:

Digitizing printed material has become an industrial process for large collections. Modern scanning equipment makes it easy to process millions of pages and concerted engineering effort has even produced options at the high-end for fragile rare items while innovative open-source projects like Project Gado make continual progress reducing the cost of reliable, batch scanning to fit almost any organization’s budget.

Such efficiencies are great for our goals of preserving history and making it available but they start making painfully obvious the degree to which digitization capacity outstrips our ability to create metadata. This is a big problem because most of the ways we find information involves searching for text and a large TIFF file is effectively invisible to a full-text search engine. The classic library solution to this challenge has been cataloging but the required labor is well beyond most budgets and runs into philosophical challenges when users want to search on something which wasn’t considered noteworthy at the time an item was cataloged.

In the spirit of finding the simplest thing that could possibly work I’ve been experimenting with a completely automated approach to perform OCR on new items and offering combined full-text search over both the available metadata and OCR text, as can be seen in this example:

If this weren’t impressive enough, Chris has a number of research ideas, including:

the idea for a generic web application which would display hOCR with the corresponding images for correction with all of the data stored somewhere like Github for full change tracking and review. It seems like something along those lines would be particularly valuable as a public service to avoid the expensive of everyone reinventing large parts of this process customized for their particular workflow.

More grist for a topic map mill!

PS: Should you ever come across a treasure trove of not widely available documents, please replicate them to as many public repositories as possible.

Traditional news outlets protect people in leak situations who knew they were playing in the street. Why they merit more protection than the average person is a mystery to me. Let’s protect the average people first and the players last.

March 7, 2014

Using Lucene’s search server to search Jira issues

Filed under: Indexing,Lucene,Search Engines — Patrick Durusau @ 5:02 pm

Using Lucene’s search server to search Jira issues by Michael McCandless.

From the post:

You may remember my first blog post describing how the Lucene developers eat our own dog food by using a Lucene search application to find our Jira issues.

That application has become a powerful showcase of a number of modern Lucene features such as drill sideways and dynamic range faceting, a new suggester based on infix matches, postings highlighter, block-join queries so you can jump to a specific issue comment that matched your search, near-real-time indexing and searching, etc. Whenever new users ask me about Lucene’s capabilities, I point them to this application so they can see for themselves.

Recently, I’ve made some further progress so I want to give an update.

The source code for the simple Netty-based Lucene server is now available on this subversion branch (see LUCENE-5376 for details). I’ve been gradually adding coverage for additional Lucene modules, including facets, suggesters, analysis, queryparsers, highlighting, grouping, joins and expressions. And of course normal indexing and searching! Much remains to be done (there are plenty of nocommits), and the goal here is not to build a feature rich search server but rather to demonstrate how to use Lucene’s current modules in a server context with minimal “thin server” additional source code.

Separately, to test this new Lucene based server, and to complete the “dog food,” I built a simple Jira search application plugin, to help us find Jira issues, here. This application has various Python tools to extract and index Jira issues using Jira’s REST API and a user-interface layer running as a Python WSGI app, to send requests to the server and render responses back to the user. The goal of this Jira search application is to make it simple to point it at any Jira instance / project and enable full searching over all issues.

Of particular interest to me because OASIS is about to start using JIRA 6.2 (the version in use at Apache).

I haven’t looked closely at the documentation for JIRA 6.2.

Thoughts on where it has specific weaknesses that are addressed by Michael’s solution?

February 23, 2014

Common Crawl’s Move to Nutch

Filed under: Nutch,Search Engines,Webcrawler — Patrick Durusau @ 2:30 pm

Common Crawl’s Move to Nutch by Jordan Mendelson.

From the post:

Last year we transitioned from our custom crawler to the Apache Nutch crawler to run our 2013 crawls as part of our migration from our old data center to the cloud.

Our old crawler was highly tuned to our data center environment where every machine was identical with large amounts of memory, hard drives and fast networking.

We needed something that would allow us to do web-scale crawls of billions of webpages and would work in a cloud environment where we might run on a heterogenous machines with differing amounts of memory, CPU and disk space depending on the price plus VMs that might go up and down and varying levels of networking performance.

Before you hand roll a custom web crawler, you should read this short but useful report on the Common Crawl experience with Nutch.

February 19, 2014

Troubleshooting Elasticsearch searches, for Beginners

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 2:46 pm

Troubleshooting Elasticsearch searches, for Beginners by Alex Brasetvik.

From the post:

Elasticsearch’s recent popularity is in large part due to its ease of use. It’s fairly simple to get going quickly with Elasticsearch, which can be deceptive. Here at Found we’ve noticed some common pitfalls new Elasticsearch users encounter. Consider this article a piece of necessary reading for the new Elasticsearch user; if you don’t know these basic techniques take the time to familiarize yourself with them now, you’ll save yourself a lot of distress.

Specifically, this article will focus on text transformation, more properly known as text analysis, which is where we see a lot of people get tripped up. Having used other databases, the fact that all data is transformed before getting indexed can take some getting used to. Additionally, “schema free” means different things for different systems, a fact that is often confused with Elasticsearch’s “Schema Flexible” design.

When Alex say “beginners” he means beginning developers so this isn’t a post you can send to users with search troubles.

Sorry!

But if you are trying to debug search results in ElasticSearch as a developer, this is a good place to start.

February 18, 2014

ElasticSearch Analyzers – Parts 1 and 2

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 4:16 pm

Andrew Cholakian has written a two part introduction to analyzers in ElasticSearch.

All About Analyzers, Part One

From the introduction:

Choosing the right analyzer for an Elasticsearch query can be as much art as science. Analyzers are the special algorithms that determine how a string field in a document is transformed into terms in an inverted index. If you need a refresher on the basics of inverted indexes and where analysis fits into Elasticsearch in general please see this chapter in Exploring Elasticsearch covering analyzers. In this article we’ll survey various analyzers, each of which showcases a very different approach to parsing text.

Ten tokenizers, thirty-one token filters, and three character filters ship with the Elasticsearch distribution; a truly overwhelming number of options. This number can be increased further still through plugins, making the choices even harder to wrap one’s head around. Combinations of these tokenizers, token filters, and character filters create what’s called an analyzer. There are eight standard analyzers defined, but really, they are simply convenient shortcuts for arranging tokenizers, token filters, and character filters yourself. While reaching an understanding of this multitude of options may sound difficult, becoming reasonably competent in the use of analyzers is merely a matter of time and practice. Once the basic mechanisms behind analysis are understood, these tools are relatively easy to reason about and compose.

All About Analyzers, Part Two (continues part 1).

Very much worth your time if you need a refresher or analyzers for ElasticSearch and/or are approaching them for the first time.

Of course I went hunting for the treatment of synonyms, only to find the standard fare.

Not bad by any means but a grade school student knows synonyms depend upon any number of factors but you would be hard pressed to find that in any search engine.

I suppose you could define synonyms as most engines do and then filter the results to eliminate from a gene search “hits” from Field and Stream, Guns & Ammo, and the like. Although your searchers may be interested in how to trick out an AR-15. 😉

It may be that simple bulk steps are faster than more sophisticated searching. Will have to give that some thought.

February 5, 2014

Hshtags (Search Engine)

Filed under: Hashtags,Search Engines — Patrick Durusau @ 2:34 pm

Hshtags (Search Engine)

At this point you can select to search hashtags on Facebook, Flickr, Instagram, Twitter and Vimeo. Which means, of course, that you have to authorize Hshtags to see your posts, friends, post for you (which I have never understood), etc.

How useful Hshtags will be depends on the subject. I can’t imagine very much content of interest on Facebook, Flickr and Instagram about semantic integration. Could be, not to blame the medium, but it seems unlikely.

For my purposes, searching across both Twitter and Vimeo for “popular” hashtags will be useful (as popular as semantic integration ever gets).

More useful to me would be a search engine that reported tags used by blogs with links back to the blogs using those tags.

That would be really useful in terms of defining communities and using terminology that is widely accepted. Even if just WordPress, Blogger, and other major blogging platforms.

One very nice aspects of Hshtags, the registration is in a large enough font to be easily readable!!! It’s a small thing but deeply appreciated none the less.

I first saw this in a tweet from Inge Henriksen.

January 23, 2014

Finding long tail suggestions…

Filed under: Lucene,Search Engines,Searching — Patrick Durusau @ 7:26 pm

Finding long tail suggestions using Lucene’s new FreeTextSuggester by Mike McCandless.

From the post:

Lucene’s suggest module offers a number of fun auto-suggest implementations to give a user live search suggestions as they type each character into a search box.

For example, WFSTCompletionLookup compiles all suggestions and their weights into a compact Finite State Transducer, enabling fast prefix lookup for basic suggestions.

AnalyzingSuggester improves on this by using an Analyzer to normalize both the suggestions and the user’s query so that trivial differences in whitespace, casing, stop-words, synonyms, as determined by the analyzer, do not prevent a suggestion from matching.

Finally, AnalyzingInfixSuggester goes further by allowing infix matches so that words inside each suggestion (not just the prefix) can trigger a match. You can see this one action at the Lucene/Solr Jira search application (e.g., try “python”) that I recently created to eat our own dog food. It is also the only suggester implementation so far that supports highlighting (this has proven challenging for the other suggesters).

Yet, a common limitation to all of these suggesters is that they can only suggest from a finite set of previously built suggestions. This may not be a problem if your suggestions are past user queries and have tons and tons of them (e.g., you are Google). Alternatively, if your universe of suggestions is inherently closed, such as the movie and show titles that Netflix’s search will suggest, or all product names on an e-commerce site, then a closed set of suggestions is appropriate.
….

Since you are unlikely to be Google, Mike goes on to show how FreeTextSuggester can ride to your rescue!

As always, Mike’s post is a pleasure to read.

January 22, 2014

Build Your Own Custom Lucene Query And Scorer

Filed under: Lucene,Search Engines — Patrick Durusau @ 8:03 pm

Build Your Own Custom Lucene Query And Scorer by Doug Turnbull.

From the post:

Every now and then we’ll come across a search problem that can’t simply be solved with plain Solr relevancy. This usually means a customer knows exactly how documents should be scored. They may have little tolerance for close approximations of this scoring through Solr boosts, function queries, etc. They want a Lucene-based technology for text analysis and performant data structures, but they need to be extremely specific in how documents should be scored relative to each other.

Well for those extremely specialized cases we can prescribe a little out-patient surgery to your Solr install – building your own Lucene Query.

This Is The Nuclear Option

Before we dive in, a word of caution. Unless you just want the educational experience, building a custom Lucene Query should be the “nuclear option” for search relevancy. It’s very fiddly and there are many ins-and-outs. If you’re actually considering this to solve a real problem, you’ve already gone down the following paths:

Not for the faint of heart!

On the other hand, Doug’s list of options to try before writing a custom Lucene query and scorer makes a great checklist of tweaking options.

You could stop there and learn a great deal. Or you can opt to continue for what Doug calls “the educational experience.”

Google shows missing search terms

Filed under: Search Engines,Searching — Patrick Durusau @ 7:40 pm

Google shows missing search terms by Karen Blakeman.

From the post:

Several weeks ago I noticed that Google was displaying the terms it had dropped from your search as ‘Missing’. Google started routinely ignoring selected search terms towards the end of 2011 (see http://www.rba.co.uk/wordpress/2011/11/08/dear-google-stop-messing-with-my-search/). Google’s response to the outcry from searchers was to introduce the Verbatim search option. However, there was no way of checking whether all of your terms appeared in a result other than viewing the whole page. Irritating, to say the least, if you found that the top 10 results did not include all of your keywords.

Fast forward to December 2013, and some people started seeing results lists that showed missing keywords as strikethroughs. I saw them for a few days and then, just as I was preparing a blog posting on the feature, they disappeared! I assumed that they were one of Google’s live experiments never to be seen again but it seems they are back. Two people contacted me today to say that they are seeing strikethroughs on missing terms. I ran my test searches again and, yes, I’m seeing them as well.

I ran the original search that prompted my November 2011 article (parrots heron island Caversham UK) and included -site:rba.co.uk in the strategy to exclude my original blog postings. Sure enough, the first two results were missing parrots and had “Missing parrots” underneath their entry in the list.

At least as of today, try: parrots heron island Caversham UK -site:rba.co.uk in Google and you will see the same result.

A welcome development, although more transparency would be very welcome.

A non-transparent search process isn’t searching. It’s guessing.

January 5, 2014

What “viable search engine competition” really looks like

Filed under: Marketing,Search Analytics,Search Engines,Search Requirements — Patrick Durusau @ 3:56 pm

What “viable search engine competition” really looks like by Alex Clemmer.

From the post:

Hacker News is up in arms again today about the RapGenius fiasco. See RapGenius statement and HN comments. One response article argues that we need more “viable search engine competition” and the HN community largely seems to agree.

In much of the discussion, there is a picaresque notion that the “search engine problem” is really just a product problem, and that if we try really hard to think of good features, we can defeat the giant.

I work at Microsoft. Competing with Google is hard work. I’m going to point out some of the lessons I’ve learned along the way, to help all you spry young entrepreneurs who might want to enter the market.

Alex has six (6) lessons for would-be Google killers:

Lesson 1: The problem is not only hiring smart people, but hiring enough smart people.

Lesson 2: competing on market share is possible; relevance is much harder

Lesson 3: social may pose an existential threat to Google’s style of search

Lesson 4: large companies have access to technology that is often categorically better than OSS state of the art

Lesson 5: large companies are necessarily limited by their previous investments

Lesson 6: large companies have much more data than you, and their approach to search is sophisticated

See Alex’s post for the details under each lesson.

What has always puzzled me is why compete on general search? General search services are “free” save for the cost of a users time to mine the results. It is hard to think of a good economic model to compete with “free.” Yes?

If we are talking about medical, legal, technical, engineering search, where services are sold to professionals and the cost is passed onto consumers, that could be a different story. Even there, costs have to be offset by a reasonable expectation of profit against established players in each of those markets.

One strategy would be to supplement or enhance existing search services and pitch that to existing market holders. Another strategy would be to propose highly specialized searching of unique data archives.

Do you think Alex is right in saying “…most traditional search problems have really been investigated thoroughly”?

I don’t because of the general decline in information retrieval from the 1950’s-1960’s to date.

If you doubt my observation, pick a Readers’ Guide to Periodical Literature (hard copy) for 1968 and choose some subject at random. Repeat that exercise with the search engine of your choice, limiting your results to 1968.

Which one gave you more relevant references for 1968, including synonyms? Say in the first 100 entries.

I first saw this in a tweet by Stefano Bertolo.

PS: I concede that the analog book does not have digital hyperlinks to take you to resources but it does have analog links for the same purpose. And it doesn’t have product ads. 😉

January 4, 2014

Writing a full-text search engine using Bloom filters

Filed under: Bloom Filters,Indexing,Search Engines — Patrick Durusau @ 2:32 pm

Writing a full-text search engine using Bloom filters by Stavros Korokithakis.

A few minutes ago I came across a Hacker News post that detailed a method of adding search to your static site. As you probably know, adding search to a static site is a bit tricky, because you can’t just send the query to a server and have the server process it and return the results. If you want full-text search, you have to implement something like an inverted index.

How an inverted index works

An inverted index is a data structure that basically maps every word in every document to the ID of the document it can be found in. For example, such an index might look like {"python": [1, 3, 6], "raspberry": [3, 7, 19]}. To find the documents that mention both “python” and “raspberry”, you look those terms up in the index and find the common document IDs (in our example, that is only document with ID 3).

However, when you have very long documents with varied words, this can grow a lot. It’s a hefty data structure, and, when you want to implement a client-side search engine, every byte you transmit counts.

Client-side search engine caveats

The problem with client-side search engines is that you (obviously) have to do all the searching on the client, so you have to transmit all available information there. What static site generators do is generate every required file when generating your site, then making those available for the client to download. Usually, search-engine plugins limit themselves to tags and titles, to cut down on the amount of information that needs to be transmitted. How do we reduce the size? Easy, use a Bloom filter!

An interesting alternative to indexing a topic map with an inverted index.

I mention it in part because of one of the “weaknesses” of Bloom filters for searching:

You can’t weight pages by relevance, since you don’t know how many times a word appears in a page, all you know is whether it appears or not. You may or may not care about this.

Unlike documents, which are more or less relevant due to work occurrences, topic maps cluster information about a subject into separate topics (or proxies if you prefer).

That being the case, it isn’t the case that one topic/proxy is more “relevant” than another. The question is whether this topic/proxy represents the subject you want?

Or to put it another way, topics/proxies have already been arranged by “relevance” by a topic map author.

If a topic map interface gives you hundreds or thousands of “relevant” topics/proxies, how are you any better off than a more traditional search engine?

If you need to look like you are working, go search any of the social media sites for useful content. It’s there, the difficulty is going to be finding it.

December 27, 2013

Imprecise machines mess with history

Filed under: Precision,Search Engines — Patrick Durusau @ 4:33 pm

Imprecise machines mess with history by Kaiser Fung.

From the post:

The mass media continues to gloss over the imprecision of machines/algorithms.

Here is another example I came across the other day. In conversation, the name Martin Van Buren popped up. I was curious about this eighth President of the United States.

What caught my eye in the following Google search result (right panel) is his height:

See Kaiser’s post for an amusing error on U.S. Presidents which has been echoed in U.S. classrooms without a doubt.

Kaiser asks how to make fact-checking machines possible?

I’m not sure we need fact-checking machines as much as we need several canonical sources of information on the WWW.

At one time, there were several world almanacs in print (may still be) and for most routine information, those were authoritative sources.

I don’t know that search engines need fact checkers so much as they need to be less promiscuous. At least in terms of the content that they repeat as fact.

There is a difference between “facts” you index at the New York Times and some local historical society.

The source of data was important before the WWW and it continues to be important today.

December 24, 2013

elasticsearch-entity-resolution

Filed under: Duke,ElasticSearch,Entity Resolution,Search Engines,Searching — Patrick Durusau @ 2:17 pm

elasticsearch-entity-resolution

From the webpage:

This project is an interactive entity resolution plugin for Elasticsearch based on Duke. Basically, it uses Bayesian probabilities to compute probability. You can pretty much use it an interactive deduplication engine.

To understand basics, go to Duke project documentation.

A list of available comparators is available here.

Interesting pairing of Duke (entity resolution/record linkage software by Lars Marius Garshol) with ElasticSearch.

Strings and user search behavior can only take an indexing engine so far. This is a step in the right direction.

A step more likely be followed with an Apache License as opposed to its current LGPLv3.

December 20, 2013

Solr Cluster

Filed under: LucidWorks,Search Engines,Searching,Solr — Patrick Durusau @ 7:30 pm

Solr Cluster

From the webpage:

Join us weekly for tips and tricks, product updates and Q&A on topics you suggest. Guest appearances from Lucene/Solr committers and PMC members. Send questions to SolrCluster@lucidworks.com

So far:

#1 Entity Recognition

Enhance Search applications beyond simple keyword search by adding intelligence through metadata. Help classify common patterns from unstructured data/content into predefined categories. Examples include names of persons, organizations, locations, expressions of time, quantities, monetary values, percentages etc. Entity recognition is usually built using either linguistic grammar-based techniques or statistical models.

#2 On Enterprise and Intranet Search

What use is search to an enterprise? What is the purpose of intranet search? How hard is it to implement? In this episode we speak with LucidWorks consultant Evan Sayer about the benefits of internal search and how to prepare your business data to best take advantage of full-text search.

Well, the lead in music isn’t Beaker Street, but it’s not that long.

I think the discussion would be easier to follow with a webpage with common terms and an outline of the topic for the day.

Has real potential so I urge you to listen, send in questions and comments.

December 12, 2013

Use your expertise – build a topical search engine

Filed under: Search Engines,Searching — Patrick Durusau @ 7:27 pm

Use your expertise – build a topical search engine

From the post:

Did you know that a topical search engine can help your users find content from more than a single domain? You can use your expertise to provide a delightful user experience targeting a particular topic on the Web.

There are two main types of engines built with Google Custom Search: site search and topical search. While site search is relatively straightforward – it lets you implement a search for a blog or a company website – topical search is an entirely different story.

Topical search engines focus on a particular topic and therefore usually cover a part of the Web that is larger than a single domain. Because of this topical engines need to be carefully fine-tuned to bring the best results to the users.

OK, yes, it is a Google API and run by Google.

That doesn’t trouble me overmuch. My starting assumption is that anything that leaves my subnet is being recorded.

Recorded and sold if there is a buyer for the information.

Doesn’t even have to leave my subnet if they have the right equipment.

Anyway, think of Google’s Custom Search API as another source of data like Common Crawl.

It’s more current than Common Crawl if that is one of your project requirements. And probably easier to use for most folks.

And you can experiment at very low risk to see if your custom search engine is likely to be successful.

Whether you want a public or private custom search engine, I am interested in hearing about your experiences.

December 10, 2013

ThisPlusThat.me: [Topic Vectors?]

Filed under: Search Algorithms,Search Engines,Searching,Vectors — Patrick Durusau @ 7:30 pm

ThisPlusThat.me: A Search Engine That Lets You ‘Add’ Words as Vectors by Christopher Moody.

From the post:

Natural language isn’t that great for searching. When you type a search query into Google, you miss out on a wide spectrum of human concepts and human emotions. Queried words have to be present in the web page, and then those pages are ranked according to the number of inbound and outbound links. That’s great for filtering out the cruft on the internet — and there’s a lot of that out there. What it doesn’t do is understand the relationships between words and understand the similarities or dissimilarities.

That’s where ThisPlusThat.me comes in — a search site I built to experiment with the word2vec algorithm recently released by Google. word2vec allows you to add and subtract concepts as if they were vectors, and get out sensible, and interesting results. I applied it to the Wikipedia corpus, and in doing so, tried creating an interactive search site that would allow users to put word2vec through it’s paces.

For example, word2vec allows you to evaluate a query like King – Man + Woman and get the result Queen. This means you can do some totally new searches.

… (examples omitted)

word2vec is a type of distributed word representation algorithm that trains a neural network in order to assign a vector to every word. Each of the dimensions in the vector tries to encapsulate some property of the word. Crudely speaking, one dimension could encode that man, woman, king and queen are all ‘people,’ whereas other dimensions could encode associations or dissociations with ‘royalty’ and ‘gender’. These traits are learned by trying to predict the context in a sentence and learning from correct and incorrect guesses.

Precisely!!!

😉

Doing it with word2vec requires large training sets of data. No doubt a useful venture if you are seeking to discover or document the word vectors in a domain.

But what if you wanted to declare vectors for words?

And then run word2vec (or something very similar) across the declared vectors.

Thinking along the lines of a topic map construct that has a “word” property with a non-null value. All the properties that follow are key/value pairs representing the positive and negative dimensions that are dimensions that give that word meaning.

Associations are collections of vector sums that identify subjects that take part in an association.

If we do all addressing by vector sums, we lose the need to track and collect system identifiers.

I think this could have legs.

Comments?

PS: For efficiency reasons, I suspect we should allow storage of computed vector sum(s) on a construct. But that would not prohibit another analysis reaching a different vector sum for different purposes.

December 6, 2013

Whoosh

Filed under: Python,Search Engines — Patrick Durusau @ 5:18 pm

Whoosh: Python Search Library

From the webpage:

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly.

Some of Whoosh’s features include:

  • Pythonic API.
  • Pure-Python. No compilation or binary packages needed, no mysterious crashes.
  • Fielded indexing and search.
  • Fast indexing and retrieval — faster than any other pure-Python search solution I know of. See Benchmarks.
  • Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc.
  • Powerful query language.
  • Production-quality pure Python spell-checker (as far as I know, the only one).

Whoosh might be useful in the following circumstances:

  • Anywhere a pure-Python solution is desirable to avoid having to build/compile native libraries (or force users to build/compile them).
  • As a research platform (at least for programmers that find Python easier to read and work with than Java 😉
  • When an easy-to-use Pythonic interface is more important to you than raw speed.
  • If your application can make good use of one deeply integrated search/lookup solution you can rely on just being there rather than having two different search solutions (a simple/slow/homegrown one integrated, an indexed/fast/external binary dependency one as an option).

Whoosh was created and is maintained by Matt Chaput. It was originally created for use in the online help system of Side Effects Software’s 3D animation software Houdini. Side Effects Software Inc. graciously agreed to open-source the code.

Learning more

One of the reasons to use Whoosh made me laugh:

When an easy-to-use Pythonic interface is more important to you than raw speed.

When is raw speed less important than anything? 😉

Seriously, experimentation with search promises to be a fruitful area for the foreseeable future.

I first saw this in Nat Torkington’s Four short links: 21 November 2013.

December 2, 2013

ElasticSearch 1.0.0.Beta2 released

Filed under: Aggregation,ElasticSearch,Search Engines — Patrick Durusau @ 4:08 pm

ElasticSearch 1.0.0.Beta2 released by Clinton Gromley.

From the post:

Today we are delighted to announce the release of elasticsearch 1.0.0.Beta2, the second beta release on the road to 1.0.0 GA. The new features we have planned for 1.0.0 have come together more quickly than we expected, and this beta release is chock full of shiny new toys. Christmas has come early!

We have added:

Please download elasticsearch 1.0.0.Beta2, try it out, break it, figure out what is missing and tell us about it. Our next release will focus on cleaning up inconsistent APIs and usability, plus fixing any bugs that are reported in the new functionality, so your early bug reports are an important part of ensuring that 1.0.0 GA is solid.

WARNING: This is a beta release – it is not production ready, features are not set in stone and may well change in the next version, and once you have made any changes to your data with this release, it will no longer be readable by older versions!

Suggestion: Pay close attention to the documentation on the new aggregation capabilities.

For example:

There are many different types of aggregations, each with its own purpose and output. To better understand these types, it is often easier to break them into two main families:

Bucketing: A family of aggregations that build buckets, where each bucket is associated with a key and a document criteria. When the aggregations is executed, the buckets criterias are evaluated on every document in the context and when matches, the document is considered to “fall in” the relevant bucket. By the end of the aggreagation process, we’ll end up with a list of buckets – each one with a set of documents that “belong” to it.

Metric: Aggregations that keep track and compute metrics over a set of documents

The interesting part comes next, since each bucket effectively defines a document set (all documents belonging to the bucket), one can potentially associated aggregations on the bucket level, and those will execute within the context of that bucket. This is where the real power of aggregations kicks in: aggregations can be nested!

Interesting, yes?

November 29, 2013

OpenSearchServer

Filed under: Search Engines,Searching — Patrick Durusau @ 8:11 pm

OpenSearchServer by Emmanuel Keller.

From the webpage:

OpenSearchServer is a powerful, enterprise-class, search engine program. Using the web user interface, the crawlers (web, file, database, …) and the REST/RESTFul API you will be able to integrate quickly and easily advanced full-text search capabilities in your application. OpenSearchServer runs on Linux/Unix/BSD/Windows.

Search functions

  • Advanced full-text search features
  • Phonetic search
  • Advanced boolean search with query language
  • Clustered results with faceting and collapsing
  • Filter search using sub-requests (including negative filters)
  • Geolocation
  • Spell-checking
  • Relevance customization
  • Search suggestion facility (auto-completion)

Indexation

  • Supports 17 languages
  • Fields schema with analyzers in each language
  • Several filters: n-gram, lemmatization, shingle, stripping diacritic from words,…
  • Automatic language recognition
  • Named entity recognition
  • Word synonyms and expression synonyms
  • Export indexed terms with frequencies
  • Automatic classification

Document supported

  • HTML / XHTML
  • MS Office documents (Word, Excel, Powerpoint, Visio, Publisher)
  • OpenOffice documents
  • Adobe PDF (with OCR)
  • RTF, Plaintext
  • Audio files metadata (wav, mp3, AIFF, Ogg)
  • Torrent files
  • OCR over images

Crawlers

  • The web crawler for internet, extranet and intranet
  • The file systems crawler for local and remote files (NFS, SMB/CIFS, FTP, FTPS, SWIFT)
  • The database crawler for all JDBC databases (MySQL, PostgreSQL, Oracle, SQL Server, …)
  • Filter inclusion or exclusion with wildcards
  • Session parameters removal
  • SQL join and linked files support
  • Screenshot capture
  • Sitemap import

General

  • REST API (XML and JSON)
  • SOAP Web Service
  • Monitoring module
  • Index replication
  • Scheduler for management of periodic tasks
  • WordPress plugin and Drupal module

OpenSearchServer is something to consider if your project is GPL v3 compatible.

Even in an enterprise context, you don’t have to be better than Google at searching the entire WWW.

You just have to be better at searching content of interest to a user, project, department, etc.

The difference between your search results and Google’s should be the difference of a breakfast on near-food at McDonald’s and the best home-cooked breakfast you can imagine.

One is a mass-produced product that is the same over the world, the other is customized to your taste.

Which one would you prefer?

November 27, 2013

Apache Lucene and Solr 4.6.0!

Filed under: Lucene,Search Engines,Solr — Patrick Durusau @ 11:37 am

Apache Lucene and Solr 4.6.0 are out!

From the announcement:

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html.

Both releases contain a number of bug fixes.

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

While it is fair to say that “Both releases contain a number of bug fixes.” I think that gives the wrong impression.

The Lucene 4.6.0 release has 23 new features versus 5 bugs and Solr 4.6.0 has 17 new features versus 14 bug fixes. Closer but 40 new features total versus 22 bug fixes sounds good to me! 😉

Just to whet your appetite for looking at the detailed change lists:

LUCENE-5294 Suggester Dictionary implementation that takes expressions as term weights

From the description:

It could be an extension of the existing DocumentDictionary (which takes terms, weights and (optionally) payloads from the stored documents in the index). The only exception being that instead of taking the weights for the terms from the specified weight fields, it could compute the weights using an user-defn expression, that uses one or more NumicDocValuesField from the document.

Example:
let the document have

  • product_id
  • product_name
  • product_popularity
  • product_profit

Then this implementation could be used with an expression of “0.2*product_popularity + 0.8*product_profit” to determine the weights of the terms for the corresponding documents (optionally along with a payload (product_id))

You may remember I pointed out Mike McCandless’ blog post on this issue.

SOLR-5374 Support user configured doc-centric versioning rules

From the description:

The existing optimistic concurrency features of Solr can be very handy for ensuring that you are only updating/replacing the version of the doc you think you are updating/replacing, w/o the risk of someone else adding/removing the doc in the mean time – but I’ve recently encountered some situations where I really wanted to be able to let the client specify an arbitrary version, on a per document basis, (ie: generated by an external system, or perhaps a timestamp of when a file was last modified) and ensure that the corresponding document update was processed only if the “new” version is greater then the “old” version – w/o needing to check exactly which version is currently in Solr. (ie: If a client wants to index version 101 of a doc, that update should fail if version 102 is already in the index, but succeed if the currently indexed version is 99 – w/o the client needing to ask Solr what the current version)

Redesigned percolator

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 11:15 am

Redesigned percolator by Martijn Vangroningen.

From the post:

The percolator is essentially search in reverse, which can by confusing initially for many people. This post will help to solve that problem and give more information on the redesigned percolator. We have added a lot more features to it to help users work with percolated documents/queries more easily.

In normal search systems, you store your data as documents and then send your questions as queries. The search results are a list of documents that matched your query.

With the percolator, this is reversed. First, you store the queries and then you send your ‘questions’ as documents. The percolator results are a list of queries that matched the document.

So what can do percolator do for you? The percolator can be used for a number of use cases, but the most common is for alerting and monitoring. By registering queries in Elasticsearch, your data can be monitored in real-time. If data with certain properties is being indexed, the percolator can tell you what queries this data matches.

For example, imagine a user “saving” a search. As new documents are added to the index, documents are percolated against this saved query and the user is alerted when new documents match. The percolator can also be used for data classification and user query feedback.

Even as a beta feature, this sounds interesting.

Another use case could be adhering to a Service Level Agreement (SLA).

You could have tiered search result packages that guarantee the freshness of search results. Near real-time would be more expensive than within six (6) hours or within the next business day. The match to a stored query could be queued up for delivery in accordance with your SLA.

I pay more for faster delivery times from FedEx, UPS, and, the US Post Office.

Why shouldn’t faster information cost more than slower information?

True, there are alternative suppliers of information but then you remind your prospective client of the old truism, you get what you pay for.

That is not contradicted by IT disasters such as HeathCare.gov.

The government hired contractors that are hard to distinguish from their agency counterparts and who are interested in “butts in seats” and not any useful results.

In that sense, the government literally got what it paid for. Had it wanted a useful heathcare IT project, it would not have put government drones in charge of the project.

November 20, 2013

Relevancy 301 – The Graduate Level Course

Filed under: Relevance,Search Algorithms,Search Engines — Patrick Durusau @ 7:58 pm

Relevancy 301 – The Graduate Level Course by Paul Nelson.

From the post:

So, I was going to write an article entitled “Relevancy 101”, but that seemed too shallow for what has become a major area of academic research. And so here we are with a Graduate-Level Course. Grab your book-bag, some Cheetos and a Mountain Dew, and let’s kick back and talk search engine relevancy.

I have blogged about relevancy before (see “What does ‘relevant’ mean?)”, but that was a more philosophical discussion of relevancy. The purpose of this blog is to go in-depth into the different types of relevancy, how they’re computed, and what they’re good for. I’ll do my best to avoid math, but no guarantees.

A very good introduction to measures of “relevancy,” most of which are no longer used.

Pay particular attention to Paul’s remarks about the weaknesses of inverse document frequency (IDF).

Before Paul posts part 2, how do you determine the relevance of documents?

Exercise:

Pick a subject covered by a journal or magazine, one with twelve issues each year and review a year’s worth of issues for “relevant” articles.

Assuming the journal is available electronically, does the search engine suggest your other “relevant” articles?

If it doesn’t, can you determine why it recommended different articles?

…Scorers, Collectors and Custom Queries

Filed under: Lucene,Search Engines,Searching — Patrick Durusau @ 7:30 pm

Lucene Search Essentials: Scorers, Collectors and Custom Queries by Mikhail Khludnev.

From the description:

My team is building next generation eCommerce search platform for major an online retailer with quite challenging business requirements. Turns out, default Lucene toolbox doesn’t ideally fit for those challenges. Thus, the team had to hack deep into Lucene core to achieve our goals. We accumulated quite a deep understanding of Lucene search internals and want to share our experience. We will start with an API overview, and then look at essential search algorithms and their implementations in Lucene. Finally, we will review a few cases of query customization, pitfalls and common performance problems.

Don’t be frightened of the slide count at 179!

Multiple slides are used with single illustrations to demonstrate small changes.

Having said that, this is a “close to the metal” type presentation.

Worth your time but read along carefully.

Don’t miss the extremely fine index on slide 18.

Follow http://www.lib.rochester.edu/index.cfm?PAGE=489 for images of pages that go with the index. This copy of Fasciculus Temporum dates from 1480.

November 14, 2013

DeleteDuplicates based on crawlDB only [Nutch-656]

Filed under: Nutch,Search Engines — Patrick Durusau @ 5:37 pm

DeleteDuplicates based on crawlDB only [Nutch-656]

As of today, Nutch, well, the nightly build after tonight, will have the ability to delete duplicate URLs.

Step in the right direction!

Now if duplicates could be declared on more than duplicate URLs and relationships maintained across deletions. 😉

November 6, 2013

elasticsearch 1.0.0.beta1 released

Filed under: ElasticSearch,Lucene,Search Engines,Searching — Patrick Durusau @ 8:04 pm

elasticsearch 1.0.0.beta1 released by Clinton Gormley.

From the post:

Today we are delighted to announce the release of elasticsearch 1.0.0.Beta1, the first public release on the road to 1.0.0. The countdown has begun!

You can download Elasticsearch 1.0.0.Beta1 here.

In each beta release we will add one major new feature, giving you the chance to try it out, to break it, to figure out what is missing and to tell us about it. Your use cases, ideas and feedback is essential to making Elasticsearch awesome.

The main feature we are showcasing in this first beta is Distributed Percolation.

WARNING: This is a beta release – it is not production ready, features are not set in stone and may well change in the next version, and once you have made any changes to your data with this release, it will no longer be readable by older versions!

distributed percolation

For those of you who aren’t familiar with percolation, it is “search reversed”. Instead of running a query to find matching docs, percolation allows you to find queries which match a doc. Think of people registering alerts like: tell me when a newspaper publishes an article mentioning “Elasticsearch”.

Percolation has been supported by Elasticsearch for a long time. In the current implementation, queries are stored in a special _percolator index which is replicated to all nodes, meaning that all queries exist on all nodes. The idea was to have the queries alongside the data.

But users are using it at a scale that we never expected, with hundreds of thousands of registered queries and high indexing rates. Having all queries on every node just doesn’t scale.

Enter Distributed Percolation.

In the new implementation, queries are registered under the special .percolator type within the same index as the data. This means that queries are distributed along with the data, and percolation can happen in a distributed manner across potentially all nodes in the cluster. It also means that an index can be made as big or small as required. The more nodes you have the more percolation you can do.

After reading the news release I understand why Twitter traffic on the elasticsearch release surged today. 😉

A new major feature with each beta release? That should attract some attention.

Not to mention “distributed percolation.”

Getting closer to a result being the “result” at X time on the system clock.

October 24, 2013

The Gap Between Documents and Answers

Filed under: Search Behavior,Search Engines,Searching,Semantic Search — Patrick Durusau @ 1:49 pm

I mentioned the webinar: Driving Knowledge-Worker Performance with Precision Search Results a few days ago in Findability As Value Proposition.

There was one nugget (among many) in the webinar before I lose sight of how important it is to topic maps and semantic technologies in general.

Dan Taylor (Earley and Associates) was presenting a maturation diagram for knowledge technologies.

See the presentation for the details but what struck me was than on the left side (starting point) there were documents. On the right side (the goal) were answers.

Think about that for a moment.

When you search in Google or any other search engine, what do you get back? Pointers to documents, presentations, videos, etc.

What task remains? Digging out answers from those documents, presentations, videos.

A mature knowledge technology goes beyond what an average user is searching for (the Google model) and returns information based on a specific user for a particular domain, that is, an answer.

For the average user there may be no better option than to drop them off in the neighborhood of a correct answer. Or what may be a correct answer to the average user. No guarantees that you will find it.

The examples in the webinar are in specific domains where user queries can be modeled accurately enough to formulate answers (not documents) to answer queries.

Reminds me of TaxMap. You?

If you want to do a side by side comparison, try USC: Title 26 – Internal Revenue Code. From the Legal Information Institute (Cornell)

Don’t get me wrong, the Cornell materials are great but they reflect the U.S. Code, nothing more or less. That is to say the text you find there isn’t engineered to provide answers. 😉

I will update this point with the webinar address as soon as it appears.

October 20, 2013

Crawl Anywhere

Filed under: Search Engines,Search Interface,Solr,Webcrawler — Patrick Durusau @ 5:59 pm

Crawl Anywhere 4.0.0-release-candidate available

From the Overview:

What is Crawl Anywhere?

Crawl Anywhere allows you to build vertical search engines. Crawl Anywhere includes :   

  • a Web Crawler with a powerful Web user interface
  • a document processing pipeline
  • a Solr indexer
  • a full featured and customizable search application

You can see the diagram of a typical use of all components in this diagram.

Why was Crawl Anywhere created?

Crawl Anywhere was originally developed to index in Apache Solr 5400 web sites (more than 10.000.000 pages) for the Hurisearch search engine: http://www.hurisearch.org/. During this project, various crawlers were evaluated (heritrix, nutch, …) but one key feature was missing : a user friendly web interface to manage Web sites to be crawled with their specific crawl rules. Mainly for this raison, we decided to develop our own Web crawler. Why did we choose the name "Crawl Anywhere" ? This name may appear a little over stated, but crawl any source types (Web, database, CMS, …) is a real objective and Crawl Anywhere was designed in order to easily implement new source connectors.

Can you create a better search corpus for some domain X than Google?

Less noise and trash?

More high quality content?

Cross referencing? (Not more like this but meaningful cross-references.)

There is only one way to find out!

Crawl Anywhere will help you with the technical side of creating a search corpus.

What it won’t help with is developing the strategy to build and maintain such a corpus.

Interested in how you go beyond creating a subject specific list of resources?

A list that leaves a reader to sort though the chaff. Time and time again.

Pointers, suggestions, comments?

October 1, 2013

Elasticsearch internals: an overview

Filed under: ElasticSearch,Lucene,Search Engines — Patrick Durusau @ 2:50 pm

Elasticsearch internals: an overview by Njal Karevoll.

From the post:

This article gives an overview of the Elasticsearch internals. I will present a 10,000 foot view of the different modules that Elasticsearch is composed of and how we can extend or replace built-in functionality using plugins.

Using Freemind, Njal has created maps of the namespaces and modules of ElasticSearch for your exploration.

The full module view reminds me of SGML productions, except less complicated.

« Newer PostsOlder Posts »

Powered by WordPress