Archive for the ‘Search Engines’ Category

How Impoverished is the “current world of search?”

Wednesday, May 8th, 2013

Internet Content Is Looking for You

From the post:

Where you are and what you’re doing increasingly play key roles in how you search the Internet. In fact, your search may just conduct itself.

This concept, called “contextual search,” is improving so gradually the changes often go unnoticed, and we may soon forget what the world was like without it, according to Brian Proffitt, a technology expert and adjunct instructor of management in the University of Notre Dame’s Mendoza College of Business.

Contextual search describes the capability for search engines to recognize a multitude of factors beyond just the search text for which a user is seeking. These additional criteria form the “context” in which the search is run. Recently, contextual search has been getting a lot of attention due to interest from Google.

(…)

“You no longer have to search for content, content can search for you, which flips the world of search completely on its head,” says Proffitt, who is the author of 24 books on mobile technology and personal computing and serves as an editor and daily contributor for ReadWrite.com.

“Basically, search engines examine your request and try to figure out what it is you really want,” Proffitt says. “The better the guess, the better the perceived value of the search engine. In the days before computing was made completely mobile by smartphones, tablets and netbooks, searches were only aided by previous searches.

(…)

Context can include more than location and time. Search engines will also account for other users’ searches made in the same place and even the known interests of the user.

If time and location plus prior searches is context that “…flips the world of search completely on its head…”, imagine what a traditional index must do.

A traditional index being created by a person who has subject matter knowledge beyond the average reader and so is able to point to connections and facts (context) previously unknown to the user.

The “…current world of search…” is truly impoverished for time and location to have that much impact.

FindZebra

Thursday, May 2nd, 2013

FindZebra

From the about page:

FindZebra is a specialised search engine supporting medical professionals in diagnosing difficult patient cases. Rare diseases are especially difficult to diagnose and this online medical search engines comes in support of medical personnel looking for diagnostic hypotheses. With a simple and consistent interface across all devices, it can be easily used as an aid tool at the time and place where medical decisions are made. The retrieved information is collected from reputable sources across the internet storing public medical articles on rare and genetic diseases.

A search engine with: WARNING! This is a research project to be used only by medical professionals.

To avoid overwhelming researchers with search result “noise,” FindZebra deliberately restricts the content it indexes.

An illustration of the crudeness of current search algorithms that altering the inputs is the easiest way to improve outcomes for particular types of searches.

That seems to be an argument in favor of smaller than enterprise search engines, which could roll-up into broader search applications.

Of course, with a topic map you could retain the division between departments even as you roll-up the content into broader search applications.

Google search:… [GDM]

Sunday, April 21st, 2013

Google search: three bugs to fix with better data science by Vincent Granville.

Vincent outlines three issues with Google search results:

  1. Outdated search results
  2. Wrongly attributed articles
  3. Favoring irrelevant pages

See Vincent’s post for advice on how Google can address these issues. (Might help with a Google interview to tell them how to fix such long standing problems.)

More practically, how does your TM application rate on the outdated search results?

Do you just dump content on the user to sort out (the Google dump model (GDM)) or are your results a bit more user friendly?

Broccoli: Semantic Full-Text Search at your Fingertips

Friday, April 19th, 2013

Broccoli: Semantic Full-Text Search at your Fingertips by Hannah Bast, Florian Bäurle, Björn Buchhold, Elmar Haussmann.

Abstract:

We present Broccoli, a fast and easy-to-use search engine for what we call semantic full-text search. Semantic full-text search combines the capabilities of standard full-text search and ontology search. The search operates on four kinds of objects: ordinary words (e.g., edible), classes (e.g., plants), instances (e.g., Broccoli), and relations (e.g., occurs-with or native-to). Queries are trees, where nodes are arbitrary bags of these objects, and arcs are relations. The user interface guides the user in incrementally constructing such trees by instant (search-as-you-type) suggestions of words, classes, instances, or relations that lead to good hits. Both standard full-text search and pure ontology search are included as special cases. In this paper, we describe the query language of Broccoli, a new kind of index that enables fast processing of queries from that language as well as fast query suggestion, the natural language processing required, and the user interface. We evaluated query times and result quality on the full version of the EnglishWikipedia (32 GB XML dump) combined with the YAGO ontology (26 million facts). We have implemented a fully functional prototype based on our ideas, see http://broccoli.informatik.uni-freiburg.de.

The most impressive part of an impressive paper was the new index, context lists.

The second idea, which is the main idea behind our new index, is to have what we call context lists instead of inverted lists. The context list for a pre x contains one index item per occurrence of a word starting with that pre x, just like the inverted list for that pre x would. But along with that it also contains one index item for each occurrence of an arbitrary entity in the same context as one of these words.

The performance numbers speak for themselves.

This should be a feature in the next release of Lucene/Solr. Or perhaps even configurable for the number of entities that can appear in a “context list.”

Was it happenstance or a desire for simplicity that caused the original indexing engines to parse text into single tokens?

Literature references on that point?

Improving Twitter search with real-time human computation ["semantics supplied"]

Tuesday, April 9th, 2013

Improving Twitter search with real-time human computation by Edwin Chen.

From the post:

Before we delve into the details, here’s an overview of how the system works.

(1) First, we monitor for which search queries are currently popular.

Behind the scenes: we run a Storm topology that tracks statistics on search queries.

For example: the query “Big Bird” may be averaging zero searches a day, but at 6pm on October 3, we suddenly see a spike in searches from the US.

(2) Next, as soon as we discover a new popular search query, we send it to our human evaluation systems, where judges are asked a variety of questions about the query.

Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon’s Mechanical Turk service, and then polls Mechanical Turk for a response.

For example: as soon as we notice “Big Bird” spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant tweets and ads.

Finally, after a response from a judge is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our human judges tell us that “Big Bird” is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.

Let’s now explore the first two sections above in more detail.

….

The post is quite awesome and I suggest you read it in full.

This resonates with a recent comment about Lotus Agenda.

The short version is a user creates a thesaurus in Agenda that enables searches enriched by the thesaurus. The user supplied semantics to enhance the searches.

In the Twitter case, human reviewers supply semantics to enhance the searches.

In both cases, Agenda and Twitter, humans are supplying semantics to enhance the searches.

I emphasize “supplying semantics” as a contrast to mechanistic searches that rely on text.

Mechanistic searches can be quite valuable but they pale beside searches where semantics have been “supplied.”

The Twitter experience is a an important clue.

The answer to semantics for searches lies somewhere between ask an expert (you get his/her semantics) and ask ask all of us (too many answers to be useful).

More to follow.

ElasticSearch: Text analysis for content enrichment

Saturday, March 30th, 2013

ElasticSearch: Text analysis for content enrichment by Jaibeer Malik.

From the post:

Taking an example of a typical eCommerce site, serving the right content in search to the end customer is very important for the business. The text analysis strategy provided by any search solution plays very big role in it. As a search user, I would prefer some of typical search behavior for my query to automatically return,

  • should look for synonyms matching my query text
  • should match singluar and plural words or words sounding similar to enter query text
  • should not allow searching on protected words
  • should allow search for words mixed with numberic or special characters
  • should not allow search on html tags
  • should allow search text based on proximity of the letters and number of matching letters

Enriching the content here would be to add above search capabilities to you content while indexing and searching for the content.

I thought the “…look for synonyms matching my query text…” might get your attention. ;-)

Not quite a topic map because there isn’t any curation of the search results, saving the next searcher time and effort.

But in order to create and maintain a topic map, you are going to need expansion of your queries by synonyms.

You will take the results of those expanded queries and fashion them into a topic map.

Think of it this way:

Machines can rapidly harvest, even sort content at your direction.

What they can’t do is curate the results of their harvesting.

That requires a secret ingredient.

That would be you.

I first saw this at DZone.

Build a search engine in 20 minutes or less

Thursday, March 28th, 2013

Build a search engine in 20 minutes or less by Ben Ogorek.

I was suspicious but pleasantly surprised by the demonstration of the vector space model you will find here.

True, it doesn’t offer all the features of the latest Lucene/Solr releases but it will give you a firm grounding on vector space models.

Enjoy!

PS: One thing to keep in mind, semantics do not map to vector space. We can model word occurrences in vector space but occurrences are not semantics.

URL Search Tool!

Wednesday, March 6th, 2013

URL Search Tool! by Lisa Green.

From the post:

A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index. Today we are happy to announce a tool that makes it even easier for you to take advantage of the URL Index!

URL Search is a web application that allows you to search for any URL, URL prefix, subdomain or top-level domain. The results of your search show the number of files in the Common Crawl corpus that came from that URL and provide a downloadable JSON metadata file with the address and offset of the data for each URL. Once you download the JSON file, you can drop it into your code so that you only run your job against the subset of the corpus you specified. URL Search makes it much easier to find the files you are interested in and significantly reduces the time and money it take to run your jobs since you can now run them across only on the files of interest instead of the entire corpus.

Imagine that.

Searching relevant data instead of “big data.”

What a concept!

How Search Works [Promoting Your Interface]

Tuesday, March 5th, 2013

How Search Works (Google)

Clever graphics and I rather liked the:

By the way, in the **** seconds you’ve been on this page, approximately *********** searches were performed.

Not that you want that sort of tracking if your topic map interface only gets two or three “hits” a day but in an enterprise context…, might be worth thinking about.

Evidence of the popularity of your topic map interface with the troops.

I first saw this in a tweet by Christian Jansen.

dtsearch Tutorial Videos

Tuesday, February 19th, 2013

Tutorials for the dtsearch engine have been posted to ediscovery TV.

In five parts:

Part 1

Part 2

Part 3

Part 4

Part 5

I skipped over the intro videos only to find:

Not being able to “select all” in Excel doesn’t increase my confidence in the presentation. (part 3)

The copying of files that are “responsive” to a search request is convenient but not all that impressive. (part 4)

User isn’t familiar with basic operations in dtsearch, such as files not copied. Does finally appear. (part 5)

Disappointing because I remember dtsearch from years ago and it was (and still is) an impressive bit of work.

Suggestion: Don’t judge dtsearch by these videos.

I started to suggest you download all the brochures/white papers you will find at: http://www.dtsearch.com/contact.html

There is a helpful “Download All: PDF Porfolio” link. Except that it doesn’t work in Chrome at least. Keeps giving me a Download Adobe Acrobat 10 download window. Even after I install Adobe Acrobat 10.

Here’s a general hint for vendors: Don’t try to help. You will get it wrong. If you want to give users access to file, great, but let viewing/use be on their watch.

So, download the brochures/white papers individually until dtsearch recovers from another self-inflicted marketing wound.

Then grab a 30-day evaluation copy of the software.

It may or may not fit your needs but you will get a fairer sense of the product than you will from the videos or parts of the dtsearch website.

Maybe that’s the key: They are great search engineers, not so hot at marketing or websites.

I first saw this at dtSearch Harnesses TV Power. Where videos are cited, but not watched.

Solr Unleashed

Friday, February 15th, 2013

Solr Unleashed: A Hands-On Workshop for Building Killer Search Apps (LucidWorks)

From the post:

Having consulted with clients on Lucene and Solr for the better part of a decade, we’ve seen the same mistakes made over and over again: applications built on shaky foundations, stretched to the breaking point. In this two day class, learn from the experts about how to do it right and make sure your apps are rock solid, scalable, and produce relevant results. Also check the course outline.

The course looks great, but if you don’t have the fees, I have reproduced the course outline below.

Using online documentation, mailing lists and other online resources, track the outline and fill it in for yourself.

If you want a real challenge, work through the outline and then build a Solr application around the outline.

To keep your newly acquired skills polished to a fine sheen.

1. The Fundamentals

  • About Solr
  • Installing and running Solr
  • Adding content to Solr
  • Reading a Solr XML response
  • Changing parameters in the URL
  • Using the browse interface

2. Searching

  • Sorting results
  • Query parsers
  • More queries
  • Hardwiring request parameters
  • Adding fields to default search
  • Faceting on fields
  • Range faceting
  • Date range faceting
  • Hierarchical faceting
  • Result grouping

3. Indexing

  • Adding your own content to Solr
  • Deleting data from solr
  • Building a bookstore search
  • Adding book data
  • Exploring the book data
  • Dedupe updateprocessor

4. Updating your schema

  • Adding fields to the schema
  • Analyzing text
  • 5. Relevance

    • Field weighting
    • Phrase queries
    • Function queries

    6. Extended features

    • More-like-this
    • Fuzzier search
    • Sounds-like
    • Geospatial
    • Spell checking
    • Suggestions
    • Highlighting

    7. Multilanguage

    • Working with English
    • Working with other languages
    • Non-whitespace languages
    • Identifying languages
    • Language specific sorting

    8. SolrCloud

    • Introduction
    • How SolrCloud works
    • Commit strategies
    • ZooKeeper
    • Managing Solr config files

Not the same as the class but will help you ask better questions of LucidWorks experts when you need them.

DuckDuckGo Architecture…

Sunday, February 3rd, 2013

DuckDuckGo Architecture – 1 Million Deep Searches A Day And Growing Interview with Gabriel Weinberg.

From the post:

This is an interview with Gabriel Weinberg, founder of Duck Duck Go and general all around startup guru, on what DDG’s architecture looks like in 2012.

Innovative search engine upstart DuckDuckGo had 30 million searches in February 2012 and averages over 1 million searches a day. It’s being positioned by super investor Fred Wilson as a clean, private, impartial and fast search engine. After talking with Gabriel I like what Fred Wilson said earlier, it seems closer to the heart of the matter: We invested in DuckDuckGo for the Reddit, Hacker News anarchists.
                  
Choosing DuckDuckGo can be thought of as not just a technical choice, but a vote for revolution. In an age when knowing your essence is not about about love or friendship, but about more effectively selling you to advertisers, DDG is positioning themselves as the do not track alternative, keepers of the privacy flame. You will still be monetized of course, but in a more civilized and anonymous way. 

Pushing privacy is a good way to carve out a competitive niche against Google et al, as by definition they can never compete on privacy. I get that. But what I found most compelling is DDG’s strong vision of a crowdsourced network of plugins giving broader search coverage by tying an army of vertical data suppliers into their search framework. For example, there’s a specialized Lego plugin for searching against a complete Lego database. Use the name of a spice in your search query, for example, and DDG will recognize it and may trigger a deeper search against a highly tuned recipe database. Many different plugins can be triggered on each search and it’s all handled in real-time.

Can’t searching the Open Web provide all this data? No really. This is structured data with semantics. Not an HTML page. You need a search engine that’s capable of categorizing, mapping, merging, filtering, prioritizing, searching, formatting, and disambiguating richer data sets and you can’t do that with a keyword search. You need the kind of smarts DDG has built into their search engine. One problem of course is now that data has become valuable many grown ups don’t want to share anymore.

Being ad supported puts DDG in a tricky position. Targeted ads are more lucrative, but ironically DDG’s do not track policies means they can’t gather targeting data. Yet that’s also a selling point for those interested in privacy. But as search is famously intent driven, DDG’s technology of categorizing queries and matching them against data sources is already a form of high value targeting.

It will be fascinating to see how these forces play out. But for now let’s see how DuckDuckGo implements their search engine magic…

Some topic map centric points from the post:

Dream is to appeal to more niche audiences to better serve people who care about a particular topic. For example: lego parts. There’s a database of Lego parts, for example. Pictures of parts and part numbers can be automatically displayed from a search.

  • Some people just use different words for things. Goal is not to rewrite the query, but give suggestions on how to do things better.
  • “phone reviews” for example, will replace phone with telephone. This happens through an NLP component that tries to figure out what phone you meant and if there are any synonyms that should be used in the query.

Those are the ones that caught my eye, there are no doubt others.

Not to mention a long list of DuckDuckGo references at the end of the post.

What place(s) would you suggest to DuckDuckGo where topic maps would make a compelling difference?

? Google Guide [Improve Google Searching]

Thursday, January 31st, 2013

? Google Guide by Nancy Blachman.

Non-official documentation for Google searching but very nice non-official documentation.

If you want to improve your Google searching, this is a good place to start!

Available in English, Dutch, German, Hebrew and Italian.

Identity In A Context

Wednesday, January 30th, 2013

Jasmine Ashton frames a quote about Julie Lynch, an archivist saying:

Due to the nature of her work, Lynch is the human equivalent of a search engine. However, she differs in one key aspect:

“Unlike Google, Lynch delivers more than search results, she provides context. That sepia-tinged photograph of the woman in funny-looking clothes on a funny-looking bicycle actually offers a window into the impact bicycles had on women’s independence. An advertisement touting “can build frame houses” demonstrates construction restrictions following the Great Chicago Fire. Surprisingly, high school yearbooks — the collection features past editions from Lane Tech, Amundsen and Lake View High Schools — serve as more than a cautionary tale in the evolution of hairstyles.”

Despite the increase in technology that makes searching information as easy as tapping a touch screen, this article reiterates the importance of having real people to contextualize these documents. (How Librarians Play an Integral Role When Searching for Historical Documents
)

Rather than say “contextualize,” I would prefer to say that librarians provide alternative “contexts” for historical documents.

Recognition of a document, or any other subject, takes place in a context. A librarian can offer the user different contexts in which to understand a document.

Doesn’t invalidate the initial context of understanding, simply becomes an alternative one.

Quite different from our search engines, which see only “matches” and no context for those matches.

Is Google Hijacking Semantic Markup/Structured Data? [FALSE]

Saturday, January 26th, 2013

Is Google Hijacking Semantic Markup/Structured Data? by Barbara Starr.

From the post:

On December 12, 2012, Google rolled out a new tool, called the Google Data Highlighter for event data. Upon a cursory read, it seems to be a tagging tool, where a human trains the Data Highlighter using a few pages on their website, until Google can pick up enough of a pattern to do the remainder of the site itself.

Better yet, you can see all of these results in the structured data dashboard. It appears as if event data is marked up and is compatible with schema.org. However, there is a caveat here that some folks may not notice.

No actual markup is placed on the page, meaning that none of the semantic markup using this Data Highlighter tool is consumable by Bing, Yahoo or any other crawler on the Web; only Google can use it!

Google is essentially hi-jacking semantic markup so only Google can take advantage of it. Google has the global touch and the ability to execute well-thought-out and brilliantly strategic plans.

Let’s do this by the numbers:

  1. Google develops a service for webmasters to add semantic annotations to their webpages.
  2. Google allows webmasters to use that service at no charge.
  3. Google uses those annotations to improve the search results it provides users (for free).

Google used its own resources to develop a valuable service for webmasters that enhances their websites and user experience with Google, for free.

Perhaps there is a new definition of highjacking?

Webster says the traditional definition includes “to steal or rob as if by hijacking.”

The Semantic Web:

Highjacking

(a) Failing to whitewash the Semantic Web’s picket fence while providing free services to webmasters and users to enhance searching of web content.

(b) Failing to give away data from free services to webmasters and users to those who did not plant, reap, spin, weave or sew.

I don’t find the Semantic Web’s definition of “hijacking” persuasive.

You?

I first saw this at: Google’s Structured Data Take Over by Angela Guess.

MongoDB Text Search Tutorial

Thursday, January 17th, 2013

MongoDB Text Search Tutorial by Alex Popescu.

From the post:

Today is the day of the experimental MongoDB text search feature. Tobias Trelle continues his posts about this feature providing some examples for query syntax (negation, phrase search)—according to the previous post even more advanced queries should be supported, filtering and projections, multiple text fields indexing, and adding details about the stemming solution used (Snowball).

Alex also has a list of his posts on the text search feature for MongoDB.

Symbolab

Tuesday, January 15th, 2013

Symbolab

Described as:

Symbolab is a search engine for students, mathematicians, scientists and anyone else looking for answers in the mathematical and scientific realm. Other search engines that do equation search use LaTex, the document mark up language for mathematical symbols which is the same as keywords, which unfortunately gives poor results.

Symbolab uses proprietary machine learning algorithms to provide the most relevant search results that are theoretically and semantically similar, rather than visually similar. In other words, it does a semantic search, understanding the query behind the symbols, to get results.

The nice thing about math and science is that it’s universal – there’s no need for translation in order to understand an equation. This means scale can come much quicker than other search engines that are limited by language.

From: The guys from The Big Bang Theory will love mathematical search engine Symbolab by Shira Abel. (includes an interview with Michael Avny, the CEO of Symbolab.

Limited to web content at the moment but a “scholar” option is in the works. I assume that will extend into academic journals.

Focused now on mathematics, physics and chemistry, but in principle should be extensible to related areas. I am particularly anxious to hear they are indexing CS publications!

Would be really nice if Springer, Elsevier, the AMS and others would permit indexing of their equations.

That presumes publishers would realize that shutting out users not at institutions is a bad marketing plan. With a marginal delivery cost of near zero and sunk costs from publication already fixed, every user a publisher gains at $200/year for their entire collection is $200 they did not have before.

Not to mention the citation and use of their publication, which just drives more people to publish there. A virtuous circle if you will.

The only concern I have is the comment:

The nice thing about math and science is that it’s universal – there’s no need for translation in order to understand an equation.

Which is directly contrary to what Michael is quoted as saying in the interview:

You say “Each symbol can mean different things within and across disciplines, order and position of elements matter, priority of features, etc.” Can you give an example of this?

The authors of the Foundations of Rule Learning spent five years attempting to reconcile notations used in rule making. Some symbols had different meanings. They resorted to inventing yet another notation as a solution.

Why the popular press perpetuates the myth of a universal language isn’t clear.

It isn’t useful and in some cases, such as national security, it leads to waste of time and resources on attempts to invent a universal language.

The phrase “myth of a universal language” should be a clue. Universal languages don’t exist. They are myths, by definition.

Anyone who says differently is trying to sell you something, Something that is in their interest and perhaps not yours.

I first saw this at Introducing Symbolab: Search for Equations by Angela Guess.

Advanced Power Searching [January 23, 2013]

Saturday, January 12th, 2013

Advanced Power Searching

From the post:

Advanced Power Searching with Google begins on January 23, 2013!

Register now to sharpen your research skills and strengthen your use of advanced Google search techniques to answer complex questions. Throughout this course you’ll also:

  • Take your search strategies to a new level with sophisticated, independent search challenges.
  • Join a community of Advanced Searchers working together to solve search challenges.
  • Pose questions to Google search experts live in Hangouts and through a course forum.
  • Receive an Advanced Power Searching certificate upon completion.

Not sure if you’re ready for Advanced Power Searching? Brush up on your search skills by visiting the Power Searching with Google course.

Topic maps help keep found information found but you have to find it first. ;-)

Enjoy!

Solr vs ElasticSearch: Part 5 – Management API Capabilities

Friday, January 11th, 2013

Solr vs ElasticSearch: Part 5 – Management API Capabilities by Rafał Kuć.

From the post:

In previous posts, all listed below, we’ve discussed general architecture, full text search capabilities and facet aggregations possibilities. However, till now we have not discussed any of the administration and management options and things you can do on a live cluster without any restart. So let’s get into it and see what Apache Solr and ElasticSearch have to offer.

Rafał continues this excellent series on Solr and ElasticSearch and promises there is more to come!

This series sets a high standard for posts comparing search capabilities!

izik Debuts as #1 Free Reference App on iTunes

Wednesday, January 9th, 2013

izik Debuts as #1 Free Reference App on iTunes

From the post:

We launched izik, our search app for tablets, last Friday and are amazed at the responses we’ve received! Thanks to our users, on day one izik was the #1 free reference app on iTunes and #49 free app overall. Yesterday we were mentioned twice in the New York Times, here and here (also in the B1 story in print). We are delighted that there is such a strong desire to see something fresh and new in search, and that our vision with izik is so well received.

The twitterverse has been especially active in spreading the word about izik. We’ve seen a lot of comments about the beautiful design and interface, the useful categories, and most importantly the high quality results that make izik a truly viable choice for searching on tablets.

Just last Monday I remarked: “From the canned video I get the sense that the interface is going to make search different.” (izik: Take Search for a Joy Ride on Your Tablet)

Users with tablets have supplied the input I asked for in that post and it is overwhelmingly in favor of izik.

To paraphrase Ray Charles in the Blues Brothers:

“E-excuse me, uh, I don’t think there’s anything wrong with the action on [search applications].”

There is plenty of “action” left in the search space.

izik is fresh evidence for that proposition.

Apache Nutch v1.6 and Apache 2.1 Releases

Monday, December 10th, 2012

Apache Nutch v1.6 Released

From the news:

The Apache Nutch PMC are extremely pleased to announce the release of Apache Nutch v1.6. This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API inluding the normalization of URL’s and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8. Please see the list of changes or the release report made in this version for a full breakdown. The release is available here.

See the Nutch 1.x tutorial.

Apache Nutch v2.1 Released

From the news:

The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v2.1. This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as addressing ~20 bugs this release also offers improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search. Please see the list of changes made in this version for a full breakdown. The release is available here.

See the Nutch 2.x tutorial.

I haven’t done a detailed comparison but roughly, Nutch 1.x relies upon Solr for storage and Nutch 2.x relies upon Gora and HBase.

Surprised that isn’t in the FAQ.

Perhaps I will investigate further and offer a short summary of the differences.

Fun with Lucene’s faceted search module

Sunday, December 9th, 2012

Fun with Lucene’s faceted search module by Mike McCandless.

From the post:

These days faceted search and navigation is common and users have come to expect and rely upon it.

Lucene’s facet module, first appearing in the 3.4.0 release, offers a powerful implementation, making it trivial to add a faceted user interface to your search application. Shai Erera wrote up a nice overview here and worked through nice “getting started” examples in his second post.

The facet module has not been integrated into Solr, which has an entirely different implementation, nor into ElasticSearch, which also has its own entirely different implementation. Bobo is yet another facet implementation! I’m sure there are more…

The facet module can compute the usual counts for each facet, but also has advanced features such as aggregates other than hit count, sampling (for better performance when there are many hits) and complements aggregation (for better performance when the number of hits is more than half of the index). All facets are hierarchical, so the app is free to index an arbitrary tree structure for each document. With the upcoming 4.1, the facet module will fully support near-real-time (NRT) search.

Take some time over the holidays to play with faceted searches in Lucene.

eGIFT: Mining Gene Information from the Literature

Thursday, November 22nd, 2012

eGIFT: Mining Gene Information from the Literature by Catalina O Tudor, Carl J Schmidt and K Vijay-Shanker.

Abstract:

Background

With the biomedical literature continually expanding, searching PubMed for information about specific genes becomes increasingly difficult. Not only can thousands of results be returned, but gene name ambiguity leads to many irrelevant hits. As a result, it is difficult for life scientists and gene curators to rapidly get an overall picture about a specific gene from documents that mention its names and synonyms.

Results

In this paper, we present eGIFT (http://biotm.cis.udel.edu/eGIFT webcite), a web-based tool that associates informative terms, called iTerms, and sentences containing them, with genes. To associate iTerms with a gene, eGIFT ranks iTerms about the gene, based on a score which compares the frequency of occurrence of a term in the gene’s literature to its frequency of occurrence in documents about genes in general. To retrieve a gene’s documents (Medline abstracts), eGIFT considers all gene names, aliases, and synonyms. Since many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene. Another additional filtering process is applied to retain those abstracts that focus on the gene rather than mention it in passing. eGIFT’s information for a gene is pre-computed and users of eGIFT can search for genes by using a name or an EntrezGene identifier. iTerms are grouped into different categories to facilitate a quick inspection. eGIFT also links an iTerm to sentences mentioning the term to allow users to see the relation between the iTerm and the gene. We evaluated the precision and recall of eGIFT’s iTerms for 40 genes; between 88% and 94% of the iTerms were marked as salient by our evaluators, and 94% of the UniProtKB keywords for these genes were also identified by eGIFT as iTerms.

Conclusions

Our evaluations suggest that iTerms capture highly-relevant aspects of genes. Furthermore, by showing sentences containing these terms, eGIFT can provide a quick description of a specific gene. eGIFT helps not only life scientists survey results of high-throughput experiments, but also annotators to find articles describing gene aspects and functions.

Website: http://biotm.cis.udel.edu/eGIFT

Another lesson for topic map authoring interfaces: Offer domain specific search capabilities.

Using a ****** search appliance is little better than a poke with a sharp stick in most domains. The user is left to their own devices to sort out ambiguities, discover synonyms, again and again.

Your search interface may report > 900,000 “hits,” but anything beyond the first 20 or so are wasted.

(If you get sick, get something that comes up in the first 20 “hits” in PubMed. Where most researchers stop.)

Developing New Ways to Search for Web Images

Thursday, November 22nd, 2012

Developing New Ways to Search for Web Images by Shar Steed.

From the post:

Collections of photos, images, and videos are quickly coming to dominate the content available on the Web. Currently internet search engines rely on the text with which the images are labeled to return matches. But why is only text being used to search visual mediums? These labels can be unreliable, unhelpful and sometimes not available at all.

To solve this problem, scientists at Stanford and Princeton have been working to “create a new generation of visual search technologies.” Dr. Fei-Fei Li, a computer scientist at Stanford, has built the world’s largest visual database, containing more than 14 million labeled objects.

A system called ImageNet, applies the data gathered from the database to recognize similar, unlabeled objects with much greater accuracy than past algorithms.

A remarkable amount of material to work with, either via the API or downloading for your own hacking.

Another tool for assisting in the authoring of topic maps (or other content).

Battle of the Giants: Apache Solr 4.0 vs. ElasticSearch 0.20

Saturday, November 10th, 2012

Battle of the Giants: Apache Solr 4.0 vs. ElasticSearch 0.20 by Rafał Kuć.

A very nice summary (slides) of Rafał’s comparison of the latest releases of Solr and ElasticSearch.

They differ and those differences fit some use cases better than others.

And the winner is: … (well, I won’t spoil the surprise.)

Read the slides.

Unless you are Rafał, you will learn something you didn’t know before.

Information Retrieval and Search Engines [Committers Needed!]

Wednesday, October 10th, 2012

Information Retrieval and Search Engines

A proposal is pending to create a Q&A site for people interested in information retrieval and search engines.

But it needs people to commit to using it and answering questions!

That could be you!

There’s a lot of action left in information retrieval and search engines.

Don’t have to believe me. Have you tried one lately? ;-)

What’s so cool about elasticsearch?

Friday, October 5th, 2012

What’s so cool about elasticsearch? by Luca Cavanna.

From the post:

Whenever there’s a new product out there and you start using it, suggest it to customers or colleagues, you need to be prepared to answer this question: “Why should I use it?”. Well, the answer could be as simple as “Because it’s cool!”, which of course is the case with elasticsearch, but then at some point you may need to explain why. I recently had to answer the question, “So what’s so cool about elasticsearch?”, that’s why I thought it might be worthwhile sharing my own answer in this blog.

Its not a staid comparison piece but a partisan, this is cool piece.

You will find it both entertaining and informative. Good weekend reading.

Will give you something to have a strong opinion (one way or the other) next Monday!

Local Search – How Hard Can It Be? [Unfolding Searches?]

Friday, September 21st, 2012

Local Search – How Hard Can It Be? by Matthew Hurst.

From the post:

This week, Apple got a rude awakening with its initial foray into the world of local search and mapping. The media and user backlash to their iOS upgrade which removes Google as the maps and local search partner and replaces it with their own application (built on licensed data) demonstrates just how important the local scenario is to the mobile space.

While the pundits are reporting various (and sometimes amusing) issues with the data and the search service, it is important to remind ourselves how hard local search can be.

For example, if you search on Google for Key Arena – a major venue in Seattle located in the famous Seattle Center, you will find some severe data quality problems.

See Matthew’s post for the detail but I am mostly interesting in his final observation:

One of the ironies of local data conflation is that landmark entities (like stadia, large complex hotels, hospitals, etc.) tend to have lots of data (everyone knows about them) and lots of complexity (the Seattle Center has lots of things within it that can be confused). These factors conspire to make the most visible entities in some ways the entities more prone to problems.

Every library student is (or should be) familiar with the “reference interview.” A patron asks a question (consider this to be the search request, “Key Arena”) and a librarian uses the reference interview to further identify the information being requested.

Contrast that unfolding of the search request, which at any juncture offers different paths to different goals, with the “if you can identify it, you can find it,” approach of most search engines.

Computers have difficulty searching complex entities such as “Key Arena” successfully. Whereas starting with the same query with a librarian does not.

Doesn’t that suggest to you that “unfolding” searches may be a better model for computer searching than simple identification?

More than static facets, but a presentation of the details most likely to distinguish subjects searched for by users under similar circumstances. Dynamically.

Sounds like the sort of heuristic knowledge that topic maps could capture quite handily.

Blame Google? Different Strategy: Let’s Blame Users! (Not!)

Saturday, September 15th, 2012

Let me quote from A Simple Guide To Understanding The Searcher Experience by Shari Thurow to start this post:

Web searchers have a responsibility to communicate what they want to find. As a website usability professional, I have the opportunity to observe Web searchers in their natural environments. What I find quite interesting is the “Blame Google” mentality.

I remember a question posed to me during World IA Day this past year. An attendee said that Google constantly gets search results wrong. He used a celebrity’s name as an example.

“I wanted to go to this person’s official website,” he said, “but I never got it in the first page of search results. According to you, it was an informational query. I wanted information about this celebrity.”

I paused. “Well,” I said, “why are you blaming Google when it is clear that you did not communicate what you really wanted?”

“What do you mean?” he said, surprised.

“You just said that you wanted information about this celebrity,” I explained. “You can get that information from a variety of websites. But you also said that you wanted to go to X’s official website. Your intent was clearly navigational. Why didn’t you type in [celebrity name] official website? Then you might have seen your desired website at the top of search results.”

The stunned silence at my response was almost deafening. I broke that silence.

“Don’t blame Google or Yahoo or Bing for your insufficient query formulation,” I said to the audience. “Look in the mirror. Maybe the reason for the poor searcher experience is the person in the mirror…not the search engine.”

People need to learn how to search. Search experts need to teach people how to search. Enough said.

What a novel concept! If the search engine/software doesn’t work, must be the user’s fault!

I can save you a trip down the hall to the marketing department. They are going to tell you that is an insane sales strategy. Satisfying to the geeks in your life but otherwise untenable, from a business perspective.

Remember the stats on using Library of Congress subject headings I posted under Subject Headings and the Semantic Web:

Overall percentages of correct meanings for subject headings in the original order of subdivisions were as follows: children, 32%, adults, 40%, reference 53%, and technical services librarians, 56%.

?

That is with decades of teaching people to search both manual and automated systems using Library of Congress classification.

Test Question: I have a product to sell. 60% of my all buyers can’t find it with a search engine. Do I:

  • Teach all users everywhere better search techniques?
  • Develop better search engines/interfaces to compensate for potential buyers’ poor searching?

I suspect the “stunned silence” was an audience with greater marketing skills than the speaker.

Fast integer compression: decoding billions of integers per second

Wednesday, September 12th, 2012

Fast integer compression: decoding billions of integers per second by Daniel Lemire.

At > 2 billion integers per second, you may find there is plenty of action left in your desktop processor!

From the post:

Databases and search engines often store arrays of integers. In search engines, we have inverted indexes that map a query term to a list of document identifiers. This list of document identifiers can be seen as a sorted array of integers. In databases, indexes often work similarly: they map a column value to row identifiers. You also get arrays of integers in databases through dictionary coding: you map all column values to an integer in a one-to-one manner.

Our modern processors are good at processing integers. However, you also want to keep much of the data close to the CPU for better speed. Hence, computer scientists have worked on fast integer compression techniques for the last 4 decades. One of the earliest clever techniques is Elias coding. Over the years, many new techniques have been developed: Golomb and Rice coding, Frame-of-Reference and PFOR-Delta, the Simple family, and so on.

The general story is that while people initially used bit-level codes (e.g., gamma codes), simpler byte-level codes like Google’s group varint are more practical. Byte-level codes like what Google uses do not compress as well, and there is less opportunity for fancy information theoretical mathematics. However, they can be much faster.

Yet we noticed that there was no trace in the literature of a sensible integer compression scheme running on desktop processor able to decompress data at a rate of billions of integers per second. The best schemes, such as Stepanov et al.’s varint-G8IU report top speeds of 1.5 billion integers per second.

As your may expect, we eventually found out that it was entirely feasible to decoding billions of integers per second. We designed a new scheme that typically compress better than Stepanov et al.’s varint-G8IU or Zukowski et al.’ PFOR-Delta, sometimes quite a bit better, while being twice as fast over real data residing in RAM (we call it SIMD-BP128). That is, we cleanly exceed a speed of 2 billions integers per second on a regular desktop processor.

We posted our paper online together with our software. Note that our scheme is not patented whereas many other schemes are.

So, how did we do it? Some insights: