Archive for the ‘Search Requirements’ Category

DuckDuckGo Architecture…

Sunday, February 3rd, 2013

DuckDuckGo Architecture – 1 Million Deep Searches A Day And Growing Interview with Gabriel Weinberg.

From the post:

This is an interview with Gabriel Weinberg, founder of Duck Duck Go and general all around startup guru, on what DDG’s architecture looks like in 2012.

Innovative search engine upstart DuckDuckGo had 30 million searches in February 2012 and averages over 1 million searches a day. It’s being positioned by super investor Fred Wilson as a clean, private, impartial and fast search engine. After talking with Gabriel I like what Fred Wilson said earlier, it seems closer to the heart of the matter: We invested in DuckDuckGo for the Reddit, Hacker News anarchists.
                  
Choosing DuckDuckGo can be thought of as not just a technical choice, but a vote for revolution. In an age when knowing your essence is not about about love or friendship, but about more effectively selling you to advertisers, DDG is positioning themselves as the do not track alternative, keepers of the privacy flame. You will still be monetized of course, but in a more civilized and anonymous way. 

Pushing privacy is a good way to carve out a competitive niche against Google et al, as by definition they can never compete on privacy. I get that. But what I found most compelling is DDG’s strong vision of a crowdsourced network of plugins giving broader search coverage by tying an army of vertical data suppliers into their search framework. For example, there’s a specialized Lego plugin for searching against a complete Lego database. Use the name of a spice in your search query, for example, and DDG will recognize it and may trigger a deeper search against a highly tuned recipe database. Many different plugins can be triggered on each search and it’s all handled in real-time.

Can’t searching the Open Web provide all this data? No really. This is structured data with semantics. Not an HTML page. You need a search engine that’s capable of categorizing, mapping, merging, filtering, prioritizing, searching, formatting, and disambiguating richer data sets and you can’t do that with a keyword search. You need the kind of smarts DDG has built into their search engine. One problem of course is now that data has become valuable many grown ups don’t want to share anymore.

Being ad supported puts DDG in a tricky position. Targeted ads are more lucrative, but ironically DDG’s do not track policies means they can’t gather targeting data. Yet that’s also a selling point for those interested in privacy. But as search is famously intent driven, DDG’s technology of categorizing queries and matching them against data sources is already a form of high value targeting.

It will be fascinating to see how these forces play out. But for now let’s see how DuckDuckGo implements their search engine magic…

Some topic map centric points from the post:

Dream is to appeal to more niche audiences to better serve people who care about a particular topic. For example: lego parts. There’s a database of Lego parts, for example. Pictures of parts and part numbers can be automatically displayed from a search.

  • Some people just use different words for things. Goal is not to rewrite the query, but give suggestions on how to do things better.
  • “phone reviews” for example, will replace phone with telephone. This happens through an NLP component that tries to figure out what phone you meant and if there are any synonyms that should be used in the query.

Those are the ones that caught my eye, there are no doubt others.

Not to mention a long list of DuckDuckGo references at the end of the post.

What place(s) would you suggest to DuckDuckGo where topic maps would make a compelling difference?

eGIFT: Mining Gene Information from the Literature

Thursday, November 22nd, 2012

eGIFT: Mining Gene Information from the Literature by Catalina O Tudor, Carl J Schmidt and K Vijay-Shanker.

Abstract:

Background

With the biomedical literature continually expanding, searching PubMed for information about specific genes becomes increasingly difficult. Not only can thousands of results be returned, but gene name ambiguity leads to many irrelevant hits. As a result, it is difficult for life scientists and gene curators to rapidly get an overall picture about a specific gene from documents that mention its names and synonyms.

Results

In this paper, we present eGIFT (http://biotm.cis.udel.edu/eGIFT webcite), a web-based tool that associates informative terms, called iTerms, and sentences containing them, with genes. To associate iTerms with a gene, eGIFT ranks iTerms about the gene, based on a score which compares the frequency of occurrence of a term in the gene’s literature to its frequency of occurrence in documents about genes in general. To retrieve a gene’s documents (Medline abstracts), eGIFT considers all gene names, aliases, and synonyms. Since many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene. Another additional filtering process is applied to retain those abstracts that focus on the gene rather than mention it in passing. eGIFT’s information for a gene is pre-computed and users of eGIFT can search for genes by using a name or an EntrezGene identifier. iTerms are grouped into different categories to facilitate a quick inspection. eGIFT also links an iTerm to sentences mentioning the term to allow users to see the relation between the iTerm and the gene. We evaluated the precision and recall of eGIFT’s iTerms for 40 genes; between 88% and 94% of the iTerms were marked as salient by our evaluators, and 94% of the UniProtKB keywords for these genes were also identified by eGIFT as iTerms.

Conclusions

Our evaluations suggest that iTerms capture highly-relevant aspects of genes. Furthermore, by showing sentences containing these terms, eGIFT can provide a quick description of a specific gene. eGIFT helps not only life scientists survey results of high-throughput experiments, but also annotators to find articles describing gene aspects and functions.

Website: http://biotm.cis.udel.edu/eGIFT

Another lesson for topic map authoring interfaces: Offer domain specific search capabilities.

Using a ****** search appliance is little better than a poke with a sharp stick in most domains. The user is left to their own devices to sort out ambiguities, discover synonyms, again and again.

Your search interface may report > 900,000 “hits,” but anything beyond the first 20 or so are wasted.

(If you get sick, get something that comes up in the first 20 “hits” in PubMed. Where most researchers stop.)

60 Months, Minimal Search Progress

Sunday, January 1st, 2012

60 Months, Minimal Search Progress

Stephen E Arnold revises his August 2005 observation:

The truth is that nothing associated with locating information is cheap, easy or fast.

to read:

The truth is that nothing associated with locating information is accurate, cheap, easy or fast.

Which reminds me of the project triangle, where the choices are cheap, fast, good and you can pick any two.1.

In fact, I created an Euler diagram of the four choices:

I got stuck when it came to adding “easy.”

In part because I don’t know “easy” for who? Easy for the search engine user? Easy for the end-user?

If easy for the end-user, is that a continuum? If so, what lies at both ends?

Having a single text box may be “easy” for the end-user but how does that intersect with “accurate?”

Suggestions? Pen is in your hand now.


1. PMI has withdrawn the 50 year old triangle on the basis that project’s have more constraints that interact than just three. On which see: The Death of the Project Management Triangle by Ben Synder.

Linked Data Paradigm Can Fuel Linked Cities

Thursday, December 29th, 2011

Linked Data Paradigm Can Fuel Linked Cities

The small city of Cluj in Romania, of some half-million inhabitants, is responsible for a 2.5 million triple store, as part of a Recognos-led project to develop a “Linked City” community portal. The project was submitted for this year’s ICT Call – SME initiative on Digital Content and Languages, FP7-ICT-2011-SME-DCL. While it didn’t receive funding from that competition, Recognos semantic web researcher Dia Miron, is hopeful of securing help from alternate sources in the coming year to expand the project, including potentially bringing the concept of linked cities to other communities in Romania or elsewhere in Europe.

The idea was to publish information from sources such as local businesses about their services and products, as well as data related to the local government and city events, points of interest and projects, using the Linked Data paradigm, says Miron. Data would also be geolocated. “So we take all the information we can get about a city so that people can exploit it in a uniform manner,” she says.

The first step was to gather the data and publish it in a standard format using RDF and OWL; the next phase, which hasn’t taken place yet (it’s funding-dependent), is to build exportation channels for the data. “First we wanted a simple query engine that will exploit the data, and then we wanted to build a faceted search mechanism for those who don’t know the data structure to exploit and navigate through the data,” she says. “We wanted to make it easier for someone not very acquainted with the models. Then we wanted also to provide some kind of SMS querying because people may not always be at their desks. And also the final query service was an augmented reality application to be used to explore the city or to navigate through the city to points of interest or business locations.”

Local Cluj authorities don’t have the budgets to support the continuation of the project on their own, but Miron says the applications will be very generic and can easily be transferred to support other cities, if they’re interested in helping to support the effort. Other collaborators on the project include Ontotext and STI Innsbruck, as well as the local Cluj council.

I don’t doubt this would be useful information for users but is this the delivery model that is going to work for users, assuming it is funded? Here or elsewhere?

How hard do users work with searches? See Keyword and Search Engines Statistics to get an idea by country.

Some users can be trained to perform fairly complex searches but I suspect that is a distinct minority. And the type of searches that need to be performed vary by domain.

For example, earlier today, I was searching for information on “spectral graph theory,” which I suspect has different requirements than searching for 24-hour sushi bars within a given geographic area.

I am not sure how to isolate those different requirements, much less test how close any approach is to satisfying them, but I do think both areas merit serious investigation.