Archive for the ‘Search Requirements’ Category

Maybe Friday (17th April) or Monday (20th April) DARPA – Dark Net

Wednesday, April 15th, 2015

Memex In Action: Watch DARPA Artificial Intelligence Search For Crime On The ‘Dark Web’ by Thomas Fox-Brewster.

Is DARPA’s Memex search engine a Google-killer? by Mark Stockleyhttps

A couple of “while you wait” pieces to read while you expect part of the DARPA Memex project to appear on its Open Catalog page, either this coming Friday (17th of April) or Monday (20th of April).

Fox-Brewster has video of a part of the system that:

It is trying to overcome one of the main barriers to modern search: crawlers can’t click or scroll like humans do and so often don’t collect “dynamic” content that appears upon an action by a user.

If you think searching is difficult now, with an estimated 5% of the web being indexed, just imagine bumping that up 10X or more.

Entirely manual indexing is already impossible and you have experienced the short comings of page ranking.

Perhaps the components of Memex will enable us to step towards a fusion of human and computer capabilities to create curated information resources.

Imagine an electronic The Art of Computer Programming that has several human experts per chapter who are assisted by deep searching and updating references and the text on an ongoing basis? So readers don’t have to weed through all the re-inventions of particular algorithms across numerous computer and math journals.

Or perhaps a more automated search of news reports so the earliest/most complete report is returned with the notation: “There are NNNNNN other, later and less complete versions of this story.” It isn’t that every major paper adds value, more often just content.

BTW, the focus on the capabilities of the search engine, as opposed to the analysis of those results most welcome.

See my post on its post-search capabilities: DARPA Is Developing a Search Engine for the Dark Web.

Looking forward to Friday or Monday!

Barkan, Bintliff, and Whisner’s Fundamentals of Legal Research, 10th

Monday, April 6th, 2015

Barkan, Bintliff, and Whisner’s Fundamentals of Legal Research, 10th by Steven M Barkan; Barbara Bintliff; Mary Whisner. (ISBN-13: 9781609300562)


This classic textbook has been updated to include the latest methods and resources. Fundamentals of Legal Research provides an authoritative introduction and guide to all aspects of legal research, integrating electronic and print sources. The Tenth Edition includes chapters on the true basics (case reporting, statutes, and so on) as well as more specialized chapters on legislative history, tax law, international law, and the law of the United Kingdom. A new chapter addresses Native American tribal law. Chapters on the research process, legal writing, and citation format help integrate legal research into the larger process of solving legal problems and communicating the solutions. This edition includes an updated glossary of research terms and revised tables and appendixes. Because of its depth and breadth, this text is well suited for advanced legal research classes; it is a book that students will want to retain for future use. Moreover, it has a place on librarians’ and attorneys’ ready reference shelves. Barkan, Bintliff and Whisner’s Assignments to Fundamentals of Legal Research complements the text.

I haven’t seen this volume in hard copy but if you are interested in learning what connections researchers are looking for with search tools, law is a great place to start.

The purpose of legal research, isn’t to find the most popular “fact” (Google), or to find every term for a “fact” ever tweeted (Twitter), but rather to find facts and their relationships to other facts, which flesh out to a legal view of a situation in context.

If you think about it, putting legislation, legislative history, court records and decisions, along with non-primary sources online, is barely a start towards making that information “accessible.” A necessary first step but not sufficient for meaningful access.

What “viable search engine competition” really looks like

Sunday, January 5th, 2014

What “viable search engine competition” really looks like by Alex Clemmer.

From the post:

Hacker News is up in arms again today about the RapGenius fiasco. See RapGenius statement and HN comments. One response article argues that we need more “viable search engine competition” and the HN community largely seems to agree.

In much of the discussion, there is a picaresque notion that the “search engine problem” is really just a product problem, and that if we try really hard to think of good features, we can defeat the giant.

I work at Microsoft. Competing with Google is hard work. I’m going to point out some of the lessons I’ve learned along the way, to help all you spry young entrepreneurs who might want to enter the market.

Alex has six (6) lessons for would-be Google killers:

Lesson 1: The problem is not only hiring smart people, but hiring enough smart people.

Lesson 2: competing on market share is possible; relevance is much harder

Lesson 3: social may pose an existential threat to Google’s style of search

Lesson 4: large companies have access to technology that is often categorically better than OSS state of the art

Lesson 5: large companies are necessarily limited by their previous investments

Lesson 6: large companies have much more data than you, and their approach to search is sophisticated

See Alex’s post for the details under each lesson.

What has always puzzled me is why compete on general search? General search services are “free” save for the cost of a users time to mine the results. It is hard to think of a good economic model to compete with “free.” Yes?

If we are talking about medical, legal, technical, engineering search, where services are sold to professionals and the cost is passed onto consumers, that could be a different story. Even there, costs have to be offset by a reasonable expectation of profit against established players in each of those markets.

One strategy would be to supplement or enhance existing search services and pitch that to existing market holders. Another strategy would be to propose highly specialized searching of unique data archives.

Do you think Alex is right in saying “…most traditional search problems have really been investigated thoroughly”?

I don’t because of the general decline in information retrieval from the 1950’s-1960’s to date.

If you doubt my observation, pick a Readers’ Guide to Periodical Literature (hard copy) for 1968 and choose some subject at random. Repeat that exercise with the search engine of your choice, limiting your results to 1968.

Which one gave you more relevant references for 1968, including synonyms? Say in the first 100 entries.

I first saw this in a tweet by Stefano Bertolo.

PS: I concede that the analog book does not have digital hyperlinks to take you to resources but it does have analog links for the same purpose. And it doesn’t have product ads. 😉

Search …Business Critical in 2014

Friday, December 20th, 2013

Search Continues to Be Business Critical in 2014 by Martin White.

From the post:

I offer two topics that I see becoming increasingly important in 2014. One of these is cross-device search, where a search is initially conducted on a desktop and is continued on a smartphone, and vice-versa. There is a very good paper from Microsoft that sets out some of the issues. The second topic is continuous information seeking, where search tasks are carried out by more than one “searcher,” often in support of collaborative working. The book on this topic by Chirag Shah, a member of staff of Rutgers University, is a very good place to start.

Editor’s Note: Read more of Martin’s thoughts on search in Why All Search Projects Fail.

Gee, let me see, what would more than one searcher need to make their collaborative search results usable by the entire team?

Can you say merging? 😉

Martin has other, equally useful insights in the search space so don’t miss the rest of his post.

But also catch his “Why All Search Projects Fail.” Good reading before you sign a contract with a client.

The Curse of Enterprise Search… [9% Solutions]

Tuesday, August 20th, 2013

The Curse of Enterprise Search and How to Break It by Maish Nichani.

From the post:

The Curse

Got enterprise search? Try answering these questions: Are end users happy? Has decision-making improved? Productivity up? Knowledge getting reused nicely? Your return-on-investment positive? If you’re finding it tough to answer these questions then most probably you’re under the curse of enterprise search.

The curse is cast when you purchase an enterprise search software and believe that it will automagically solve all your problems the moment you switch it on. You believe that the boatload of money you just spent on it justifies the promised magic of instant findability. Sadly, this belief cannot be further from the truth.

Search needs to be designed. Your users and content are unique to your organisation. Search needs to work with your users. It needs to make full use of the type of content you have. Search really needs to be designed.

Don’t believe in the curse? Consider these statistics from the Enterprise Search and Findability Survey 2013 done by Findwise with 101 practitioners working for global companies:

  • Only 9% said it was easy to find the right information within the organisation
  • Only 19% said they were happy with the existing search application in their organisation
  • Only 20% said they had a search strategy in place

Just in case you need some more numbers when pushing your better solution to enterprise search.

I wonder how search customers would react to an application that made it easy to find the right data 20% of the time?

Just leaving room for future versions and enhancements. 😉

Maish isn’t handing out silver bullets but a close read will improve your search application (topic map or not).

DuckDuckGo Architecture…

Sunday, February 3rd, 2013

DuckDuckGo Architecture – 1 Million Deep Searches A Day And Growing Interview with Gabriel Weinberg.

From the post:

This is an interview with Gabriel Weinberg, founder of Duck Duck Go and general all around startup guru, on what DDG’s architecture looks like in 2012.

Innovative search engine upstart DuckDuckGo had 30 million searches in February 2012 and averages over 1 million searches a day. It’s being positioned by super investor Fred Wilson as a clean, private, impartial and fast search engine. After talking with Gabriel I like what Fred Wilson said earlier, it seems closer to the heart of the matter: We invested in DuckDuckGo for the Reddit, Hacker News anarchists.
Choosing DuckDuckGo can be thought of as not just a technical choice, but a vote for revolution. In an age when knowing your essence is not about about love or friendship, but about more effectively selling you to advertisers, DDG is positioning themselves as the do not track alternative, keepers of the privacy flame. You will still be monetized of course, but in a more civilized and anonymous way. 

Pushing privacy is a good way to carve out a competitive niche against Google et al, as by definition they can never compete on privacy. I get that. But what I found most compelling is DDG’s strong vision of a crowdsourced network of plugins giving broader search coverage by tying an army of vertical data suppliers into their search framework. For example, there’s a specialized Lego plugin for searching against a complete Lego database. Use the name of a spice in your search query, for example, and DDG will recognize it and may trigger a deeper search against a highly tuned recipe database. Many different plugins can be triggered on each search and it’s all handled in real-time.

Can’t searching the Open Web provide all this data? No really. This is structured data with semantics. Not an HTML page. You need a search engine that’s capable of categorizing, mapping, merging, filtering, prioritizing, searching, formatting, and disambiguating richer data sets and you can’t do that with a keyword search. You need the kind of smarts DDG has built into their search engine. One problem of course is now that data has become valuable many grown ups don’t want to share anymore.

Being ad supported puts DDG in a tricky position. Targeted ads are more lucrative, but ironically DDG’s do not track policies means they can’t gather targeting data. Yet that’s also a selling point for those interested in privacy. But as search is famously intent driven, DDG’s technology of categorizing queries and matching them against data sources is already a form of high value targeting.

It will be fascinating to see how these forces play out. But for now let’s see how DuckDuckGo implements their search engine magic…

Some topic map centric points from the post:

Dream is to appeal to more niche audiences to better serve people who care about a particular topic. For example: lego parts. There’s a database of Lego parts, for example. Pictures of parts and part numbers can be automatically displayed from a search.

  • Some people just use different words for things. Goal is not to rewrite the query, but give suggestions on how to do things better.
  • “phone reviews” for example, will replace phone with telephone. This happens through an NLP component that tries to figure out what phone you meant and if there are any synonyms that should be used in the query.

Those are the ones that caught my eye, there are no doubt others.

Not to mention a long list of DuckDuckGo references at the end of the post.

What place(s) would you suggest to DuckDuckGo where topic maps would make a compelling difference?

eGIFT: Mining Gene Information from the Literature

Thursday, November 22nd, 2012

eGIFT: Mining Gene Information from the Literature by Catalina O Tudor, Carl J Schmidt and K Vijay-Shanker.



With the biomedical literature continually expanding, searching PubMed for information about specific genes becomes increasingly difficult. Not only can thousands of results be returned, but gene name ambiguity leads to many irrelevant hits. As a result, it is difficult for life scientists and gene curators to rapidly get an overall picture about a specific gene from documents that mention its names and synonyms.


In this paper, we present eGIFT ( webcite), a web-based tool that associates informative terms, called iTerms, and sentences containing them, with genes. To associate iTerms with a gene, eGIFT ranks iTerms about the gene, based on a score which compares the frequency of occurrence of a term in the gene’s literature to its frequency of occurrence in documents about genes in general. To retrieve a gene’s documents (Medline abstracts), eGIFT considers all gene names, aliases, and synonyms. Since many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene. Another additional filtering process is applied to retain those abstracts that focus on the gene rather than mention it in passing. eGIFT’s information for a gene is pre-computed and users of eGIFT can search for genes by using a name or an EntrezGene identifier. iTerms are grouped into different categories to facilitate a quick inspection. eGIFT also links an iTerm to sentences mentioning the term to allow users to see the relation between the iTerm and the gene. We evaluated the precision and recall of eGIFT’s iTerms for 40 genes; between 88% and 94% of the iTerms were marked as salient by our evaluators, and 94% of the UniProtKB keywords for these genes were also identified by eGIFT as iTerms.


Our evaluations suggest that iTerms capture highly-relevant aspects of genes. Furthermore, by showing sentences containing these terms, eGIFT can provide a quick description of a specific gene. eGIFT helps not only life scientists survey results of high-throughput experiments, but also annotators to find articles describing gene aspects and functions.


Another lesson for topic map authoring interfaces: Offer domain specific search capabilities.

Using a ****** search appliance is little better than a poke with a sharp stick in most domains. The user is left to their own devices to sort out ambiguities, discover synonyms, again and again.

Your search interface may report > 900,000 “hits,” but anything beyond the first 20 or so are wasted.

(If you get sick, get something that comes up in the first 20 “hits” in PubMed. Where most researchers stop.)

60 Months, Minimal Search Progress

Sunday, January 1st, 2012

60 Months, Minimal Search Progress

Stephen E Arnold revises his August 2005 observation:

The truth is that nothing associated with locating information is cheap, easy or fast.

to read:

The truth is that nothing associated with locating information is accurate, cheap, easy or fast.

Which reminds me of the project triangle, where the choices are cheap, fast, good and you can pick any two.1.

In fact, I created an Euler diagram of the four choices:

I got stuck when it came to adding “easy.”

In part because I don’t know “easy” for who? Easy for the search engine user? Easy for the end-user?

If easy for the end-user, is that a continuum? If so, what lies at both ends?

Having a single text box may be “easy” for the end-user but how does that intersect with “accurate?”

Suggestions? Pen is in your hand now.

1. PMI has withdrawn the 50 year old triangle on the basis that project’s have more constraints that interact than just three. On which see: The Death of the Project Management Triangle by Ben Synder.

Linked Data Paradigm Can Fuel Linked Cities

Thursday, December 29th, 2011

Linked Data Paradigm Can Fuel Linked Cities

The small city of Cluj in Romania, of some half-million inhabitants, is responsible for a 2.5 million triple store, as part of a Recognos-led project to develop a “Linked City” community portal. The project was submitted for this year’s ICT Call – SME initiative on Digital Content and Languages, FP7-ICT-2011-SME-DCL. While it didn’t receive funding from that competition, Recognos semantic web researcher Dia Miron, is hopeful of securing help from alternate sources in the coming year to expand the project, including potentially bringing the concept of linked cities to other communities in Romania or elsewhere in Europe.

The idea was to publish information from sources such as local businesses about their services and products, as well as data related to the local government and city events, points of interest and projects, using the Linked Data paradigm, says Miron. Data would also be geolocated. “So we take all the information we can get about a city so that people can exploit it in a uniform manner,” she says.

The first step was to gather the data and publish it in a standard format using RDF and OWL; the next phase, which hasn’t taken place yet (it’s funding-dependent), is to build exportation channels for the data. “First we wanted a simple query engine that will exploit the data, and then we wanted to build a faceted search mechanism for those who don’t know the data structure to exploit and navigate through the data,” she says. “We wanted to make it easier for someone not very acquainted with the models. Then we wanted also to provide some kind of SMS querying because people may not always be at their desks. And also the final query service was an augmented reality application to be used to explore the city or to navigate through the city to points of interest or business locations.”

Local Cluj authorities don’t have the budgets to support the continuation of the project on their own, but Miron says the applications will be very generic and can easily be transferred to support other cities, if they’re interested in helping to support the effort. Other collaborators on the project include Ontotext and STI Innsbruck, as well as the local Cluj council.

I don’t doubt this would be useful information for users but is this the delivery model that is going to work for users, assuming it is funded? Here or elsewhere?

How hard do users work with searches? See Keyword and Search Engines Statistics to get an idea by country.

Some users can be trained to perform fairly complex searches but I suspect that is a distinct minority. And the type of searches that need to be performed vary by domain.

For example, earlier today, I was searching for information on “spectral graph theory,” which I suspect has different requirements than searching for 24-hour sushi bars within a given geographic area.

I am not sure how to isolate those different requirements, much less test how close any approach is to satisfying them, but I do think both areas merit serious investigation.