Archive for the ‘Search Algorithms’ Category

Haystack: The Search Relevance Conference! (Proposals by Jan. 19, 2018) Updated

Friday, December 8th, 2017

Haystack: The Search Relevance Conference!

From the webpage:

Haystack is the conference for improving search relevance. If you’re like us, you work to understand the shiny new tools or dense academic papers out there that promise the moon. Then you puzzle how to apply those insights to your search problem, in your search stack. But the path isn’t always easy, and the promised gains don’t always materialize.

Haystack is the no-holds-barred conference for organizations where search, matching, and relevance really matters to the bottom line. For search managers, developers & data scientists finding ways to innovate, see past the silver bullets, and share what actually has worked well for their unique problems. Please come share and learn!

… (inline form for submission proposals)

Welcome topics include

  • Information Retrieval
  • Learning to Rank
  • Query Understanding
  • Semantic Search
  • Applying NLP to search
  • Personalized Search
  • Search UX Strategy: Perceived relevance, smart snippeting
  • Measuring and testing search against business objectives
  • Nuts & bolts: plugins for Solr, Elasticsearch, Vespa, etc
  • Adjacent topics: recommendation systems, entity matching via search, and other topics

… (emphasis in original)

The first link for the conference I saw was http://mailchi.mp/e609fba68dc6/announcing-haystack-the-search-relevance-conference, which promised topics including:

  • Intent detection

The modest price of $75 covers our costs….

To see a solution to the problem of other minds and to discover their intent, all for $75, is quite a bargain. Especially since the $75 covers breakfast and lunch both days, plus dinner the first day in a beer hall. 😉

Even without solving philosophical problems, sponsorship by OpenSource Connections is enough to recommend this conference without reservation.

My expectation is this conference is going to rock for hard core search geeks!

PS: Ask if videos will be posted. Thanks!

The Iraq Inquiry (Chilcot Report) [4.5x longer than War and Peace]

Wednesday, July 6th, 2016

The Iraq Inquiry

To give a rough sense of the depth of the Chilcot Report, the executive summary runs 150 pages. The report appears in twelve (12) volumes, not including video testimony, witness transcripts, documentary evidence, contributions and the like.

Cory Doctorow reports a Guardian project to crowd source collecting facts from the 2.6 million word report. The Guardian observes the Chilcot report is “…almost four-and-a-half times as long as War and Peace.”

Manual reading of the Chilcot report is doable, but unlikely to yield all of the connections that exist between participants, witnesses, evidence, etc.

How would you go about making the Chilcot report and its supporting evidence more amenable to navigation and analysis?

The Report

The Evidence

Other Material

Unfortunately, sections within volumes were not numbered according to their volume. In other words, volume 2 starts with section 3.3 and ends with 3.5, whereas volume 4 only contains sections beginning with “4.,” while volume 5 starts with section 5 but also contains sections 6.1 and 6.2. Nothing can be done for it but be aware that section numbers don’t correspond to volume numbers.

Aho-Corasick

Sunday, July 19th, 2015

Aho-Corasick – Java implementation of the Aho-Corasick algorithm for efficient string matching.

From the webpage:

Nowadays most free-text searching is based on Lucene-like approaches, where the search text is parsed into its various components. For every keyword a lookup is done to see where it occurs. When looking for a couple of keywords this approach is great. But what about it if you are not looking for just a couple of keywords, but a 100,000 of them? Like, for example, checking against a dictionary?

This is where the Aho-Corasick algorithm shines. Instead of chopping up the search text, it uses all the keywords to build up a construct called a Trie. There are three crucial components to Aho-Corasick:

  • goto
  • fail
  • output

Every character encountered is presented to a state object within the goto structure. If there is a matching state, that will be elevated to the new current state.

However, if there is no matching state, the algorithm will signal a fail and fall back to states with less depth (ie, a match less long) and proceed from there, until it found a matching state, or it has reached the root state.

Whenever a state is reached that matches an entire keyword, it is emitted to an output set which can be read after the entire scan has completed.

The beauty of the algorithm is that it is O(n). No matter how many keywords you have, or how big the search text is, the performance will decline in a linear way.

Some examples you could use the Aho-Corasick algorithm for:

  • looking for certain words in texts in order to URL link or emphasize them
  • adding semantics to plain text
  • checking against a dictionary to see if syntactic errors were made

This library is the Java implementation of the afore-mentioned Aho-Corasick algorithm for efficient string matching. The algorithm is explained in great detail in the white paper written by Aho and Corasick: ftp://163.13.200.222/assistant/bearhero/prog/%A8%E4%A5%A6/ac_bm.pdf

The link to the Aho-Corasick paper timed out on me. Try: Efficient String Matching: An Aid to Bibliographic Search (CiteSeer).

The line that caught my eye was “adding semantics to plain text.

Apologies for the “lite” posting over the past several days. I have just completed a topic maps paper for the Balisage conference in collaboration with Sam Hunting. My suggestion is that you register before all the seats are gone.

Internet Search as a Crap Shoot in a Black Box

Tuesday, March 17th, 2015

The post, Google To Launch New Doorway Page Penalty Algorithm by Barry Schwartz reminded me that Internet search is truly a crap shoot in a black box.

Google has over two hundred (200) factors that are known (or suspected) to play a role in its search algorithms and their ranking of results.

Even if you memorized the 200, if you are searching you don’t know how those factors will impact pages with information you want to see. (Unless you want to drive web traffic, the 200 factors are a curiosity and not much more.)

When you throw your search terms, like dice, in to the Google black box, you don’t know how they will interact with the unknown results of the ranking algorithms.

To make matters worse, yes, worse, the Google algorithms change over time. Some major, some not quite so major. But every change stands a chance to impact any ad hoc process you have adopted for finding information.

A good number of you won’t remember print indexes but one of their attractive features (in hindsight) was that the indexing was uniform, at least within reasonable limits, for decades. If you learned how to effectively use the printed index, you could always find information using that technique, without fear that the familiar results would simply disappear.

Perhaps that is a commercial use case for the Common Crawl data. Imagine a disclosed ranking algorithm that could be exposed to create a custom ranking for a sub-set of the data against which to perform searches. So the ranking against which you are searching is known and can be explored.

It would not have the very latest data but that’s difficult to extract from Google since it apparently tosses the information about when it first encountered a page. Or at the very least doesn’t make it available to users. At least as an option, being able to pick the most recent resources matching a search would be vastly superior to the page-rank orthodoxy at Google.

Not to single Google out too much because I haven’t encountered other search engines that are more transparent. They may exist but I am unaware of them.

A Survey of Monte Carlo Tree Search Methods

Thursday, December 18th, 2014

A Survey of Monte Carlo Tree Search Methods by Cameron Browne, et al.

Abstract:

Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm’s derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and non-game domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.

At almost fifty (50) pages, this review of the state of the art for MCTS research as of 2012, should keep even dedicated readers occupied for several days. The extensive bibliography will enhance your reading experience!

I first saw this in a tweet by Ebenezer Fogus.

Laboratory for Web Algorithmics

Sunday, June 8th, 2014

Laboratory for Web Algorithmics

From the homepage:

The Laboratory for Web Algorithmics (LAW) was established in 2002 at the Dipartimento di Scienze dell’Informazione (now merged in the Computer Science Department) of the Università degli studi di Milano.

The LAW is part of the NADINE FET EU project.

Research at LAW concerns all algorithmic aspects of the study of the web and of social networks. More in detail…

The details include:

  • High-performance web crawling: Including an open source web crawler
  • Compression of web graphs and social networks: compression of web crawling results
  • Analysis of web graphs and social networks: research and algorithms for exploration of web graphs

Deeply impressive project and one with several papers and resources that I will be covering in more detail in future posts.

I first saw this in a tweet by Network Fact.

“Credibility” As “Google Killer”?

Sunday, May 4th, 2014

Nancy Baym tweets: “Nice article on flaws of ”it’s not our fault, it’s the algorithm” logic from Facebook with quotes from @TarletonG” pointing to: Facebook draws fire on ‘related articles’ push.

From the post:

A surprise awaited Facebook users who recently clicked on a link to read a story about Michelle Obama’s encounter with a 10-year-old girl whose father was jobless.

Facebook responded to the click by offering what it called “related articles.” These included one that alleged a Secret Service officer had found the president and his wife having “S*X in Oval Office,” and another that said “Barack has lost all control of Michelle” and was considering divorce.

A Facebook spokeswoman did not try to defend the content, much of which was clearly false, but instead said there was a simple explanation for why such stories are pushed on readers. In a word: algorithms.

The stories, in other words, apparently are selected by Facebook based on mathematical calculations that rely on word association and the popularity of an article. No effort is made to vet or verify the content.

Facebook’s explanation, however, is drawing sharp criticism from experts who said the company should immediately suspend its practice of pushing so-called related articles to unsuspecting users unless it can come up with a system to ensure that they are credible. (emphasis added)

Just imagine the hue and outcry had that last line read:

Imaginary Quote Google’s explanation of search results, however, is drawing sharp criticism from experts who said the company should immediately suspend its practice of pushing so-called related articles to unsuspecting users unless it can come up with a system to ensure that they are credible. End Imaginary Quote

Is demanding “credibility” of search results the long sought after “Google Killer?”

“Credibility” is closely related to the “search” problem but I think it should be treated separately from search.

In part because the “credibility” question is one that can require multiple searches upon the author of search result content, searches for reviews and comments on search result content, searches of other sources of data on the content in the search result and then a collation of that additional content to make a credibility judgement on the search result content. The procedure isn’t always that elaborate but the main point is that it requires additional searching and evaluation of content to even begin to answer a credibility question.

Not to mention why the information is being sought has a bearing on credibility. If I want to find examples of nutty things said about President Obama to cite, then finding the cases mentioned above is not only relevant (the search question) but also “credible” in the sense that Facebook did not make they up. They are published nutty statements about the current President.

What if a user wanted to search for “coffee and bagels?” The top hit on one popular search engine today is: Coffee Meets Bagel: Free Online Dating Sites, along with numerous other links to information on the first link. Was this relevant to my search? No, but search results aren’t always predictable. They are relevant to someone’s search using “coffee and bagels.”

It is the responsibility of every reader to decide for themselves what is relevant, credible, useful, etc. in terms of content, whether it is hard copy or digital.

Any other solution takes us to Plato‘s Republic, which was great to read about, would not want to live there.

Everything is Editorial:..

Saturday, December 14th, 2013

Everything is Editorial: Why Algorithms are Hand-Made, Human, and Not Just For Search Anymore by Aaron Kirschenfeld.

From the post:

Down here in Durham, NC, we have artisanal everything: bread, cheese, pizza, peanut butter, and of course coffee, coffee, and more coffee. It’s great—fantastic food and coffee, that is, and there is no doubt some psychological kick from knowing that it’s been made carefully by skilled craftspeople for my enjoyment. The old ways are better, at least until they’re co-opted by major multinational corporations.

Aside from making you either hungry or jealous, or perhaps both, why am I talking about fancy foodstuffs on a blog about legal information? It’s because I’d like to argue that algorithms are not computerized, unknowable, mysterious things—they are produced by people, often painstakingly, with a great deal of care. Food metaphors abound, helpfully I think. Algorithms are the “special sauce” of many online research services. They are sets of instructions to be followed and completed, leading to a final product, just like a recipe. Above all, they are the stuff of life for the research systems of the near future.

Human Mediation Never Went Away

When we talk about algorithms in the research community, we are generally talking about search or information retrieval (IR) algorithms. A recent and fascinating VoxPopuLII post by Qiang Lu and Jack Conrad, “Next Generation Legal Search – It’s Already Here,” discusses how these algorithms have become more complicated by considering factors beyond document-based, topical relevance. But I’d like to step back for a moment and head into the past for a bit to talk about the beginnings of search, and the framework that we have viewed it within for the past half-century.

Many early information-retrieval systems worked like this: a researcher would come to you, the information professional, with an information need, that vague and negotiable idea which you would try to reduce to a single question or set of questions. With your understanding of Boolean search techniques and your knowledge of how the document corpus you were searching was indexed, you would then craft a search for the computer to run. Several hours later, when the search was finished, you would be presented with a list of results, sometimes ranked in order of relevance and limited in size because of a lack of computing power. Presumably you would then share these results with the researcher, or perhaps just turn over the relevant documents and send him on his way. In the academic literature, this was called “delegated search,” and it formed the background for the most influential information retrieval studies and research projects for many years—the Cranfield Experiments. See also “On the History of Evaluation in IR” by Stephen Robertson (2008).

In this system, literally everything—the document corpus, the index, the query, and the results—were mediated. There was a medium, a middle-man. The dream was to some day dis-intermediate, which does not mean to exhume the body of the dead news industry. (I feel entitled to this terrible joke as a former journalist… please forgive me.) When the World Wide Web and its ever-expanding document corpus came on the scene, many thought that search engines—huge algorithms, basically—would remove any barrier between the searcher and the information she sought. This is “end-user” search, and as algorithms improved, so too would the system, without requiring the searcher to possess any special skills. The searcher would plug a query, any query, into the search box, and the algorithm would present a ranked list of results, high on both recall and precision. Now, the lack of human attention, evidenced by the fact that few people ever look below result 3 on the list, became the limiting factor, instead of the lack of computing power.

delegated search

The only problem with this is that search engines did not remove the middle-man—they became the middle-man. Why? Because everything, whether we like it or not, is editorial, especially in reference or information retrieval. Everything, every decision, every step in the algorithm, everything everywhere, involves choice. Search engines, then, are never neutral. They embody the priorities of the people who created them and, as search logs are analyzed and incorporated, of the people who use them. It is in these senses that algorithms are inherently human.

A delightful piece on search algorithms that touches at the heart of successful search and/or data integration.

Its first three words capture the issue: Everything is Editorial….

Despite the pretensions of scholars, sages and rogues, everything is editorial, there are no universal semantic primitives.

For convenience in data processing we may choose to treat some tokens as semantic primitives, but that is always a choice that we make.

Once you make that leap, it comes as no surprise that owl:sameAs wasn’t used the same way by everyone who used it.

See: When owl:sameAs isn’t the Same: An Analysis of Identity Links on the Semantic Web by Harry Halpin, Ivan Herman, and Patrick J. Hayes, for one take on the confusion around owl:sameAs.

If you are interested in moving beyond opaque keyword searching, consider Aaron’s post carefully.

ThisPlusThat.me: [Topic Vectors?]

Tuesday, December 10th, 2013

ThisPlusThat.me: A Search Engine That Lets You ‘Add’ Words as Vectors by Christopher Moody.

From the post:

Natural language isn’t that great for searching. When you type a search query into Google, you miss out on a wide spectrum of human concepts and human emotions. Queried words have to be present in the web page, and then those pages are ranked according to the number of inbound and outbound links. That’s great for filtering out the cruft on the internet — and there’s a lot of that out there. What it doesn’t do is understand the relationships between words and understand the similarities or dissimilarities.

That’s where ThisPlusThat.me comes in — a search site I built to experiment with the word2vec algorithm recently released by Google. word2vec allows you to add and subtract concepts as if they were vectors, and get out sensible, and interesting results. I applied it to the Wikipedia corpus, and in doing so, tried creating an interactive search site that would allow users to put word2vec through it’s paces.

For example, word2vec allows you to evaluate a query like King – Man + Woman and get the result Queen. This means you can do some totally new searches.

… (examples omitted)

word2vec is a type of distributed word representation algorithm that trains a neural network in order to assign a vector to every word. Each of the dimensions in the vector tries to encapsulate some property of the word. Crudely speaking, one dimension could encode that man, woman, king and queen are all ‘people,’ whereas other dimensions could encode associations or dissociations with ‘royalty’ and ‘gender’. These traits are learned by trying to predict the context in a sentence and learning from correct and incorrect guesses.

Precisely!!!

😉

Doing it with word2vec requires large training sets of data. No doubt a useful venture if you are seeking to discover or document the word vectors in a domain.

But what if you wanted to declare vectors for words?

And then run word2vec (or something very similar) across the declared vectors.

Thinking along the lines of a topic map construct that has a “word” property with a non-null value. All the properties that follow are key/value pairs representing the positive and negative dimensions that are dimensions that give that word meaning.

Associations are collections of vector sums that identify subjects that take part in an association.

If we do all addressing by vector sums, we lose the need to track and collect system identifiers.

I think this could have legs.

Comments?

PS: For efficiency reasons, I suspect we should allow storage of computed vector sum(s) on a construct. But that would not prohibit another analysis reaching a different vector sum for different purposes.

Relevancy 301 – The Graduate Level Course

Wednesday, November 20th, 2013

Relevancy 301 – The Graduate Level Course by Paul Nelson.

From the post:

So, I was going to write an article entitled “Relevancy 101”, but that seemed too shallow for what has become a major area of academic research. And so here we are with a Graduate-Level Course. Grab your book-bag, some Cheetos and a Mountain Dew, and let’s kick back and talk search engine relevancy.

I have blogged about relevancy before (see “What does ‘relevant’ mean?)”, but that was a more philosophical discussion of relevancy. The purpose of this blog is to go in-depth into the different types of relevancy, how they’re computed, and what they’re good for. I’ll do my best to avoid math, but no guarantees.

A very good introduction to measures of “relevancy,” most of which are no longer used.

Pay particular attention to Paul’s remarks about the weaknesses of inverse document frequency (IDF).

Before Paul posts part 2, how do you determine the relevance of documents?

Exercise:

Pick a subject covered by a journal or magazine, one with twelve issues each year and review a year’s worth of issues for “relevant” articles.

Assuming the journal is available electronically, does the search engine suggest your other “relevant” articles?

If it doesn’t, can you determine why it recommended different articles?

FindZebra

Thursday, May 2nd, 2013

FindZebra

From the about page:

FindZebra is a specialised search engine supporting medical professionals in diagnosing difficult patient cases. Rare diseases are especially difficult to diagnose and this online medical search engines comes in support of medical personnel looking for diagnostic hypotheses. With a simple and consistent interface across all devices, it can be easily used as an aid tool at the time and place where medical decisions are made. The retrieved information is collected from reputable sources across the internet storing public medical articles on rare and genetic diseases.

A search engine with: WARNING! This is a research project to be used only by medical professionals.

To avoid overwhelming researchers with search result “noise,” FindZebra deliberately restricts the content it indexes.

An illustration of the crudeness of current search algorithms that altering the inputs is the easiest way to improve outcomes for particular types of searches.

That seems to be an argument in favor of smaller than enterprise search engines, which could roll-up into broader search applications.

Of course, with a topic map you could retain the division between departments even as you roll-up the content into broader search applications.

Broccoli: Semantic Full-Text Search at your Fingertips

Friday, April 19th, 2013

Broccoli: Semantic Full-Text Search at your Fingertips by Hannah Bast, Florian Bäurle, Björn Buchhold, Elmar Haussmann.

Abstract:

We present Broccoli, a fast and easy-to-use search engine for what we call semantic full-text search. Semantic full-text search combines the capabilities of standard full-text search and ontology search. The search operates on four kinds of objects: ordinary words (e.g., edible), classes (e.g., plants), instances (e.g., Broccoli), and relations (e.g., occurs-with or native-to). Queries are trees, where nodes are arbitrary bags of these objects, and arcs are relations. The user interface guides the user in incrementally constructing such trees by instant (search-as-you-type) suggestions of words, classes, instances, or relations that lead to good hits. Both standard full-text search and pure ontology search are included as special cases. In this paper, we describe the query language of Broccoli, a new kind of index that enables fast processing of queries from that language as well as fast query suggestion, the natural language processing required, and the user interface. We evaluated query times and result quality on the full version of the EnglishWikipedia (32 GB XML dump) combined with the YAGO ontology (26 million facts). We have implemented a fully functional prototype based on our ideas, see http://broccoli.informatik.uni-freiburg.de.

The most impressive part of an impressive paper was the new index, context lists.

The second idea, which is the main idea behind our new index, is to have what we call context lists instead of inverted lists. The context list for a pre x contains one index item per occurrence of a word starting with that pre x, just like the inverted list for that pre x would. But along with that it also contains one index item for each occurrence of an arbitrary entity in the same context as one of these words.

The performance numbers speak for themselves.

This should be a feature in the next release of Lucene/Solr. Or perhaps even configurable for the number of entities that can appear in a “context list.”

Was it happenstance or a desire for simplicity that caused the original indexing engines to parse text into single tokens?

Literature references on that point?

Developing New Ways to Search for Web Images

Thursday, November 22nd, 2012

Developing New Ways to Search for Web Images by Shar Steed.

From the post:

Collections of photos, images, and videos are quickly coming to dominate the content available on the Web. Currently internet search engines rely on the text with which the images are labeled to return matches. But why is only text being used to search visual mediums? These labels can be unreliable, unhelpful and sometimes not available at all.

To solve this problem, scientists at Stanford and Princeton have been working to “create a new generation of visual search technologies.” Dr. Fei-Fei Li, a computer scientist at Stanford, has built the world’s largest visual database, containing more than 14 million labeled objects.

A system called ImageNet, applies the data gathered from the database to recognize similar, unlabeled objects with much greater accuracy than past algorithms.

A remarkable amount of material to work with, either via the API or downloading for your own hacking.

Another tool for assisting in the authoring of topic maps (or other content).

Text Mining Methods Applied to Mathematical Texts

Saturday, July 14th, 2012

Text Mining Methods Applied to Mathematical Texts (slides) by Yannis Haralambous, Département Informatique, Télécom Bretagne.

Abstract:

Up to now, flexiform mathematical text has mainly been processed with the intention of formalizing mathematical knowledge so that proof engines can be applied to it. This approach can be compared with the symbolic approach to natural language processing, where methods of logic and knowledge representation are used to analyze linguistic phenomena. In the last two decades, a new approach to natural language processing has emerged, based on statistical methods and, in particular, data mining. This method, called text mining, aims to process large text corpora, in order to detect tendencies, to extract information, to classify documents, etc. In this talk I will present math mining, namely the potential applications of text mining to mathematical texts. After reviewing some existing works heading in that direction, I will formulate and describe several roadmap suggestions for the use and applications of statistical methods to mathematical text processing: (1) using terms instead of words as the basic unit of text processing, (2) using topics instead of subjects (“topics” in the sense of “topic models” in natural language processing, and “subjects” in the sense of various mathematical subject classifications), (3) using and correlating various graphs extracted from mathematical corpora, (4) use paraphrastic redundancy, etc. The purpose of this talk is to give a glimpse on potential applications of the math mining approach on large mathematical corpora, such as arXiv.org.

An invited presentation at CICM 2012.

I know Yannis from a completely different context and may comment on that in another post.

No paper but 50+ slides showing existing text mining tools can deliver useful search results, while waiting for a unified and correct index to all of mathematics. 😉

Varying semantics, as in all human enterprises, is an opportunity for topic map based assistance.

Search Algorithms for Conceptual Graph Databases

Friday, July 13th, 2012

Search Algorithms for Conceptual Graph Databases by Abdurashid Mamadolimov.

Abstract:

We consider a database composed of a set of conceptual graphs. Using conceptual graphs and graph homomorphism it is possible to build a basic query-answering mechanism based on semantic search. Graph homomorphism defines a partial order over conceptual graphs. Since graph homomorphism checking is an NP-Complete problem, the main requirement for database organizing and managing algorithms is to reduce the number of homomorphism checks. Searching is a basic operation for database manipulating problems. We consider the problem of searching for an element in a partially ordered set. The goal is to minimize the number of queries required to find a target element in the worst case. First we analyse conceptual graph database operations. Then we propose a new algorithm for a subclass of lattices. Finally, we suggest a parallel search algorithm for a general poset.

While I have no objection to efficient solutions for particular cases, as a general rule those solutions are valid for some known set of cases.

Here we appear to have an efficient solution for some unknown number of cases. I mention it to keep in mind while watching the search literature on graph databases develop.

Search and Counterfactuals

Thursday, July 5th, 2012

In Let the Children Play, It’s Good for Them! (Smithsonian, July/August 2012) Alison Gopnik writes:

Walk into any preschool and you’ll find toddling superheroes battling imaginary monsters. We take it for granted that young children play and, especially, pretend. Why do they spend so much time in fantasy worlds?

People have suspected that play helps children learn, but until recently there was little research that showed this or explained why it might be true. In my lab at the University of California at Berkeley, we’ve been trying to explain how very young children can learn so much so quickly, and we’ve developed a new scientific approach to children’s learning.

Where does pretending come in? It relates to what philosophers call “counterfactual” thinking, like Einstein wondering what would happen if a train went at the speed of light.

Do our current models for search encourage or discourage counterfactual thinking? Neutral?

There is place for “factual” queries: Has “Chipper” Jones, who plays for the Atlanta Braves, ever hit safely 5 out of 5 times in a game? 😉

But what of counterfactuals?

Do they lead us to new forms of indexing? By re-imagining how searching could be done, if and only if there were a new indexing structure?

Are advances in algorithms largely due to counterfactuals? Where the “factuals” are the world of processing as previously imagined?

We can search for the “factuals,” prior answers approved by authorities, but how does one search for a counterfactual?

Or search for what triggers a counterfactual?

I don’t have even an inkling at an answer or what an answer might look like, but thought it would be worth asking the question.

Theory and Techniques for Synthesizing a Family of Graph Algorithms

Thursday, July 5th, 2012

Theory and Techniques for Synthesizing a Family of Graph Algorithms by Srinivas Nedunuri, William R. Cook, and Douglas R. Smith.

Abstract:

Although Breadth-First Search (BFS) has several advantages over Depth-First Search (DFS) its prohibitive space requirements have meant that algorithm designers often pass it over in favor of DFS. To address this shortcoming, we introduce a theory of Efficient BFS (EBFS) along with a simple recursive program schema for carrying out the search. The theory is based on dominance relations, a long standing technique from the field of search algorithms. We show how the theory can be used to systematically derive solutions to two graph algorithms, namely the Single Source Shortest Path problem and the Minimum Spanning Tree problem. The solutions are found by making small systematic changes to the derivation, revealing the connections between the two problems which are often obscured in textbook presentations of them.

I don’t think it would satisfy the formal derivation requirements of the authors but am curious why the dominance relationships are derived? Wouldn’t it be easier to ask a user at each node, go/no-go as far as dominance relationship? Allowing an interactive application of the search algorithms based upon the users dominance judgement?

I say that because the derivation strategy depends upon the developer’s interpretation of dominance from a field that may be unfamiliar or incorrectly communicated by a user. To be sure the derivation and results may be formally correct but not produce an answer that is “correct” in the view of a user.

Not to mention that once a derivation is formally “correct,” there would be resistance to changing a “correct” derivation.

A interactive, dynamic, symbiotic search experience between users and their search systems is more likely to produce results thought to be “correct” by users. (What other measure of “correctness” comes to mind?)

PS: Srinivas Nedunuri‘s homepage promises a copy of his PhD dissertation, which formed the basis for this paper, soon!

Challenges in maintaining a high performance search engine written in Java

Friday, March 23rd, 2012

Challenges in maintaining a high performance search engine written in Java

You will find this on the homepage or may have to search for it. I was logged in when I accessed it.

Very much worth your time for a high level overview of issues that Lucene will face sooner rather than later.

After reviewing, think about it, make serious suggestions and if possible, contributions to the future of Lucene.

Just off the cuff, I would really like to see Lucene become a search engine framework with a default data structure that admits to either extension or replacement by other data structures. Some data structures may have higher performance costs than others, but if that is what your requirements call for, they can hardly be wrong. Yes? A “fast” search engine that doesn’t meet your requirements is no prize.

We Need To Talk About Binary Search

Wednesday, February 8th, 2012

We Need To Talk About Binary Search.

From the post:

Although the basic idea of binary search is comparatively straightforward, the details can be surprisingly tricky.

— Donald Knuth

Why is binary search so damn hard to get right? Why is it that 90% of programmers are unable to code up a binary search on the spot, even though it’s easily the most intuitive of the standard algorithms?

  • Firstly, binary search has a lot of potential for off-by-one errors. Do you do inclusive bounds or exclusive bounds? What’s your break condition: lo=hi+1, lo=hi, or lo=hi-1? Is the midpoint (lo+hi)/2 or (lo+hi)/2 - 1 or (lo+hi)/2 + 1? And what about the comparison, < or ? Certain combinations of these work, but it’s easy to pick one that doesn’t.
  • Secondly, there are actually two variants of binary search: a lower-bound search and an upper-bound search. Bugs are often caused by a careless programmer accidentally applying a lower-bound search when an upper-bound search was required, or vice versa.
  • Finally, binary search is very easy to underestimate and very hard to debug. You’ll get it working on one case, but when you increase the array size by 1 it’ll stop working; you’ll then fix it for this case, but now it won’t work in the original case!

I want to generalise and nail down the binary search, with the goal of introducing a shift in the way the you perceive it. By the end of this post you should be able to code any variant of binary search without hesitation and with complete confidence. But first, back to the start: here is the binary search you were probably taught…

In order to return a result to a user or for use in some other process, you have to find it first. This post may help you do just that, reliably.

NLM Plus

Monday, December 12th, 2011

NLM Plus

From the webpage:

NLMplus is an award winning Semantic Search Engine and Biomedical Knowledge Base application that showcases a variety of natural language processing tools to provide an improved level of access to the vast collection of biomedical data and services of the National Library of Medicine.

Utilizing its proprietary Web Knowledge Base, WebLib LLC can apply the universal search and semantic technology solutions demonstrated by NLMplus to libraries, businesses, and research organizations in all domains of science and technology and Web applications

Any medical librarians in the audience? Or ones you can forward this post to?

Curious what professional researchers make of NLM Plus? I don’t have the domain expertise to evaluate it.

Thanks!

Common Crawl

Wednesday, November 30th, 2011

Common Crawl

From the webpage:

Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.

As the largest and most diverse collection of information in human history, the web grants us tremendous insight if we can only understand it better. For example, web crawl data can be used to spot trends and identify patterns in politics, economics, health, popular culture and many other aspects of life. It provides an immensely rich corpus for scientific research, technological advancement, and innovative new businesses. It is crucial for our information-based society that the web be openly accessible to anyone who desires to utilize it.

We strive to be transparent in all of our operations and we support nofollow and robots.txt. For more information about the ccBot, please see FAQ. For more information on Common Crawl data and how to access it, please see Data. For access to our open source code, please see our GitHub repository.

Current crawl is reported to be 5 billion pages. That should keep you hard drives spinning enough to help with heating in cold climes!

Looks like a nice place to learn a good bit about searching as well as processing serious sized data.

Ten recent algorithm changes (Google)

Monday, November 14th, 2011

Ten recent algorithm changes (Google)

From the post:

Today we’re continuing our long-standing series of blog posts to share the methodology and process behind our search ranking, evaluation and algorithmic changes. This summer we published a video that gives a glimpse into our overall process, and today we want to give you a flavor of specific algorithm changes by publishing a highlight list of many of the improvements we’ve made over the past couple weeks.

We’ve published hundreds of blog posts about search over the years on this blog, our Official Google Blog, and even on my personal blog. But we’re always looking for ways to give you even deeper insight into the over 500 changes we make to search in a given year. In that spirit, here’s a list of ten improvements from the past couple weeks:

(skipping the good stuff, go to the post to read it)

If you’re a site owner, before you go wild tuning your anchor text or thinking about your web presence for Icelandic users, please remember that this is only a sampling of the hundreds of changes we make to our search algorithms in a given year, and even these changes may not work precisely as you’d imagine. We’ve decided to publish these descriptions in part because these specific changes are less susceptible to gaming.

I don’t doubt very large vendors (as well as small ones) would try to “game” Google search results.

But, describing search changes in generalities cuts Google off from suggestions from the community that could improve its legitimate search results. I am sure Google staff search all the search conferences and journals for new ideas, which is an indication that Google staff aren’t the only source of good ideas about search.

I don’t know what the mechanism would be like but I think Google should work towards some method to include more interested outsiders in the development of its search algorithms.

I don’t think Google has anything to fear from say Bing taking such a move but it isn’t that hard to imagine a start-up search company in a niche coming up with a viable way to harness an entire community of insight that filters upwards into search algorithms.

Now that would be a fearsome competitor for someone limited only to the “best and the brightest.”

PS: Do go read the ten algorithm changes. See what new ideas they spark in you and vote for Google to continue with the disclosures.

PPS: Is there a general discussion list for search algorithms? I don’t remember one off hand. Lots of specific ones for particular search engines.

BioGene 1.1 – Information Tool for Biological Research for Iphone

Thursday, October 20th, 2011

BioGene 1.1 – Information Tool for Biological Research for Iphone

From the website:

BioGene is an information tool for biological research. Use BioGene to learn about gene function. Enter a gene symbol or gene name, for example “CDK4″ or “cyclin dependent kinase 4″ and BioGene will retrieve its gene function and references into its function (GeneRIF).

The search/match criteria of Biogene is instructive:

Where does BioGene get its data?
BioGene provides primary information from Entrez Gene, a searchable database hosted by the NCBI.
What is a GeneRIF?
A GeneRIF is a functional annotation of a gene described in Entrez Gene. The annotation includes a link to a citation within PubMed which describes the function. Please see GeneRIF for more information.
How does BioGene search Entrez Gene?
BioGene attempts to match a query against a gene name (symbol). If no matching records are found, BioGene applies mild increments in permissiveness until a match is found. For example, if we are searching for the following single-term query, “trk”, BioGene will attempt the following sequence of queries in succession, stopping whenever one or more matching records is returned:

  • search for a gene name (symbol) that matches the exact sequence of characters “trk”
  • search for a gene name (symbol) that starts with the sequence of characters “trk”
  • search for a gene name (symbol) that contains the sequence of characters “trk” within a word
  • perform a free text search that matches the exact sequence of characters “trk”
  • perform a free text search that starts with the sequence of characters “trk”
  • perform a free text search that contains the sequence of characters “trk” within a word

In Entrez Gene parlance, for the following single-term query “trk”, the following sequence of queries is attempted:

  • trk[pref]
  • trk*[pref] OR trk[sym]
  • trk*[sym]
  • *trk*[sym]
  • trk[All Fields]
  • trk*[All Fields]
  • *trk*[All Fields]

If however, we are searching for the following multi-term query, “protein kinase 4”, BioGene will attempt the following sequence of queries in succession, stopping whenever one or more matching records is returned:

  • search for a full gene name that matches the exact sequence of characters “protein kinase 4”
  • perform a free text search that contains every term in the multi-term query “protein kinase 4”
  • perform a free text search that contains one of the terms in the multi-term query “protein kinase 4”

In Entrez Gene parlance, for the following multi-term query “protein kinase 4”, the following sequence of queries is attempted:

  • protein+kinase+4[gene full name]
  • protein[All Fields] OR kinase[All Fields] OR 4[All Fields]

If BioGene detects one or more of the following character sequences within the query:

[   ]   *   AND   OR   NOT

it treats this as an “advanced” query and passes the query directly to Entrez Gene. In this situation, BioGene ignores the organism filter specified in the application settings and expects the user to embed this filter within the query.

BioGene gives you access to subjects by multiple identifiers and at least for a closed set of data, attempts to find a “best” match.

Review Entrez Gene.

What other identifiers suggest themselves as bases of integrating known sources of additional information?

(Only “known” sources can be meaningfully displayed/integrated for the user.)

Search Algorithms with Google Director of Research Peter Norvig

Tuesday, October 18th, 2011

Search Algorithms with Google Director of Research Peter Norvig

From the post:

As you will see in the transcript below, this discussion focused on the use of artificial intelligence algorithms in search. Peter outlines for us the approach used by Google on a number of interesting search problems, and how they view search problems in general. This is fascinating reading for those of you who want to get a deeper understanding of how search is evolving and the technological approaches that are driving it. The types of things that are detailed in this interview include:

  1. The basic approach used to build Google Translate
  2. The process Google uses to test and implement algorithm updates
  3. How voice driven search works
  4. The methodology being used for image recognition
  5. How Google views speed in search
  6. How Google views the goals of search overall

Some of the particularly interesting tidbits include:

  1. Teaching automated translation systems vocabularly and grammar rules is not a viable approach. There are too many exceptions, and language changes and evolved rapidly. Google Translate uses a data driven approach of finding millions of real world translations on the web and learning from them.
  2. Chrome will auto translate foreign language websites for you on the fly (if you want it to).
  3. Google tests tens of thousands of algorithm changes per year, and make one to two actual changes every day
  4. Test is layered, starting with a panel of users comparing current and proposed results, perhaps a spin through the usability lab at Google, and finally with a live test with a small subset of actual Google users.
  5. Google Voice Search relies on 230 billion real world search queries to learn all the different ways that people articulate given words. So people no longer need to train their speech recognition for their own voice, as Google has enough real world examples to make that step unecessary.
  6. Google Image search allows you to drag and drop images onto the search box, and it will try to figure out what it is for you. I show a screen shot of an example of this for you below. I LOVE that feature!
  7. Google is obsessed with speed. As Peter says “you want the answer before you’re done thinking of the question”. Expressed from a productivity perspective, if you don’t have the answer that soon your flow of thought will be interrupted.

Reading the interview it occurred to me that perhaps, just perhaps, that authoring semantic applications, whether Semantic Web or Topic Maps, that we have been overly concerned with “correctness.” More so on the logic side where applications fall on their sides when they encounter outliers but precision is also the enemy of large scale production of topic maps.

What if we took a tack from Google’s use of a data driven approach to find mappings between data structures and the terms in data structures? I know automated techniques have been used for preliminary mapping of schemas before. What I am suggesting that we capture the basis for the mapping, so we can improve or change it.

Although there are more than 70 names for “insurance policy number” in information systems, I suspect that within the domain those stand in relationship to other subjects that would assist in refining a mining of those terms over time. Rather than making mining/mapping a “run it again Sam” type event, capturing that information could improve our odds at other mappings.

Depending on the domain, how accurate does it need to be? Particularly since we can build feedback into the systems so that as users encounter errors, those are corrected and cascade back to other users. Places users don’t visit may be wrong, but if no one visits, what difference does it make?

Very compelling interview and I suggest you read it in full.

Algorithms of the Intelligent Web Review

Monday, October 3rd, 2011

Algorithms of the Intelligent Web Review by Pearlene McKinley

From the post:

I have always had an interest in AI, machine learning, and data mining but I found the introductory books too mathematical and focused mostly on solving academic problems rather than real-world industrial problems. So, I was curious to see what this book was about.

I have read the book front-to-back (twice!) before I write this report. I started reading the electronic version a couple of months ago and read the paper print again over the weekend. This is the best practical book in machine learning that you can buy today — period. All the examples are written in Java and all algorithms are explained in plain English. The writing style is superb! The book was written by one author (Marmanis) while the other one (Babenko) contributed in the source code, so there are no gaps in the narrative; it is engaging, pleasant, and fluent. The author leads the reader from the very introductory concepts to some fairly advanced topics. Some of the topics are covered in the book and some are left as an exercise at the end of each chapter (there is a “To Do” section, which was a wonderful idea!). I did not like some of the figures (they were probably made by the authors not an artist) but this was only a minor aesthetic inconvenience.

The book covers four cornerstones of machine learning and intelligence, i.e. intelligent search, recommendations, clustering, and classification. It also covers a subject that today you can find only in the academic literature, i.e. combination techniques. Combination techniques are very powerful and although the author presents the techniques in the context of classifiers, it is clear that the same can be done for recommendations — as the Bell Korr team did for the Netflix prize.

Wonder if this will be useful in the Stanford AI course that starts next week with more than 130,000 students? Introduction to Artificial Intelligence – Stanford Class

I am going to order a copy, if for no other reason than to evaluate the reviewer’s claim of explanations “in plain English.” I have seen some fairly clever explanations of AI algorithms and would like to see how these stack up.

ElasticSearch: Beyond Full Text Search

Friday, September 30th, 2011

ElasticSearch: Beyond Full Text Search by Karel Minařík.

If you aren’t into hard core searching already, this is a nice introduction to the area. Would like to see the presentation that went with the slides but even the slides alone should be useful.

Ergodic Control and Polyhedral approaches to PageRank Optimization

Monday, September 26th, 2011

Ergodic Control and Polyhedral approaches to PageRank Optimization by Olivier Fercoq, Marianne Akian, Mustapha Bouhtou, Stéphane Gaubert (Submitted on 10 Nov 2010 (v1), last revised 19 Sep 2011 (this version, v2))

Abstract:

We study a general class of PageRank optimization problems which consist in finding an optimal outlink strategy for a web site subject to design constraints. We consider both a continuous problem, in which one can choose the intensity of a link, and a discrete one, in which in each page, there are obligatory links, facultative links and forbidden links. We show that the continuous problem, as well as its discrete variant when there are no constraints coupling different pages, can both be modeled by constrained Markov decision processes with ergodic reward, in which the webmaster determines the transition probabilities of websurfers. Although the number of actions turns out to be exponential, we show that an associated polytope of transition measures has a concise representation, from which we deduce that the continuous problem is solvable in polynomial time, and that the same is true for the discrete problem when there are no coupling constraints. We also provide efficient algorithms, adapted to very large networks. Then, we investigate the qualitative features of optimal outlink strategies, and identify in particular assumptions under which there exists a “master” page to which all controlled pages should point. We report numerical results on fragments of the real web graph.

I mention this research to raise several questions:

  1. Does PageRank have a role to play in presentation for topic map systems?
  2. Should PageRank results in topic map systems be used assign subject identifications?
  3. If your answer to #2 is yes, what sort of subjects and how would you design the user choices leading to them?
  4. Are you monitoring user navigations of your topic maps?
  5. Has user navigation of your topic maps affected their revision or design of following maps?
  6. Are the navigations in #5 the same as choices based on search results? (In theory or practice.)
  7. Is there an optimal strategy for linking nodes in a topic map?

Do No Evil (but patent the obvious)

Thursday, August 11th, 2011

On August 2, 2011, the US Patent office issued 7,991,780, Performing multiple related searches:

The Abstract reads:

A first search is performed in response to a received search query. The first search is based at least in part on a first portion of the search query. In the first search, a first set of content items are searched over to identify a first set of search results. Each result in the first set of search results identifies at least one content item of the first set of content items. A second set of content items for performing a second search is determined based at least in part on one or more of the results in the first set of search results. The second set of content items includes content items not included in the first set of search results. A second search is performed, searching over the second set of content items to identify a second set of search results. The second search is based at least in part on a second portion of the search query. Each result in the second set of search results identifies at least one content item of the second set of content items.

I have known about this for several days but had to give myself a “time out” so that my post would not be laced with obscenities and speculations on parentage and personal habits of which I have no personal knowledge.

I must also say that nothing in this post should be construed or taken as legal advice or opinion on the form, substance or history of this patent or any of the patents mentioned in this patent.

What I would like to say is that I remember subset searching in the days of Lexis/Nexis hard wired terminals in law offices (I had one) and that personal experience was twenty-five (25) years ago. The patent uses the magic words web pages and web addresses but I recognize the searches I did so long ago.

Perhaps I can describe integer arithmetic in a patent application using web page(s) and web address(es) terminology. If I can, the first person to front the fees and expenses, I will assign a one-half interest in the patent. Contact me privately and we will arrive at a figure for the application. Just don’t tell Intel. 😉

BTW, topic maps, with merging of information about a subject avoids the need for a secondary search on a sub-set of material in some cases.

Something that is long over-due and uniquely suited for topic maps would be a survey of the IP in the area of search and retrieval. Not in the sense of legal advice but simply putting together all the relevant patents. With a mapping of the various claims.

Open Source Search Engines (comparison)

Wednesday, July 27th, 2011

Open Source Search Engines (comparison)

A comparison of ten (10) open source search engines.

Appears as an appendix to Modern Information Retrieval, second edition.

I probably don’t need yet another IR book.

But the first edition was well written, the second edition website includes teaching slides for all chapters, a nice set of pointers to additional resources, problems and solutions is “under construction” as of 27 July 2011, all of which are things I like to encourage in authors.

OK, I talked myself into it, I am ordering a copy today. 😉

More comments to follow.

Lucene.net is back on track

Saturday, July 23rd, 2011

Lucene.net is back on track by Simone Chiaretta

From the post:

More than 6 months ago I blogged about Lucene.net starting his path toward extinction. Soon after that, due to the “stubbornness” of the main committer, a few forks appeared, the biggest of which was Lucere.net by Troy Howard.

At the end of the year, despite the promises of the main committer of complying to the request of the Apache board by himself, nothing happened and Lucene.net went really close to be being shut down. But luckily, the same Troy Howard that forked Lucene.net a few months before, decided, together with a bunch of other volunteers, to resubmit the documents required by the Apache Board for starting a new project into the Apache Incubator; by the beginning of February the new proposal was voted for by the Board and the project re-entered the incubator.

If you are interested in search engines and have .Net skills (or want to acquire them), this would be a good place to start.