Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 9, 2014

Lucene 4 Essentials for Text Search and Indexing

Filed under: Indexing,Java,Lucene,Searching — Patrick Durusau @ 5:06 pm

Lucene 4 Essentials for Text Search and Indexing by Mitzi Morris.

From the post:

Here’s a short-ish introduction to the Lucene search engine which shows you how to use the current API to develop search over a collection of texts. Most of this post is excerpted from Text Processing in Java, Chapter 7, Text Search with Lucene.

Not too short! 😉

I have seen blurbs about Text Processing in Java but this post convinced me to put it on my wish list.

You?

PS: As soon as a copy arrives I will start working on a review of it. If you want to see that happen sooner rather than later, ping me.

February 24, 2014

Findability and Exploration:…

Findability and Exploration: the future of search by Stijn Debrouwere.

From the introduction:

The majority of people visiting a news website don’t care about the front page. They might have reached your site from Google while searching for a very specific topic. They might just be wandering around. Or they’re visiting your site because they’re interested in one specific event that you cover. This is big. It changes the way we should think about news websites.

We need ambient findability. We need smart ways of guiding people towards the content they’d like to see — with categorization and search playing complementary goals. And we need smart ways to keep readers on our site, especially if they’re just following a link from Google or Facebook, by prickling their sense of exploration.

Pete Bell recently opined that search is the enemy of information architecture. That’s too bad, because we’re really going to need great search if we’re to beat Wikipedia at its own game: providing readers with timely information about topics they care about.

First, we need to understand a bit more about search. What is search?

A classic (2010) statement of the requirements for a “killer” app. I didn’t say “search” app because search might not be a major aspect of its success. At least if you measure success in terms of user satisfaction after using an app.

A satisfaction that comes from obtaining the content they want to see. How they got there isn’t important to them.

February 21, 2014

Business Information Key Resources

Filed under: BI,Business Intelligence,Research Methods,Searching — Patrick Durusau @ 11:19 am

Business Information Key Resources by Karen Blakeman.

From the post:

On one of my recent workshops I was asked if I used Google as my default search tool, especially when conducting business research. The short answer is “It depends”. The long answer is that it depends on the topic and type of information I am looking for. Yes, I do use Google a lot but if I need to make sure that I have covered as many sources as possible I also use Google alternatives such as Bing, Millionshort, Blekko etc. On the other hand and depending on the type of information I require I may ignore Google and its ilk altogether and go straight to one or more of the specialist websites and databases.

Here are just a few of the free and pay-per-view resources that I use.

Starting points for research are a matter of subject, cost, personal preference, recommendations from others, etc.

What are your favorite starting points for business information?

February 19, 2014

Why Not AND, OR, And NOT?

Filed under: Boolean Operators,Lucene,Searching,Solr — Patrick Durusau @ 3:20 pm

Why Not AND, OR, And NOT?

From the post:

The following is written with Solr users in mind, but the principles apply to Lucene users as well.

I really dislike the so called “Boolean Operators” (“AND”, “OR”, and “NOT”) and generally discourage people from using them. It’s understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it’s a good idea to try to “set aside childish things” and start thinking (and encouraging your users to think) in terms of the superior “Prefix Operators” (“+”, “-”).

Background: Boolean Logic Makes For Terrible Scores

Boolean Algebra is (as my father would put it) “pretty neat stuff” and the world as we know it most certainly wouldn’t exist with out it. But when it comes to building a search engine, boolean logic tends to not be very helpful. Depending on how you look at it, boolean logic is all about truth values and/or set intersections. In either case, there is no concept of “relevancy” — either something is true or it’s false; either it is in a set, or it is not in the set.

When a user is looking for “all documents that contain the word ‘Alligator’” they aren’t going to very be happy if a search system applied simple boolean logic to just identify the unordered set of all matching documents. Instead algorithms like TF/IDF are used to try and identify the ordered list of matching documents, such that the “best” matches come first. Likewise, if a user is looking for “all documents that contain the words ‘Alligator’ or ‘Crocodile’”, a simple boolean logic union of the sets of documents from the individual queries would not generate results as good as a query that took into the TF/IDF scores of the documents for the individual queries, as well as considering which documents matches both queries. (The user is probably more interested in a document that discusses the similarities and differences between Alligators to Crocodiles then in documents that only mention one or the other a great many times).

This brings us to the crux of why I think it’s a bad idea to use the “Boolean Operators” in query strings: because it’s not how the underlying query structures actually work, and it’s not as expressive as the alternative for describing what you want.

As if you needed more proof that knowing “how” a search system is constructed is as important as knowing the surface syntax.

A great post that gives examples to illustrate each of the issues.

In case you are wondering about the December 28, 2011 date on the post, BooleanCause.Occur Lucene 4.6.1.

Troubleshooting Elasticsearch searches, for Beginners

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 2:46 pm

Troubleshooting Elasticsearch searches, for Beginners by Alex Brasetvik.

From the post:

Elasticsearch’s recent popularity is in large part due to its ease of use. It’s fairly simple to get going quickly with Elasticsearch, which can be deceptive. Here at Found we’ve noticed some common pitfalls new Elasticsearch users encounter. Consider this article a piece of necessary reading for the new Elasticsearch user; if you don’t know these basic techniques take the time to familiarize yourself with them now, you’ll save yourself a lot of distress.

Specifically, this article will focus on text transformation, more properly known as text analysis, which is where we see a lot of people get tripped up. Having used other databases, the fact that all data is transformed before getting indexed can take some getting used to. Additionally, “schema free” means different things for different systems, a fact that is often confused with Elasticsearch’s “Schema Flexible” design.

When Alex say “beginners” he means beginning developers so this isn’t a post you can send to users with search troubles.

Sorry!

But if you are trying to debug search results in ElasticSearch as a developer, this is a good place to start.

February 18, 2014

ElasticSearch Analyzers – Parts 1 and 2

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 4:16 pm

Andrew Cholakian has written a two part introduction to analyzers in ElasticSearch.

All About Analyzers, Part One

From the introduction:

Choosing the right analyzer for an Elasticsearch query can be as much art as science. Analyzers are the special algorithms that determine how a string field in a document is transformed into terms in an inverted index. If you need a refresher on the basics of inverted indexes and where analysis fits into Elasticsearch in general please see this chapter in Exploring Elasticsearch covering analyzers. In this article we’ll survey various analyzers, each of which showcases a very different approach to parsing text.

Ten tokenizers, thirty-one token filters, and three character filters ship with the Elasticsearch distribution; a truly overwhelming number of options. This number can be increased further still through plugins, making the choices even harder to wrap one’s head around. Combinations of these tokenizers, token filters, and character filters create what’s called an analyzer. There are eight standard analyzers defined, but really, they are simply convenient shortcuts for arranging tokenizers, token filters, and character filters yourself. While reaching an understanding of this multitude of options may sound difficult, becoming reasonably competent in the use of analyzers is merely a matter of time and practice. Once the basic mechanisms behind analysis are understood, these tools are relatively easy to reason about and compose.

All About Analyzers, Part Two (continues part 1).

Very much worth your time if you need a refresher or analyzers for ElasticSearch and/or are approaching them for the first time.

Of course I went hunting for the treatment of synonyms, only to find the standard fare.

Not bad by any means but a grade school student knows synonyms depend upon any number of factors but you would be hard pressed to find that in any search engine.

I suppose you could define synonyms as most engines do and then filter the results to eliminate from a gene search “hits” from Field and Stream, Guns & Ammo, and the like. Although your searchers may be interested in how to trick out an AR-15. 😉

It may be that simple bulk steps are faster than more sophisticated searching. Will have to give that some thought.

Better Search = Better Results

Filed under: Bing,Programming,Searching — Patrick Durusau @ 11:29 am

Bing Code Search Makes Developers More Productive by Rob Knies.

The problem:

Software developers routinely rely on the Internet to find and reuse code samples that pertain to their current projects. Sites such as the Microsoft Developer Network (MSDN) and StackOverflow provide a rich collection of code samples to address many of the needs programmers face.

The process for doing so, though, is not particularly streamlined. The developer has to exit the programming environment, switch to a browser, enter a search query, sift through the search results for useful code snippets, copy and paste a promising snippet back into the programming environment, and adapt the pasted snippet to the programming context at hand.

It works, but it’s not optimal.

The Solution:


The result of all this collaboration is a free add-in, which became available for download on Feb. 17, that makes it easier for .NET developers to search for and reuse code samples from across the coding community. The news about Bing Code Search also appears on the Bing and Visual Studio blogs.

The Payoff:

A recent study indicated that Bing Code Search provides to programmers a time improvement of more than 60 percent, compared with the browser-search-copy-and-paste scenario. (emphasis added)

Whether you use category theory with your spreadsheets or not, a 60 percent time improvement on code searching for your developers is impressive!

Your next goal should be 60 percent re-use of the code they find. 😉

PS: This is the type of metric semantic integration software needs to demonstrate. Take some concrete or even routine task that is familiar, time consuming and/or hard to get good search results. Save time and/or produce markedly better results.

February 16, 2014

SearchReSearch

Filed under: Search Behavior,Searching — Patrick Durusau @ 4:45 pm

SearchReSearch by Daniel M. Russell.

WARNING: SearchReSearch looks very addictive!

Truly, it really looks addictive

The description reads:

A blog about search, search skills, teaching search, learning how to search, learning how to use Google effectively, learning how to do research. It also covers a good deal of sensemaking and information foraging.

If you like searching, knowing why searches work (or don’t), sensemaking and information foraging, this is the blog for you.

Among other features, Daniel posts search challenges that are solved by commenters and himself. Interesting search challenges.

Spread the news about Daniel’s blog to every librarian (or researcher) you know.

I first saw this at Pete Warden’s Five Short Links February 13, 2014.

February 6, 2014

Knowledge Base Completion…

Filed under: Knowledge,Knowledge Discovery,Searching — Patrick Durusau @ 8:01 pm

Knowledge Base Completion via Search-Based Question Answering by Robert West, et.al.

Abstract:

Over the past few years, massive amounts of world knowledge have been accumulated in publicly available knowledge bases, such as Freebase, NELL, and YAGO. Yet despite their seemingly huge size, these knowledge bases are greatly incomplete. For example, over 70% of people included in Freebase have no known place of birth, and 99% have no known ethnicity. In this paper, we propose a way to leverage existing Web-search–based question-answering technology to fill in the gaps in knowledge bases in a targeted way. In particular, for each entity attribute, we learn the best set of queries to ask, such that the answer snippets returned by the search engine are most likely to contain the correct value for that attribute. For example, if we want to find Frank Zappa’s mother, we could ask the query “who is the mother of Frank Zappa”. However, this is likely to return “The Mothers of Invention”, which was the name of his band. Our system learns that it should (in this case) add disambiguating terms, such as Zappa’s place of birth, in order to make it more likely that the search results contain snippets mentioning his mother. Our system also learns how many different queries to ask for each attribute, since in some cases, asking too many can hurt accuracy (by introducing false positives). We discuss how to aggregate candidate answers across multiple queries, ultimately returning probabilistic predictions for possible values for each attribute. Finally, we evaluate our system and show that it is able to extract a large number of facts with high confidence.

I was glad to see this paper was relevant to searching because any paper with Frank Zappa and “The Mothers of Invention” in the abstract deserves to be cited. 😉 I will tell you that story another day.

It’s heavy reading and I have just begun but I wanted to mention something from early in the paper:

We show that it is better to ask multiple queries and aggregate the results, rather than rely on the answers to a single query, since integrating several pieces of evidence allows for more robust estimates of answer correctness.

Does the use of multiple queries run counter to the view that querying a knowledge base, be it RDF or topic maps or other, should result in a single answer?

If you were to ask me a non-trivial question five (5) days in a row (same question) you would get at least five different answers. All in response to the same question but eliciting slightly different information.

Should we take the same approach to knowledge bases? Or do we in fact already do take that approach by querying search engines with slightly different queries?

Thoughts?

I first saw this in a tweet by Stefano Bertolo.

February 5, 2014

Patent Search and Analysis Tools

Filed under: Intellectual Property (IP),Patents,Searching — Patrick Durusau @ 2:54 pm

Free and Low Cost Patent Search and Analysis Tools: Who Needs Expensive Name Brand Products? by Jackie Hutter.

From the post:

In private conversations, some of my corporate peers inform me that they pay $1000′s per year (or even per quarter for larger companies) for access to “name brand” patent search tools that nonetheless do not contain accurate and up to date information. For example, a client tells me that one of these expensive tools fails to update USPTO records on a portfolio her company is monitoring and that the PAIR data is more than 1 year out of date. This limits the effectiveness of the expensive database by requiring her IP support staff to check each individual record on a regular basis to update the data. Of course, this limitation defeats the purpose of spending the big bucks to engage with a “name brand” search tool.

Certainly, one need not have sympathy for corporate IP professionals who manage large department budgets–if they spend needlessly on “name brand” tools and staff to manage the quality of such tools, so be it. But most companies with IP strategy needs do not have money and staff to purchase such tools, let alone to fix the errors in the datasets obtained from them. Others might wish not to waste their department budgets on worthless tools. To this end, over the last 5 years, I have used a number of free and low cost tools in my IP strategy practice. I use all of these tools on a regular basis and have personally validated the quality and validity of each one for my practice.
….

Jackie makes two cases:

First, there are free tools that perform as well or better than commercial patent tools. A link is offered to a list of them.

Second, and more importantly from my perspective, is the low cost tools leave a lot to be desired in terms of UI and usability.

Certainly enough room for an “inexpensive” but better than commercial-grade patent search service to establish a market.

Or perhaps a more expensive “challenge” tool that warns subscribers about patents close to theirs.

I first saw this in a tweet by Lutz Maicher.

January 23, 2014

Finding long tail suggestions…

Filed under: Lucene,Search Engines,Searching — Patrick Durusau @ 7:26 pm

Finding long tail suggestions using Lucene’s new FreeTextSuggester by Mike McCandless.

From the post:

Lucene’s suggest module offers a number of fun auto-suggest implementations to give a user live search suggestions as they type each character into a search box.

For example, WFSTCompletionLookup compiles all suggestions and their weights into a compact Finite State Transducer, enabling fast prefix lookup for basic suggestions.

AnalyzingSuggester improves on this by using an Analyzer to normalize both the suggestions and the user’s query so that trivial differences in whitespace, casing, stop-words, synonyms, as determined by the analyzer, do not prevent a suggestion from matching.

Finally, AnalyzingInfixSuggester goes further by allowing infix matches so that words inside each suggestion (not just the prefix) can trigger a match. You can see this one action at the Lucene/Solr Jira search application (e.g., try “python”) that I recently created to eat our own dog food. It is also the only suggester implementation so far that supports highlighting (this has proven challenging for the other suggesters).

Yet, a common limitation to all of these suggesters is that they can only suggest from a finite set of previously built suggestions. This may not be a problem if your suggestions are past user queries and have tons and tons of them (e.g., you are Google). Alternatively, if your universe of suggestions is inherently closed, such as the movie and show titles that Netflix’s search will suggest, or all product names on an e-commerce site, then a closed set of suggestions is appropriate.
….

Since you are unlikely to be Google, Mike goes on to show how FreeTextSuggester can ride to your rescue!

As always, Mike’s post is a pleasure to read.

January 22, 2014

Google shows missing search terms

Filed under: Search Engines,Searching — Patrick Durusau @ 7:40 pm

Google shows missing search terms by Karen Blakeman.

From the post:

Several weeks ago I noticed that Google was displaying the terms it had dropped from your search as ‘Missing’. Google started routinely ignoring selected search terms towards the end of 2011 (see http://www.rba.co.uk/wordpress/2011/11/08/dear-google-stop-messing-with-my-search/). Google’s response to the outcry from searchers was to introduce the Verbatim search option. However, there was no way of checking whether all of your terms appeared in a result other than viewing the whole page. Irritating, to say the least, if you found that the top 10 results did not include all of your keywords.

Fast forward to December 2013, and some people started seeing results lists that showed missing keywords as strikethroughs. I saw them for a few days and then, just as I was preparing a blog posting on the feature, they disappeared! I assumed that they were one of Google’s live experiments never to be seen again but it seems they are back. Two people contacted me today to say that they are seeing strikethroughs on missing terms. I ran my test searches again and, yes, I’m seeing them as well.

I ran the original search that prompted my November 2011 article (parrots heron island Caversham UK) and included -site:rba.co.uk in the strategy to exclude my original blog postings. Sure enough, the first two results were missing parrots and had “Missing parrots” underneath their entry in the list.

At least as of today, try: parrots heron island Caversham UK -site:rba.co.uk in Google and you will see the same result.

A welcome development, although more transparency would be very welcome.

A non-transparent search process isn’t searching. It’s guessing.

January 6, 2014

Needles in Stacks of Needles:…

Filed under: Bioinformatics,Biomedical,Genomics,Searching,Visualization — Patrick Durusau @ 3:33 pm

Needles in Stacks of Needles: genomics + data mining by Martin Krzywinski. (ICDM2012 Keynote)

Abstract:

In 2001, the first human genome sequence was published. Now, just over 10 years later, we capable of sequencing a genome in just a few days. Massive parallel sequencing projects now make it possible to study the cancers of thousands of individuals. New data mining approaches are required to robustly interrogate the data for causal relationships among the inherently noisy biology. How does one identify genetic changes that are specific and causal to a disease within the rich variation that is either natural or merely correlated? The problem is one of finding a needle in a stack of needles. I will provide a non-specialist introduction to data mining methods and challenges in genomics, with a focus on the role visualization plays in the exploration of the underlying data.

This page links to the slides Martin used in his presentation.

Excellent graphics and a number of amusing points, even without the presentation itself:

Cheap Data: A fruit fly that expresses high sensitivity to alcohol.

Kenny: A fruit fly without this gene dies in two days, named for the South Park character who dies in each episode.

Ken and Barbie: Fruit flys that fail to develop external genitalia.

One observation that rings true across disciplines:

Literature is still largely composed and published opaquely.

I searched for a video recording of the presentation but came up empty.

December 29, 2013

How semantic search is killing the keyword

Filed under: Searching,Semantic Search,Semantics,Topic Maps — Patrick Durusau @ 2:23 pm

How semantic search is killing the keyword by Rich Benci.

From the post:

Keyword-driven results have dominated search engine results pages (SERPs) for years, and keyword-specific phrases have long been the standard used by marketers and SEO professionals alike to tailor their campaigns. However, Google’s major new algorithm update, affectionately known as Hummingbird because it is “precise and fast,” is quietly triggering a wholesale shift towards “semantic search,” which focuses on user intent (the purpose of a query) instead of individual search terms (the keywords in a query).

Attempts have been made (in the relatively short history of search engines) to explore the value of semantic results, which address the meaning of a query, rather than traditional results, which rely on strict keyword adherence. Most of these efforts have ended in failure. However, Google’s recent steps have had quite an impact in the internet marketing world. Google began emphasizing the importance of semantic search by showcasing its Knowledge Graph, a clear sign that search engines today (especially Google) care a lot more about displaying predictive, relevant, and more meaningful sites and web pages than ever before. This “graph” is a massive mapping system that connects real-world people, places, and things that are related to each other and that bring richer, more relevant results to users. The Knowledge Graph, like Hummingbird, is an example of how Google is increasingly focused on answering questions directly and producing results that match the meaning of the query, rather than matching just a few words.

“Hummingbird” takes flight

Google’s search chief, Amit Singhal, says that the Hummingbird update is “the first time since 2001 that a Google algorithm has been so dramatically rewritten.” This is how Danny Sullivan of Search Engine Land explains it: “Hummingbird pays more attention to each word in a query, ensuring that the whole query — the whole sentence or conversation or meaning — is taken into account, rather than particular words.”

The point of this new approach is to filter out less-relevant, less-desirable results, making for a more satisfying, more accurate answer that includes rich supporting information and easier navigation. Google’s Knowledge Graph, with its “connect the dots” type of approach, is important because users stick around longer as they discover more about related people, events, and topics. The results of a simple search for Hillary Clinton, for instance, include her birthday, her hometown, her family members, the books she’s written, a wide variety of images, and links to “similar” people, like Barack Obama, John McCain, and Joe Biden.

The key to making your website more amenable to “semantic search” is the use of the microformat you will find at Schema.org.

That is to say Google’s graph has pre-fabricated information in its knowledge graph that it can match up with information specified using Schema.org markup.

Sounds remarkably like a topic map doesn’t it?

Useful if you are looking for “popular” people, places and things. Not so hot with intra-enterprise search results. Unless of course your enterprise is driven by “pop” culture.

Impressive if you want coarse semantic searching sufficient to sell advertising. (See Type Hierarchy at Schema.org for all available types.

I say coarse semantic searching, my count on the types at Schema.org, as of today, is seven hundred and nineteen (719) types. Is that what you get?

I ask because in scanning “InterAction,” I don’t see SexAction or any of its sub-categories. Under “ConsumeAction” I don’t see SmokeAction or SmokeCrackAction or SmokeWeedAction or any of the other sub-categories of “ConsumeAction.” Under “LocalBusiness” I did not see WhoreHouse, DrugDealer, S/MShop, etc.

I felt like I had fallen into BradyBunchville. 😉

Seriously, if they left out those mainstream activities, what are the chances they included what you need for your enterprise?

Not so good. That’s what I thought.

A topic map when paired with a search engine and your annotated content can take your enterprise beyond keyword search.

December 24, 2013

Undated Search Results

Filed under: Search Data,Searching — Patrick Durusau @ 3:12 pm

Looking for HTML5 resources to mention in Design, Math, and Data was complicated by the lack of dating in search results.

Searching on “html5 interfaces examples,” the highest ranked result was:

HTML5 Website Showcase: 48 Potential Flash-Killing Demos (2009, est.)

That’s right, a four year old post.

Never mind the changes in CSS, jQuery, etc. over the last four years.

Several pages into the first search results I found:

40+ Useful HTML5 Examples and Tutorials (2012)

It was in a mixture of undated or variously dated resources.

Finally, after following an old post and then searching that site, I uncovered:

21 Fresh Examples of Websites Using HTML5 (2013)

Even there it wasn’t the highest ranked page at the site.

I realize that parsing dates for sites could be difficult but surely search engines know the date when they first encountered a page? That would make it trivial to order search results by time.

Pages would not have a strict chronological sequence but a better time sorting than the current time hodgepodge of results.

elasticsearch-entity-resolution

Filed under: Duke,ElasticSearch,Entity Resolution,Search Engines,Searching — Patrick Durusau @ 2:17 pm

elasticsearch-entity-resolution

From the webpage:

This project is an interactive entity resolution plugin for Elasticsearch based on Duke. Basically, it uses Bayesian probabilities to compute probability. You can pretty much use it an interactive deduplication engine.

To understand basics, go to Duke project documentation.

A list of available comparators is available here.

Interesting pairing of Duke (entity resolution/record linkage software by Lars Marius Garshol) with ElasticSearch.

Strings and user search behavior can only take an indexing engine so far. This is a step in the right direction.

A step more likely be followed with an Apache License as opposed to its current LGPLv3.

December 20, 2013

Principles of Solr application design

Filed under: Searching,Solr — Patrick Durusau @ 7:35 pm

Principles of Solr application design – part 1 of 2

Principles of Solr application design – part 2 of 2

From part 1:

We’ve been working internally on a document encapsulating how we build (and recommend others should build) search applications based on Apache Solr, probably the most popular open source search engine library. As an early Christmas present we’re releasing these as a two part series – if you have any feedback we’d welcome comments! So without further ado here’s the first part:

Over two posts you get thirteen (13) points to check off while building a Solr application.

You won’t find anything startling but it will make a useful checklist.

Solr Cluster

Filed under: LucidWorks,Search Engines,Searching,Solr — Patrick Durusau @ 7:30 pm

Solr Cluster

From the webpage:

Join us weekly for tips and tricks, product updates and Q&A on topics you suggest. Guest appearances from Lucene/Solr committers and PMC members. Send questions to SolrCluster@lucidworks.com

So far:

#1 Entity Recognition

Enhance Search applications beyond simple keyword search by adding intelligence through metadata. Help classify common patterns from unstructured data/content into predefined categories. Examples include names of persons, organizations, locations, expressions of time, quantities, monetary values, percentages etc. Entity recognition is usually built using either linguistic grammar-based techniques or statistical models.

#2 On Enterprise and Intranet Search

What use is search to an enterprise? What is the purpose of intranet search? How hard is it to implement? In this episode we speak with LucidWorks consultant Evan Sayer about the benefits of internal search and how to prepare your business data to best take advantage of full-text search.

Well, the lead in music isn’t Beaker Street, but it’s not that long.

I think the discussion would be easier to follow with a webpage with common terms and an outline of the topic for the day.

Has real potential so I urge you to listen, send in questions and comments.

Search …Business Critical in 2014

Filed under: Merging,Search Requirements,Searching,Topic Maps — Patrick Durusau @ 7:14 pm

Search Continues to Be Business Critical in 2014 by Martin White.

From the post:

I offer two topics that I see becoming increasingly important in 2014. One of these is cross-device search, where a search is initially conducted on a desktop and is continued on a smartphone, and vice-versa. There is a very good paper from Microsoft that sets out some of the issues. The second topic is continuous information seeking, where search tasks are carried out by more than one “searcher,” often in support of collaborative working. The book on this topic by Chirag Shah, a member of staff of Rutgers University, is a very good place to start.

Editor’s Note: Read more of Martin’s thoughts on search in Why All Search Projects Fail.

Gee, let me see, what would more than one searcher need to make their collaborative search results usable by the entire team?

Can you say merging? 😉

Martin has other, equally useful insights in the search space so don’t miss the rest of his post.

But also catch his “Why All Search Projects Fail.” Good reading before you sign a contract with a client.

December 14, 2013

Everything is Editorial:..

Filed under: Algorithms,Law,Legal Informatics,Search Algorithms,Searching,Semantics — Patrick Durusau @ 7:57 pm

Everything is Editorial: Why Algorithms are Hand-Made, Human, and Not Just For Search Anymore by Aaron Kirschenfeld.

From the post:

Down here in Durham, NC, we have artisanal everything: bread, cheese, pizza, peanut butter, and of course coffee, coffee, and more coffee. It’s great—fantastic food and coffee, that is, and there is no doubt some psychological kick from knowing that it’s been made carefully by skilled craftspeople for my enjoyment. The old ways are better, at least until they’re co-opted by major multinational corporations.

Aside from making you either hungry or jealous, or perhaps both, why am I talking about fancy foodstuffs on a blog about legal information? It’s because I’d like to argue that algorithms are not computerized, unknowable, mysterious things—they are produced by people, often painstakingly, with a great deal of care. Food metaphors abound, helpfully I think. Algorithms are the “special sauce” of many online research services. They are sets of instructions to be followed and completed, leading to a final product, just like a recipe. Above all, they are the stuff of life for the research systems of the near future.

Human Mediation Never Went Away

When we talk about algorithms in the research community, we are generally talking about search or information retrieval (IR) algorithms. A recent and fascinating VoxPopuLII post by Qiang Lu and Jack Conrad, “Next Generation Legal Search – It’s Already Here,” discusses how these algorithms have become more complicated by considering factors beyond document-based, topical relevance. But I’d like to step back for a moment and head into the past for a bit to talk about the beginnings of search, and the framework that we have viewed it within for the past half-century.

Many early information-retrieval systems worked like this: a researcher would come to you, the information professional, with an information need, that vague and negotiable idea which you would try to reduce to a single question or set of questions. With your understanding of Boolean search techniques and your knowledge of how the document corpus you were searching was indexed, you would then craft a search for the computer to run. Several hours later, when the search was finished, you would be presented with a list of results, sometimes ranked in order of relevance and limited in size because of a lack of computing power. Presumably you would then share these results with the researcher, or perhaps just turn over the relevant documents and send him on his way. In the academic literature, this was called “delegated search,” and it formed the background for the most influential information retrieval studies and research projects for many years—the Cranfield Experiments. See also “On the History of Evaluation in IR” by Stephen Robertson (2008).

In this system, literally everything—the document corpus, the index, the query, and the results—were mediated. There was a medium, a middle-man. The dream was to some day dis-intermediate, which does not mean to exhume the body of the dead news industry. (I feel entitled to this terrible joke as a former journalist… please forgive me.) When the World Wide Web and its ever-expanding document corpus came on the scene, many thought that search engines—huge algorithms, basically—would remove any barrier between the searcher and the information she sought. This is “end-user” search, and as algorithms improved, so too would the system, without requiring the searcher to possess any special skills. The searcher would plug a query, any query, into the search box, and the algorithm would present a ranked list of results, high on both recall and precision. Now, the lack of human attention, evidenced by the fact that few people ever look below result 3 on the list, became the limiting factor, instead of the lack of computing power.

delegated search

The only problem with this is that search engines did not remove the middle-man—they became the middle-man. Why? Because everything, whether we like it or not, is editorial, especially in reference or information retrieval. Everything, every decision, every step in the algorithm, everything everywhere, involves choice. Search engines, then, are never neutral. They embody the priorities of the people who created them and, as search logs are analyzed and incorporated, of the people who use them. It is in these senses that algorithms are inherently human.

A delightful piece on search algorithms that touches at the heart of successful search and/or data integration.

Its first three words capture the issue: Everything is Editorial….

Despite the pretensions of scholars, sages and rogues, everything is editorial, there are no universal semantic primitives.

For convenience in data processing we may choose to treat some tokens as semantic primitives, but that is always a choice that we make.

Once you make that leap, it comes as no surprise that owl:sameAs wasn’t used the same way by everyone who used it.

See: When owl:sameAs isn’t the Same: An Analysis of Identity Links on the Semantic Web by Harry Halpin, Ivan Herman, and Patrick J. Hayes, for one take on the confusion around owl:sameAs.

If you are interested in moving beyond opaque keyword searching, consider Aaron’s post carefully.

December 12, 2013

Use your expertise – build a topical search engine

Filed under: Search Engines,Searching — Patrick Durusau @ 7:27 pm

Use your expertise – build a topical search engine

From the post:

Did you know that a topical search engine can help your users find content from more than a single domain? You can use your expertise to provide a delightful user experience targeting a particular topic on the Web.

There are two main types of engines built with Google Custom Search: site search and topical search. While site search is relatively straightforward – it lets you implement a search for a blog or a company website – topical search is an entirely different story.

Topical search engines focus on a particular topic and therefore usually cover a part of the Web that is larger than a single domain. Because of this topical engines need to be carefully fine-tuned to bring the best results to the users.

OK, yes, it is a Google API and run by Google.

That doesn’t trouble me overmuch. My starting assumption is that anything that leaves my subnet is being recorded.

Recorded and sold if there is a buyer for the information.

Doesn’t even have to leave my subnet if they have the right equipment.

Anyway, think of Google’s Custom Search API as another source of data like Common Crawl.

It’s more current than Common Crawl if that is one of your project requirements. And probably easier to use for most folks.

And you can experiment at very low risk to see if your custom search engine is likely to be successful.

Whether you want a public or private custom search engine, I am interested in hearing about your experiences.

December 10, 2013

ThisPlusThat.me: [Topic Vectors?]

Filed under: Search Algorithms,Search Engines,Searching,Vectors — Patrick Durusau @ 7:30 pm

ThisPlusThat.me: A Search Engine That Lets You ‘Add’ Words as Vectors by Christopher Moody.

From the post:

Natural language isn’t that great for searching. When you type a search query into Google, you miss out on a wide spectrum of human concepts and human emotions. Queried words have to be present in the web page, and then those pages are ranked according to the number of inbound and outbound links. That’s great for filtering out the cruft on the internet — and there’s a lot of that out there. What it doesn’t do is understand the relationships between words and understand the similarities or dissimilarities.

That’s where ThisPlusThat.me comes in — a search site I built to experiment with the word2vec algorithm recently released by Google. word2vec allows you to add and subtract concepts as if they were vectors, and get out sensible, and interesting results. I applied it to the Wikipedia corpus, and in doing so, tried creating an interactive search site that would allow users to put word2vec through it’s paces.

For example, word2vec allows you to evaluate a query like King – Man + Woman and get the result Queen. This means you can do some totally new searches.

… (examples omitted)

word2vec is a type of distributed word representation algorithm that trains a neural network in order to assign a vector to every word. Each of the dimensions in the vector tries to encapsulate some property of the word. Crudely speaking, one dimension could encode that man, woman, king and queen are all ‘people,’ whereas other dimensions could encode associations or dissociations with ‘royalty’ and ‘gender’. These traits are learned by trying to predict the context in a sentence and learning from correct and incorrect guesses.

Precisely!!!

😉

Doing it with word2vec requires large training sets of data. No doubt a useful venture if you are seeking to discover or document the word vectors in a domain.

But what if you wanted to declare vectors for words?

And then run word2vec (or something very similar) across the declared vectors.

Thinking along the lines of a topic map construct that has a “word” property with a non-null value. All the properties that follow are key/value pairs representing the positive and negative dimensions that are dimensions that give that word meaning.

Associations are collections of vector sums that identify subjects that take part in an association.

If we do all addressing by vector sums, we lose the need to track and collect system identifiers.

I think this could have legs.

Comments?

PS: For efficiency reasons, I suspect we should allow storage of computed vector sum(s) on a construct. But that would not prohibit another analysis reaching a different vector sum for different purposes.

December 9, 2013

Building Client-side Search Applications with Solr

Filed under: Lucene,Search Interface,Searching,Solr — Patrick Durusau @ 7:46 pm

Building Client-side Search Applications with Solr by Daniel Beach.

Description:

Solr is a powerful search engine, but creating a custom user interface can be daunting. In this fast paced session I will present an overview of how to implement a client-side search application using Solr. Using open-source frameworks like SpyGlass (to be released in September) can be a powerful way to jumpstart your development by giving you out-of-the box results views with support for faceting, autocomplete, and detail views. During this talk I will also demonstrate how we have built and deployed lightweight applications that are able to be performant under large user loads, with minimal server resources.

If you need a compelling reason to watch this video, check out:

Global Patent Search Network.

What is the Global Patent Search Network?

As a result of cooperative effort between the United States Patent and Trademark Office (USPTO) and State Intellectual Property Office (SIPO) of the People’s Republic of China, Chinese patent documentation is now available for search and retrieval from the USPTO website via the Global Patent Search Network. This tool will enable the user to search Chinese patent documents in the English or Chinese language. The data available include fulltext Chinese patents and machine translations. Also available are full document images of Chinese patents which are considered the authoritative Chinese patent document. Users can search documents including published applications, granted patents and utility models from 1985 to 2012.

Something over four (4) million patents.

Try the site, then watch the video.

Software mentioned: Spyglass, Ember.js.

Introducing Luwak,…

Filed under: Java,Lucene,Searching — Patrick Durusau @ 5:04 pm

Introducing Luwak, a library for high-performance stored queries by Charlie Hull.

From the post:

A few weeks ago we spoke in Dublin at Lucene Revolution 2013 on our work in the media monitoring sector for various clients including Gorkana and Australian Associated Press. These organisations handle a huge number (sometimes hundreds of thousands) of news articles every day and need to apply tens of thousands of stored expressions to each one, which would be extremely inefficient if done with standard search engine libraries. We’ve developed a much more efficient way to achieve the same result, by pre-filtering the expressions before they’re even applied: effectively we index the expressions and use the news article itself as a query, which led to the presentation title ‘Turning Search Upside Down’.

We’re pleased to announce the core of this process, a Java library we’ve called Luwak, is now available as open source software for your own projects. Here’s how you might use it:

That may sound odd, using the article as the query but be aware that Charlie reports “speeds of up to 70,000 stored queries applied to an article in around a second on modest hardware.

Perhaps not “big data speed” but certainly enough speed to get your attention.

Charlie mentions in his Dublin slides that Luwak could be used to “Add metadata to items based on their content.”

That one use case but creating topic/associations out of content would be another.

December 2, 2013

Top search tips from Exeter and Bristol

Filed under: Searching — Patrick Durusau @ 7:28 pm

Top search tips from Exeter and Bristol by Karen Blakeman.

From the post:

A couple of weeks ago I was in Exeter and Bristol leading workshops for NHS South West on “Google & Beyond”. We covered advanced Google commands, Google Scholar and alternatives to Google. Below are the combined top tips from the two sessions. I may have missed a couple from the list as I could not read my writing, so if you attended one of the workshops let me know if I’ve omitted your suggested tip.

All of these tips are no doubt old hat to readers of this blog but Karen gives a nice list of search tips you can forward to your users. 😉

Enjoy!

A language for search and discovery

Filed under: Search Behavior,Searching,Users — Patrick Durusau @ 7:22 pm

A language for search and discovery by Tony Russell-Rose.

Abstract:

In order to design better search experiences, we need to understand the complexities of human information-seeking behaviour. In this paper, we propose a model of information behaviour based on the needs of users across a range of search and discovery scenarios. The model consists of a set of modes that users employ to satisfy their information goals.

We discuss how these modes relate to existing models of human information seeking behaviour, and identify areas where they differ. We then examine how they can be applied in the design of interactive systems, and present examples where individual modes have been implemented in interesting or novel ways. Finally, we consider the ways in which modes combine to form distinct chains or patterns of behaviour, and explore the use of such patterns both as an analytical tool for understanding information behaviour and as a generative tool for designing search and discovery experiences.

Tony’s post is also available as a pdf file.

A deeply interesting paper but consider the evidence that underlies it:

The scenarios were collected as part of a series of requirements workshops involving stakeholders and customer-facing staff from various client organisations. A proportion of these engagements focused on consumer-oriented site search applications (resulting in 277 scenarios) and the remainder on enterprise search applications (104 scenarios).

The scenarios were generated by participants in breakout sessions and subsequently moderated by the workshop facilitator in a group session to maximise consistency and minimise redundancy or ambiguity. They were also prioritised by the group to identify those that represented the highest value both to the end user and to the client organisation.

This data possesses a number of unique properties. In previous studies of information seeking behaviour (e.g. [5], [10]), the primary source of data has traditionally been interview transcripts that provide an indirect, verbal account of end user information behaviours. By contrast, the current data source represents a self-reported account of information needs, generated directly by end users (although a proportion were captured via proxy, e.g. through customer facing staff speaking on behalf of the end users). This change of perspective means that instead of using information behaviours to infer information needs and design insights, we can adopt the converse approach and use the stated needs to infer information behaviours and the interactions required to support them.

Moreover, the scope and focus of these scenarios represents a further point of differentiation. In previous studies, (e.g. [8]), measures have been taken to address the limitations of using interview data by combining it with direct observation of information seeking behaviour in naturalistic settings. However, the behaviours that this approach reveals are still bounded by the functionality currently offered by existing systems and working practices, and as such do not reflect the full range of aspirational or unmet user needs encompassed by the data in this study.

Finally, the data is unique in that is constitutes a genuine practitioner-oriented deliverable, generated expressly for the purpose of designing and delivering commercial search applications. As such, it reflects a degree of realism and authenticity that interview data or other research-based interventions might struggle to replicate.

It’s not a bad thing to use data from commercial engagements for research and is certainly better than usability studies based on 10 to 12 undergraduates, two of whom did not complete the study. 😉

However, I would be very careful about trying to generalize from a self-selected group even for commercial search, much less the fuller diversity of other search scenarios.

On the other hand, the care with which the data was analyzed makes it an excellent data point against which to compare other data points, hopefully with more diverse populations.

November 29, 2013

OpenSearchServer

Filed under: Search Engines,Searching — Patrick Durusau @ 8:11 pm

OpenSearchServer by Emmanuel Keller.

From the webpage:

OpenSearchServer is a powerful, enterprise-class, search engine program. Using the web user interface, the crawlers (web, file, database, …) and the REST/RESTFul API you will be able to integrate quickly and easily advanced full-text search capabilities in your application. OpenSearchServer runs on Linux/Unix/BSD/Windows.

Search functions

  • Advanced full-text search features
  • Phonetic search
  • Advanced boolean search with query language
  • Clustered results with faceting and collapsing
  • Filter search using sub-requests (including negative filters)
  • Geolocation
  • Spell-checking
  • Relevance customization
  • Search suggestion facility (auto-completion)

Indexation

  • Supports 17 languages
  • Fields schema with analyzers in each language
  • Several filters: n-gram, lemmatization, shingle, stripping diacritic from words,…
  • Automatic language recognition
  • Named entity recognition
  • Word synonyms and expression synonyms
  • Export indexed terms with frequencies
  • Automatic classification

Document supported

  • HTML / XHTML
  • MS Office documents (Word, Excel, Powerpoint, Visio, Publisher)
  • OpenOffice documents
  • Adobe PDF (with OCR)
  • RTF, Plaintext
  • Audio files metadata (wav, mp3, AIFF, Ogg)
  • Torrent files
  • OCR over images

Crawlers

  • The web crawler for internet, extranet and intranet
  • The file systems crawler for local and remote files (NFS, SMB/CIFS, FTP, FTPS, SWIFT)
  • The database crawler for all JDBC databases (MySQL, PostgreSQL, Oracle, SQL Server, …)
  • Filter inclusion or exclusion with wildcards
  • Session parameters removal
  • SQL join and linked files support
  • Screenshot capture
  • Sitemap import

General

  • REST API (XML and JSON)
  • SOAP Web Service
  • Monitoring module
  • Index replication
  • Scheduler for management of periodic tasks
  • WordPress plugin and Drupal module

OpenSearchServer is something to consider if your project is GPL v3 compatible.

Even in an enterprise context, you don’t have to be better than Google at searching the entire WWW.

You just have to be better at searching content of interest to a user, project, department, etc.

The difference between your search results and Google’s should be the difference of a breakfast on near-food at McDonald’s and the best home-cooked breakfast you can imagine.

One is a mass-produced product that is the same over the world, the other is customized to your taste.

Which one would you prefer?

November 27, 2013

Redesigned percolator

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 11:15 am

Redesigned percolator by Martijn Vangroningen.

From the post:

The percolator is essentially search in reverse, which can by confusing initially for many people. This post will help to solve that problem and give more information on the redesigned percolator. We have added a lot more features to it to help users work with percolated documents/queries more easily.

In normal search systems, you store your data as documents and then send your questions as queries. The search results are a list of documents that matched your query.

With the percolator, this is reversed. First, you store the queries and then you send your ‘questions’ as documents. The percolator results are a list of queries that matched the document.

So what can do percolator do for you? The percolator can be used for a number of use cases, but the most common is for alerting and monitoring. By registering queries in Elasticsearch, your data can be monitored in real-time. If data with certain properties is being indexed, the percolator can tell you what queries this data matches.

For example, imagine a user “saving” a search. As new documents are added to the index, documents are percolated against this saved query and the user is alerted when new documents match. The percolator can also be used for data classification and user query feedback.

Even as a beta feature, this sounds interesting.

Another use case could be adhering to a Service Level Agreement (SLA).

You could have tiered search result packages that guarantee the freshness of search results. Near real-time would be more expensive than within six (6) hours or within the next business day. The match to a stored query could be queued up for delivery in accordance with your SLA.

I pay more for faster delivery times from FedEx, UPS, and, the US Post Office.

Why shouldn’t faster information cost more than slower information?

True, there are alternative suppliers of information but then you remind your prospective client of the old truism, you get what you pay for.

That is not contradicted by IT disasters such as HeathCare.gov.

The government hired contractors that are hard to distinguish from their agency counterparts and who are interested in “butts in seats” and not any useful results.

In that sense, the government literally got what it paid for. Had it wanted a useful heathcare IT project, it would not have put government drones in charge of the project.

November 25, 2013

MAST Discovery Portal

Filed under: Astroinformatics,Data Mining,Searching,Space Data — Patrick Durusau @ 8:11 pm

A New Way To Search, A New Way To Discover: MAST Discovery Portal Goes Live

From the post:

MAST is pleased to announce that the first release of our Discovery Portal is now available. The Discovery Portal is a one-stop web interface to access data from all of MAST’s supported missions, including HST, Kepler, GALEX, FUSE, IUE, EUVE, Swift, and XMM. Currently, users can search using resolvable target names or coordinates (RA and DEC). The returned data include preview plots of the data (images, spectra, or lightcurves), sortable columns, and advanced filtering options. An accompanying AstroViewer projects celestial sky backgrounds from DSS, GALEX, or SDSS on which to overlay footprints from your search results. A details panel allows you to see header information without downloading the file, visit external sites like interactive displays or MAST preview pages, and cross-search with the Virtual Observatory. In addition to searching MAST, users can also search the Virtual Observatory based on resolvable target names or coordinates, and download data from the VO directly through the Portal (Spitzer, 2MASS, WISE, ROSAT, etc.) You can quickly download data one row at a time, or add items to your Download Cart as you browse for download when finished, much like shopping online. Basic plotting tools allow you to visualize metadata from your search results. Users can also upload their own tables of targets (IDs and coordinates) for use within the Portal. Cross-matching can be done with all MAST data or any data available through the CDS at Strasbourg. All of these features interact with each other: you can use the charts to drag and select data points on a plot, whose footprints are highlighted in the AstroViewer and whose returned rows are brought to the top of your search results grid for further download or exploration.

Just a quick reminder that not every data mining project is concerned with recommendations of movies or mining reviews.

Seriously, astronomy has been dealing with “big data” long before it became a buzz word.

When you are looking for new techniques or insights into data exploration, check my posts under astroinformatics.

November 20, 2013

…Scorers, Collectors and Custom Queries

Filed under: Lucene,Search Engines,Searching — Patrick Durusau @ 7:30 pm

Lucene Search Essentials: Scorers, Collectors and Custom Queries by Mikhail Khludnev.

From the description:

My team is building next generation eCommerce search platform for major an online retailer with quite challenging business requirements. Turns out, default Lucene toolbox doesn’t ideally fit for those challenges. Thus, the team had to hack deep into Lucene core to achieve our goals. We accumulated quite a deep understanding of Lucene search internals and want to share our experience. We will start with an API overview, and then look at essential search algorithms and their implementations in Lucene. Finally, we will review a few cases of query customization, pitfalls and common performance problems.

Don’t be frightened of the slide count at 179!

Multiple slides are used with single illustrations to demonstrate small changes.

Having said that, this is a “close to the metal” type presentation.

Worth your time but read along carefully.

Don’t miss the extremely fine index on slide 18.

Follow http://www.lib.rochester.edu/index.cfm?PAGE=489 for images of pages that go with the index. This copy of Fasciculus Temporum dates from 1480.

« Newer PostsOlder Posts »

Powered by WordPress