Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 2, 2013

solrconfig.xml: …

Filed under: Search Engines,Searching,Solr — Patrick Durusau @ 11:58 am

solrconfig.xml: Understanding SearchComponents, RequestHandlers and Spellcheckers by Mark Bennett.

I spend most of my configuration time in Solr’s schema.xml, but the solrconfig.xml is also a really powerful tool. I wanted to use my recent spellcheck configuration experience to review some aspects of this important file. Sure, solrconfig.xml lets you configure a bunch of mundane sounding things like caching policies and library load paths, but it also has some high-tech configuration “LEGO blocks” that you can mix and match and re-assemble into all kinds of interesting Solr setups.

What is spell checking if it isn’t validation of a name? 😉

If you like knowing the details, this is a great post!

May 27, 2013

Crawl-Anywhere

Filed under: Search Engines,Searching,Solr,Webcrawler — Patrick Durusau @ 1:24 pm

Crawl-Anywhere

From the webpage:

April 2013 – Starting version 4.0, Crawl-Anywhere becomes an open-source project. Current version is 4.0.0-alpha

Stable version 3.x is still available at http://www.crawl-anywhere.com/

(…)

Crawl Anywhere is mainly a web crawler. However, Crawl-Anywhere includes all components in order to build a vertical search engine.

Crawl Anywhere includes :

Project home page : http://www.crawl-anywhere.com/

A web crawler is a program that discovers and read all HTML pages or documents (HTML, PDF, Office, …) on a web site in order for example to index these data and build a search engine (like google). Wikipedia provides a great description of what is a Web crawler : http://en.wikipedia.org/wiki/Web_crawler.

If you are gathering “very valuable intel” as in Snow Crash, a search engine will help.

Not do the heavy lifting but help.

May 24, 2013

How Does A Search Engine Work?…

Filed under: Indexing,Lucene,Search Engines — Patrick Durusau @ 5:59 pm

How Does A Search Engine Work? An Educational Trek Through A Lucene Postings Format by Doug Turnbull.

From the post:

A new feature of Lucene 4 – pluggable codecs – allows for the modification of Lucene’s underlying storage engine. Working with codecs and examining their output yields fascinating insights into how exactly Lucene’s search works in its most fundamental form.

The centerpiece of a Lucene codec is it’s postings format. Postings are a commonly thrown around word in the Lucene space. A Postings format is the representation of the inverted search index – the core data structure used to lookup documents that contain a term. I think nothing really captures the logical look-and-feel of Lucene’s postings better than Mike McCandless’s SimpleTextPostingsFormat. SimpleText is a text-based representation of postings created for educational purposes. I’ve indexed a few documents in Lucene using SimpleText to demonstrate how postings are structured to allow for fast search:

A first step towards moving beyond being a search engine result consumer.

May 22, 2013

Dynamic faceting with Lucene

Filed under: Faceted Search,Facets,Indexing,Lucene,Search Engines — Patrick Durusau @ 2:08 pm

Dynamic faceting with Lucene by Michael McCandless.

From the post:

Lucene’s facet module has seen some great improvements recently: sizable (nearly 4X) speedups and new features like DrillSideways. The Jira issues search example showcases a number of facet features. Here I’ll describe two recently committed facet features: sorted-set doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next (4.4) release.

To understand these features, and why they are important, we first need a little background. Lucene’s facet module does most of its work at indexing time: for each indexed document, it examines every facet label, each of which may be hierarchical, and maps each unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc values field. A separate taxonomy index stores this mapping, and ensures that, even across segments, the same label gets the same id.

At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count.

This is in contrast to purely dynamic faceting implementations like ElasticSearch‘s and Solr‘s, which do all work at search time. Such approaches are more flexible: you need not do anything special during indexing, and for every query you can pick and choose exactly which facets to compute.

However, the price for that flexibility is slower searching, as each search must do more work for every matched document. Furthermore, the impact on near-real-time reopen latency can be horribly costly if top-level data-structures, such as Solr’s UnInvertedField, must be rebuilt on every reopen. The taxonomy index used by the facet module means no extra work needs to be done on each near-real-time reopen.

The dynamic range faceting sounds particularly useful.

May 8, 2013

How Impoverished is the “current world of search?”

Filed under: Context,Indexing,Search Engines,Searching — Patrick Durusau @ 12:34 pm

Internet Content Is Looking for You

From the post:

Where you are and what you’re doing increasingly play key roles in how you search the Internet. In fact, your search may just conduct itself.

This concept, called “contextual search,” is improving so gradually the changes often go unnoticed, and we may soon forget what the world was like without it, according to Brian Proffitt, a technology expert and adjunct instructor of management in the University of Notre Dame’s Mendoza College of Business.

Contextual search describes the capability for search engines to recognize a multitude of factors beyond just the search text for which a user is seeking. These additional criteria form the “context” in which the search is run. Recently, contextual search has been getting a lot of attention due to interest from Google.

(…)

“You no longer have to search for content, content can search for you, which flips the world of search completely on its head,” says Proffitt, who is the author of 24 books on mobile technology and personal computing and serves as an editor and daily contributor for ReadWrite.com.

“Basically, search engines examine your request and try to figure out what it is you really want,” Proffitt says. “The better the guess, the better the perceived value of the search engine. In the days before computing was made completely mobile by smartphones, tablets and netbooks, searches were only aided by previous searches.

(…)

Context can include more than location and time. Search engines will also account for other users’ searches made in the same place and even the known interests of the user.

If time and location plus prior searches is context that “…flips the world of search completely on its head…”, imagine what a traditional index must do.

A traditional index being created by a person who has subject matter knowledge beyond the average reader and so is able to point to connections and facts (context) previously unknown to the user.

The “…current world of search…” is truly impoverished for time and location to have that much impact.

May 2, 2013

FindZebra

Filed under: Medical Informatics,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 4:48 am

FindZebra

From the about page:

FindZebra is a specialised search engine supporting medical professionals in diagnosing difficult patient cases. Rare diseases are especially difficult to diagnose and this online medical search engines comes in support of medical personnel looking for diagnostic hypotheses. With a simple and consistent interface across all devices, it can be easily used as an aid tool at the time and place where medical decisions are made. The retrieved information is collected from reputable sources across the internet storing public medical articles on rare and genetic diseases.

A search engine with: WARNING! This is a research project to be used only by medical professionals.

To avoid overwhelming researchers with search result “noise,” FindZebra deliberately restricts the content it indexes.

An illustration of the crudeness of current search algorithms that altering the inputs is the easiest way to improve outcomes for particular types of searches.

That seems to be an argument in favor of smaller than enterprise search engines, which could roll-up into broader search applications.

Of course, with a topic map you could retain the division between departments even as you roll-up the content into broader search applications.

April 21, 2013

Google search:… [GDM]

Filed under: Search Behavior,Search Engines,Searching — Patrick Durusau @ 12:46 pm

Google search: three bugs to fix with better data science by Vincent Granville.

Vincent outlines three issues with Google search results:

  1. Outdated search results
  2. Wrongly attributed articles
  3. Favoring irrelevant pages

See Vincent’s post for advice on how Google can address these issues. (Might help with a Google interview to tell them how to fix such long standing problems.)

More practically, how does your TM application rate on the outdated search results?

Do you just dump content on the user to sort out (the Google dump model (GDM)) or are your results a bit more user friendly?

April 19, 2013

Broccoli: Semantic Full-Text Search at your Fingertips

Filed under: Indexing,Search Algorithms,Search Engines,Semantic Search — Patrick Durusau @ 4:49 pm

Broccoli: Semantic Full-Text Search at your Fingertips by Hannah Bast, Florian Bäurle, Björn Buchhold, Elmar Haussmann.

Abstract:

We present Broccoli, a fast and easy-to-use search engine for what we call semantic full-text search. Semantic full-text search combines the capabilities of standard full-text search and ontology search. The search operates on four kinds of objects: ordinary words (e.g., edible), classes (e.g., plants), instances (e.g., Broccoli), and relations (e.g., occurs-with or native-to). Queries are trees, where nodes are arbitrary bags of these objects, and arcs are relations. The user interface guides the user in incrementally constructing such trees by instant (search-as-you-type) suggestions of words, classes, instances, or relations that lead to good hits. Both standard full-text search and pure ontology search are included as special cases. In this paper, we describe the query language of Broccoli, a new kind of index that enables fast processing of queries from that language as well as fast query suggestion, the natural language processing required, and the user interface. We evaluated query times and result quality on the full version of the EnglishWikipedia (32 GB XML dump) combined with the YAGO ontology (26 million facts). We have implemented a fully functional prototype based on our ideas, see http://broccoli.informatik.uni-freiburg.de.

The most impressive part of an impressive paper was the new index, context lists.

The second idea, which is the main idea behind our new index, is to have what we call context lists instead of inverted lists. The context list for a pre x contains one index item per occurrence of a word starting with that pre x, just like the inverted list for that pre x would. But along with that it also contains one index item for each occurrence of an arbitrary entity in the same context as one of these words.

The performance numbers speak for themselves.

This should be a feature in the next release of Lucene/Solr. Or perhaps even configurable for the number of entities that can appear in a “context list.”

Was it happenstance or a desire for simplicity that caused the original indexing engines to parse text into single tokens?

Literature references on that point?

April 9, 2013

Improving Twitter search with real-time human computation [“semantics supplied”]

Filed under: Human Computation,Search Engines,Searching,Semantics,Tweets — Patrick Durusau @ 1:54 pm

Improving Twitter search with real-time human computation by Edwin Chen.

From the post:

Before we delve into the details, here’s an overview of how the system works.

(1) First, we monitor for which search queries are currently popular.

Behind the scenes: we run a Storm topology that tracks statistics on search queries.

For example: the query “Big Bird” may be averaging zero searches a day, but at 6pm on October 3, we suddenly see a spike in searches from the US.

(2) Next, as soon as we discover a new popular search query, we send it to our human evaluation systems, where judges are asked a variety of questions about the query.

Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon’s Mechanical Turk service, and then polls Mechanical Turk for a response.

For example: as soon as we notice “Big Bird” spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant tweets and ads.

Finally, after a response from a judge is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our human judges tell us that “Big Bird” is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.

Let’s now explore the first two sections above in more detail.

….

The post is quite awesome and I suggest you read it in full.

This resonates with a recent comment about Lotus Agenda.

The short version is a user creates a thesaurus in Agenda that enables searches enriched by the thesaurus. The user supplied semantics to enhance the searches.

In the Twitter case, human reviewers supply semantics to enhance the searches.

In both cases, Agenda and Twitter, humans are supplying semantics to enhance the searches.

I emphasize “supplying semantics” as a contrast to mechanistic searches that rely on text.

Mechanistic searches can be quite valuable but they pale beside searches where semantics have been “supplied.”

The Twitter experience is a an important clue.

The answer to semantics for searches lies somewhere between ask an expert (you get his/her semantics) and ask ask all of us (too many answers to be useful).

More to follow.

March 30, 2013

ElasticSearch: Text analysis for content enrichment

Filed under: ElasticSearch,Indexing,Search Engines,Searching — Patrick Durusau @ 6:15 pm

ElasticSearch: Text analysis for content enrichment by Jaibeer Malik.

From the post:

Taking an example of a typical eCommerce site, serving the right content in search to the end customer is very important for the business. The text analysis strategy provided by any search solution plays very big role in it. As a search user, I would prefer some of typical search behavior for my query to automatically return,

  • should look for synonyms matching my query text
  • should match singluar and plural words or words sounding similar to enter query text
  • should not allow searching on protected words
  • should allow search for words mixed with numberic or special characters
  • should not allow search on html tags
  • should allow search text based on proximity of the letters and number of matching letters

Enriching the content here would be to add above search capabilities to you content while indexing and searching for the content.

I thought the “…look for synonyms matching my query text…” might get your attention. 😉

Not quite a topic map because there isn’t any curation of the search results, saving the next searcher time and effort.

But in order to create and maintain a topic map, you are going to need expansion of your queries by synonyms.

You will take the results of those expanded queries and fashion them into a topic map.

Think of it this way:

Machines can rapidly harvest, even sort content at your direction.

What they can’t do is curate the results of their harvesting.

That requires a secret ingredient.

That would be you.

I first saw this at DZone.

March 28, 2013

Build a search engine in 20 minutes or less

Filed under: Indexing,Lucene,Search Engines,Solr — Patrick Durusau @ 7:15 pm

Build a search engine in 20 minutes or less by Ben Ogorek.

I was suspicious but pleasantly surprised by the demonstration of the vector space model you will find here.

True, it doesn’t offer all the features of the latest Lucene/Solr releases but it will give you a firm grounding on vector space models.

Enjoy!

PS: One thing to keep in mind, semantics do not map to vector space. We can model word occurrences in vector space but occurrences are not semantics.

March 6, 2013

URL Search Tool!

Filed under: Common Crawl,Search Data,Search Engines,Searching — Patrick Durusau @ 7:22 pm

URL Search Tool! by Lisa Green.

From the post:

A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index. Today we are happy to announce a tool that makes it even easier for you to take advantage of the URL Index!

URL Search is a web application that allows you to search for any URL, URL prefix, subdomain or top-level domain. The results of your search show the number of files in the Common Crawl corpus that came from that URL and provide a downloadable JSON metadata file with the address and offset of the data for each URL. Once you download the JSON file, you can drop it into your code so that you only run your job against the subset of the corpus you specified. URL Search makes it much easier to find the files you are interested in and significantly reduces the time and money it take to run your jobs since you can now run them across only on the files of interest instead of the entire corpus.

Imagine that.

Searching relevant data instead of “big data.”

What a concept!

March 5, 2013

How Search Works [Promoting Your Interface]

Filed under: Search Engines,Searching — Patrick Durusau @ 12:51 pm

How Search Works (Google)

Clever graphics and I rather liked the:

By the way, in the **** seconds you’ve been on this page, approximately *********** searches were performed.

Not that you want that sort of tracking if your topic map interface only gets two or three “hits” a day but in an enterprise context…, might be worth thinking about.

Evidence of the popularity of your topic map interface with the troops.

I first saw this in a tweet by Christian Jansen.

February 19, 2013

dtsearch Tutorial Videos

Filed under: dtSearch,e-Discovery,Search Engines — Patrick Durusau @ 7:26 am

Tutorials for the dtsearch engine have been posted to ediscovery TV.

In five parts:

Part 1

Part 2

Part 3

Part 4

Part 5

I skipped over the intro videos only to find:

Not being able to “select all” in Excel doesn’t increase my confidence in the presentation. (part 3)

The copying of files that are “responsive” to a search request is convenient but not all that impressive. (part 4)

User isn’t familiar with basic operations in dtsearch, such as files not copied. Does finally appear. (part 5)

Disappointing because I remember dtsearch from years ago and it was (and still is) an impressive bit of work.

Suggestion: Don’t judge dtsearch by these videos.

I started to suggest you download all the brochures/white papers you will find at: http://www.dtsearch.com/contact.html

There is a helpful “Download All: PDF Porfolio” link. Except that it doesn’t work in Chrome at least. Keeps giving me a Download Adobe Acrobat 10 download window. Even after I install Adobe Acrobat 10.

Here’s a general hint for vendors: Don’t try to help. You will get it wrong. If you want to give users access to file, great, but let viewing/use be on their watch.

So, download the brochures/white papers individually until dtsearch recovers from another self-inflicted marketing wound.

Then grab a 30-day evaluation copy of the software.

It may or may not fit your needs but you will get a fairer sense of the product than you will from the videos or parts of the dtsearch website.

Maybe that’s the key: They are great search engineers, not so hot at marketing or websites.

I first saw this at dtSearch Harnesses TV Power. Where videos are cited, but not watched.

February 15, 2013

Solr Unleashed

Filed under: Search Engines,Searching,Solr — Patrick Durusau @ 10:14 am

Solr Unleashed: A Hands-On Workshop for Building Killer Search Apps (LucidWorks)

From the post:

Having consulted with clients on Lucene and Solr for the better part of a decade, we’ve seen the same mistakes made over and over again: applications built on shaky foundations, stretched to the breaking point. In this two day class, learn from the experts about how to do it right and make sure your apps are rock solid, scalable, and produce relevant results. Also check the course outline.

The course looks great, but if you don’t have the fees, I have reproduced the course outline below.

Using online documentation, mailing lists and other online resources, track the outline and fill it in for yourself.

If you want a real challenge, work through the outline and then build a Solr application around the outline.

To keep your newly acquired skills polished to a fine sheen.

1. The Fundamentals

  • About Solr
  • Installing and running Solr
  • Adding content to Solr
  • Reading a Solr XML response
  • Changing parameters in the URL
  • Using the browse interface

2. Searching

  • Sorting results
  • Query parsers
  • More queries
  • Hardwiring request parameters
  • Adding fields to default search
  • Faceting on fields
  • Range faceting
  • Date range faceting
  • Hierarchical faceting
  • Result grouping

3. Indexing

  • Adding your own content to Solr
  • Deleting data from solr
  • Building a bookstore search
  • Adding book data
  • Exploring the book data
  • Dedupe updateprocessor

4. Updating your schema

  • Adding fields to the schema
  • Analyzing text
  • 5. Relevance

    • Field weighting
    • Phrase queries
    • Function queries

    6. Extended features

    • More-like-this
    • Fuzzier search
    • Sounds-like
    • Geospatial
    • Spell checking
    • Suggestions
    • Highlighting

    7. Multilanguage

    • Working with English
    • Working with other languages
    • Non-whitespace languages
    • Identifying languages
    • Language specific sorting

    8. SolrCloud

    • Introduction
    • How SolrCloud works
    • Commit strategies
    • ZooKeeper
    • Managing Solr config files

Not the same as the class but will help you ask better questions of LucidWorks experts when you need them.

February 3, 2013

DuckDuckGo Architecture…

Filed under: Search Engines,Search Interface,Search Requirements,Searching — Patrick Durusau @ 6:58 pm

DuckDuckGo Architecture – 1 Million Deep Searches A Day And Growing Interview with Gabriel Weinberg.

From the post:

This is an interview with Gabriel Weinberg, founder of Duck Duck Go and general all around startup guru, on what DDG’s architecture looks like in 2012.

Innovative search engine upstart DuckDuckGo had 30 million searches in February 2012 and averages over 1 million searches a day. It’s being positioned by super investor Fred Wilson as a clean, private, impartial and fast search engine. After talking with Gabriel I like what Fred Wilson said earlier, it seems closer to the heart of the matter: We invested in DuckDuckGo for the Reddit, Hacker News anarchists.
                  
Choosing DuckDuckGo can be thought of as not just a technical choice, but a vote for revolution. In an age when knowing your essence is not about about love or friendship, but about more effectively selling you to advertisers, DDG is positioning themselves as the do not track alternative, keepers of the privacy flame. You will still be monetized of course, but in a more civilized and anonymous way. 

Pushing privacy is a good way to carve out a competitive niche against Google et al, as by definition they can never compete on privacy. I get that. But what I found most compelling is DDG’s strong vision of a crowdsourced network of plugins giving broader search coverage by tying an army of vertical data suppliers into their search framework. For example, there’s a specialized Lego plugin for searching against a complete Lego database. Use the name of a spice in your search query, for example, and DDG will recognize it and may trigger a deeper search against a highly tuned recipe database. Many different plugins can be triggered on each search and it’s all handled in real-time.

Can’t searching the Open Web provide all this data? No really. This is structured data with semantics. Not an HTML page. You need a search engine that’s capable of categorizing, mapping, merging, filtering, prioritizing, searching, formatting, and disambiguating richer data sets and you can’t do that with a keyword search. You need the kind of smarts DDG has built into their search engine. One problem of course is now that data has become valuable many grown ups don’t want to share anymore.

Being ad supported puts DDG in a tricky position. Targeted ads are more lucrative, but ironically DDG’s do not track policies means they can’t gather targeting data. Yet that’s also a selling point for those interested in privacy. But as search is famously intent driven, DDG’s technology of categorizing queries and matching them against data sources is already a form of high value targeting.

It will be fascinating to see how these forces play out. But for now let’s see how DuckDuckGo implements their search engine magic…

Some topic map centric points from the post:

Dream is to appeal to more niche audiences to better serve people who care about a particular topic. For example: lego parts. There’s a database of Lego parts, for example. Pictures of parts and part numbers can be automatically displayed from a search.

  • Some people just use different words for things. Goal is not to rewrite the query, but give suggestions on how to do things better.
  • “phone reviews” for example, will replace phone with telephone. This happens through an NLP component that tries to figure out what phone you meant and if there are any synonyms that should be used in the query.

Those are the ones that caught my eye, there are no doubt others.

Not to mention a long list of DuckDuckGo references at the end of the post.

What place(s) would you suggest to DuckDuckGo where topic maps would make a compelling difference?

January 31, 2013

? Google Guide [Improve Google Searching]

Filed under: Query Language,Search Engines,Searching — Patrick Durusau @ 7:24 pm

? Google Guide by Nancy Blachman.

Non-official documentation for Google searching but very nice non-official documentation.

If you want to improve your Google searching, this is a good place to start!

Available in English, Dutch, German, Hebrew and Italian.

January 30, 2013

Identity In A Context

Filed under: Search Engines,Searching,Semantics — Patrick Durusau @ 8:44 pm

Jasmine Ashton frames a quote about Julie Lynch, an archivist saying:

Due to the nature of her work, Lynch is the human equivalent of a search engine. However, she differs in one key aspect:

“Unlike Google, Lynch delivers more than search results, she provides context. That sepia-tinged photograph of the woman in funny-looking clothes on a funny-looking bicycle actually offers a window into the impact bicycles had on women’s independence. An advertisement touting “can build frame houses” demonstrates construction restrictions following the Great Chicago Fire. Surprisingly, high school yearbooks — the collection features past editions from Lane Tech, Amundsen and Lake View High Schools — serve as more than a cautionary tale in the evolution of hairstyles.”

Despite the increase in technology that makes searching information as easy as tapping a touch screen, this article reiterates the importance of having real people to contextualize these documents. (How Librarians Play an Integral Role When Searching for Historical Documents
)

Rather than say “contextualize,” I would prefer to say that librarians provide alternative “contexts” for historical documents.

Recognition of a document, or any other subject, takes place in a context. A librarian can offer the user different contexts in which to understand a document.

Doesn’t invalidate the initial context of understanding, simply becomes an alternative one.

Quite different from our search engines, which see only “matches” and no context for those matches.

January 26, 2013

Is Google Hijacking Semantic Markup/Structured Data? [FALSE]

Filed under: Search Data,Search Engines,Searching,Semantics,Structured Data — Patrick Durusau @ 1:42 pm

Is Google Hijacking Semantic Markup/Structured Data? by Barbara Starr.

From the post:

On December 12, 2012, Google rolled out a new tool, called the Google Data Highlighter for event data. Upon a cursory read, it seems to be a tagging tool, where a human trains the Data Highlighter using a few pages on their website, until Google can pick up enough of a pattern to do the remainder of the site itself.

Better yet, you can see all of these results in the structured data dashboard. It appears as if event data is marked up and is compatible with schema.org. However, there is a caveat here that some folks may not notice.

No actual markup is placed on the page, meaning that none of the semantic markup using this Data Highlighter tool is consumable by Bing, Yahoo or any other crawler on the Web; only Google can use it!

Google is essentially hi-jacking semantic markup so only Google can take advantage of it. Google has the global touch and the ability to execute well-thought-out and brilliantly strategic plans.

Let’s do this by the numbers:

  1. Google develops a service for webmasters to add semantic annotations to their webpages.
  2. Google allows webmasters to use that service at no charge.
  3. Google uses those annotations to improve the search results it provides users (for free).

Google used its own resources to develop a valuable service for webmasters that enhances their websites and user experience with Google, for free.

Perhaps there is a new definition of highjacking?

Webster says the traditional definition includes “to steal or rob as if by hijacking.”

The Semantic Web:

Highjacking

(a) Failing to whitewash the Semantic Web’s picket fence while providing free services to webmasters and users to enhance searching of web content.

(b) Failing to give away data from free services to webmasters and users to those who did not plant, reap, spin, weave or sew.

I don’t find the Semantic Web’s definition of “hijacking” persuasive.

You?

I first saw this at: Google’s Structured Data Take Over by Angela Guess.

January 17, 2013

MongoDB Text Search Tutorial

Filed under: MongoDB,Search Engines,Searching,Text Mining — Patrick Durusau @ 7:26 pm

MongoDB Text Search Tutorial by Alex Popescu.

From the post:

Today is the day of the experimental MongoDB text search feature. Tobias Trelle continues his posts about this feature providing some examples for query syntax (negation, phrase search)—according to the previous post even more advanced queries should be supported, filtering and projections, multiple text fields indexing, and adding details about the stemming solution used (Snowball).

Alex also has a list of his posts on the text search feature for MongoDB.

January 15, 2013

Symbolab

Filed under: Mathematics,Mathematics Indexing,Search Engines,Searching — Patrick Durusau @ 8:31 pm

Symbolab

Described as:

Symbolab is a search engine for students, mathematicians, scientists and anyone else looking for answers in the mathematical and scientific realm. Other search engines that do equation search use LaTex, the document mark up language for mathematical symbols which is the same as keywords, which unfortunately gives poor results.

Symbolab uses proprietary machine learning algorithms to provide the most relevant search results that are theoretically and semantically similar, rather than visually similar. In other words, it does a semantic search, understanding the query behind the symbols, to get results.

The nice thing about math and science is that it’s universal – there’s no need for translation in order to understand an equation. This means scale can come much quicker than other search engines that are limited by language.

From: The guys from The Big Bang Theory will love mathematical search engine Symbolab by Shira Abel. (includes an interview with Michael Avny, the CEO of Symbolab.

Limited to web content at the moment but a “scholar” option is in the works. I assume that will extend into academic journals.

Focused now on mathematics, physics and chemistry, but in principle should be extensible to related areas. I am particularly anxious to hear they are indexing CS publications!

Would be really nice if Springer, Elsevier, the AMS and others would permit indexing of their equations.

That presumes publishers would realize that shutting out users not at institutions is a bad marketing plan. With a marginal delivery cost of near zero and sunk costs from publication already fixed, every user a publisher gains at $200/year for their entire collection is $200 they did not have before.

Not to mention the citation and use of their publication, which just drives more people to publish there. A virtuous circle if you will.

The only concern I have is the comment:

The nice thing about math and science is that it’s universal – there’s no need for translation in order to understand an equation.

Which is directly contrary to what Michael is quoted as saying in the interview:

You say “Each symbol can mean different things within and across disciplines, order and position of elements matter, priority of features, etc.” Can you give an example of this?

The authors of the Foundations of Rule Learning spent five years attempting to reconcile notations used in rule making. Some symbols had different meanings. They resorted to inventing yet another notation as a solution.

Why the popular press perpetuates the myth of a universal language isn’t clear.

It isn’t useful and in some cases, such as national security, it leads to waste of time and resources on attempts to invent a universal language.

The phrase “myth of a universal language” should be a clue. Universal languages don’t exist. They are myths, by definition.

Anyone who says differently is trying to sell you something, Something that is in their interest and perhaps not yours.

I first saw this at Introducing Symbolab: Search for Equations by Angela Guess.

January 12, 2013

Advanced Power Searching [January 23, 2013]

Filed under: Search Engines,Searching — Patrick Durusau @ 7:07 pm

Advanced Power Searching

From the post:

Advanced Power Searching with Google begins on January 23, 2013!

Register now to sharpen your research skills and strengthen your use of advanced Google search techniques to answer complex questions. Throughout this course you’ll also:

  • Take your search strategies to a new level with sophisticated, independent search challenges.
  • Join a community of Advanced Searchers working together to solve search challenges.
  • Pose questions to Google search experts live in Hangouts and through a course forum.
  • Receive an Advanced Power Searching certificate upon completion.

Not sure if you’re ready for Advanced Power Searching? Brush up on your search skills by visiting the Power Searching with Google course.

Topic maps help keep found information found but you have to find it first. 😉

Enjoy!

January 11, 2013

Solr vs ElasticSearch: Part 5 – Management API Capabilities

Filed under: ElasticSearch,Search Engines,Searching,Solr — Patrick Durusau @ 7:35 pm

Solr vs ElasticSearch: Part 5 – Management API Capabilities by Rafał Kuć.

From the post:

In previous posts, all listed below, we’ve discussed general architecture, full text search capabilities and facet aggregations possibilities. However, till now we have not discussed any of the administration and management options and things you can do on a live cluster without any restart. So let’s get into it and see what Apache Solr and ElasticSearch have to offer.

Rafał continues this excellent series on Solr and ElasticSearch and promises there is more to come!

This series sets a high standard for posts comparing search capabilities!

January 9, 2013

izik Debuts as #1 Free Reference App on iTunes

Filed under: Interface Research/Design,izik,Search Engines,Search Interface — Patrick Durusau @ 12:04 pm

izik Debuts as #1 Free Reference App on iTunes

From the post:

We launched izik, our search app for tablets, last Friday and are amazed at the responses we’ve received! Thanks to our users, on day one izik was the #1 free reference app on iTunes and #49 free app overall. Yesterday we were mentioned twice in the New York Times, here and here (also in the B1 story in print). We are delighted that there is such a strong desire to see something fresh and new in search, and that our vision with izik is so well received.

The twitterverse has been especially active in spreading the word about izik. We’ve seen a lot of comments about the beautiful design and interface, the useful categories, and most importantly the high quality results that make izik a truly viable choice for searching on tablets.

Just last Monday I remarked: “From the canned video I get the sense that the interface is going to make search different.” (izik: Take Search for a Joy Ride on Your Tablet)

Users with tablets have supplied the input I asked for in that post and it is overwhelmingly in favor of izik.

To paraphrase Ray Charles in the Blues Brothers:

“E-excuse me, uh, I don’t think there’s anything wrong with the action on [search applications].”

There is plenty of “action” left in the search space.

izik is fresh evidence for that proposition.

December 10, 2012

Apache Nutch v1.6 and Apache 2.1 Releases

Filed under: Gora,HBase,Nutch,Search Engines,Solr — Patrick Durusau @ 10:45 am

Apache Nutch v1.6 Released

From the news:

The Apache Nutch PMC are extremely pleased to announce the release of Apache Nutch v1.6. This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API inluding the normalization of URL’s and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8. Please see the list of changes or the release report made in this version for a full breakdown. The release is available here.

See the Nutch 1.x tutorial.

Apache Nutch v2.1 Released

From the news:

The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v2.1. This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as addressing ~20 bugs this release also offers improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search. Please see the list of changes made in this version for a full breakdown. The release is available here.

See the Nutch 2.x tutorial.

I haven’t done a detailed comparison but roughly, Nutch 1.x relies upon Solr for storage and Nutch 2.x relies upon Gora and HBase.

Surprised that isn’t in the FAQ.

Perhaps I will investigate further and offer a short summary of the differences.

December 9, 2012

Fun with Lucene’s faceted search module

Filed under: Faceted Search,Lucene,Search Engines,Searching — Patrick Durusau @ 8:16 pm

Fun with Lucene’s faceted search module by Mike McCandless.

From the post:

These days faceted search and navigation is common and users have come to expect and rely upon it.

Lucene’s facet module, first appearing in the 3.4.0 release, offers a powerful implementation, making it trivial to add a faceted user interface to your search application. Shai Erera wrote up a nice overview here and worked through nice “getting started” examples in his second post.

The facet module has not been integrated into Solr, which has an entirely different implementation, nor into ElasticSearch, which also has its own entirely different implementation. Bobo is yet another facet implementation! I’m sure there are more…

The facet module can compute the usual counts for each facet, but also has advanced features such as aggregates other than hit count, sampling (for better performance when there are many hits) and complements aggregation (for better performance when the number of hits is more than half of the index). All facets are hierarchical, so the app is free to index an arbitrary tree structure for each document. With the upcoming 4.1, the facet module will fully support near-real-time (NRT) search.

Take some time over the holidays to play with faceted searches in Lucene.

November 22, 2012

eGIFT: Mining Gene Information from the Literature

eGIFT: Mining Gene Information from the Literature by Catalina O Tudor, Carl J Schmidt and K Vijay-Shanker.

Abstract:

Background

With the biomedical literature continually expanding, searching PubMed for information about specific genes becomes increasingly difficult. Not only can thousands of results be returned, but gene name ambiguity leads to many irrelevant hits. As a result, it is difficult for life scientists and gene curators to rapidly get an overall picture about a specific gene from documents that mention its names and synonyms.

Results

In this paper, we present eGIFT (http://biotm.cis.udel.edu/eGIFT webcite), a web-based tool that associates informative terms, called iTerms, and sentences containing them, with genes. To associate iTerms with a gene, eGIFT ranks iTerms about the gene, based on a score which compares the frequency of occurrence of a term in the gene’s literature to its frequency of occurrence in documents about genes in general. To retrieve a gene’s documents (Medline abstracts), eGIFT considers all gene names, aliases, and synonyms. Since many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene. Another additional filtering process is applied to retain those abstracts that focus on the gene rather than mention it in passing. eGIFT’s information for a gene is pre-computed and users of eGIFT can search for genes by using a name or an EntrezGene identifier. iTerms are grouped into different categories to facilitate a quick inspection. eGIFT also links an iTerm to sentences mentioning the term to allow users to see the relation between the iTerm and the gene. We evaluated the precision and recall of eGIFT’s iTerms for 40 genes; between 88% and 94% of the iTerms were marked as salient by our evaluators, and 94% of the UniProtKB keywords for these genes were also identified by eGIFT as iTerms.

Conclusions

Our evaluations suggest that iTerms capture highly-relevant aspects of genes. Furthermore, by showing sentences containing these terms, eGIFT can provide a quick description of a specific gene. eGIFT helps not only life scientists survey results of high-throughput experiments, but also annotators to find articles describing gene aspects and functions.

Website: http://biotm.cis.udel.edu/eGIFT

Another lesson for topic map authoring interfaces: Offer domain specific search capabilities.

Using a ****** search appliance is little better than a poke with a sharp stick in most domains. The user is left to their own devices to sort out ambiguities, discover synonyms, again and again.

Your search interface may report > 900,000 “hits,” but anything beyond the first 20 or so are wasted.

(If you get sick, get something that comes up in the first 20 “hits” in PubMed. Where most researchers stop.)

Developing New Ways to Search for Web Images

Developing New Ways to Search for Web Images by Shar Steed.

From the post:

Collections of photos, images, and videos are quickly coming to dominate the content available on the Web. Currently internet search engines rely on the text with which the images are labeled to return matches. But why is only text being used to search visual mediums? These labels can be unreliable, unhelpful and sometimes not available at all.

To solve this problem, scientists at Stanford and Princeton have been working to “create a new generation of visual search technologies.” Dr. Fei-Fei Li, a computer scientist at Stanford, has built the world’s largest visual database, containing more than 14 million labeled objects.

A system called ImageNet, applies the data gathered from the database to recognize similar, unlabeled objects with much greater accuracy than past algorithms.

A remarkable amount of material to work with, either via the API or downloading for your own hacking.

Another tool for assisting in the authoring of topic maps (or other content).

November 10, 2012

Battle of the Giants: Apache Solr 4.0 vs. ElasticSearch 0.20

Filed under: ElasticSearch,Search Engines,Searching,Solr,SolrCloud — Patrick Durusau @ 8:42 am

Battle of the Giants: Apache Solr 4.0 vs. ElasticSearch 0.20 by Rafał Kuć.

A very nice summary (slides) of Rafał’s comparison of the latest releases of Solr and ElasticSearch.

They differ and those differences fit some use cases better than others.

And the winner is: … (well, I won’t spoil the surprise.)

Read the slides.

Unless you are Rafał, you will learn something you didn’t know before.

October 10, 2012

Information Retrieval and Search Engines [Committers Needed!]

Filed under: Information Retrieval,Search Engines — Patrick Durusau @ 4:18 pm

Information Retrieval and Search Engines

A proposal is pending to create a Q&A site for people interested in information retrieval and search engines.

But it needs people to commit to using it and answering questions!

That could be you!

There’s a lot of action left in information retrieval and search engines.

Don’t have to believe me. Have you tried one lately? 😉

« Newer PostsOlder Posts »

Powered by WordPress