Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 9, 2015

Solr 2014: A Year in Review

Filed under: Search Engines,Solr — Patrick Durusau @ 8:25 pm

Solr 2014: A Year in Review by Anshum Gupta.

If you aren’t already excited about Solr 5, targeted for alter this month, perhaps these section headings from Anshum’s post will capture your interest:

Usability – Ease of use and management

SolrCloud and Collection APIs

Scalability and optimizations

CursorMark: Distributed deep paging

TTL: Auto-expiration for documents

Distributed Pivot Faceting

Query Parsers

Distributed IDF

Solr Scale Toolkit

Testing

No more war

Solr 5

Community

That is a lot of improvement for a single year! See Anshum’s post and you will be excited about Solr 5 too!

January 4, 2015

Find In-Depth Articles (Google Hack)

Filed under: Search Engines,Searching — Patrick Durusau @ 7:56 pm

Find In-Depth Articles by Alex Chitu.

From the post:

Sometimes you want to find more about a topic and you find a lot of superficial news articles and blog posts that keep rehashing the same information. Google shows a list of in-depth articles for some queries, but this feature seems to be restricted to the US and it’s only displayed for some queries.

See Alex’s post for the search string addition.

Works with the examples but be forewarned, it doesn’t work with every search.

I tried “deep+learning” and got the usual results.

If you are researching topics where Google has in-depth articles, this could be quite useful.

Just glancing at some of the other posts, this looks like a blog to follow if you do any amount of searching.

Enjoy!

I first saw this in a tweet by Aaron Kirschenfeld.

December 11, 2014

Wouldn’t it be fun to build your own Google?

Wouldn’t it be fun to build your own Google? by Martin Kleppmann.

Martin writes:

Imagine you had your own copy of the entire web, and you could do with it whatever you want. (Yes, it would be very expensive, but we’ll get to that later.) You could do automated analyses and surface the results to users. For example, you could collate the “best” articles (by some definition) written on many different subjects, no matter where on the web they are published. You could then create a tool which, whenever a user is reading something about one of those subjects, suggests further reading: perhaps deeper background information, or a contrasting viewpoint, or an argument on why the thing you’re reading is full of shit.

Unfortunately, at the moment, only Google and a small number of other companies that have crawled the web have the resources to perform such analyses and build such products. Much as I believe Google try their best to be neutral, a pluralistic society requires a diversity of voices, not a filter bubble controlled by one organization. Surely there are people outside of Google who want to work on this kind of thing. Many a start-up could be founded on the basis of doing useful things with data extracted from a web crawl.

He goes on to discuss current search efforts such a Common Crawl and Wayfinder before hitting full stride with his suggestion for a distributed web search engine. Painting in the broadest of strokes, Martin makes it sound almost plausible to contemplate such an effort.

While conceding the technological issues would be many, it is contended that the payoff would be immense, but in ways we won’t know until it is available. I suspect Martin is right but if so, then we should be able to see a similar impact from Common Crawl. Yes?

Not to rain on a parade I would like to join, but extracting value from a web crawl like Common Crawl is not a guaranteed thing. A more complete crawl of the web only multiplies those problems, it doesn’t make them easier to solve.

On the whole I think the idea of a distributed crawl of the web is a great idea, but while that develops, we best hone our skills at extracting value from the partial crawls that already exist.

December 2, 2014

Nonsensical ‘Unbiased Search’ Proposal

Filed under: EU,Governance,Search Engines,Searching — Patrick Durusau @ 4:50 pm

Forget EU’s Toothless Vote To ‘Break Up’ Google; Be Worried About Nonsensical ‘Unbiased Search’ Proposal by Mike Masnick.

Mike uncovers (in plain sight) the real danger of the recent EU proposal to “break up” Google.

Reading the legislation (which I neglected to do), Mike writes:

But within the proposal, a few lines down, there was something that might be even more concerning, and more ridiculous, even if it generated fewer (actually, almost no) headlines. And it’s that, beyond “breaking up” search engines, the resolution also included this bit of nonsense, saying that search engines need to be “unbiased”:

Stresses that, when operating search engines for users, the search process and results should be unbiased in order to keep internet searches non-discriminatory, to ensure more competition and choice for users and consumers and to maintain the diversity of sources of information; notes, therefore, that indexation, evaluation, presentation and ranking by search engines must be unbiased and transparent; calls on the Commission to prevent any abuse in the marketing of interlinked services by search engine operators;

But what does that even mean? Search is inherently biased. That’s the point of search. You want the best results for what you’re searching for, and the job of the search engine is to rank results by what it thinks is the best. An “unbiased” search engine isn’t a search engine at all. It just returns stuff randomly.

See Mike’s post for additional analysis of this particular mummers farce.

Another example why the Internet should be governed by a new structure, staffed by people with the technical knowledge to make sensible decisions. By “new structure” I mean one separate from and not subject to any existing government. Including the United States, where the head of the NSA thinks local water supplies are controlled over the Internet (FALSE).

I first saw this in a tweet by Joseph Esposito.

November 17, 2014

Apache Lucene™ 5.0.0 is coming!

Filed under: Lucene,Search Engines — Patrick Durusau @ 4:16 pm

Apache Lucene™ 5.0.0 is coming! by Michael McCandless.

At long last, after a strong series of 4.x feature releases, most recently 4.10.2, we are finally working towards another major Apache Lucene release!

There are no promises for the exact timing (it’s done when it’s done!), but we already have a volunteer release manager (thank you Anshum!).

A major release in Lucene means all deprecated APIs (as of 4.10.x) are dropped, support for 3.x indices is removed while the numerous 4.x index formats are still supported for index backwards compatibility, and the 4.10.x branch becomes our bug-fix only release series (no new features, no API changes).

5.0.0 already contains a number of exciting changes, which I describe below, and they are still rolling in with ongoing active development.

Michael has a great list and explanation of changes you will be seeing in Lucene 5.0.0. Pick your favorite(s) to follow and/or contribute to the next release.

October 25, 2014

Building Scalable Search from Scratch with ElasticSearch

Filed under: ElasticSearch,Search Engines — Patrick Durusau @ 5:46 pm

Building Scalable Search from Scratch with ElasticSearch by Ram Viswanadha.

From the post:

1 Introduction

Savvy is an online community for the world’s product enthusiasts. Our communities are the product trendsetters that the rest of the world follows. Across the site, our users are able to compare products, ask and answer product questions, share product reviews, and generally share their product interests with one another. Savvy1.com boasts a vibrant community that save products on the site at the rate of 1 product every second. We wanted to provide a search bar that can search across various entities in the system – users, products, coupons, collections, etc. – and return the results in a timely fashion.

2 Requirements

The search server should satisfy the following requirements:

  1. Full Text Search: The ability to not only return documents that contain the exact keywords, but also documents that contain words that are related or relevant to the keywords.
  2. Clustering: The ability to distribute data across multiple nodes for load balancing and efficient searching.
  3. Horizontal Scalability: The ability to increase the capacity of the cluster by adding more nodes.
  4. Read and Write Efficiency: Since our application is both read and write heavy, we need a system that allows for high write loads and efficient read times on heavy read loads.
  5. Fault Tolerant: The loss of any node in the cluster should not affect the stability of the cluster.
  6. REST API with JSON: The server should support a REST API using JSON for input and output.

At the time, we looked at Sphinx, Solr and ElasticSearch. The only system that satisfied all of the above requirements was ElasticSearch, and — to sweeten the deal — ElasticSearch provided a way to efficiently ingest and index data in our MongoDB database via the River API so we could get up and running quickly.

If you need an outline for building a basic ElasticSearch system, this is it!

It has the advantage of introducing you to a number of other web technologies that will be handy with ElasticSearch.

Enjoy!

The Anatomy of a Large-Scale Hypertextual Web Search Engine (Ambiguity)

Filed under: Search Engines,WWW — Patrick Durusau @ 12:26 pm

If you search for “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page, will you get the “long” version or the “short” version?

The version found at: http://infolab.stanford.edu/pub/papers/google.pdf reports in its introduction:

(Note: There are two versions of this paper — a longer full version and a shorter printed version. The full version is available on the web and the conference CD-ROM.)

However, it doesn’t say whether it is the “longer full version” or the “shorter printed version.” Length, twenty (20) pages.

The version found at: http://snap.stanford.edu/class/cs224w-readings/Brin98Anatomy.pdf claims the following citation: “Computer Networks and ISDN Systems 30 (1998) 107-117.” Length, eleven (11) pages. It “looks” like a journal printed article.

Ironic that the search engine fails to distinguish between these two versions of such an important paper.

Perhaps the search confusion is justified to some degree because Lawrence Page’s publications at: http://research.google.com/pubs/LawrencePage.html reports:

Lawrence Page pub info

But if you access the PDF, you get the twenty (20) page version, not the eleven page version published at: Computer Networks and ISDN Systems 30 (1998) 107-117.

BTW, if you want to automatically distinguish the files, the file sizes on the two versions referenced above are: 123603 (the twenty (20) page version) and 1492735 (the eleven (11) page version). (The published version has the publisher logo, etc. that boosts the file size.)

If Google had a mechanism to accept explicit crowd input, that confusion and the typical confusion between slides and papers with the same name could be easily solved.

The first reader who finds either the paper or slides, types it as paper or slides. The characteristics of that file become the basis for distinguishing those files into paper or slides. When the next searcher is returned results including those files, they get a pointer to paper or slides?

If they don’t submit a change for paper or slides, that distinction becomes more certain.

I don’t know what the carrot would be for typing resources returned in search results, perhaps five (5) minutes of freedom from ads! 😉

Thoughts?

I first saw this in a tweet by onepaperperday.

September 12, 2014

Connected Histories: British History Sources, 1500-1900

Filed under: History,Search Engines,Searching — Patrick Durusau @ 4:24 pm

Connected Histories: British History Sources, 1500-1900

From the webpage:

Connected Histories brings together a range of digital resources related to early modern and nineteenth century Britain with a single federated search that allows sophisticated searching of names, places and dates, as well as the ability to save, connect and share resources within a personal workspace. We have produced this short video guide to introduce you to the key features.

Twenty-two remarkable resources can be searched by place, person, or keyword. Some of the sources require subscriptions but the vast majority do not. A summary of the resources would fail to do them justice so here is a list of the currently searchable resources:

As you probably assume, there is no binding point for any person, object, date or thing across all twenty-two resources with its associations to other persons, objects, dates or things.

As you explore Connected Histories, keep track of where you found information on a person, object, date or thing. Depending on the granularity of pointing, you might want to create a topic map to capture that information.

September 8, 2014

Demystifying The Google Knowledge Graph

Filed under: Entities,Google Knowledge Graph,Search Engines,Searching — Patrick Durusau @ 3:28 pm

Demystifying The Google Knowledge Graph by Barbara Starr.

knowledge graph

Barbara covers:

  • Explicit vs. Implicit Entities (and how to determine which is which on your webpages)
  • How to improve your chances of being in “the Knowledge Graph” using Schema.org and JSON-LD.
  • Thinking about “things, not strings.”

Is there something special about “events?” I remember the early Semantic Web motivations being setting up tennis matches between colleagues. The examples here are of sporting and music events.

If your users don’t know how to use TicketMaster, repeating delivery of that data on your site isn’t going to help them.

On the other hand, this is a good reminder to extract from Schema.org all the “types” that would be useful for my blog.

PS: A “string” doesn’t become a “thing” simply because it has a longer token. Having an agreed upon “longer token” from a vocabulary such as Schema.org does provide more precise identification than an unadorned “string.”

Having said that, the power of having several key/value pairs and a declaration of which ones must, may or must not match, should be readily obvious. Particularly when those keys and values may themselves be collections of key/value pairs.

August 5, 2014

A new proximity query for Lucene, using automatons

Filed under: Automata,Lucene,Search Engines — Patrick Durusau @ 6:34 pm

A new proximity query for Lucene, using automatons by Michael McCandless.

From the post:


As of Lucene 4.10 there will be a new proximity query to further generalize on MultiPhraseQuery and the span queries: it allows you to directly build an arbitrary automaton expressing how the terms must occur in sequence, including any transitions to handle slop.

automata

This is a very expert query, allowing you fine control over exactly what sequence of tokens constitutes a match. You build the automaton state-by-state and transition-by-transition, including explicitly adding any transitions (sorry, no QueryParser support yet, patches welcome!). Once that’s done, the query determinizes the automaton and then uses the same infrastructure (e.g. CompiledAutomaton) that queries like FuzzyQuery use for fast term matching, but applied to term positions instead of term bytes. The query is naively scored like a phrase query, which may not be ideal in some cases.

Micahael walks through current proximity queries before diving into the new proximity query for Lucene 4.10.

As always, this is a real treat!

July 30, 2014

Multi-Term Synonyms [Bags of Properties?]

Filed under: Lucene,Search Engines,Synonymy — Patrick Durusau @ 12:34 pm

Solution for multi-term synonyms in Lucene/Solr using the Auto Phrasing TokenFilter by Ted Sullivan.

From the post:

In a previous blog post, I introduced the AutoPhrasingTokenFilter. This filter is designed to recognize noun-phrases that represent a single entity or ‘thing’. In this post, I show how the use of this filter combined with a Synonym Filter configured to take advantage of auto phrasing, can help to solve an ongoing problem in Lucene/Solr – how to deal with multi-term synonyms.

The problem with multi-term synonyms in Lucene/Solr is well documented (see Jack Krupansky’s proposal, John Berryman’s excellent summary and Nolan Lawson’s query parser solution). Basically, what it boils down to is a problem with parallel term positions in the synonym-expanded token list – based on the way that the Lucene indexer ingests the analyzed token stream. The indexer pays attention to a token’s start position but does not attend to its position length increment. This causes multi-term tokens to overlap subsequent terms in the token stream rather than maintaining a strictly parallel relation (in terms of both start and end positions) with their synonymous terms. Therefore, rather than getting a clean ‘state-graph’, we get a pattern called “sausagination” that does not accurately reflect the 1-1 mapping of terms to synonymous terms within the flow of the text (see blog post by Mike McCandless on this issue). This problem disappears if all of the synonym pairs are single tokens.

The multi-term synonym problem was described in a Lucene JIRA ticket (LUCENE-1622) which is still marked as “Unresolved”:

Posts like this one are a temptation to sign off Twitter and read the ticket feeds for Lucene/Solr instead. Seriously.

Ted proposes a workaround to the multi-term synonym problem using the auto phrasing tokenfilter. Equally important is his conclusion:

The AutoPhrasingTokenFilter can be an important tool in solving one of the more difficult problems with Lucene/Solr search – how to deal with multi-term synonyms. Simultaneously, we can improve another serious problem that all search engines have – their focus on single tokens and the ambiguities that are present at that level. By shifting the focus more towards phrases that should be treated as semantic entities or units of language (i.e. “things”), the search engine is better able to return results based on ‘what’ the user is looking for rather than documents containing words that match the query. We are moving from searching with a “bag of words” to searching a “bag of things”.

Or more precisely:

…their focus on single tokens and the ambiguities that are present at that level. By shifting the focus more towards phrases that should be treated as semantic entities or units of language (i.e. “things”)…

Ambiguity at the token level remains, even if for particular cases phrases can be treated as semantic entities.

Rather than Ted’s “bag of things,” may I suggest indexing “bags of properties?” Where the lowliest token or a higher semantic unit can be indexed as a bag of properties.

Imagine indexing these properties* for a single token:

  • string: value
  • pubYear: value
  • author: value
  • journal: value
  • keywords: value

Would that suffice to distinguish a term in a medical journal from Vanity Fair?

Ambiguity is predicated upon a lack of information.

That should be suggestive of a potential cure.

*(I’m not suggesting that all of those properties or even most of them would literally appear in a bag. Most, if not all, could be defaulted from an indexed source.)

I first saw this in a tweet by SolrLucene.

June 30, 2014

Fess

Filed under: Search Engines,Solr — Patrick Durusau @ 8:31 pm

Fess: Open Source Enterprise Search Server

From the homepage:

Fess is very powerful and easily deployable Enterprise Search Server. You can install and run Fess quickly on any platforms, which have Java runtime environment. Fess is provided under Apache license.

[image omitted]

Fess is Solr based search server, but knowledge/experience about Solr is NOT needed because of All-in-One Enterprise Search Server. Fess provides Administration GUI to configure the system on your browser. Fess also contains a crawler, which can crawl documents on Web/File System/DB and support many file formats, such as MS Office, pdf and zip.

Features

  • Very Easy Installation/Configuration
  • Apache License (OSS)
  • OS-independent (Runs on Java)
  • Crawl documents on Web/File System/DB/Windows Shared Folder
  • Support many document types, such as MS Office, PDF, Zip archive,…
  • Support a web page for BASIC/DIGEST/NTLM authentication
  • Contain Apache Solr as a search engine
  • Provide UI as a responsive web design
  • Provide a browser based administative page
  • Support a role authentication
  • Support XML/JSON/JSONP format
  • Provide a search/click log and statistics
  • Provide auto-complete(suggest)

Sounds interesting enough.

I don’t have a feel for the trade-offs between a traditional Solr/Tomcat install and what appears to be a Solr-out-of-the-box solution. At least not today.

I recently built a Solr/Tomcat install on a VM so this could be a good comparison to that process.

Any experience with Fess?

June 20, 2014

Poor Reasoning Travels Faster Than Bad News

Filed under: Search Engines — Patrick Durusau @ 1:32 pm

Google forced to e-forget a company worldwide by Lisa Vaas.

From the post:

Forcing Google to develop amnesia is turning out to be contagious.

Likely inspired by Europeans winning the right to be forgotten in Google search results last month, a Canadian court has ruled that Google has to remove search results for a Canadian company’s competitor, not just in Canada but around the world.

The Supreme Court of British Columbia ruled on 13 June that Google had two weeks to forget the websites of a handful of companies with “Datalink” in their names.

I didn’t know I was being prescient in Feathers, Gossip and the European Union Court of Justice (ECJ) when I said:

Even if Google, removes all of its references from a particular source, the information could be re-indexed in the future from new sources.

That is precisely the issue in the Canadian case. Google removes specific URLs only to have different URLs for the same company appear in their search results.

The tenor of the decision is best captured by:

The Court must adapt to the reality of e-commerce with its potential for abuse by those who would take the property of others and sell it through the borderless electronic web of the internet. I conclude that an interim injunction should be granted compelling Google to block the defendants’ websites from Google’s search results worldwide. That order is necessary to preserve the Court’s process and to ensure that the defendants cannot continue to flout the Court’s orders. [159]

What you won’t find in the decision is any mention of the plaintiffs tracking down the funds from e-commerce sites alleged to be selling the product in question. Odd isn’t it? The plaintiffs are concerned about enjoined sales but make no effort to recover funds from those sales?

Of course, allowing a Canadian court (or any court) to press-gang anyone at hand to help enforce its order is very attractive, to courts at least. Should not be so attractive to anyone concerned with robust e-commerce.

If enjoined sales were occurring, there may have been evidence on sales but the court fails to mention it, plaintiffs had more than enough remedies to pursue those transactions.

Instead, with the aid of a local court, the plaintiffs are forcing Google to act as its unpaid worldwide e-commerce assassin.

More convenient for the local plaintiff but a bad news for global e-commerce.

PS: I don’t suppose anyone will be registering new websites with the name “Datalink” in them just to highlight the absurdity of this decision.

June 16, 2014

You complete me

Filed under: ElasticSearch,Search Engines — Patrick Durusau @ 6:50 pm

You complete me by Alexander Reelsen.

From the post:

Effective search is not just about returning relevant results when a user types in a search phrase, it’s also about helping your user to choose the best search phrases. Elasticsearch already has did-you-mean functionality which can correct the user’s spelling after they have searched. Now, we are adding the completion suggester which can make suggestions while-you-type. Giving the user the right search phrase before they have issued their first search makes for happier users and reduced load on your servers.

In the context of search you can suggest search phrases. (Alexander’s post is a bit dated so see: the Elasticsearch documentation as well.)

How much further can you go with suggestions? Search syntax?

June 8, 2014

Laboratory for Web Algorithmics

Filed under: Algorithms,Search Algorithms,Search Engines,Webcrawler — Patrick Durusau @ 2:53 pm

Laboratory for Web Algorithmics

From the homepage:

The Laboratory for Web Algorithmics (LAW) was established in 2002 at the Dipartimento di Scienze dell’Informazione (now merged in the Computer Science Department) of the Università degli studi di Milano.

The LAW is part of the NADINE FET EU project.

Research at LAW concerns all algorithmic aspects of the study of the web and of social networks. More in detail…

The details include:

  • High-performance web crawling: Including an open source web crawler
  • Compression of web graphs and social networks: compression of web crawling results
  • Analysis of web graphs and social networks: research and algorithms for exploration of web graphs

Deeply impressive project and one with several papers and resources that I will be covering in more detail in future posts.

I first saw this in a tweet by Network Fact.

The Clojure Style Guide

Filed under: Clojure,Programming,Search Engines — Patrick Durusau @ 1:36 pm

The Clojure Style Guide by Bozhidar Batsov.

From the webpage:

Role models are important.
— Officer Alex J. Murphy / RoboCop

This Clojure style guide recommends best practices so that real-world Clojure programmers can write code that can be maintained by other real-world Clojure programmers. A style guide that reflects real-world usage gets used, and a style guide that holds to an ideal that has been rejected by the people it is supposed to help risks not getting used at all — no matter how good it is.

The guide is separated into several sections of related rules. I’ve tried to add the rationale behind the rules (if it’s omitted, I’ve assumed that it’s pretty obvious).

I didn’t come up with all the rules out of nowhere; they are mostly based on my extensive career as a professional software engineer, feedback and suggestions from members of the Clojure community, and various highly regarded Clojure programming resources, such as “Clojure Programming” and “The Joy of Clojure“.

The guide is still a work in progress; some sections are missing, others are incomplete, some rules are lacking examples, some rules don’t have examples that illustrate them clearly enough. In due time these issues will be addressed — just keep them in mind for now.

Please note, that the Clojure developing community maintains a list of coding standards for libraries, too.

You can generate a PDF or an HTML copy of this guide using Transmuter.

Another example where Ungoogleable Symbols from Clojure may be of interest.

A good index to Clojure resources needs to overcome the limitations of Google‘s search engine as well as others.

I first saw this in a tweet by LE Minh Triet.

June 6, 2014

Ungoogleable Symbols from Clojure

Filed under: Clojure,Search Engines — Patrick Durusau @ 4:03 pm

The Weird and Wonderful Characters of Clojure by James Hughes.

From the post:

A reference collection of characters used in Clojure that are difficult to “google”. Descriptions sourced from various blogs, StackOverflow, Learning Clojure and the official Clojure docs – sources attributed where necessary. Type the symbols into the box below to search (or use CTRL-F). Sections not in any particular order but related items are grouped for ease. If I’m wrong or missing anything worthy of inclusion tweet me @kouphax or mail me at james@yobriefca.se.

Before reading further, do you agree/disagree that symbols are hard to search for?

Jot your reasons down.

Now try search for each of the following strings:

#

#{

#”

Hmmm, the post is on the WWW and indexed by Google.

I can prove that, search using Google for: “The Weird and Wonderful Characters of Clojure”.

I can understand the result for “#.” There are a variety of subjects that are all represented by “#” so that result isn’t surprising. You would have to distinguish the different subjects represented by “#,” something search engines don’t do.

That is search engines operate on surface strings only.

What is less understandable is the total failure on #{ and #”, with an without surrounding quotes.

If you are going to return results on “#,” it seems like you would return results on other arbitrary strings.

Can someone comment without violating their NDA with Google?

I first saw this in a tweet by Rob Stuttaford.

May 24, 2014

Elasticsearch 1.2.0 and 1.1.2 released

Filed under: ElasticSearch,Lucene,Search Engines — Patrick Durusau @ 2:59 pm

Elasticsearch 1.2.0 and 1.1.2 released by Clinton Gormley.

From the post:

Today, we are happy to announce the release of Elasticsearch 1.2.0, based on Lucene 4.8.1, along with a bug fix release Elasticsearch 1.1.2.

You can download them and read the full change lists here:

Elasticsearch 1.2.0 is a bumper release, containing over 300 new features, enhancements, and bug fixes. You can see the full changes list in the Elasticsearch 1.2.0 release notes, but we will highlight some of the important ones below:

Highlights of the more important changes for Elasticsearch 1.2.0:

  • Java 7 required
  • dynamic scripting disabled by default
  • field data and filter caches
  • gateways removed
  • indexing and merging
  • aggregations
  • context suggester
  • improved deep scrolling
  • field value factor

See Clinton’s post or the release notes for more complete coverage. (Aggregation looks particularly interesting.)

May 21, 2014

Your own search engine…

Filed under: Search Engines,Searching,Solr — Patrick Durusau @ 4:46 pm

Your own search engine (based on Apache Solr open-source enterprise-search)

From the webpage:

Tools for easier searching with free software on your own server

  • search in many documents, images and files
    • full text search with powerful search operators
    • in many different formats (text, word, openoffice, PDF, sheets, csv, doc, images, jpg, video and many more)
    • get a overview by explorative search and comfortable and powerful navigation with faceted search (easy to use interactive filters)
  • analyze documents (preview, extracted text, wordlists and visualizations with wordclouds and trend charts)
  • structure your research, investigation, navigation, metadata or notes (semantic wiki for tagging documents, annotations and structured notes)
  • OCR: automatic text recognition for images and graphical content or scans inside PDF, i.e. for scanned or photographed documents

Do you think this would be a way to pull back the curtain on search a bit? To show people that even results like we see from Google require more than casual effort?

I ask because Jeni Tennison tweeted earlier today:

#TDC14 @emckean “search is the hammer that makes us think everything is a nail that can be searched for”

Is a common misunderstanding of search making “improved” finding methods a difficult sell?

Not that I have a lot of faith or interest in educating potential purchasers. Finding a way to use the misunderstanding seems like a better marketing strategy to me.

Suggestions?

May 17, 2014

Building a Recipe Search Site…

Filed under: ElasticSearch,Lucene,Search Engines,Solr — Patrick Durusau @ 4:32 pm

Building a Recipe Search Site with Angular and Elasticsearch by Adam Bard.

From the post:

Have you ever wanted to build a search feature into an application? In the old days, you might have found yourself wrangling with Solr, or building your own search service on top of Lucene — if you were lucky. But, since 2010, there’s been an easier way: Elasticsearch.

Elasticsearch is an open-source storage engine built on Lucene. It’s more than a search engine; it’s a true document store, albeit one emphasizing search performance over consistency or durability. This means that, for many applications, you can use Elasticsearch as your entire backend. Applications such as…

Think of this as a snapshot of the capabilities of most search solutions.

Which makes this a great baseline for answering the question: What does your app do that Elasticsearch + Angular cannot?

That’s a serious question.

Responses that don’t count include:

  1. My app is written in the Linear B programming language.
  2. My app uses a Post-Pre-NOSQL DB engine.
  3. My app will bring freedom and health to the WWW.
  4. (insert your reason)

You can say all those things if you like, but the convincing point for users is going to be exceeding their expectations about current solutions.

Do the best you can with Elasticsearch and Angular and use that as your basepoint for comparison.

May 14, 2014

Feathers, Gossip and the European Union Court of Justice (ECJ)

Filed under: EU,Privacy,Search Engines — Patrick Durusau @ 2:52 pm

It is a common comment that the United States Supreme Court has difficulty with technology issues. Not terribly surprising since digital technology evolves several orders of magnitude faster than legal codes and customs.

But even if judicial digital illiteracy isn’t surprising, judicial theological illiteracy should be.

I am referring, of course, to the recent opinion by the European Court of Justice that there is a right to be “forgotten” in the records of the search giant Google.

In the informal press release about its decision, the ECJ states:

Finally, in response to the question whether the directive enables the data subject to request that links to web pages be removed from such a list of results on the grounds that he wishes the information appearing on those pages relating to him personally to be ‘forgotten’ after a certain time, the Court holds that, if it is found, following a request by the data subject, that the inclusion of those links in the list is, at this point in time, incompatible with the directive, the links and information in the list of results must be erased. The Court observes in this regard that even initially lawful processing of accurate data may, in the course of time, become incompatible with the directive where, having regard to all the circumstances of the case, the data appear to be inadequate, irrelevant or no longer relevant, or excessive in relation to the purposes for which they were processed and in the light of the time that has elapsed. The Court adds that, when appraising such a request made by the data subject in order to oppose the processing carried out by the operator of a search engine, it should in particular be examined whether the data subject has a right that the information in question relating to him personally should, at this point in time, no longer be linked to his name by a list of results that is displayed following a search made on the basis of his name. If that is the case, the links to web pages containing that information must be removed from that list of results, unless there are particular reasons, such as the role played by the data subject in public life, justifying a preponderant interest of the public in having access to the information when such a search is made. (The press release version, The official judgement).

Which doesn’t sound unreasonable, particularly if you are a theological illiterate.

One contemporary retelling of a story about St. Philip Neri goes as follows:

The story is often told of the most unusual penance St. Philip Neri assigned to a woman for her sin of spreading gossip. The sixteenth-century saint instructed her to take a feather pillow to the top of the church bell tower, rip it open, and let the wind blow all the feathers away. This probably was not the kind of penance this woman, or any of us, would have been used to!

But the penance didn’t end there. Philip Neri gave her a second and more difficult task. He told her to come down from the bell tower and collect all the feathers that had been scattered throughout the town. The poor lady, of course, could not do it-and that was the point Philip Neri was trying to make in order to underscore the destructive nature of gossip. When we detract from others in our speech, our malicious words are scattered abroad and cannot be gathered back. They continue to dishonor and divide many days, months, and years after we speak them as they linger in people’s minds and pass from one tale-bearer to the next. (From The Feathers of Gossip: How our Words can Build Up or Tear Down by Edward P. Sri)*

The problem with “forgetting” is the same one as the gossip penitent. Information is copied and replicated by sites for their own purposes. Nothing Google can do will impact those copies. Even if Google, removes all of its references from a particular source, the information could be re-indexed in the future from new sources.

This decision is a “feel good” one for privacy advocates. But, the ECJ should have recognized the gossip folktale parallel and decided that effective relief is impossible. Ordering an Impossible solution diminishes the stature of the court and the seriousness with which its decisions are regarded.

Not to mention the burden this will place on Google and other search result providers, with no guarantee that the efforts will be successful.

Sometimes the best solution is to simply do nothing at all.

* There isn’t a canonical form for this folktale, which has been told and re-told by many cultures.

May 7, 2014

New in Solr 4.8: Document Expiration

Filed under: Search Engines,Solr,Topic Maps — Patrick Durusau @ 7:07 pm

New in Solr 4.8: Document Expiration

From the post:

Lucene & Solr 4.8 were released last week and you can download Solr 4.8 from the Apache mirror network. Today I’d like to introduce you to a small but powerful feature I worked on for 4.8: Document Expiration.

The DocExpirationUpdateProcessorFactory provides two features related to the “expiration” of documents which can be used individually, or in combination:

  • Periodically delete documents from the index based on an expiration field
  • Computing expiration field values for documents from a “time to live” (TTL)

Assuming you are using a topic maps solution that presents topics as merged, this could be an interesting feature to emulate.

After all, if you are listing ticket sale outlets for concerts in a music topic map, good maintenance suggests those occurrences should go away after the concert has occurred.

Or if you need the legacy information for some purpose, at least not have it presented as currently available. Perhaps a change of its occurrence type?

Would you actually delete topics or add an “internal” occurrence so they would not participate in future presentations of merged topics?

Go ahead, compete with Google Search

Filed under: Marketing,Search Engines — Patrick Durusau @ 3:53 pm

Go ahead, compete with Google Search: Why its is not that crazy to go build a search engine. by Alexis Smirnov.

Alexis doesn’t sound very promising at the start:

Google’s mission is to organize the world’s information and make it universally accessible and useful. Google Search has become a shining example of progress towards accomplishing this mission.

Google Search is the best general-purpose search engine and it gets better all the time. Over the years it killed off most of it’s competitors.

But after a very interesting and useful review of non-general public search engines, he concludes:

To sum up, the best way to compete against Google is not to build another general-purpose search engine. It is to build another vertical semantic search engine. The better engines understand the specific domain, the better chance they have to be better than Google.

See the post for the details and get thee to the software forge to build a specialized search engine!

PS: We all realize the artwork that accompanies the post isn’t an accurate depiction of Google. Too good looking. 😉

PPS: I am particularly aware of the need for date/version ordered searching for software issues. Just today I was searching for a error that turned out to be a bad symbolic link but the results from one search engine included material from 2 or 3 years ago. Not all that helpful when you are running the latest release.

May 4, 2014

“Credibility” As “Google Killer”?

Filed under: Facebook,Relevance,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 6:26 pm

Nancy Baym tweets: “Nice article on flaws of ”it’s not our fault, it’s the algorithm” logic from Facebook with quotes from @TarletonG” pointing to: Facebook draws fire on ‘related articles’ push.

From the post:

A surprise awaited Facebook users who recently clicked on a link to read a story about Michelle Obama’s encounter with a 10-year-old girl whose father was jobless.

Facebook responded to the click by offering what it called “related articles.” These included one that alleged a Secret Service officer had found the president and his wife having “S*X in Oval Office,” and another that said “Barack has lost all control of Michelle” and was considering divorce.

A Facebook spokeswoman did not try to defend the content, much of which was clearly false, but instead said there was a simple explanation for why such stories are pushed on readers. In a word: algorithms.

The stories, in other words, apparently are selected by Facebook based on mathematical calculations that rely on word association and the popularity of an article. No effort is made to vet or verify the content.

Facebook’s explanation, however, is drawing sharp criticism from experts who said the company should immediately suspend its practice of pushing so-called related articles to unsuspecting users unless it can come up with a system to ensure that they are credible. (emphasis added)

Just imagine the hue and outcry had that last line read:

Imaginary Quote Google’s explanation of search results, however, is drawing sharp criticism from experts who said the company should immediately suspend its practice of pushing so-called related articles to unsuspecting users unless it can come up with a system to ensure that they are credible. End Imaginary Quote

Is demanding “credibility” of search results the long sought after “Google Killer?”

“Credibility” is closely related to the “search” problem but I think it should be treated separately from search.

In part because the “credibility” question is one that can require multiple searches upon the author of search result content, searches for reviews and comments on search result content, searches of other sources of data on the content in the search result and then a collation of that additional content to make a credibility judgement on the search result content. The procedure isn’t always that elaborate but the main point is that it requires additional searching and evaluation of content to even begin to answer a credibility question.

Not to mention why the information is being sought has a bearing on credibility. If I want to find examples of nutty things said about President Obama to cite, then finding the cases mentioned above is not only relevant (the search question) but also “credible” in the sense that Facebook did not make they up. They are published nutty statements about the current President.

What if a user wanted to search for “coffee and bagels?” The top hit on one popular search engine today is: Coffee Meets Bagel: Free Online Dating Sites, along with numerous other links to information on the first link. Was this relevant to my search? No, but search results aren’t always predictable. They are relevant to someone’s search using “coffee and bagels.”

It is the responsibility of every reader to decide for themselves what is relevant, credible, useful, etc. in terms of content, whether it is hard copy or digital.

Any other solution takes us to Plato‘s Republic, which was great to read about, would not want to live there.

May 2, 2014

Apache Solr 4.8 Documentation

Filed under: Search Engines,Solr — Patrick Durusau @ 7:22 pm

Apache Solr 4.8 Reference Guide (pdf)

Apache Solr 4.8.0 Documentation

From the documentation page:

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr’s powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

This is the official documentation for Apache Solr 4.8.0.

I haven’t had good experiences with either the “official” Solr documentation or commercial publications on the same.

Not that any of it in particular was wrong so much as it was incomplete. Not that any of it was short. 😉

Perhaps it was more of an organizational problem than anything else.

I will be using the documentation on a regular basis for a while so I will start contributing suggestions as issues arise.

Curious to know if your experience with the Solr documentation has been the same? Different?

April 11, 2014

Google Top 10 Search Tips

Filed under: Search Engines,Searching — Patrick Durusau @ 6:47 pm

Google Top 10 Search Tips by Karen Blakeman.

From the post:

These are the top 10 tips from the participants of a recent workshop on Google, organised by UKeiG and held on 9th April 2014. The edited slides from the day can be found on authorSTREAM at http://www.authorstream.com/Presentation/karenblakeman-2121264-making-google-behave-techniques-better-results/ and on Slideshare at http://www.slideshare.net/KarenBlakeman/making-google-behave-techniques-for-better-results

Ten search tips from the trenches. Makes a very nice cheat sheet.

April 9, 2014

Revealing the Uncommonly Common…

Filed under: Algorithms,ElasticSearch,Search Engines,Searching — Patrick Durusau @ 3:34 pm

Revealing the Uncommonly Common with Elasticsearch by Mark Harwood.

From the summary:

Mark Harwood shows how anomaly detection algorithms can spot card fraud, incorrectly tagged movies and the UK’s most unexpected hotspot for weapon possession.

Makes me curious about the market for a “Mr./Ms. Normal” service?

A service that enables you to enter your real viewing/buying/entertainment preferences and for a fee, the service generates a paper trail for you than hides your real habits in digital dust.

If you order porn from NetFlix then the “Mr./Ms. Normal” service will order enough PBS and NatGeo material to even out your renting record.

Depending on how extreme your buying habits happen to be, you may need a “Mr./Ms. Abnormal” service that shields you from any paper trail at all.

As data surveillance grows, having a pre-defined Mr./Ms. Normal/Abnormal account may become a popular high school/college graduation or even a wedding present.

The usefulness of data surveillance depends on the cooperation of its victims. Have you ever considered not cooperating? But appearing to?

March 31, 2014

150 Million Topics

Filed under: Bing,Merging,Search Engines,Topic Maps — Patrick Durusau @ 9:02 pm

150 Million More Reasons to Love Bing Everyday by Richard Qian.

From the post:

At Bing, we understand that search is more than simply finding information and browsing a collection of blue links pointing to pages around the web. We’ve talked about doing instead of searching and how Bing continues to expand its approach to understand the actual world around us.

Today, you’ll see this come to life on Bing.com in a feature called Snapshot. Snapshot brings together information that you need at a glance, with rich connections to deeper information on the people, places, and things you care about made possible by our deep understanding of the real world. To accomplish this, Bing now tracks billions of entities and perhaps more importantly, the billions of relationships between them, all to get you the right data instantly while you search.

New Entities: Introducing Doctors, Lawyers, Dentists and Real Estate Properties
….

In case you are interested, the “Snapshot” is what ISO/IEC 13250 (Dec., 1999) defined as a topic.

topic: An aggregate of topic characteristics, including zero or more names, occurrences, and roles played in association with other topics, whose organizing principle is a single subject.

Unlike the topic maps effort, Bing conceals all the ugliness that underlies merging of information and delivers to users an immediately useful and consumable information product.

But also unlike the topic maps effort, Bing is about as useful as a wedgie when you are looking for information internal to your enterprise.

Why?

Mostly because the subjects internal to your organization don’t get mapped by Bing.

Can you guess who is going to have to do that work? Got a mirror handy?

Use Bing’s Snapshots to see what collated information can look like. Decide if that’s a look you want for all or some of your information.

Hard to turn down free advertising/marketing from MS. It happens so rarely.

Thanks MS!

March 29, 2014

Installing Apache Solr 4.7 multicore…

Filed under: Search Engines,Solr — Patrick Durusau @ 6:28 pm

Installing Apache Solr 4.7 multicore on Ubuntu 12.04 and Tomcat7

From the post:

I will show you how to install the ApacheSolr search engine under Tomcat7 servlet container on Ubuntu 12.04.4 LTS (Precise Pangolin) to be used later with Drupal 7. In this writeup I’m gonna discuss only the installation and setup of the ApacheSolr server. Specific Drupal configuration and/or Drupal side configuration to be discussed in future writeup.

Nothing you don’t already know but a nice checklist for the installation.

I’m glad I found it because I am writing a VM script to auto-install Solr as part of a VM distribution.

Manually I do ok but am likely to forget something the script needs explicitly.

March 26, 2014

Elasticsearch 1.1.0,…

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 7:00 pm

Elasticsearch 1.1.0, 1.0.2 and 0.90.13 released by Clinton Gormley.

From the post:

Today we are happy to announce the release of Elasticsearch 1.1.0, based on Lucene 4.7, along with bug fix releases Elasticsearch 1.0.2 and Elasticsearch 0.90.13:

You can download them and read the full changes list here:

New features in 1.1.0

Elasticsearch 1.1.0 is packed with new features: better multi-field search, the search templates and the ability to create aliases when creating an index manually or with a template. In particular, the new aggregations framework has enabled us to support more advanced analytics: the cardinality agg for counting unique values, the significant_terms agg for finding uncommonly common terms, and the percentiles agg for understanding data distribution.

We will be blogging about all of these new features in more detail, but for now we’ll give you a taste of what each feature adds:

….

Well, there’s goes the rest of the week! 😉

« Newer PostsOlder Posts »

Powered by WordPress