### Solr 4, the NoSQL Search Server [Webinar]

Friday, May 17th, 2013

Solr 4, the NoSQL Search Server by Yonik Seeley

The long awaited Solr 4 release brings a large amount of new functionality that blurs the line between search engines and NoSQL databases. Now you can have your cake and search it too with Atomic updates, Versioning and Optimistic Concurrency, Durability, and Real-time Get!

Learn about new Solr NoSQL features and implementation details of how the distributed indexing of Solr Cloud was designed from the ground up to accommodate them.
Yonik Seeley – Research creator of Apache Solr and the Chief Open Source Architect and Co-Founder at LucidWorks. Mr. Seeley is an Apache Lucene/Solr PMC member and committer and an expert in distributed search systems architecture and performance. His work experience includes CNET Networks, BEA and Telcordia. He earned his M.S. in Computer Science from Stanford University.

This could be a real treat!

Notes on the webinar to follow.

### Automated Archival and Visual Analysis of Tweets…

Thursday, May 16th, 2013

Ever since Twitter gamed its own API and killed off great services like IFTTT triggers, I’ve been looking for a way to automatically archive tweets containing certain search terms of interest to me. Twitter’s built-in search is limited, and I wanted to archive interesting tweets for future reference and to start playing around with some basic text / trend analysis.

Enter t – the twitter command-line interface. t is a command-line power tool for doing all sorts of powerful Twitter queries using the command line. See t‘s documentation for examples.

I wrote this script that uses the t utility to search Twitter separately for a set of specified keywords, and append those results to a file. The comments at the end of the script also show you how to commit changes to a git repository, push to GitHub, and automate the entire process to run twice a day with a cron job. Here’s the code as of May 14, 2013:

Stephen promises in his post that the script updates automatically and you may find “unsavory” tweets.

I didn’t but that may be a matter of happenstance or sensitivity.

### Keyword Search, Plus a Little Magic

Wednesday, May 15th, 2013

Keyword Search, Plus a Little Magic by Geoffrey Pullum.

I promised last week that I would discuss three developments that turned almost-useless language-connected technological capabilities into something seriously useful. The one I want to introduce first was introduced by Google toward the end of the 1990s, and it changed our whole lives, largely eliminating the need for having full sentences parsed and translated into database query language.

The hunch that the founders of Google bet on was that simple keyword search could be made vastly more useful by taking the entire set of pages containing all of the list of search words and not just returning it as the result but rather ranking its members by influentiality and showing the most influential first. What a page contains is not the only relevant thing about it: As with any academic publication, who values it and refers to it is also important. And that is (at least to some extent) revealed in the link structure of the Web.

In his first post, which wasn’t sympathetic to natural language processing, Geoffrey baited his critics into fits of frenzied refutation.

Fits of refutation that failed to note Geoffrey hadn’t completed his posts on natural language processing.

Take the keyword search posting for instance.

I won’t spoil the surprise for you but the fourth fact that Geoffrey says Google relies upon could have serious legs for topic map authoring and interface design.

And not a little insight into what we call natural language processing.

I suggest we savor each one as it appears and after reflection on the whole, sally forth onto the field of verbal combat.

### Seventh ACM International Conference on Web Search and Data Mining

Monday, May 13th, 2013

WSDM 2014 : Seventh ACM International Conference on Web Search and Data Mining

Abstract submission deadline: August 19, 2013
Paper submission deadline: August 26, 2013
Tutorial proposals due: September 9, 2013
Tutorial and paper acceptance notifications: November 25, 2013
Tutorials: February 24, 2014
Main Conference: February 25-28, 2014

WSDM (pronounced “wisdom”) is one of the premier conferences covering research in the areas of search and data mining on the Web. The Seventh ACM WSDM Conference will take place in New York City, USA during February 25-28, 2014.

WSDM publishes original, high-quality papers related to search and data mining on the Web and the Social Web, with an emphasis on practical but principled novel models of search, retrieval and data mining, algorithm design and analysis, economic implications, and in-depth experimental analysis of accuracy and performance.

WSDM 2014 is a highly selective, single track meeting that includes invited talks as well as refereed full papers. Topics covered include but are not limited to:

Papers emphasizing novel algorithmic approaches are particularly encouraged, as are empirical/analytical studies of specific data mining problems in other scientific disciplines, in business, engineering, or other application domains. Application-oriented papers that make innovative technical contributions to research are welcome. Visionary papers on new and emerging topics are also welcome.

Authors are explicitly discouraged from submitting papers that do not present clearly their contribution with respect to previous works, that contain only incremental results, and that do not provide significant advances over existing approaches.

Sets a high bar but one that can be met.

Would be very nice PR to have a topic map paper among those accepted.

### Enigma

Friday, May 10th, 2013

Enigma

I suppose it had to happen. With all the noise about public data sets that someone would create a startup to search them.

Not a lot of detail at the site but you can sign up for a free trial.

Features:

100,000+ Public Data Sources: Access everything from import bills of lading, to aircraft ownership, lobbying activity,real estate assessments, spectrum licenses, financial filings, liens, government spending contracts and much, much more.

Augment Your Data: Get a more complete picture of investments, customers, partners, and suppliers. Discover unseen correlations between events, geographies and transactions.

API Access: Get direct access to the data sets, relational engine and NLP technologies that power Enigma.

Request Custom Data: Can’t find a data set anywhere else? Need to synthesize data from disparate sources? We are here to help.

Discover While You Work: Never miss a critical piece of information. Enigma uncovers entities in context, adding intelligence and insight to your daily workflow.

Powerful Context Filters: Our vast collection of public data sits atop a proprietary data ontology. Filter results by topics, tags and source to quickly refine and scope your query.

Focus on the Data: Immerse yourself in the details. Data is presented in its raw form, full screen and without distraction.

Curated Metadata: Source data is often unorganized and poorly documented. Our domain experts focus on sanitizing, organizing and annotating the data.

Easy Filtering: Rapidly prototype hypotheses by refining and shaping data sets in context. Filter tools allow the sorting, refining, and mathematical manipulation of data sets.

The “proprietary data ontology” jumps out at me as an obvious question. Do users get to know what the ontology is?

Not to mention the “our domain experts focus on sanitizing,….” Works for some cases, take legal research for example. Not sure that “your” experts works as well as “my” experts for less focused areas.

Looking forward to learning more about Enigma!

### Moloch

Friday, May 10th, 2013

Moloch

Moloch is an open source, large scale IPv4 packet capturing (PCAP), indexing and database system. A simple web interface is provided for PCAP browsing, searching, and exporting. APIs are exposed that allow PCAP data and JSON-formatted session data to be downloaded directly. Simple security is implemented by using HTTPS and HTTP digest password support or by using apache in front. Moloch is not meant to replace IDS engines but instead work along side them to store and index all the network traffic in standard PCAP format, providing fast access. Moloch is built to be deployed across many systems and can scale to handle multiple gigabits/sec of traffic.

Where do you think you are most likely to find dirty laundry?

In data you have been given permission to see?

Or, in data that others don’t want you to see?

Times up!

I first saw this in Nat Torkington’s Four short links: 8 May 2013.

### How Impoverished is the “current world of search?”

Wednesday, May 8th, 2013

Internet Content Is Looking for You

Where you are and what you’re doing increasingly play key roles in how you search the Internet. In fact, your search may just conduct itself.

This concept, called “contextual search,” is improving so gradually the changes often go unnoticed, and we may soon forget what the world was like without it, according to Brian Proffitt, a technology expert and adjunct instructor of management in the University of Notre Dame’s Mendoza College of Business.

Contextual search describes the capability for search engines to recognize a multitude of factors beyond just the search text for which a user is seeking. These additional criteria form the “context” in which the search is run. Recently, contextual search has been getting a lot of attention due to interest from Google.

“You no longer have to search for content, content can search for you, which flips the world of search completely on its head,” says Proffitt, who is the author of 24 books on mobile technology and personal computing and serves as an editor and daily contributor for ReadWrite.com.

“Basically, search engines examine your request and try to figure out what it is you really want,” Proffitt says. “The better the guess, the better the perceived value of the search engine. In the days before computing was made completely mobile by smartphones, tablets and netbooks, searches were only aided by previous searches.

(…)

Context can include more than location and time. Search engines will also account for other users’ searches made in the same place and even the known interests of the user.

If time and location plus prior searches is context that “…flips the world of search completely on its head…”, imagine what a traditional index must do.

A traditional index being created by a person who has subject matter knowledge beyond the average reader and so is able to point to connections and facts (context) previously unknown to the user.

The “…current world of search…” is truly impoverished for time and location to have that much impact.

### Is Search a Thing of the Past

Friday, May 3rd, 2013

Is Search a Thing of the Past by April Holmes.

April covers a survey of 2277 private technology firms that were acquired in 2012.

See her post for the details but the bottom line was:

None of them were search companies.

I can’t remember anyone ever saying they had a “great” search experience.

Can you?

If not, what would you want to replace present search interfaces? (Leaving technical feasibility aside for the moment.)

### FindZebra

Thursday, May 2nd, 2013

FindZebra

FindZebra is a specialised search engine supporting medical professionals in diagnosing difficult patient cases. Rare diseases are especially difficult to diagnose and this online medical search engines comes in support of medical personnel looking for diagnostic hypotheses. With a simple and consistent interface across all devices, it can be easily used as an aid tool at the time and place where medical decisions are made. The retrieved information is collected from reputable sources across the internet storing public medical articles on rare and genetic diseases.

A search engine with: WARNING! This is a research project to be used only by medical professionals.

To avoid overwhelming researchers with search result “noise,” FindZebra deliberately restricts the content it indexes.

An illustration of the crudeness of current search algorithms that altering the inputs is the easiest way to improve outcomes for particular types of searches.

That seems to be an argument in favor of smaller than enterprise search engines, which could roll-up into broader search applications.

Of course, with a topic map you could retain the division between departments even as you roll-up the content into broader search applications.

### Designing Search: Displaying Results

Saturday, April 27th, 2013

Designing Search: Displaying Results by Tony Russell-Rose.

Search is a conversation: a dialogue between user and system that can be every bit as rich as human conversation. Like human dialogue, it is bidirectional: on one side is the user with their information need, which they articulate as some form of query.

On the other is the system and its response, which it expresses a set of search results. Together, these two elements lie at the heart of the search experience, defining and shaping much of the information seeking dialogue. In this piece, we examine the most universal of elements within that response: the search result.

Basic Principles

Search results play a vital role in the search experience, communicating the richness and diversity of the overall result set, while at the same time conveying the detail of each individual item. This dual purpose creates the primary tension in the design: results that are too detailed risk wasting valuable screen space while those that are too succinct risk omitting vital information.

Suppose you’re looking for a new job, and you browse to the 40 or so open positions listed on UsabilityNews. The results are displayed in concise groups of ten, occupying minimal screen space. But can you tell which ones might be worth pursuing?

As always a great post by Tony but a little over the top with:

“…a dialogue between user and system that can be every bit as rich as human conversation.”

Not in my experience but that’s not everyone’s experience.

Has anyone tested the thesis that dialogue between a user and search engine is as rich as between user and reference librarian?

### Developing a Solr Plugin

Saturday, April 27th, 2013

Developing a Solr Plugin by Andrew Janowczyk.

For our flagship product, Searchbox.com, we strive to bring the most cutting-edge technologies to our users. As we’ve mentioned in earlier blog posts, we rely heavily on Solr and Lucene to provide the framework for these functionalities. The nice thing about the Solr framework is that it allows for easy development of plugins which can greatly extend the capabilities of the software. We’ll be creating a set of slideshares which describe how to implement 3 types of plugins so that you can get ahead of the learning curve and start extending your own custom Solr installation now.

There are mainly 4 types of custom plugins which can be created. We’ll discuss their differences here:

Sometimes Andrew says three (3) types of plugins and sometimes he says four (4).

I tried to settle the question by looking at the Solr Wiki on plugins.

Depends on how you want to count separate plugins.

But, Andrew’s advice about learning to write plugins is sound. It will put your results above those of others.

### Ack 2.0 enhances the “grep for source code”

Tuesday, April 23rd, 2013

Ack 2.0 enhances the “grep for source code”

The developers of ack have released version 2.0 of their grep-like tool optimised for searching source code. Described as “designed for programmers”, ack has been available since 2005 and is based on Perl’s regular expressions engine. It minimises false positives by ignoring version control directories by default and has flexible highlighting for matches. The newly released ack 2.0 introduces a more flexible identification system, better support for ackrc configuration files and the ability to read the list of files to be searched from stdin.

Its developers say that ack is designed to perform in a similar fashion to GNU grep but to improve on it when searching source code repositories. The programs web site at beyondgrep.com lists a number of reasons why programmers might want to use ack instead of grep when searching through source code, the least of which being that the ack command is quicker to type than grep. But ack brings a lot more to the table than that as it is specifically designed to deal with source code and understand a large number of programming languages and tools such as build systems and version control software.

Is there any ongoing discussion of semantic searching for source code?

### Enabling action: Digging deeper into strategies for learning

Sunday, April 21st, 2013

Enabling action: Digging deeper into strategies for learning by Thom Haller. (Haller, T. (2013), Enabling action: Digging deeper into strategies for learning. Bul. Am. Soc. Info. Sci. Tech., 39: 42–43. doi: 10.1002/bult.2013.1720390413)

A central goal for information architects is to understand how people use information, make choices as they navigate a website and accomplish their objectives. If the goal is learning, we often assume it relates to an end point, a question to answer, a problem to which one applies new understanding. Benjamin Bloom’s 1956 taxonomy of learning breaks down the cognitive process, starting from understanding needs and progressing to action and final evaluation. Carol Kuhlthau’s 1991 outline of the information search process similarly starts with awareness of a need, progresses through exploring options, refining requirements and collecting solutions, and ends with decision making and action. Recognizing the stages of information browsing, learning and action can help information architects build sites that better meet searchers’ needs.

Thom starts with Bloom, cruises by Kahlthau and ends up with Jared Pomranky restating Kuhlthau in: Seeking Knowledge: Denver, Web Design, And The Stages of Learning:

According to Kuhlthau, the six stages of learning are:

• Initiation — the person becomes aware that they need information. Generally, it’s assumed that visitors to your website have this awareness already, but there are circumstances in which you can generate this kind of awareness as well.
• Exploration — the person sees the options that are available to choose between. Quite often, especially online, ‘analysis paralysis’ can set in and make a learner quit at this stage because they can’t decide which of the options are worth further pursuit.
• Formulation — the person sees that they’re going to have to create further requirements before they’re able to make a final selection, and they make decisions to narrow the field. Confidence returns.
• Collection — the person has clearly articulated their precise needs and is able to evaluate potential solutions. They gather all available solutions and begin to weigh them based on relevant criteria.
• Action — the person makes their final decision and acts on it based on their understanding.

Many web designers assume that their surfers are at the Collection stage, and craft their entire webpage toward moving their reader from Collection to Action — but statistically, most people are going to be at Exploration or Formulation when they arrive at your site.

Does that mean that you should build a website that encourages people to go read other options and learn more, hoping they’ll return to your site for their Action? Not at all — but it does mean that by understanding what people are looking for at each stage of their learning process, we can design websites that guide them through the whole thing. This, by no coincidence whatsoever, also results in websites and web content that is useful, user-friendly, and entirely Google-appropriate.

We all use models of online behavior, learning if you like, but I would caution against using models disconnected from your users.

Particularly models disconnected from your users and re-interpreted by you as reflecting your users.

A better course would be to study the behavior of your users and to model your content on their behavior.

Otherwise you will be the seekers who: “… came looking for [your users], only to find Zarathustra.” Thus Spake Zarathustra

Sunday, April 21st, 2013

Google search: three bugs to fix with better data science by Vincent Granville.

Vincent outlines three issues with Google search results:

1. Outdated search results
2. Wrongly attributed articles
3. Favoring irrelevant pages

See Vincent’s post for advice on how Google can address these issues. (Might help with a Google interview to tell them how to fix such long standing problems.)

More practically, how does your TM application rate on the outdated search results?

Do you just dump content on the user to sort out (the Google dump model (GDM)) or are your results a bit more user friendly?

### 2ND International Workshop on Mining Scientific Publications

Monday, April 15th, 2013

2ND International Workshop on Mining Scientific Publications

May 26, 2013 – Submission deadline
June 23, 2013 – Notification of acceptance
July 26, 2013 – Workshop

Digital libraries that store scientific publications are becoming increasingly important in research. They are used not only for traditional tasks such as finding and storing research outputs, but also as sources for mining this information, discovering new research trends and evaluating research excellence. The rapid growth in the number of scientific publications being deposited in digital libraries makes it no longer sufficient to provide access to content to human readers only. It is equally important to allow machines analyse this information and by doing so facilitate the processes by which research is being accomplished. Recent developments in natural language processing, information retrieval, the semantic web and other disciplines make it possible to transform the way we work with scientific publications. However, in order to make this happen, researchers first need to be able to easily access and use large databases of scientific publications and research data, to carry out experiments.

This workshop aims to bring together people from different backgrounds who:

(a) are interested in analysing and mining databases of scientific publications,

(b) develop systems, infrastructures or datasets that enable such analysis and mining,

(c) design novel technologies that improve the way research is being accomplished or

(d) support the openness and free availability of publications and research data.

2. TOPICS

The topics of the workshop will be organised around the following three themes:

1. Infrastructures, systems, open datasets or APIs that enable analysis of large volumes of scientific publications.
2. Semantic enrichment of scientific publications by means of text-mining, crowdsourcing or other methods.
3. Analysis of large databases of scientific publications to identify research trends, high impact, cross-fertilisation between disciplines, research excellence and to aid content exploration.

Of particular interest for topic mappers:

Topics of interest relevant to theme 2 include, but are not limited to:

• Novel information extraction and text-mining approaches to semantic enrichment of publications. This might range from mining publication structure, such as title, abstract, authors, citation information etc. to more challenging tasks, such as extracting names of applied methods, research questions (or scientific gaps), identifying parts of the scholarly discourse structure etc.
• Automatic categorization and clustering of scientific publications. Methods that can automatically categorize publications according to an established subject-based classification/taxonomy (such as Library of Congress classification, UNESCO thesaurus, DOAJ subject classification, Library of Congress Subject Headings) are of particular interest. Other approaches might involve automatic clustering or classification of research publications according to various criteria.
• New methods and models for connecting and interlinking scientific publications. Scientific publications in digital libraries are not isolated islands. Connecting publications using explicitly defined citations is very restrictive and has many disadvantages. We are interested in innovative technologies that can automatically connect and interlink publications or parts of publications, according to various criteria, such as semantic similarity, contradiction, argument support or other relationship types.
• Models for semantically representing and annotating publications. This topic is related to aspects of semantically modeling publications and scholarly discourse. Models that are practical with respect to the state-of-the-art in Natural Language Processing (NLP) technologies are of special interest.
• Semantically enriching/annotating publications by crowdsourcing. Crowdsourcing can be used in innovative ways to annotate publications with richer metadata or to approve/disapprove annotations created using text-mining or other approaches. We welcome papers that address the following questions: (a) what incentives should be provided to motivate users in contributing metadata, (b) how to apply crowdsourcing in the specialized domains of scientific publications, (c) what tasks in the domain of organising scientific publications is crowdsourcing suitable for and where it might fail, (d) other relevant crowdsourcing topics relevant to the domain of scientific publications.

The other themes could be viewed through a topic map lens but semantic enrichment seems like a natural.

### Improving Twitter search with real-time human computation ["semantics supplied"]

Tuesday, April 9th, 2013

Before we delve into the details, here’s an overview of how the system works.

(1) First, we monitor for which search queries are currently popular.

Behind the scenes: we run a Storm topology that tracks statistics on search queries.

For example: the query “Big Bird” may be averaging zero searches a day, but at 6pm on October 3, we suddenly see a spike in searches from the US.

(2) Next, as soon as we discover a new popular search query, we send it to our human evaluation systems, where judges are asked a variety of questions about the query.

Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon’s Mechanical Turk service, and then polls Mechanical Turk for a response.

For example: as soon as we notice “Big Bird” spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant tweets and ads.

Finally, after a response from a judge is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our human judges tell us that “Big Bird” is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.

Let’s now explore the first two sections above in more detail.

….

The post is quite awesome and I suggest you read it in full.

This resonates with a recent comment about Lotus Agenda.

The short version is a user creates a thesaurus in Agenda that enables searches enriched by the thesaurus. The user supplied semantics to enhance the searches.

In the Twitter case, human reviewers supply semantics to enhance the searches.

In both cases, Agenda and Twitter, humans are supplying semantics to enhance the searches.

I emphasize “supplying semantics” as a contrast to mechanistic searches that rely on text.

Mechanistic searches can be quite valuable but they pale beside searches where semantics have been “supplied.”

The Twitter experience is a an important clue.

The answer to semantics for searches lies somewhere between ask an expert (you get his/her semantics) and ask ask all of us (too many answers to be useful).

More to follow.

### ElasticSearch: Text analysis for content enrichment

Saturday, March 30th, 2013

ElasticSearch: Text analysis for content enrichment by Jaibeer Malik.

Taking an example of a typical eCommerce site, serving the right content in search to the end customer is very important for the business. The text analysis strategy provided by any search solution plays very big role in it. As a search user, I would prefer some of typical search behavior for my query to automatically return,

• should look for synonyms matching my query text
• should match singluar and plural words or words sounding similar to enter query text
• should not allow searching on protected words
• should allow search for words mixed with numberic or special characters
• should not allow search on html tags
• should allow search text based on proximity of the letters and number of matching letters

Enriching the content here would be to add above search capabilities to you content while indexing and searching for the content.

I thought the “…look for synonyms matching my query text…” might get your attention.

Not quite a topic map because there isn’t any curation of the search results, saving the next searcher time and effort.

But in order to create and maintain a topic map, you are going to need expansion of your queries by synonyms.

You will take the results of those expanded queries and fashion them into a topic map.

Think of it this way:

Machines can rapidly harvest, even sort content at your direction.

What they can’t do is curate the results of their harvesting.

That requires a secret ingredient.

That would be you.

I first saw this at DZone.

### HCIR [Human-Computer Information Retrieval] site gets publication page

Saturday, March 30th, 2013

HCIR site gets publication page by Gene Golovchinsky.

From the post:

Over the past six years of the HCIR series of meetings, we’ve accumulated a number of publications. We’ve had a series of reports about the meetings, papers published in the ACM Digital Library, and an up-coming Special Issue of IP&M. In the run-up to this year’s event (stay tuned!), I decided it might be useful to consolidate these publications in one place. Hence, we now have the HCIR Publications page.

Human-Computer Information Retrieval (HCIR) if the lingo is unfamiliar.

Will ease access to a great set of papers, at least in one respect.

One small improvement:

Do no rely upon the ACM Digital Library as the sole repository for these papers.

Access isn’t an issue for me but I suspect it may be for a number of others.

Hiding information behind a paywall diminishes its impact.

### elasticsearch 0.90.0.RC1 Released

Thursday, March 21st, 2013

elasticsearch 0.90.0.RC1 Released by Shay Banon.

elasticsearch version 0.90.0.RC1 is out, the first release candiate for the 0.90 release. You can download it here.

This release includes an upgrade to Lucene 4.2, many improvements to the suggester feature (including its own dedicated API), another round of memory improvements to field data (long type will now automatically “narrow” to the smallest type when loaded to memory) and several bug fixes. Upgrading to it from previous beta releases is highly recommended. (inserted URL to release notes)

Just to keep you on the cutting edge of search technology!

### MongoDB 2.4 Release

Tuesday, March 19th, 2013

MongoDB 2.4 Release

Developer Productivity

• Capped Arrays simplify development by making it easy to incorporate fixed, sorted lists for features like leaderboards and logging.
• Geospatial Enhancements enable new use cases with support for polygon intersections and analytics based on geospatial data.
• Text Search provides a simplified, integrated approach to incorporating search functionality into apps (Note: this feature is currently in beta release).

Operations

• Hash-Based Sharding simplifies deployment of large MongoDB systems.
• Working Set Analyzer makes capacity planning easier for ops teams.
• Improved Replication increases resiliency and reduces administration.
• Mongo Client creates an intuitive, consistent feature set across all drivers.

Performance

• Faster Counts and Aggregation Framework Refinements make it easier to leverage real-time, in-place analytics.
• V8 JavaScript Engine offers better concurrency and faster performance for some operations, including MapReduce jobs.

Monitoring

• On-Prem Monitoring provides comprehensive monitoring, visualization and alerting on more than 100 operational metrics of a MongoDB system in real time, based on the same application that powers 10gen’s popular MongoDB Monitoring Service (MMS). On-Prem Monitoring is only available with MongoDB Enterprise.

Security
….

• Kerberos Authentication enables enterprise and government customers to integrate MongoDB into existing enterprise security systems. Kerberos support is only available in MongoDB Enterprise.
• Role-Based Privileges allow organizations to assign more granular security policies for server, database and cluster administration.

You can read more about the improvements to MongoDB 2.4 in the Release Notes. Also, MongoDB 2.4 is available for download on MongoDB.org.

Lots to look at in MongoDB 2.4!

But I am curious about the beta text search feature.

Text search (SERVER-380) is one of the most requested features for MongoDB 10gen is working on an experimental text-search feature, to be released in v2.4, and we’re already seeing some talk in the community about the native implementation within the server. We view this as an important step towards fulfilling a community need.

MongoDB text search is still in its infancy and we encourage you to try it out on your datasets. Many applications use both MongoDB and Solr/Lucene, but realize that there is still a feature gap. For some applications, the basic text search that we are introducing may be sufficient. As you get to know text search, you can determine when MongoDB has crossed the threshold for what you need. (emphasis added)

So, why isn’t MongoDB incorporating Solr/Lucene instead of a home grown text search feature?

Seems like users could leverage their Solr/Lucene skills with their MongoDB installations.

Yes?

### Semantic Queries by Example [Identity by Example (IBE)?]

Sunday, March 17th, 2013

Semantic Queries by Example by Lipyeow Lim, Haixun Wang, Min Wang.

Abstract:

With the ever increasing quantities of electronic data, there is a growing need to make sense out of the data. Many advanced database applications are beginning to support this need by integrating domain knowledge encoded as ontologies into queries over relational data. However, it is extremely difficult to express queries against graph structured ontology in the relational SQL query language or its extensions. Moreover, semantic queries are usually not precise, especially when data and its related ontology are complicated. Users often only have a vague notion of their information needs and are not able to specify queries precisely. In this paper, we address these challenges by introducing a novel method to support semantic queries in relational databases with ease. Instead of casting ontology into relational form and creating new language constructs to express such queries, we ask the user to provide a small number of examples that satisfy the query she has in mind. Using those examples as seeds, the system infers the exact query automatically, and the user is therefore shielded from the complexity of interfacing with the ontology. Our approach consists of three steps. In the first step, the user provides several examples that satisfy the query. In the second step, we use machine learning techniques to mine the semantics of the query from the given examples and related ontologies. Finally, we apply the query semantics on the data to generate the full query result. We also implement an optional active learning mechanism to find the query semantics accurately and quickly. Our experiments validate the effectiveness of our approach.

Potentially deeply important work for both a topic map query language and topic map authoring.

The authors conclude:

In this paper, we introduce a machine learning approach to support semantic queries in relational database. In semantic query processing, the biggest hurdle is to represent ontological data in relational form so that the relational database engine can manipulate the ontology in a way consistent with manipulating the data. Previous approaches include transforming the graph ontological data into tabular form, or representing ontological data in XML and leveraging database extenders on XML such as DB2’s Viper. These approaches, however, are either expensive (materializing a transitive relationship represented by a graph may increase the data size exponentially) or requiring changes in the database engine and new extensions to SQL. Our approach shields the user from the necessity of dealing with the ontology directly. Indeed, as our user study indicates, the diﬃculty of expressing ontology-based query semantics in a query language is the major hurdle of promoting semantic query processing. With our approach, the users do not even need to know ontology representation. All that is required is that the user gives some examples that satisfy the query he has in mind. The system then automatically ﬁnds the answer to the query. In this process, semantics, which is a concept usually hard to express, remains as a concept in the mind of user, without having to be expressed explicitly in a query language. Our experiments and user study results show that the approach is eﬃcient, eﬀective, and general in supporting semantic queries in terms of both accuracy and usability. (emphasis added)

I rather like: “In this process, semantics, which is a concept usually hard to express, remains as a concept in the mind of user, without having to be expressed explicitly in a query language.

To take it a step further, it should apply to the authoring of topic maps as well.

A user selects from a set of examples the subjects they want to talk about. Quite different from any topic map authoring interface I have seen to date.

The “details” of capturing and querying semantics have stymied RDF:

And topic map authoring as well.

Is your next authoring/querying interface going to be by example?

I first saw this in a tweet by Stefano Bertolo.

### Leading People to Longer Queries

Thursday, March 14th, 2013

Leading People to Longer Queries by Elena Agapie, Gene Golovchinsky, Pernilla Qvarfordt.

Abstract:

Although longer queries can produce better results for information seeking tasks, people tend to type short queries. We created an interface designed to encourage people to type longer queries, and evaluated it in two Mechanical Turk experiments. Results suggest that our interface manipulation may be effective for eliciting longer queries.

The researchers encouraged longer queries by varying a halo around the search box.

Not conclusive but enough evidence to ask the questions:

What does your search interface encourage?

What other ways could you encourage query construction?

How would you encourage graph queries?

I first saw this in a tweet by Gene Golovchinsky.

### squirro

Wednesday, March 13th, 2013

squirro

I am not sure how “hard” the numbers are but CRM application claims:

Up to 15% increase in revenues

66% less time wasted on finding and re-finding information

15% increase in win rates

I take this as evidence there is a market for less noisy data streams.

If filtered search can produce this kind of ROI, imagine what curated search can do.

Yes?

### SURAAK – When Search Is Not Enough [A "google" of search results, new metric]

Wednesday, March 13th, 2013

A new way to do research. SURAAK is a web application that uses natural language processing techniques to analyze big data of published healthcare articles in the area of geriatrics and senior care. See how SURAAK uses text causality to find and analyze word relationship is this and other areas of interest.

SURAAK = Semantic Understanding Research in the Automatic Acquisition of Knowledge.

NLP based system that extracts “causal” sentences.

Differences from Google (according to the video)

• Extracts text from PDFs
• Links concepts together building relationships found in extracted text
• Links articles together based on shared concepts

Search demo was better than using Google but that’s not hard to do.

The “notes” that are extracted from texts are sentences.

I am uneasy about the use of sentences in isolation from the surrounding text as a “note.”

It’s clearly “doable,” but whether it is a good idea, remains to be seen. Particularly since users are rating sentences/notes in isolation from the text in which they occur.

BTW, funded with tax dollars from the National Institutes of Health and the National Institute on Aging, to the tune of $844K. I am still trying to track down the resulting software. I take this as an illustration that anything over a “google” of search results (a new metric), is of interest and fundable. ### Studying PubMed usages in the field… Monday, March 11th, 2013 Studying PubMed usages in the field for complex problem solving: Implications for tool design by Barbara Mirel, Jennifer Steiner Tonks, Jean Song, Fan Meng, Weijian Xuan, Rafiqa Ameziane. (Mirel, B., Tonks, J. S., Song, J., Meng, F., Xuan, W. and Ameziane, R. (2013), Studying PubMed usages in the field for complex problem solving: Implications for tool design. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22796) Abstract: Many recent studies on MEDLINE-based information seeking have shed light on scientists’ behaviors and associated tool innovations that may improve efficiency and effectiveness. Few, if any, studies, however, examine scientists’ problem-solving uses of PubMed in actual contexts of work and corresponding needs for better tool support. Addressing this gap, we conducted a field study of novice scientists (14 upper-level undergraduate majors in molecular biology) as they engaged in a problem-solving activity with PubMed in a laboratory setting. Findings reveal many common stages and patterns of information seeking across users as well as variations, especially variations in cognitive search styles. Based on these findings, we suggest tool improvements that both confirm and qualify many results found in other recent studies. Our findings highlight the need to use results from context-rich studies to inform decisions in tool design about when to offer improved features to users. From the introduction: For example, our ﬁndings conﬁrm that additional conceptual information integrated into retrieved results could expedite getting to relevance. Yet—as a qualiﬁcation—evidence from our ﬁeld cases suggests that presentations of this information need to be strategically apportioned and staged or they may inadvertently become counterproductive due to cognitive overload. Curated data raises its ugly head, again. Topic maps curate data and search results. Search engines don’t curate data or search results. How important is it for your doctor to find the right answers? In a timely manner? ### URL Search Tool! Wednesday, March 6th, 2013 URL Search Tool! by Lisa Green. From the post: A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index. Today we are happy to announce a tool that makes it even easier for you to take advantage of the URL Index! URL Search is a web application that allows you to search for any URL, URL prefix, subdomain or top-level domain. The results of your search show the number of files in the Common Crawl corpus that came from that URL and provide a downloadable JSON metadata file with the address and offset of the data for each URL. Once you download the JSON file, you can drop it into your code so that you only run your job against the subset of the corpus you specified. URL Search makes it much easier to find the files you are interested in and significantly reduces the time and money it take to run your jobs since you can now run them across only on the files of interest instead of the entire corpus. Imagine that. Searching relevant data instead of “big data.” What a concept! ### How Search Works [Promoting Your Interface] Tuesday, March 5th, 2013 How Search Works (Google) Clever graphics and I rather liked the: By the way, in the **** seconds you’ve been on this page, approximately *********** searches were performed. Not that you want that sort of tracking if your topic map interface only gets two or three “hits” a day but in an enterprise context…, might be worth thinking about. Evidence of the popularity of your topic map interface with the troops. I first saw this in a tweet by Christian Jansen. ### Are Googly Eyes Spying (on you)? Monday, February 25th, 2013 Felix Salmon’s The long arm of the Google raises serious privacy issues. A bond king recovering$10 million in stolen art warms everyone’s heart, but what other law enforcement searches are being done Google’s assistance?

Are they collecting data on searches for:

• “Root kit”
• Bomb making
• Cybersecurity
• Sources of guns or ammunition
• Partners with sexual preferences
• Your searches correlated with those of others

Hard to say and I would not trust any answer from Google or law enforcement on the subject.

Avoiding script kiddie spying by search engines requires the use of proxy servers or services such as Tor (anonymity network).

But none of those methods is immune from attack and all require technical skill and vigilance on the part of a user.

Let me sketch out a possible solution, at least for web searching.

1. A human search service to do a curated search
2. The search results are packaged for HTTP pickup
3. A web server running in no-log mode. Never logs any data. Can pass the ID of your search for retrieval but that is all that it knows.

Thinking of a curated search because you don’t have the full interactivity of a live search.

Having a person curate the results would get you higher quality results. Like using a librarian.

Would not be free but you would not have Google, local, state and federal law enforcement looking over your shoulder.

What is it they say?

Freedom is never free.

### Drill Sideways faceting with Lucene

Monday, February 25th, 2013

Drill Sideways faceting with Lucene by Mike McCandless.

From the post:

Lucene’s facet module, as I described previously, provides a powerful implementation of faceted search for Lucene. There’s been a lot of progress recently, including awesome performance gains as measured by the nightly performance tests we run for Lucene:

[3.8X speedup!]

….

For example, try searching for an LED television at Amazon, and look at the Brand field, seen in the image to the right: this is a multi-select UI, allowing you to select more than one value. When you select a value (check the box or click on the value), your search is filtered as expected, but this time the field does not disappear: it stays where it was, allowing you to then drill sideways on additional values. Much better!

LinkedIn’s faceted search, seen on the left, takes this even further: not only are all fields drill sideways and multi-select, but there is also a text box at the bottom for you to choose a value not shown in the top facet values.

To recap, a single-select field only allows picking one value at a time for filtering, while a multi-select field allows picking more than one. Separately, drilling down means adding a new filter to your search, reducing the number of matching docs. Drilling up means removing an existing filter from your search, expanding the matching documents. Drilling sideways means changing an existing filter in some way, for example picking a different value to filter on (in the single-select case), or adding another or’d value to filter on (in the multi-select case). (images omitted)

More details: DrillSideways class being developed under LUCENE-4748.

Just following the progress on Lucene is enough to make you dizzy!

### Creating a Simple Bloom Filter

Friday, February 22nd, 2013

Creating a Simple Bloom Filter by Max Burstein.

From the post:

Bloom filters are super efficient data structures that allow us to tell if an object is most likely in a data set or not by checking a few bits. Bloom filters return some false positives but no false negatives. Luckily we can control the amount of false positives we receive with a trade off of time and memory.

You may have never heard of a bloom filter before but you’ve probably interacted with one at some point. For instance if you use Chrome, Chrome has a bloom filter of malicious URLs. When you visit a website it checks if that domain is in the filter. This prevents you from having to ping Google’s servers every time you visit a website to check if it’s malicious or not. Large databases such as Cassandra and Hadoop use bloom filters to see if it should do a large query or not.

I think you will appreciate the lookup performance difference.

I first saw this at: Alex Popescu: Creating a Simple Bloom Filter in Python.