Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 26, 2014

Elasticsearch 1.1.0,…

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 7:00 pm

Elasticsearch 1.1.0, 1.0.2 and 0.90.13 released by Clinton Gormley.

From the post:

Today we are happy to announce the release of Elasticsearch 1.1.0, based on Lucene 4.7, along with bug fix releases Elasticsearch 1.0.2 and Elasticsearch 0.90.13:

You can download them and read the full changes list here:

New features in 1.1.0

Elasticsearch 1.1.0 is packed with new features: better multi-field search, the search templates and the ability to create aliases when creating an index manually or with a template. In particular, the new aggregations framework has enabled us to support more advanced analytics: the cardinality agg for counting unique values, the significant_terms agg for finding uncommonly common terms, and the percentiles agg for understanding data distribution.

We will be blogging about all of these new features in more detail, but for now we’ll give you a taste of what each feature adds:

….

Well, there’s goes the rest of the week! 😉

March 21, 2014

Elasticsearch: The Definitive Guide

Filed under: ElasticSearch,Indexing,Search Engines,Searching — Patrick Durusau @ 5:52 pm

Elasticsearch: The Definitive Guide (Draft)

From the Preface, who should read this book:

This book is for anybody who wants to put their data to work. It doesn’t matter whether you are starting a new project and have the flexibility to design the system from the ground up, or whether you need to give new life to a legacy system. Elasticsearch will help you to solve existing problems and open the way to new features that you haven’t yet considered.

This book is suitable for novices and experienced users alike. We expect you to have some programming background and, although not required, it would help to have used SQL and a relational database. We explain concepts from first principles, helping novices to gain a sure footing in the complex world of search.

The reader with a search background will also benefit from this book. Elasticsearch is a new technology which has some familiar concepts. The more experienced user will gain an understanding of how those concepts have been implemented and how they interact in the context of Elasticsearch. Even in the early chapters, there are nuggets of information that will be useful to the more advanced user.

Finally, maybe you are in DevOps. While the other departments are stuffing data into Elasticsearch as fast as they can, you’re the one charged with stopping their servers from bursting into flames. Elasticsearch scales effortlessly, as long as your users play within the rules. You need to know how to setup a stable cluster before going into production, then be able to recognise the warning signs at 3am in the morning in order to prevent catastrophe. The earlier chapters may be of less interest to you but the last part of the book is essential reading — all you need to know to avoid meltdown.

I fully understand the need, nay, compulsion for an author to say that everyone who is literate needs to read their book. And, if you are not literate, their book is a compelling reason to become literate! 😉

As the author of a book (two editions) and more than one standard, I can assure you an author’s need to reach everyone serves no one very well.

Potential readers ranges from novices, intermediate users and experts.

A book that targets all three will “waste” space on matter already know to experts but not to novices and/or intermediate users.

At the same time, space in a physical book being limited, some material relevant to the expert will be left out all together.

I had that experience quite recently when the details of LukeRequestHandler (Solr) were described as:

Reports meta-information about a Solr index, including information about the number of terms, which fields are used, top terms in the index, and distributions of terms across the index. You may also request information on a per-document basis.

That’s it. Out of more than 600+ pages of text, that is all the information you will find on LukeRequestHandler.

Fortunately I did find: https://wiki.apache.org/solr/LukeRequestHandler.

I don’t fault the author because several entire books could be written with the material they left out.

That is the hardest part of authoring, knowing what to leave out.

PS: Having said all that, I am looking forward to reading Elasticsearch: The Definitive Guide as it develops.

March 18, 2014

Automatic bulk OCR and full-text search…

Filed under: ElasticSearch,Search Engines,Solr,Topic Maps — Patrick Durusau @ 8:48 pm

Automatic bulk OCR and full-text search for digital collections using Tesseract and Solr by Chris Adams.

From the post:

Digitizing printed material has become an industrial process for large collections. Modern scanning equipment makes it easy to process millions of pages and concerted engineering effort has even produced options at the high-end for fragile rare items while innovative open-source projects like Project Gado make continual progress reducing the cost of reliable, batch scanning to fit almost any organization’s budget.

Such efficiencies are great for our goals of preserving history and making it available but they start making painfully obvious the degree to which digitization capacity outstrips our ability to create metadata. This is a big problem because most of the ways we find information involves searching for text and a large TIFF file is effectively invisible to a full-text search engine. The classic library solution to this challenge has been cataloging but the required labor is well beyond most budgets and runs into philosophical challenges when users want to search on something which wasn’t considered noteworthy at the time an item was cataloged.

In the spirit of finding the simplest thing that could possibly work I’ve been experimenting with a completely automated approach to perform OCR on new items and offering combined full-text search over both the available metadata and OCR text, as can be seen in this example:

If this weren’t impressive enough, Chris has a number of research ideas, including:

the idea for a generic web application which would display hOCR with the corresponding images for correction with all of the data stored somewhere like Github for full change tracking and review. It seems like something along those lines would be particularly valuable as a public service to avoid the expensive of everyone reinventing large parts of this process customized for their particular workflow.

More grist for a topic map mill!

PS: Should you ever come across a treasure trove of not widely available documents, please replicate them to as many public repositories as possible.

Traditional news outlets protect people in leak situations who knew they were playing in the street. Why they merit more protection than the average person is a mystery to me. Let’s protect the average people first and the players last.

February 19, 2014

Troubleshooting Elasticsearch searches, for Beginners

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 2:46 pm

Troubleshooting Elasticsearch searches, for Beginners by Alex Brasetvik.

From the post:

Elasticsearch’s recent popularity is in large part due to its ease of use. It’s fairly simple to get going quickly with Elasticsearch, which can be deceptive. Here at Found we’ve noticed some common pitfalls new Elasticsearch users encounter. Consider this article a piece of necessary reading for the new Elasticsearch user; if you don’t know these basic techniques take the time to familiarize yourself with them now, you’ll save yourself a lot of distress.

Specifically, this article will focus on text transformation, more properly known as text analysis, which is where we see a lot of people get tripped up. Having used other databases, the fact that all data is transformed before getting indexed can take some getting used to. Additionally, “schema free” means different things for different systems, a fact that is often confused with Elasticsearch’s “Schema Flexible” design.

When Alex say “beginners” he means beginning developers so this isn’t a post you can send to users with search troubles.

Sorry!

But if you are trying to debug search results in ElasticSearch as a developer, this is a good place to start.

February 18, 2014

ElasticSearch Analyzers – Parts 1 and 2

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 4:16 pm

Andrew Cholakian has written a two part introduction to analyzers in ElasticSearch.

All About Analyzers, Part One

From the introduction:

Choosing the right analyzer for an Elasticsearch query can be as much art as science. Analyzers are the special algorithms that determine how a string field in a document is transformed into terms in an inverted index. If you need a refresher on the basics of inverted indexes and where analysis fits into Elasticsearch in general please see this chapter in Exploring Elasticsearch covering analyzers. In this article we’ll survey various analyzers, each of which showcases a very different approach to parsing text.

Ten tokenizers, thirty-one token filters, and three character filters ship with the Elasticsearch distribution; a truly overwhelming number of options. This number can be increased further still through plugins, making the choices even harder to wrap one’s head around. Combinations of these tokenizers, token filters, and character filters create what’s called an analyzer. There are eight standard analyzers defined, but really, they are simply convenient shortcuts for arranging tokenizers, token filters, and character filters yourself. While reaching an understanding of this multitude of options may sound difficult, becoming reasonably competent in the use of analyzers is merely a matter of time and practice. Once the basic mechanisms behind analysis are understood, these tools are relatively easy to reason about and compose.

All About Analyzers, Part Two (continues part 1).

Very much worth your time if you need a refresher or analyzers for ElasticSearch and/or are approaching them for the first time.

Of course I went hunting for the treatment of synonyms, only to find the standard fare.

Not bad by any means but a grade school student knows synonyms depend upon any number of factors but you would be hard pressed to find that in any search engine.

I suppose you could define synonyms as most engines do and then filter the results to eliminate from a gene search “hits” from Field and Stream, Guns & Ammo, and the like. Although your searchers may be interested in how to trick out an AR-15. 😉

It may be that simple bulk steps are faster than more sophisticated searching. Will have to give that some thought.

February 10, 2014

Data visualization with Elasticsearch aggregations and D3

Filed under: D3,ElasticSearch,Visualization — Patrick Durusau @ 1:53 pm

Data visualization with Elasticsearch aggregations and D3 by Shelby Sturgis.

From the post:

For those of you familiar with Elasticsearch, you know that its an amazing modern, scalable, full-text search engine with Apache Lucene and the inverted index at its core. Elasticsearch allows users to query their data and provides efficient and blazingly fast look up of documents that make it perfect for creating real-time analytics dashboards.

Currently, Elasticsearch includes faceted search, a functionality that allows users to compute aggregations of their data. For example, a user with twitter data could create buckets for the number of tweets per year, quarter, month, day, week, hour, or minute using the date histogram facet, making it quite simple to create histograms.

Faceted search is a powerful tool for data visualization. Kibana is a great example of a front-end interface that makes good use of facets. However, there are some major restrictions to faceting. Facets do not retain information about which documents fall into which buckets, making complex querying difficult. Which is why, Elasticsearch is pleased to introduce the aggregations framework with the 1.0 release. Aggregations rips apart its faceting restraints and provides developers the potential to do much more with visualizations.

Aggregations (=Awesomeness!)

Aggregations is “faceting reborn”. Aggregations incorporate all of the faceting functionality while also providing much more powerful capabilities. Aggregations is a “generic” but “extremely powerful” framework for building any type of aggregation. There are several different types of aggregations, but they fall into two main categories: bucketing and metric. Bucketing aggregations produce a list of buckets, each one with a set of documents that belong to it (e.g., terms, range, date range, histogram, date histogram, geo distance). Metric aggregations keep track and compute metrics over a set of documents (e.g., min, max, sum, avg, stats, extended stats).

Using Aggregations for Data Visualization (with D3)

Lets dive right in and see the power that aggregations give us for data visualization. We will create a donut chart and a dendrogram using the Elasticsearch aggregations framework, the Elasticsearch javascript client, and D3.

If you are new to Elasticsearch, it is very easy to get started. Visit the Elasticsearch overview page to learn how to download, install, and run Elasticsearch version 1.0.

The dendrogram of football (U.S.) touchdowns is particularly impressive.

BTW, https://github.com/stormpython/Elasticsearch-datasets/archive/master.zip, returns Elasticsearch-datasets-master.zip on your local drive. Just to keep you from hunting for it.

January 23, 2014

Foundation…

Filed under: ElasticSearch,Lucene — Patrick Durusau @ 8:29 pm

Foundation: Learn and Play with Elasticsearch

I have posted about several of the articles here but missed posting about the homepage for this site.

Take a close look at Play. It offers you the opportunity to alter documents and search settings, online experimentation I would call it, with ElasticSearch.

The idea of simple, interactive play with search software is a good one.

I wonder how that would translate into an interface for the same thing for topic maps?

The immediacy of feedback along with a non-complex interface would be selling points to me.

You will also find some twenty-five articles (as of today) ranging from beginner to more advanced topics on ElasticSearch.

January 10, 2014

CouchDB + ElasticSearch on Ubuntu 13.10 VPS

Filed under: CouchDB,ElasticSearch — Patrick Durusau @ 2:16 pm

How To Set Up CouchDB with ElasticSearch on an Ubuntu 13.10 VPS by Cooper Thompson.

From the post:

CouchDB is a NoSQL database that stores data as JSON documents. It is extremely helpful in situations where a schema would cause headaches and a flexible data model is required. CouchDB also supports master-master continuous replication, which means data can be continuously replicated between two databases without having to setup a complex system of master and slave databases.

ElasticSearch is a full-text search engine that indexes everything and makes pretty much anything searchable. This works extremely well with CouchDB because one of the limitations of CouchDB is that for all queries you have to either know the document ID or you have to use map/reduce.

This looks like a very useful installation guide if you are just starting with CouchDB and/or ElasticSearch.

I say “looks like” because the article is undated. The only way I know it is relatively recent is that it refers to ElasticSearch 90.8 and the latest release of ElasticSearch is 90.10.

Dating posts, guides, etc. really isn’t that hard and it helps readers avoid out-dated material.

December 24, 2013

elasticsearch-entity-resolution

Filed under: Duke,ElasticSearch,Entity Resolution,Search Engines,Searching — Patrick Durusau @ 2:17 pm

elasticsearch-entity-resolution

From the webpage:

This project is an interactive entity resolution plugin for Elasticsearch based on Duke. Basically, it uses Bayesian probabilities to compute probability. You can pretty much use it an interactive deduplication engine.

To understand basics, go to Duke project documentation.

A list of available comparators is available here.

Interesting pairing of Duke (entity resolution/record linkage software by Lars Marius Garshol) with ElasticSearch.

Strings and user search behavior can only take an indexing engine so far. This is a step in the right direction.

A step more likely be followed with an Apache License as opposed to its current LGPLv3.

December 21, 2013

…Titan Cluster on Cassandra and ElasticSearch on AWS EC2

Filed under: Cassandra,ElasticSearch,Graphs,Titan — Patrick Durusau @ 8:10 pm

Setting up a Titan Cluster on Cassandra and ElasticSearch on AWS EC2 by Jenny Kim.

From the post:

This purpose of this post is to provide a walkthrough of a Titan cluster setup and highlight some key gotchas I’ve learned along the way. This walkthrough will utilize the following versions of each software package:

Versions

The cluster in this walkthrough will utilize 2 M1.Large instances, which mirrors our current Staging cluster setup. A typical production graph cluster utilizes 4 M1.XLarge instances.

NOTE: While the Datastax Community AMI requires at minimum, M1.Large instances, the exact instance-type and cluster size should depend on your expected graph size, concurrent requests, and replication and consistency needs.

Great post!

You will be gaining experience with cloud computing along with very high end graph software (Titan).

December 6, 2013

Instructions for deploying an Elasticsearch Cluster with Titan

Filed under: ElasticSearch,Graphs,Titan — Patrick Durusau @ 7:28 pm

Instructions for deploying an Elasticsearch Cluster with Titan by Benjamin Bengfort.

From the post:

Elasticsearch is an open source distributed real-time search engine for the cloud. It allows you to deploy a scalable, auto-discovered cluster of nodes, and as search capacity grows, you simple need to add more nodes and the cluster will reorganize itself. Titan, a distributed graph engine by Aurelius supports elasticsearch as an option to index your vertices for fast lookup and retrieval. By default, Titan supports elasticsearch running in the same JVM and storing data locally on the client, which is fine for embedded mode. However, once your Titan cluster starts growing, you have to respond by growing an elasticsearch cluster side by side with the graph engine.

This tutorial is how to quickly get a elasticsearch cluster up and running on EC2, then configuring Titan to use it for indexing. It assumes you already have an EC2/Titan cluster deployed. Note, that these instructions were for a particular deployment, so please forward any questions about specifics in the comments!

A great tutorial. Short, on point and references other resources.

Enjoy!

December 2, 2013

ElasticSearch 1.0.0.Beta2 released

Filed under: Aggregation,ElasticSearch,Search Engines — Patrick Durusau @ 4:08 pm

ElasticSearch 1.0.0.Beta2 released by Clinton Gromley.

From the post:

Today we are delighted to announce the release of elasticsearch 1.0.0.Beta2, the second beta release on the road to 1.0.0 GA. The new features we have planned for 1.0.0 have come together more quickly than we expected, and this beta release is chock full of shiny new toys. Christmas has come early!

We have added:

Please download elasticsearch 1.0.0.Beta2, try it out, break it, figure out what is missing and tell us about it. Our next release will focus on cleaning up inconsistent APIs and usability, plus fixing any bugs that are reported in the new functionality, so your early bug reports are an important part of ensuring that 1.0.0 GA is solid.

WARNING: This is a beta release – it is not production ready, features are not set in stone and may well change in the next version, and once you have made any changes to your data with this release, it will no longer be readable by older versions!

Suggestion: Pay close attention to the documentation on the new aggregation capabilities.

For example:

There are many different types of aggregations, each with its own purpose and output. To better understand these types, it is often easier to break them into two main families:

Bucketing: A family of aggregations that build buckets, where each bucket is associated with a key and a document criteria. When the aggregations is executed, the buckets criterias are evaluated on every document in the context and when matches, the document is considered to “fall in” the relevant bucket. By the end of the aggreagation process, we’ll end up with a list of buckets – each one with a set of documents that “belong” to it.

Metric: Aggregations that keep track and compute metrics over a set of documents

The interesting part comes next, since each bucket effectively defines a document set (all documents belonging to the bucket), one can potentially associated aggregations on the bucket level, and those will execute within the context of that bucket. This is where the real power of aggregations kicks in: aggregations can be nested!

Interesting, yes?

November 27, 2013

Redesigned percolator

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 11:15 am

Redesigned percolator by Martijn Vangroningen.

From the post:

The percolator is essentially search in reverse, which can by confusing initially for many people. This post will help to solve that problem and give more information on the redesigned percolator. We have added a lot more features to it to help users work with percolated documents/queries more easily.

In normal search systems, you store your data as documents and then send your questions as queries. The search results are a list of documents that matched your query.

With the percolator, this is reversed. First, you store the queries and then you send your ‘questions’ as documents. The percolator results are a list of queries that matched the document.

So what can do percolator do for you? The percolator can be used for a number of use cases, but the most common is for alerting and monitoring. By registering queries in Elasticsearch, your data can be monitored in real-time. If data with certain properties is being indexed, the percolator can tell you what queries this data matches.

For example, imagine a user “saving” a search. As new documents are added to the index, documents are percolated against this saved query and the user is alerted when new documents match. The percolator can also be used for data classification and user query feedback.

Even as a beta feature, this sounds interesting.

Another use case could be adhering to a Service Level Agreement (SLA).

You could have tiered search result packages that guarantee the freshness of search results. Near real-time would be more expensive than within six (6) hours or within the next business day. The match to a stored query could be queued up for delivery in accordance with your SLA.

I pay more for faster delivery times from FedEx, UPS, and, the US Post Office.

Why shouldn’t faster information cost more than slower information?

True, there are alternative suppliers of information but then you remind your prospective client of the old truism, you get what you pay for.

That is not contradicted by IT disasters such as HeathCare.gov.

The government hired contractors that are hard to distinguish from their agency counterparts and who are interested in “butts in seats” and not any useful results.

In that sense, the government literally got what it paid for. Had it wanted a useful heathcare IT project, it would not have put government drones in charge of the project.

Similarity in Elasticsearch

Filed under: ElasticSearch,Similarity,Similarity Retrieval — Patrick Durusau @ 10:43 am

Similarity in Elasticsearch by Konrad G. Beiske.

From the post:

A similarity model is a set of abstractions and metrics to define to what extent things are similar. That’s quite a general definition. In this article I will only consider textual similarity. In this context, the uses of similarity models can be divided into two categories: classification of documents, with a finite set of categories where the categories are known; and information retrieval where the problem can be defined as ‘find the the most relevant documents to a given query’. In this article I will look into the latter category.

Elasticsearch provides the following similarity models: default, bm25, drf and ib. I have limited the scope of this article to default and bm25. The divergence from randomness and information based similarities may feature in a future article.

Konrad goes on to talk about the default similarity model in Elasticsearch, Tf/idf and BM25 (aka Okapi BM25), a probabilistic model.

He also points the reader to: The Probabilistic Relevance Framework: BM25 and Beyond for further details on BM25.

A good post if you want to learn more about tuning similarity in Elasticsearch.

BTW, documentation on similarity module for 0.90.

While the build-in similarity models offer a lot of mileage no doubt, I am more intrigued by the potential for creating a custom similarity model.

As you know, some people think English words are just English words. Search engines tend to ignore time, social class, context of use, etc., in returning all the “instances” of an English word.

That is to say the similarity model for one domain or period could be quite different from the similarity model for another.

Domain or period specific similarity models would be difficult to construct and certainly less general.

Given the choice, of being easy, general and less accurate versus being harder, less general and more accurate, which would you choose?

Does your answer change if you are a consumer looking for the best results or a developer trying to sell “good enough” results?

November 25, 2013

Fast Search and Analytics on Hadoop with Elasticsearch

Filed under: ElasticSearch,Hadoop YARN — Patrick Durusau @ 5:01 pm

Fast Search and Analytics on Hadoop with Elasticsearch by Lisa Sensmeier.

From the post:

Hortonworks customers can now enhance their Hadoop applications with Elasticsearch real-time data exploration, analytics, logging and search features, all designed to help businesses ask better questions, get clearer answers and better analyze their business metrics in real-time.

Hortonworks Data Platform and Elasticsearch make for a powerful combination of technologies that are extremely useful to anyone handling large volumes of data on a day-to-day basis. With the ability of YARN to support multiple workloads, customers with current investments in flexible batch processing can also add real-time search applications from Elasticsearch.

Not much in the way of substantive content but it does have links to good resources on Hadoop and Elasticsearch.

November 22, 2013

Spark and Elasticsearch

Filed under: ElasticSearch,Spark — Patrick Durusau @ 4:41 pm

Spark and Elasticsearch by Barnaby Gray.

From the post:

If you work in the Hadoop world and have not yet heard of Spark, drop everything and go check it out. It’s a really powerful, intuitive and fast map/reduce system (and some).

Where it beats Hadoop/Pig/Hive hands down is it’s not a massive stack of quirky DSLs built on top of layers of clunky Java abstractions – it’s a simple, pure Scala functional DSL with all the flexibility and succinctness of Scala. And it’s fast, and properly interactive – query, bam response snappiness – not query, twiddle fingers, wait a bit.. response.

And if you’re into search, you’ll no doubt have heard of Elasticsearch – a distributed restful search engine built upon Lucene.

They’re perfect bedfellows – crunch your raw data and spit it out into a search index ready for serving to your frontend. At the company I work for we’ve built the google-analytics-esque part of our product around this combination.

It so fast, it flies – we can process raw event logs at 250,000 events/s without breaking a sweat on a meagre EC2 m1.large instance. (bold emphasis added)

Don’t you just hate it when bloggers hold back? 😉

I’m not endorsing this solution but I do appreciate a post with attitude and useful information.

Enjoy!

November 6, 2013

elasticsearch 1.0.0.beta1 released

Filed under: ElasticSearch,Lucene,Search Engines,Searching — Patrick Durusau @ 8:04 pm

elasticsearch 1.0.0.beta1 released by Clinton Gormley.

From the post:

Today we are delighted to announce the release of elasticsearch 1.0.0.Beta1, the first public release on the road to 1.0.0. The countdown has begun!

You can download Elasticsearch 1.0.0.Beta1 here.

In each beta release we will add one major new feature, giving you the chance to try it out, to break it, to figure out what is missing and to tell us about it. Your use cases, ideas and feedback is essential to making Elasticsearch awesome.

The main feature we are showcasing in this first beta is Distributed Percolation.

WARNING: This is a beta release – it is not production ready, features are not set in stone and may well change in the next version, and once you have made any changes to your data with this release, it will no longer be readable by older versions!

distributed percolation

For those of you who aren’t familiar with percolation, it is “search reversed”. Instead of running a query to find matching docs, percolation allows you to find queries which match a doc. Think of people registering alerts like: tell me when a newspaper publishes an article mentioning “Elasticsearch”.

Percolation has been supported by Elasticsearch for a long time. In the current implementation, queries are stored in a special _percolator index which is replicated to all nodes, meaning that all queries exist on all nodes. The idea was to have the queries alongside the data.

But users are using it at a scale that we never expected, with hundreds of thousands of registered queries and high indexing rates. Having all queries on every node just doesn’t scale.

Enter Distributed Percolation.

In the new implementation, queries are registered under the special .percolator type within the same index as the data. This means that queries are distributed along with the data, and percolation can happen in a distributed manner across potentially all nodes in the cluster. It also means that an index can be made as big or small as required. The more nodes you have the more percolation you can do.

After reading the news release I understand why Twitter traffic on the elasticsearch release surged today. 😉

A new major feature with each beta release? That should attract some attention.

Not to mention “distributed percolation.”

Getting closer to a result being the “result” at X time on the system clock.

October 27, 2013

Tiny Data: Rapid development with Elasticsearch

Filed under: ElasticSearch,Lucene,Ruby — Patrick Durusau @ 6:52 pm

Tiny Data: Rapid development with Elasticsearch by Leslie Hawthorn.

From the post:

Today we’re pleased to bring you the story of the creation of SeeMeSpeak, a Ruby application that allows users to record gestures for those learning sign language. Florian Gilcher, one of the organizers of the Berlin Elasticsearch User Group participated in a hackathon last weekend with three friends, resulting in this brand new open source project using Elasticsearch on the back end. (Emphasis in original.)

Project:

Sadly, there are almost no good learning resources for sign language on the internet. If material is available, licensing is a hassle or both the licensing and the material is poorly documented. Documenting sign language yourself is also hard, because producing and collecting videos is difficult. You need third-party recording tools, video conversion and manual categorization. That’s a sad state in a world where every notebook has a usable camera built in!

Our idea was to leverage modern browser technologies to provide an easy recording function and a quick interface to categorize the recorded words. The result is SeeMeSpeak.

Two lessons here:

  1. Data does not have to be “big” in order to be important.
  2. Browsers are very close to being the default UI for users.

October 26, 2013

Integrating Nutch 1.7 with ElasticSearch

Filed under: ElasticSearch,Nutch,Searching — Patrick Durusau @ 3:03 pm

Integrating Nutch 1.7 with ElasticSearch

From the post:

With Nutch 1.7 the possibility for integrating with ElasticSearch became available. However setting up the integration turned out to be quite a treasure hunt for me. For anybody else wanting to achieve the same result without tearing out as much hair as I did please find some simple instructions on this page that hopefully will help you in getting Nutch to talk to ElasticSearch.

I’m assuming you have both Nutch and ElasticSearch running fine by which I mean that Nutch does it crawl, fetch, parse thing and ElasticSearch is doing its indexing and searching magic, however not yet together.

All of the work involved is in Nutch and you need to edit nutch-site.xml in the conf directory to get things going. First off you need to activate the elasticsearch indexer plugin by adding the following line to nutch-site.xml:

A post that will be much appreciated by anyone who wants to integrate Nutch with ElasticSearch.

A large number of software issues are matters of configuration, once you know the configuration.

The explorers who find those configurations and share them with others are under appreciated.

October 9, 2013

ElasticHQ

Filed under: ElasticSearch,Searching — Patrick Durusau @ 7:33 pm

ElasticHQ

From the homepage:

Real-Time Monitoring

From monitoring individual cluster nodes, to viewing real-time threads, ElasticHQ enables up-to-the-second insight in to ElasticSearch cluster runtime metrics and configurations, using the ElasticSearch REST API. ElasticHQ’s real-time update feature works by polling your ElasticSearch cluster intermittently, always pulling the latest aggregate information and deltas; keeping you up-to-date with the internals of your working cluster.

Full Cluster Management

Elastic HQ gives you complete control over your ElasticSearch clusters, nodes, indexes, and mappings. The sleek, intuitive UI gives you all the power of the ElasticSearch Admin API, without having to tangle with REST and large cumbersome JSON requests and responses.

Search and Query

Easily find what you’re looking for by querying a specific Index or several Indices at once. ElasticHQ provides a Query interface, along with all of the other Administration UI features.

No Software to Install

ElasticHQ does not require any software. It works in your web browser, allowing you to manage and monitor your ElasticSearch clusters from anywhere at any time. Built on responsive CSS design, ElasticHQ adjusts itself to any screen size on any device.

I don’t know of any compelling reason to make ElasticSearch management and monitoring difficult for sysadmins. 😉

If approaches like ElasticHQ make their lives easier, perhaps they won’t begrudge users having better UIs as well.

October 8, 2013

Elasticsearch Workshop

Filed under: ElasticSearch,JSON — Patrick Durusau @ 3:16 pm

Elasticsearch Workshop by David Pilato.

Nothing startling or new but a good introduction to Elasticsearch that you can pass along to programmers who like JSON. 😉

Nothing against JSON but “efficient” syntaxes are like using 7-bit encodings because it saves disk space.

October 3, 2013

Dublin Lucene Revolution 2013 Sessions

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 6:45 pm

Dublin Lucene Revolution 2013 Sessions

Just a sampling to whet your appetite:

With many more entries in the intermediate and introductory levels.

Of all of the listed sessions, which ones will set your sights on Dublin?

Reminder: Training: November 4-5, Conference: November 6-7

October 1, 2013

Elasticsearch internals: an overview

Filed under: ElasticSearch,Lucene,Search Engines — Patrick Durusau @ 2:50 pm

Elasticsearch internals: an overview by Njal Karevoll.

From the post:

This article gives an overview of the Elasticsearch internals. I will present a 10,000 foot view of the different modules that Elasticsearch is composed of and how we can extend or replace built-in functionality using plugins.

Using Freemind, Njal has created maps of the namespaces and modules of ElasticSearch for your exploration.

The full module view reminds me of SGML productions, except less complicated.

September 26, 2013

Explore Your Data with Elasticsearch

Filed under: ElasticSearch,Search Engines — Patrick Durusau @ 2:36 pm

From the description:

As Honza Kral puts it, “Elasticsearch is a very buzz-word compliant piece of software.” By this he means, it’s open source, it can do REST, JSON, HTTP, it has real time, and even Lucene is somewhere in there. What does this all really mean? Well, simply, Elasticsearch is a distributed data store that’s very good at searching and analyzing data.

Honza, a Python programmer and Django core developer, visits SF Python, to show off what this powerful tool can do. He uses real data to demonstrate how Elasticsearch’s real-time analytics and visualizations tools can help you make sense of your application.

Follow along with Honza’s slides: http://crcl.to/6tdvs

There are clients for ElasticSearch so don’t worry about the deeply nested brackets in the examples. 😉

A very good presentation on exploring data with ElasticSearch.

September 12, 2013

Elasticsearch Entity Resolution

Filed under: Deduplication,Duke,ElasticSearch,Entity Resolution — Patrick Durusau @ 2:24 pm

elasticsearch-entity-resolution by Yann Barraud.

From the webpage:

This project is an interactive entity resolution plugin for Elasticsearch based on Duke. Basically, it uses Bayesian probabilities to compute probability. You can pretty much use it an interactive deduplication engine.

It is usable as is, though cleaners are not yet implemented.

To understand basics, go to Duke project documentation.

A list of available comparators is available here.

Intereactive deduplication? Now that sounds very useful for topic map authoring.

Appropriate that I saw this in a Tweet by Duke‘s author, Lars Marius Garshol.

September 3, 2013

Elastisch 1.3.0-beta2 Is Released

Filed under: Clojure,ElasticSearch — Patrick Durusau @ 6:47 pm

Elastisch 1.3.0-beta2 Is Released

From the post:

Elastisch is a battle tested, small but feature rich and well documented Clojure client for ElasticSearch. It supports virtually every Elastic Search feature and has solid documentation.

Solid documentation. Well, the guides page says “10 minutes” to study Getting Started. And, Getting Started says it will take about “15 minutes to read and study the provided code examples.” No estimate on reading the prose. 😉

Just teasing.

If you are developing or maintaining your Clojure skills, this is a good opportunity to add a popular search engine to your skill set.

August 29, 2013

Parsing arbitrary Text-based Guitar Tab…

Filed under: ElasticSearch,Music,Music Retrieval — Patrick Durusau @ 6:39 pm

RiffBank – Parsing arbitrary Text-based Guitar Tab into an Indexable and Queryable “RiffCode for ElasticSearch
by Ryan Robitalle.

Guitar tab is a form of tablature, a form of music notation that records finger positions.

Surfing just briefly, there appear to be a lot of music available in “tab” format.

Deeply interesting post that will take some time to work through.

It is one of those odd things that may suddenly turn out to be very relevant (or not) in another domain.

Looking forward to spending some time with tablature data.

August 26, 2013

Register To Watch? Let’s Not.

Filed under: ElasticSearch — Patrick Durusau @ 5:55 pm

webinar: Getting Started with Elasticsearch by Drew Raines.

I am sure Drew does a great job in this webinar. Just as I am sure if you really are a search newbie, it would be useful to you.

But let’s all start passing on the “register to watch/download” dance.

If they want to count views or downloads, hell, Youtube does that (without registering).

Many people have given “skip registering to view” advice before and many will in the future.

Mark me down as just one more.

Do blog about your decision and which “register to view” you decided to skip.

If that happens enough times, maybe marketing departments will look elsewhere for spam addresses.

August 22, 2013

You complete me

Filed under: AutoSuggestion,ElasticSearch,Interface Research/Design,Lucene — Patrick Durusau @ 2:03 pm

You complete me by Alexander Reelsen.

From the post:

Effective search is not just about returning relevant results when a user types in a search phrase, it’s also about helping your user to choose the best search phrases. Elasticsearch already has did-you-mean functionality which can correct the user’s spelling after they have searched. Now, we are adding the completion suggester which can make suggestions while-you-type. Giving the user the right search phrase before they have issued their first search makes for happier users and reduced load on your servers.

Warning: The completion suggester Alexander describes may “change/break in future releases.”

Two features that made me read the post were: readability and custom ordering.

Under readability, the example walks you through returning one output for several search completions.

Suggestions don’t have to be presented in TF/IDF relevance order. A weight assigned to the target of a completion controls the ordering of suggestions.

The post covers several other features and if you are using or considering using Elasticsearch, it is a good read.

duplitector

Filed under: Duplicates,ElasticSearch,Lucene — Patrick Durusau @ 1:17 pm

duplitector by Paweł Rychlik.

From the webpage:

duplitector

A duplicate data detector engine based on Elasticsearch. It’s been successfully used as a proof of concept, piloting an full-blown enterprize solution.

Context

In certain systems we have to deal with lots of low-quality data, containing some typos, malformatted or missing fields, erraneous bits of information, sometimes coming from different sources, like careless humans, faulty sensors, multiple external data providers, etc. This kind of datasets often contain vast numbers of duplicate or similar entries. If this is the case – then these systems might struggle to deal with such unnatural, often unforeseen, conditions. It might, in turn, affect the quality of service delivered by the system.

This project is meant to be a playground for developing a deduplication algorithm, and is currently aimed at the domain of various sorts of organizations (e.g. NPO databases). Still, it’s small and generic enough, so that it can be easily adjusted to handle other data schemes or data sources.

The repository contains a set of crafted organizations and their duplicates (partially fetched from IRS, partially intentionally modified, partially made up), so that it’s convenient to test the algorithm’s pieces.

Paweł also points to this article by Andrei Zmievski: Duplicates Detection with ElasticSearch. Andrei merges tags for locations based on their proximity to a particular coordinates.

I am looking forward to the use of indexing engines for deduplication of data in situ as it were. That is without transforming the data into some other format for processing.

« Newer PostsOlder Posts »

Powered by WordPress