Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 24, 2018

‘Learning to Rank’ (No Unique Feature Name Fail – Update)

Filed under: Artificial Intelligence,ElasticSearch,Ranking,Searching — Patrick Durusau @ 8:02 pm

Elasticsearch ‘Learning to Rank’ Released, Bringing Open Source AI to Search Teams

From the post:

Search experts at OpenSource Connections, the Wikimedia Foundation, and Snagajob, deliver open source cognitive search capabilities to the Elasticsearch community. The open source Learning to Rank plugin allows organizations to control search relevance ranking with machine learning. The plugin is currently delivering search results at Wikipedia and Snagajob, providing significant search quality improvements over legacy solutions.

Learning to Rank lets organizations:

  • Directly optimize sales, conversions and user satisfaction in search
  • Personalize search for users
  • Drive deeper insights from a knowledge base
  • Customize ranking down for complex nuance
  • Avoid the sticker shock & lock-in of a proprietary "cognitive search" product

“Our mission is to empower search teams. This plugin gives teams deep control of ranking, allowing machine learning models to be directly deployed to the search engine for relevance ranking” said Doug Turnbull, author of Relevant Search and CTO, OpenSource Connections.

I need to work through all the documentation and examples but:

Feature Names are Unique

Because some model training libraries refer to features by name, Elasticsearch LTR enforces unique names for each features. In the example above, we could not add a new user_rating feature without creating an error.

is a warning of what you (and I) are likely to find.

Really? Someone involved in the design thought globally unique feature names was a good idea? Or at a minimum didn’t realize it is a very bad idea?

Scope anyone? Either in the programming or topic map sense?

Despite the unique feature name fail, I’m sure ‘Learning to Rank’ will be useful. But not as useful as it could have been.

Doug Turnbull (https://twitter.com/softwaredoug) advises that features are scoped by feature stores, so the correct prose would read: “…LTR enforces unique names for each feature within a feature store.”

No fail, just bad writing.

February 14, 2017

We’re Bringing Learning to Rank to Elasticsearch [Merging Properties Query Dependent?]

Filed under: DSL,ElasticSearch,Merging,Search Engines,Searching,Topic Maps — Patrick Durusau @ 8:26 pm

We’re Bringing Learning to Rank to Elasticsearch.

From the post:

It’s no secret that machine learning is revolutionizing many industries. This is equally true in search, where companies exhaust themselves capturing nuance through manually tuned search relevance. Mature search organizations want to get past the “good enough” of manual tuning to build smarter, self-learning search systems.

That’s why we’re excited to release our Elasticsearch Learning to Rank Plugin. What is learning to rank? With learning to rank, a team trains a machine learning model to learn what users deem relevant.

When implementing Learning to Rank you need to:

  1. Measure what users deem relevant through analytics, to build a judgment list grading documents as exactly relevant, moderately relevant, not relevant, for queries
  2. Hypothesize which features might help predict relevance such as TF*IDF of specific field matches, recency, personalization for the searching user, etc.
  3. Train a model that can accurately map features to a relevance score
  4. Deploy the model to your search infrastructure, using it to rank search results in production

Don’t fool yourself: underneath each of these steps lie complex, hard technical and non-technical problems. There’s still no silver bullet. As we mention in Relevant Search, manual tuning of search results comes with many of the same challenges as a good learning to rank solution. We’ll have more to say about the many infrastructure, technical, and non-technical challenges of mature learning to rank solutions in future blog posts.

… (emphasis in original)

A great post as always but of particular interest for topic map fans is this passage:


Many of these features aren’t static properties of the documents in the search engine. Instead they are query dependent – they measure some relationship between the user or their query and a document. And to readers of Relevant Search, this is what we term signals in that book.
… (emphasis in original)

Do you read this as suggesting the merging exhibited to users should depend upon their queries?

That two or more users, with different query histories could (should?) get different merged results from the same topic map?

Now that’s an interesting suggestion!

Enjoy this post and follow the blog for more of same.

(I have a copy of Relevant Search waiting to be read so I had better get to it!)

November 12, 2016

10 Reasons to Choose Apache Solr Over Elasticsearch

Filed under: ElasticSearch,Lucene,LucidWorks,Solr — Patrick Durusau @ 9:24 pm

10 Reasons to Choose Apache Solr Over Elasticsearch by Grant Ingersoll.

From the post:

Hey, clickbait title aside, I get it, Elasticsearch has been growing. Kudos to the project for tapping into a new set of search users and use cases like logging, where they are making inroads against the likes of Splunk in the IT logging market. However, there is another open source, Lucene-based search engine out there that is quite mature, more widely deployed and still growing, granted without a huge marketing budget behind it: Apache Solr. Despite what others would have you believe, Solr is quite alive and well, thank you very much. And I’m not just saying that because I make a living off of Solr (which I’m happy to declare up front), but because the facts support it.

For instance, in the Google Trends arena (see below or try the query yourself), Solr continues to hold a steady recurring level of interest even while Elasticsearch has grown. Dissection of these trends (which are admittedly easy to game, so I’ve tried to keep them simple), show Elasticsearch is strongest in Europe and Russia while Solr is strongest in the US, China, India, Brazil and Australia. On the DB-Engines ranking site, which factors in Google trends and other job/social metrics, you’ll see both Elasticsearch and Solr are top 15 projects, beating out a number of other databases like HBase and Hive. Solr’s mailing list is quite active (~280 msgs per week compared to ~170 per week for Elasticsearch) and it continues to show strong download numbers via Maven repository statistics. Solr as a codebase continues to innovate (which I’ll cover below) as well as provide regular, stable releases. Finally, Lucene/Solr Revolution, the conference my company puts on every year, continues to set record attendance numbers.

Not so much an “us versus them” piece as tantalizing facts about Solr 6 that will leave you wanting to know more.

Grant invites you to explore the Solr Quick Start if one or more of his ten points capture your interest.

Timely because with a new presidential administration about to take over in Washington, D.C., there will be:

  • Data leaks as agencies vie with each other
  • Data leaks due to inexperienced staffers
  • Data leaks to damage one side or in retaliation
  • Data leaks from foundations and corporations
  • others

If 2016 was the year of “false news” then 2017 is going to be the year of the “government data leak.”

Left unexplored except for headline suitable quips found with grep, leaks may not be significant.

On the other hand, using Solr 6 can enable you to weave a coherent narrative from diverse resources.

But you will have to learn Solr 6 to know for sure.

Enjoy!

August 9, 2016

ACHE Focused Crawler

Filed under: ElasticSearch,Record Linkage,Webcrawler — Patrick Durusau @ 4:51 pm

ACHE Focused Crawler

From the webpage:

ACHE is an implementation of a focused crawler. A focused crawler is a web crawler that collects Web pages that satisfy some specific property. ACHE differs from other crawlers in the sense that it includes page classifiers that allows it to distinguish between relevant and irrelevant pages in a given domain. The page classifier can be from a simple regular expression (that matches every page that contains a specific word, for example), to a sophisticated machine-learned classification model. ACHE also includes link classifiers, which allows it decide the best order in which the links should be downloaded in order to find the relevant content on the web as fast as possible, at the same time it doesn’t waste resources downloading irrelevant content.

ache-logo-400

The inclusion of machine learning (Weka) and robust indexing (ElasticSearch) means this will take more than a day or two to explore.

Certainly well suited to exploring all the web accessible resources on narrow enough topics.

I was thinking about doing a “9 Million Pages of Donald Trump,” (think Nine Billion Names of God) but a quick sanity check showed there are already more than 230 million such pages.

Perhaps by the election I could produce “9 Million Pages With Favorable Comments About Donald Trump.” Perhaps if I don’t dedupe the pages found by searching it would go that high.

Other topics for comprehensive web searching come to mind?

PS: The many names of record linkage come to mind. I think I have thirty (30) or so.

August 3, 2016

OnionRunner, ElasticSearch & Maltego

Filed under: ElasticSearch,Graphs,OnionRunner,Tor,Visualization — Patrick Durusau @ 2:21 pm

OnionRunner, ElasticSearch & Maltego by Adam Maxwell.

From the post:

Last week Justin Seitz over at automatingosint.com released OnionRunner which is basically a python wrapper (because Python is awesome) for the OnionScan tool (https://github.com/s-rah/onionscan).

At the bottom of Justin’s blog post he wrote this:

For bonus points you can also push those JSON files into Elasticsearch (or modify onionrunner.py to do so on the fly) and analyze the results using Kibana!

Always being up for a challenge I’ve done just that. The onionrunner.py script outputs each scan result as a json file, you have two options for loading this into ElasticSearch. You can either load your results after you’ve run a scan or you can load them into ElasticSearch as a scan runs. Now this might sound scary but it’s not, lets tackle each option separately.

A great enhancement to Justin’s original OnionRunner!

You will need a version of Maltego to perform the visualization as described. Not a bad idea to become familiar with Maltego in general.

Data is just data, until it is analyzed.

Enjoy!

July 6, 2016

The Iraq Inquiry (Chilcot Report) [4.5x longer than War and Peace]

Filed under: ElasticSearch,Lucene,Search Algorithms,Search Interface,Solr,Topic Maps — Patrick Durusau @ 2:41 pm

The Iraq Inquiry

To give a rough sense of the depth of the Chilcot Report, the executive summary runs 150 pages. The report appears in twelve (12) volumes, not including video testimony, witness transcripts, documentary evidence, contributions and the like.

Cory Doctorow reports a Guardian project to crowd source collecting facts from the 2.6 million word report. The Guardian observes the Chilcot report is “…almost four-and-a-half times as long as War and Peace.”

Manual reading of the Chilcot report is doable, but unlikely to yield all of the connections that exist between participants, witnesses, evidence, etc.

How would you go about making the Chilcot report and its supporting evidence more amenable to navigation and analysis?

The Report

The Evidence

Other Material

Unfortunately, sections within volumes were not numbered according to their volume. In other words, volume 2 starts with section 3.3 and ends with 3.5, whereas volume 4 only contains sections beginning with “4.,” while volume 5 starts with section 5 but also contains sections 6.1 and 6.2. Nothing can be done for it but be aware that section numbers don’t correspond to volume numbers.

April 17, 2016

UNIX, Bi-Grams, Tri-Grams, and Topic Modeling

UNIX, Bi-Grams, Tri-Grams, and Topic Modeling by Greg Brown.

From the post:

I’ve built up a list of UNIX commands over the years for doing basic text analysis on written language. I’ve built this list from a number of sources (Jim Martin‘s NLP class, StackOverflow, web searches), but haven’t seen it much in one place. With these commands I can analyze everything from log files to user poll responses.

Mostly this just comes down to how cool UNIX commands are (which you probably already know). But the magic is how you mix them together. Hopefully you find these recipes useful. I’m always looking for more so please drop into the comments to tell me what I’m missing.

For all of these examples I assume that you are analyzing a series of user responses with one response per line in a single file: data.txt. With a few cut and paste commands I often apply the same methods to CSV files and log files.

My favorite comment on this post was a reader who extended the tri-gram generator to build a hexagram!

If that sounds unreasonable, you haven’t read very many government reports. 😉

While you are at Greg’s blog, notice a number of useful posts on Elasticsearch.

June 2, 2015

Side by Side with Elasticsearch and Solr: Performance and Scalability

Filed under: ElasticSearch,Solr — Patrick Durusau @ 3:43 pm

Side by Side with Elasticsearch and Solr: Performance and Scalability by Mick Emmett.

From the post:

Back by popular demand! Sematext engineers Radu Gheorghe and Rafal Kuc returned to Berlin Buzzwords on Tuesday, June 2, with the second installment of their “Side by Side with Elasticsearch and Solr” talk. (You can check out Part 1 here.)

Elasticsearch and Solr Performance and Scalability

This brand new talk — which included a live demo, a video demo and slides — dove deeper into into how Elasticsearch and Solr scale and perform. And, of course, they took into account all the goodies that came with these search platforms since last year. Radu and Rafal showed attendees how to tune Elasticsearch and Solr for two common use-cases: logging and product search. Then they showed what numbers they got after tuning. There was also some sharing of best practices for scaling out massive Elasticsearch and Solr clusters; for example, how to divide data into shards and indices/collections that account for growth, when to use routing, and how to make sure that coordinated nodes don’t become unresponsive.

Video is coming soon, and in the meantime please enjoy the slides:

After you see the presentation and slides (parts 1 and 2), you will understand the “popular demand” for these authors.

The best comparison of Elasticsearch and Solr that you will see this year. (Unless the presenters update their presentation before the end of the year.)

Relevant Search

Filed under: ElasticSearch,Relevance,Search Engines,Solr — Patrick Durusau @ 3:21 pm

Relevant Search – With examples using Elasticsearch and Solr by Doug Turnbull and John Berryman.

From the webpage:

Users expect search to be simple: They enter a few terms and expect perfectly-organized, relevant results instantly. But behind this simple user experience, complex machinery is at work. Whether using Solr, Elasticsearch, or another search technology, the solution is never one size fits all. Returning the right search results requires conveying domain knowledge and business rules in the search engine’s data structures, text analytics, and results ranking capabilities.

Relevant Search demystifies relevance work. Using Elasticsearch, it teaches you how to return engaging search results to your users, helping you understand and leverage the internals of Lucene-based search engines. Relevant Search walks through several real-world problems using a cohesive philosophy that combines text analysis, query building, and score shaping to express business ranking rules to the search engine. It outlines how to guide the engineering process by monitoring search user behavior and shifting the enterprise to a search-first culture focused on humans, not computers. You’ll see how the search engine provides a deeply pluggable platform for integrating search ranking with machine learning, ontologies, personalization, domain-specific expertise, and other enriching sources.

  • Creating a foundation for Lucene-based search (Solr, Elasticsearch) relevance internals
  • Bridging the field of Information Retrieval and real-world search problems
  • Building your toolbelt for relevance work
  • Solving search ranking problems by combining text analysis, query building, and score shaping
  • Providing users relevance feedback so that they can better interact with search
  • Integrating test-driven relevance techniques based on A/B testing and content expertise
  • Exploring advanced relevance solutions through custom plug-ins and machine learning

Now imagine relevancy searching where a topic map contains multiple subject identifications for a single subject, from different perspectives.

Relevant Search is in early release but the sooner you participate, the fewer errata there will be in the final version.

May 18, 2015

A Virtual Database between MongoDB, ElasticSearch, and MarkLogic

Filed under: ElasticSearch,MarkLogic,MongoDB — Patrick Durusau @ 2:21 pm

A Virtual Database between MongoDB, ElasticSearch, and MarkLogic by William Candillon.

From the post:

Virtual Databases enable developers to write applications regardless of the underlying database technologies. We recently updated a database infrastructure from MongoDB and ElasticSearch to MarkLogic without touching the codebase.

We just flipped a switch. We updated the database infrastructure of an application (20k LOC) from MongoDB and Elasticsearch to MarkLogic without changing a single line of code.

Earlier this year, we published a tutorial that shows how the 28msec query technology can enable developers to write applications regardless of the underlying database technology. Recently, we had the opportunity to put it to the test on both a real world use case and a substantial codebase.

At 28msec, we have designed1 and implemented2 an open source modern data warehouse called CellStore. Whereas traditional data warehousing solutions can only support hundreds of fixed dimensions and thus need to ETL the data to analyze, cell stores support an unbounded number of dimensions. Our implementation of the cell store paradigm is around 20k lines of JSONiq queries. Originally the implementation was running on top of MongoDB and Elasticsearch.
….

1. http://arxiv.org/pdf/1410.0600.pdf
2. http://github.com/28msec/cellstore

Impressive work and it merits a separate post on the underlying technology, CellStore.

April 16, 2015

An Inside Look at the Components of a Recommendation Engine

Filed under: ElasticSearch,Mahout,MapR,Recommendation — Patrick Durusau @ 7:01 pm

An Inside Look at the Components of a Recommendation Engine by Carol McDonald.

From the post:

Recommendation engines help narrow your choices to those that best meet your particular needs. In this post, we’re going to take a closer look at how all the different components of a recommendation engine work together. We’re going to use collaborative filtering on movie ratings data to recommend movies. The key components are a collaborative filtering algorithm in Apache Mahout to build and train a machine learning model, and search technology from Elasticsearch to simplify deployment of the recommender.

There are two reasons to read this post:

First, you really don’t know how recommendation engines work. Well, better late than never.

Second, you want an example of how to write an excellent explanation of recommendation engines, hopefully to replicate it for other software.

This is an example of an excellent explanation of recommendation engines but whether you can replicate it for other software remains to be seen. 😉

Still, reading excellent explanations is a first step towards authoring excellent explanations.

Good luck!

November 22, 2014

Solr vs. Elasticsearch – Case by Case

Filed under: ElasticSearch,Solr — Patrick Durusau @ 8:26 pm

Solr vs. Elasticsearch – Case by Case by Alexandre Rafalovitch.

From the description:

A presentation given at the Lucene/Solr Revolution 2014 conference to show Solr and Elasticsearch features side by side. The presentation time was only 30 minutes, so only the core usability features were compared. The full video is coming later.

Just the highlights and those from an admitted ElasticSearch user.

One very telling piece of advice for Solr:

Solr – needs to buckle down and focus on the onboarding experience

Solr is getting better (e.g. listen to SolrCluster podcast of October 24, 2014)

Just in case you don’t know the term: onboarding.

And SolrCluster podcast of October 24, 2014: Solr Usability with Steve Rowe & Tim Potter

From the description:

In this episode, Lucene/Solr Committers Steve Rowe and Tim Potter join the SolrCluster team to discuss how Lucidworks and the community are making changes and improvements to Solr to increase usability and add ease to the getting started experience. Steve and Tim discuss new features such as data-driven schema, start-up scripts, launching SolrCloud, and more. (length 33:29)

Paraphrasing:

…focusing on the first five minutes of the Solr experience…hard to explore if you can’t get it started…can be a little bit scary at first…has lacked a focus on accessibility by ordinary users…need usability addressed throughout the lifecycle of the product…want to improve kicking the tires on Solr…lowering mental barriers for new users…do now have start scripts…bakes in a lot of best practices…scripts for SolrCloud…hide all the weird stuff…data driven schemas…throw data at Solr and it creates an index without creating a schema…working on improving tutorials and documentation…moving towards consolidating information…will include use cases…walk throughs…will point to different data sets…making it easier to query Solr and understand the query URLs…bringing full collections API support to the admin UI…Rest interface…components report possible configuration…plus a form to interact with it directly…forms that render in the browser…will have a continued focus on usability…not a one time push…new users need to submit any problems they encounter….

Great podcast!

Very encouraging on issues of documentation and accessibility in Solr.

November 21, 2014

Big data in minutes with the ELK Stack

Filed under: ElasticSearch,Kibana,logstash — Patrick Durusau @ 8:36 pm

Big data in minutes with the ELK Stack by Philippe Creux.

From the post:

We’ve built a data analysis and dashboarding infrastructure for one of our clients over the past few weeks. They collect about 10 million data points a day. Yes, that’s big data.

My highest priority was to allow them to browse the data they collect so that they can ensure that the data points are consistent and contain all the attributes required to generate the reports and dashboards they need.

I chose to give the ELK stack a try: ElasticSearch, logstash and Kibana.

Is it just me or does processing “big data” seem to have gotten easier over the past several years?

But however easy or hard the processing, the value-add question is what do we know post data processing that we didn’t know before?

October 25, 2014

Building Scalable Search from Scratch with ElasticSearch

Filed under: ElasticSearch,Search Engines — Patrick Durusau @ 5:46 pm

Building Scalable Search from Scratch with ElasticSearch by Ram Viswanadha.

From the post:

1 Introduction

Savvy is an online community for the world’s product enthusiasts. Our communities are the product trendsetters that the rest of the world follows. Across the site, our users are able to compare products, ask and answer product questions, share product reviews, and generally share their product interests with one another. Savvy1.com boasts a vibrant community that save products on the site at the rate of 1 product every second. We wanted to provide a search bar that can search across various entities in the system – users, products, coupons, collections, etc. – and return the results in a timely fashion.

2 Requirements

The search server should satisfy the following requirements:

  1. Full Text Search: The ability to not only return documents that contain the exact keywords, but also documents that contain words that are related or relevant to the keywords.
  2. Clustering: The ability to distribute data across multiple nodes for load balancing and efficient searching.
  3. Horizontal Scalability: The ability to increase the capacity of the cluster by adding more nodes.
  4. Read and Write Efficiency: Since our application is both read and write heavy, we need a system that allows for high write loads and efficient read times on heavy read loads.
  5. Fault Tolerant: The loss of any node in the cluster should not affect the stability of the cluster.
  6. REST API with JSON: The server should support a REST API using JSON for input and output.

At the time, we looked at Sphinx, Solr and ElasticSearch. The only system that satisfied all of the above requirements was ElasticSearch, and — to sweeten the deal — ElasticSearch provided a way to efficiently ingest and index data in our MongoDB database via the River API so we could get up and running quickly.

If you need an outline for building a basic ElasticSearch system, this is it!

It has the advantage of introducing you to a number of other web technologies that will be handy with ElasticSearch.

Enjoy!

September 6, 2014

Elastic Search: The Definitive Guide

Filed under: ElasticSearch,Lucene — Patrick Durusau @ 6:52 pm

Elastic Search: The Definitive Guide by Clinton Gormley and Zachary Tong.

From “why we wrote this book:”

We wrote this book because Elasticsearch needs a narrative. The existing reference documentation is excellent… as long as you know what you are looking for. It assumes that you are intimately familiar with information retrieval concepts, distributed systems, the query DSL and a host of other topics.

This book makes no such assumptions. It has been written so that a complete beginner — to both search and distributed systems — can pick it up and start building a prototype within a few chapters.

We have taken a problem based approach: this is the problem, how do I solve it, and what are the trade-offs of the alternative solutions? We start with the basics and each chapter builds on the preceding ones, providing practical examples and explaining the theory where necessary.

The existing reference documentation explains how to use features. We want this book to explain why and when to use various features.

An important guide/reference for Elastic Search but the “why” for this book is important as well.

Reference documentation is absolutely essential but so is documentation that eases the learning curve in order to promote adoption of software or a technology.

Read this both for Elastic Search as well as one model for writing a “why” and “when” book for other technologies.

August 25, 2014

Introducing Splainer…

Filed under: ElasticSearch,Lucene,Search Analytics,Search Behavior,Solr — Patrick Durusau @ 3:10 pm

Introducing Splainer — The Open Source Search Sandbox That Tells You Why by Doug Turnbull.

Splainer is a step towards addressing two problems:

From the post:

  • Collaboration: At OpenSource Connections, we believe that collaboration with non-techies is the secret ingredient of search relevancy. We need to arm business analysts and content experts with a human readable version of the explain information so they can inform the search tuning process.
  • Usability: I want to paste a Solr URL, full of query paramaters and all, and go! Then, once I see more helpful explain information, I want to tweak (and tweak and tweak) until I get the search results I want. Much like some of my favorite regex tools. Get out of the way and let me tune!
  • ….

    We hope you’ll give it a spin and let us know how it can be improved. We welcome your bugs, feedback, and pull requests. And if you want to try the Splainer experience over multiple queries, with diffing, results grading, a develoment history, and more — give Quepid a spin for free!

Improving the information content of the tokens you are searching is another way to improve search results.

August 3, 2014

Side by side with Elasticsearch and Solr

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 4:26 pm

Side by side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe.

Abstract:

We all know that Solr and Elasticsearch are different, but what those differences are and which solution is the best fit for a particular use case is a frequent question. We will try to make those differences clear, not by showing slides and compare them, but by showing online demo of both Elasticsearch and Solr:

  • Set up and start both search servers. See what you need to prepare and launch Solr and Elasticsearch
  • Index data right after the server was started using the “schemaless” mode
  • Create index structure and modify it using the provided API
  • Explore different query use cases
  • Scale by adding and removing nodes from the cluster, creating indices and managing shards. See how that affects data indexing and querying.
  • Monitor and administer clusters. See what metrics can be seen out of the box, how to get them and what tools can provide you with the graphical view of all the goodies that each search server can provide.

Slides

Very impressive split-screen comparison of Elasticsearch and Solr by two presenters on the same data set.

I first saw this at: Side-By-Side with Solr and Elasticsearch : A Comparison by Charles Ditzel.

August 1, 2014

Elasticsearch 1.3.1 released

Filed under: ElasticSearch,Lucene — Patrick Durusau @ 1:50 pm

Elasticsearch 1.3.1 released by Clinton Gormley.

From the post:

Today, we are happy to announce the bugfix release of Elasticsearch 1.3.1, based on Lucene 4.9. You can download it and read the full changes list here: Elasticsearch 1.3.1.

Enjoy!

July 30, 2014

Scrapy and Elasticsearch

Filed under: ElasticSearch,Python,Web Scrapers — Patrick Durusau @ 9:56 am

Scrapy and Elasticsearch by Florian Hopf.

From the post:

On 29.07.2014 I gave a talk at Search Meetup Karlsruhe on using Scrapy with Elasticsearch, the slides are here. This post evolved from the talk and introduces you to web scraping and search with Scrapy and Elasticsearch.

Web Crawling

You might think that web crawling and scraping only is for search engines like Google and Bing. But a lot of companies are using it for different purposes: Price comparison, financial risk information and portals all need a way to get the data. And at least sometimes the way is to retrieve it through some public website. Besides these cases where the data is not in your hand it can also make sense if the data is aggregated already. For intranet and portal search engines it can be easier to just scrape the frontend instead of building data import facilities for different, sometimes even old systems.

The Example

In this post we are looking at a rather artificial example: Crawling the meetup.com page for recent meetups to make them available for search. Why artificial? Because meetup.com has an API that provides all the data in a more convenient way. But imagine there is no other way and we would like to build a custom search on this information, probably by adding other event sites as well. (emphasis in original)

Not everything you need to know about Scrapy but enough to get you interested.

APIs for data are on the up swing but web scrapers will be relevant to data mining for decades to come.

June 22, 2014

Call me maybe: Elasticsearch

Filed under: ElasticSearch — Patrick Durusau @ 8:00 pm

Call me maybe: Elasticsearch by Kyle Kingsbury.

Kyle attempts to answer the question: How safe is data in Elasticsearch?.

I say “attempts,” Kyle does a remarkable job of documenting unanswered questions and conditions that can lead to data loss with Elasticsearch. But you will find there is no final answer to the safety question, despite deep analysis and research.

Klye is an Elasticsearch user and does provide some guidance on making your Elasticsearch installation safer, not safe but safer.

Must reading for all serious users of Elasticsearch.

I first saw this in a tweet by Andrew Purtell.

June 16, 2014

You complete me

Filed under: ElasticSearch,Search Engines — Patrick Durusau @ 6:50 pm

You complete me by Alexander Reelsen.

From the post:

Effective search is not just about returning relevant results when a user types in a search phrase, it’s also about helping your user to choose the best search phrases. Elasticsearch already has did-you-mean functionality which can correct the user’s spelling after they have searched. Now, we are adding the completion suggester which can make suggestions while-you-type. Giving the user the right search phrase before they have issued their first search makes for happier users and reduced load on your servers.

In the context of search you can suggest search phrases. (Alexander’s post is a bit dated so see: the Elasticsearch documentation as well.)

How much further can you go with suggestions? Search syntax?

June 15, 2014

Analyzing 1.2 Million Network Packets…

Filed under: ElasticSearch,Hadoop,HBase,Hive,Hortonworks,Kafka,Storm — Patrick Durusau @ 4:19 pm

Analyzing 1.2 Million Network Packets per Second in Real Time by James Sirota and Sheetal Dolas.

Slides giving an overview of OpenSOC (Open Security Operations Center).

I mention this in case you are not the NSA and simply streaming the backbone of the Internet to storage for later analysis. Some business cases require real time results.

The project is also a good demonstration of building a high throughput system using only open source software.

Not to mention a useful collaboration between Cisco and Hortonworks.

BTW, take a look at slide 18. I would say they are adding information to the representative of a subject, wouldn’t you? While on the surface this looks easy, merging that data with other data, say held by local law enforcement, might not be so easy.

For example, depending on where you are intercepting traffic, you will be told I am about thirty (30) miles from my present physical location or some other answer. 😉 Now, if someone had annotated an earlier packet with that information and it was accessible to you, well, your targeting of my location could be a good deal more precise.

And there is the question of using data annotated by different sources who may have been attacked by the same person or group.

Even at 1.2 million packets per second there is still a role for subject identity and merging.

June 12, 2014

SecureGraph

Filed under: Accumulo,Blueprints,ElasticSearch,Graphs,SecureGraph — Patrick Durusau @ 6:56 pm

SecureGraph

From the webpage:

SecureGraph is an API to manipulate graphs, similar to Blueprints. Unlike Blueprints, every Secure graph method requires authorizations and visibilities. SecureGraph also supports multivalued properties as well as property metadata.

The SecureGraph API was designed to be generic, allowing for multiple implementations. The only implementation provided currently is built on top of Apache Accumulo for data storage and Elasticsearch for indexing.

According to the readme file, definitely “beta” software but interesting software none the less.

Are you using insecure graph software?

Might be time to find out!

I first saw this in a tweet by Marko A. Rodriguez

June 11, 2014

Elasticsearch, RethinkDB and the Semantic Web

Filed under: Biomedical,ElasticSearch,RethinkDB,Semantic Web — Patrick Durusau @ 1:31 pm

Elasticsearch, RethinkDB and the Semantic Web by Michel Dumontier.

From the post:

Everyone is handling big data nowadays, or at least, so it seems. Hadoop is very popular among the Big Data wranglers and it is often mentioned as the de facto solution. I have dabbled into working with Hadoop over the past years and found that: yes, it is very suitable for certain kinds of data mining/analysis and for those it provides high data crunching throughput, but, no, it cannot answer queries quickly and you cannot port every algorithm into Hadoop’s map/reduce paradigm. I have since turned to Elasticsearch and more recently to RethinkDB. It is a joy to work with the latter and it performs faceting just as well as Elasticsearch for the benchmark data that I used, but still permits me to carry out more complex data mining and analysis too.

The story here describes the data that I am working with a bit, it shows how it can be turned into a data format that both Elasticsearch and RethinkDB understand, how the data is being loaded and indexed, and finally, how to get some facets out of the systems.

Interesting post on biomedical data in RDF N-Quads format which is converted into JSON and then processed with ElasticSearch and RethinkDB.

I first saw this in a tweet by Joachim Baran.

May 24, 2014

…Data Analytics Hackathon

Filed under: Analytics,Data Analysis,ElasticSearch — Patrick Durusau @ 4:27 pm

Elasticsearch Teams up with MIT Sloan for Data Analytics Hackathon by Sejal Korenromp.

From the post:

Following from the success and popularity of the Hopper Hackathon we participated in late last year, last week we sponsored the MIT Sloan Data Analytics Club Hackathon for our latest offering to Elasticsearch aficionados. More than 50 software engineers, business students and other open source software enthusiasts signed up to participate, and on a Saturday to boot! The full day’s festivities included access to a huge storage and computing cluster, and everyone was set free to create something awesome using Elasticsearch.

Hacks from the finalists:

  • Quimbly – A Digital Library
  • Brand Sentiment Analysis
  • Conference Data
  • Twitter based sentiment analyzer
  • Statistics on Movies and Wikipedia

See Sejal’s post for the details of each hack and the winner.

I noticed several very good ideas in these hacks, no doubt you will notice even more.

Enjoy!

Elasticsearch 1.2.0 and 1.1.2 released

Filed under: ElasticSearch,Lucene,Search Engines — Patrick Durusau @ 2:59 pm

Elasticsearch 1.2.0 and 1.1.2 released by Clinton Gormley.

From the post:

Today, we are happy to announce the release of Elasticsearch 1.2.0, based on Lucene 4.8.1, along with a bug fix release Elasticsearch 1.1.2.

You can download them and read the full change lists here:

Elasticsearch 1.2.0 is a bumper release, containing over 300 new features, enhancements, and bug fixes. You can see the full changes list in the Elasticsearch 1.2.0 release notes, but we will highlight some of the important ones below:

Highlights of the more important changes for Elasticsearch 1.2.0:

  • Java 7 required
  • dynamic scripting disabled by default
  • field data and filter caches
  • gateways removed
  • indexing and merging
  • aggregations
  • context suggester
  • improved deep scrolling
  • field value factor

See Clinton’s post or the release notes for more complete coverage. (Aggregation looks particularly interesting.)

May 17, 2014

Building a Recipe Search Site…

Filed under: ElasticSearch,Lucene,Search Engines,Solr — Patrick Durusau @ 4:32 pm

Building a Recipe Search Site with Angular and Elasticsearch by Adam Bard.

From the post:

Have you ever wanted to build a search feature into an application? In the old days, you might have found yourself wrangling with Solr, or building your own search service on top of Lucene — if you were lucky. But, since 2010, there’s been an easier way: Elasticsearch.

Elasticsearch is an open-source storage engine built on Lucene. It’s more than a search engine; it’s a true document store, albeit one emphasizing search performance over consistency or durability. This means that, for many applications, you can use Elasticsearch as your entire backend. Applications such as…

Think of this as a snapshot of the capabilities of most search solutions.

Which makes this a great baseline for answering the question: What does your app do that Elasticsearch + Angular cannot?

That’s a serious question.

Responses that don’t count include:

  1. My app is written in the Linear B programming language.
  2. My app uses a Post-Pre-NOSQL DB engine.
  3. My app will bring freedom and health to the WWW.
  4. (insert your reason)

You can say all those things if you like, but the convincing point for users is going to be exceeding their expectations about current solutions.

Do the best you can with Elasticsearch and Angular and use that as your basepoint for comparison.

May 13, 2014

Choosing a fast unique identifier (UUID) for Lucene

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 9:44 am

Choosing a fast unique identifier (UUID) for Lucene by Michael McCandless.

From the post:

Most search applications using Apache Lucene assign a unique id, or primary key, to each indexed document. While Lucene itself does not require this (it could care less!), the application usually needs it to later replace, delete or retrieve that one document by its external id. Most servers built on top of Lucene, such as Elasticsearch and Solr, require a unique id and can auto-generate one if you do not provide it.

Sometimes your id values are already pre-defined, for example if an external database or content management system assigned one, or if you must use a URI, but if you are free to assign your own ids then what works best for Lucene?

One obvious choice is Java’s UUID class, which generates version 4 universally unique identifiers, but it turns out this is the worst choice for performance: it is 4X slower than the fastest. To understand why requires some understanding of how Lucene finds terms.
….

Excellent tips for creating identifiers for Lucene! Complete with tests and an explanation for the possible choices.

Enjoy!

May 10, 2014

Parameterizing Queries in Solr and Elasticsearch

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 6:19 pm

Parameterizing Queries in Solr and Elasticsearch by RAFAŁ KUĆ.

From the post:

We all know how good it is to have abstraction layers in software we create. We tend to abstract implementation from the method contracts using interfaces, we use n-tier architectures so that we can abstract and divide different system layers from each other. This is very good – when we change one piece, we don’t need to touch the other parts that only knew about method contracts, API’s, etc. Why not do the same with search queries? Can we even do that in Elasticsearch and Solr? We can and I’ll show you how to do that.

The problem

Imagine, that we have a query, a complicated one, with boosts, sorts, facets and so on. However in most cases the query is pretty static when it comes to its structure and the only thing that changes is one of the filters in the query (actually a filter value) and the query entered by the user. I guess such situation could ring a bell for someone who developed a search application. Of course we can include the whole query in the application itself and reuse it. But in such case, changes to boosts for example requires us to deploy the application or a configuration file. And if more than a single application uses the same query, than we need to change them all.

What if we could make the change on the search server side only and let application pass the necessary data only? That would be nice, but it requires us to do some work on the search server side.

For the purpose of the blog post, let’s assume that we want to have a query that:

  • searches for documents with terms entered by the user,
  • limits the searches to a given category,
  • displays facet results for the price ranges

This is a simple example, so that the queries are easy to understand. So, in the perfect world we would only need to provide user query and category identifier to a search engine.

It is encouraging to see someone give solutions to the same search problem from Solr and Elasticsearch perspectives.

Not to mention that I think you will find this very useful.

April 9, 2014

Revealing the Uncommonly Common…

Filed under: Algorithms,ElasticSearch,Search Engines,Searching — Patrick Durusau @ 3:34 pm

Revealing the Uncommonly Common with Elasticsearch by Mark Harwood.

From the summary:

Mark Harwood shows how anomaly detection algorithms can spot card fraud, incorrectly tagged movies and the UK’s most unexpected hotspot for weapon possession.

Makes me curious about the market for a “Mr./Ms. Normal” service?

A service that enables you to enter your real viewing/buying/entertainment preferences and for a fee, the service generates a paper trail for you than hides your real habits in digital dust.

If you order porn from NetFlix then the “Mr./Ms. Normal” service will order enough PBS and NatGeo material to even out your renting record.

Depending on how extreme your buying habits happen to be, you may need a “Mr./Ms. Abnormal” service that shields you from any paper trail at all.

As data surveillance grows, having a pre-defined Mr./Ms. Normal/Abnormal account may become a popular high school/college graduation or even a wedding present.

The usefulness of data surveillance depends on the cooperation of its victims. Have you ever considered not cooperating? But appearing to?

Older Posts »

Powered by WordPress