Archive for the ‘Hashing’ Category

Optimizing Hash-Array Mapped Tries…

Saturday, November 28th, 2015

Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections by Adrian Colyer.

Adrian’s review of Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections by Steinforder & Vinju, 2015, starts this way:

You’d think that the collection classes in modern JVM-based languages would be highly efficient at this point in time – and indeed they are. But the wonderful thing is that there always seems to be room for improvement. Today’s paper examines immutable collections on the JVM – in particular, in Scala and Clojure – and highlights a new CHAMPion data structure that offers 1.3-6.7x faster iteration, and 3-25.4x faster equality checking.

CHAMP stands for Compressed Hash-Array Mapped Prefix-tree.

The use of immutable collections is on the rise…

Immutable collections are a specific area most relevant to functional/object-oriented programming such as practiced by Scala and Clojure programmers. With the advance of functional language constructs in Java 8 and functional APIs such as the stream processing API, immutable collections become more relevant to Java as well. Immutability for collections has a number of benefits: it implies referential transparency without giving up on sharing data; it satisfies safety requirements for having co-variant sub-types; it allows to safely share data in presence of concurrency.

Both Scala and Clojure use a Hash-Array Mapped Trie (HAMT) data structure for immutable collections. The HAMT data structure was originally developed by Bagwell in C/C++. It becomes less efficient when ported to the JVM due to the lack of control over memory layout and the extra indirection caused by arrays also being objects. This paper is all about the quest for an efficient JVM-based derivative of HAMTs.

Fine-tuning data structures for cache locality usually improves their runtime performance. However, HAMTs inherently feature many memory indirections due to their tree-based nature, notably when compared to array-based data structures such as hashtables. Therefore HAMTs presents an optimization challenge on the JVM. Our goal is to optimize HAMT-based data structures such that they become a strong competitor of their optimized array-based counterparts in terms of speed and memory footprints.

Adrian had me at: “a new CHAMPion data structure that offers 1.3-6.7x faster iteration, and 3-25.4x faster equality checking.”

If you want experience with the proposed data structures, the authors have implemented them in the Rascal Metaprogramming Language.

I first saw this in a tweet by Atabey Kaygun

Hash Table Performance in R: Part I + Part 2

Tuesday, April 14th, 2015

Hash Table Performance in R: Part I + Part 2 by Jeffrey Horner.

From part 1:

A hash table, or associative array, is a well known key-value data structure. In R there is no equivalent, but you do have some options. You can use a vector of any type, a list, or an environment.

But as you’ll see with all of these options their performance is compromised in some way. In the average case a lookupash tabl for a key should perform in constant time, or O(1), while in the worst case it will perform in O(n) time, n being the number of elements in the hash table.

For the tests below, we’ll implement a hash table with a few R data structures and make some comparisons. We’ll create hash tables with only unique keys and then perform a search for every key in the table.

This rocks! Talk about performance increases!

My current Twitter client doesn’t dedupe my home feed and certainly doesn’t dedupe it against search based feeds. I’m not so concerned with retweets as with authors that repeat the same tweet several times in a row. What I don’t know is what period of uniqueness would be best? Will have to experiment with that.

I originally saw this series at Hash Table Performance in R: Part II In Part I of this series, I explained how R hashed… on R-Bloggers, the source of so much excellent R related content.

[O]ne Billion Tweets

Saturday, May 31st, 2014

Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing by Narayanan Sundaram, et al.

Abstract:

Finding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kd-trees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects.

In this paper, we describe a new variant of LSH, called Parallel LSH (PLSH) designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and which supports high-throughput streaming of new data. Our approach employs several novel ideas, including: cache-conscious hash table layout, using a 2-level merge algorithm for hash table construction; an efficient algorithm for duplicate elimination during hash-table querying; an insert-optimized hash table structure and efficient data expiration algorithm for streaming data; and a performance model that accurately estimates performance of the algorithm and can be used to optimize parameter settings. We show that on a workload where we perform similarity search on a dataset of
> 1 Billion tweets, with hundreds of millions of new tweets per day, we can achieve query times of 1–2.5 ms. We show that this is an order of magnitude faster than existing indexing schemes, such as inverted indexes. To the best of our knowledge, this is the fastest implementation of LSH, with table construction times up to 3:7x faster and query times that are 8:3x faster than a basic implementation.

In the introduction, the authors report “…typical queries taking 1-2.5ms. In comparison to other text search schemes, such as inverted indexes, our approach is an order of magnitude faster.”

I looked but did not find any open-source code for PLSH.

Caution: If you search for other research, the string “PLSH” is unlikely to be helpful. One my first search I found:

  • PL/sh is a procedural language handler for PostgreSQL
  • Partia Liberale Shqiptare (Albanian Liberal Party, Kosovo)
  • Pet Loss Support Hotline
  • Promised Land Spanish Horses
  • Polish courses (Abbreviation at Brown University)
  • Point Loma High School

…Locality Sensitive Hashing for Unstructured Data

Friday, May 9th, 2014

Practical Applications of Locality Sensitive Hashing for Unstructured Data by Jake Drew.

From the post:

The purpose of this article is to demonstrate how the practical Data Scientist can implement a Locality Sensitive Hashing system from start to finish in order to drastically reduce the search time typically required in high dimensional spaces when finding similar items. Locality Sensitive Hashing accomplishes this efficiency by exponentially reducing the amount of data required for storage when collecting features for comparison between similar item sets. In other words, Locality Sensitive Hashing successfully reduces a high dimensional feature space while still retaining a random permutation of relevant features which research has shown can be used between data sets to determine an accurate approximation of Jaccard similarity [2,3].

Complete with code and references no less!

How “similar” do two items need to be to count as the same item?

If two libraries own a physical copy of the same book, for some purposes they are distinct items but for annotations/reviews, you could treat them as one item.

If that sounds like a topic map-like question, your right!

What measures of similarity are you applying to what subjects?

HyperDex 1.0 Release

Tuesday, December 10th, 2013

HyperDex 1.0 Release

From the webpage:

We are proud to announce HyperDex 1.0.0. With this official release, we pass the 1.0 development milestone. Key features of this release are:

  • High Performance: HyperDex is fast. It outperforms MongoDB and Cassandra on industry-standard benchmarks by a factor of 2X or more.
  • Advanced Functionality: With the Warp add-on, HyperDex offers multi-key transactions that span multiple objects with ACID guarantees.
  • Strong Consistency: HyperDex ensures that every GET returns the result of the latest PUT.
  • Fault Tolerance: HyperDex automatically replicates data to tolerate a configurable number of failures.

  • Scalable: HyperDex automatically redistributes data to make use of new resources as you add more nodes to your cluster.

HyperDex runs on 64-bit Linux (Ubuntu, Debian, Fedora, Centos) and OS X. Binary packages for Debian 7, Ubuntu 12.04-13.10, Fedora 18-20, and CentOS 6 are available from the Downloads page[1], as well as source tarballs for other Linux platforms.

This release provides bindings for C, C++, Python, Java, Ruby, and Go.

If that sounds good to you, drop by the Get HyperDex page.

See also: HyperDex Reference Manual v1.0.dev by Robert Escriva, Bernard Wong, and Emin Gün Sirer.

For the real story, see Papers and read HyperDex: A Distributed, Searchable Key-Value Store by Robert Escriva, Bernard Wong and Emin Gün Sirer.

The multidimensional aspects of HyperDex resemble recent efforts to move beyond surface tokens, otherwise known as words.

Learning Hash Functions Using Column Generation

Monday, March 11th, 2013

Learning Hash Functions Using Column Generation by Xi Li, Guosheng Lin, Chunhua Shen, Anton van den Hengel, Anthony Dick.

Abstract:

Fast nearest neighbor searching is becoming an increasingly important tool in solving many large-scale problems. Recently a number of approaches to learning data-dependent hash functions have been developed. In this work, we propose a column generation based method for learning data-dependent hash functions on the basis of proximity comparison information. Given a set of triplets that encode the pairwise proximity comparison information, our method learns hash functions that preserve the relative comparison relationships in the data as well as possible within the large-margin learning framework. The learning procedure is implemented using column generation and hence is named CGHash. At each iteration of the column generation procedure, the best hash function is selected. Unlike most other hashing methods, our method generalizes to new data points naturally; and has a training objective which is convex, thus ensuring that the global optimum can be identified. Experiments demonstrate that the proposed method learns compact binary codes and that its retrieval performance compares favorably with state-of-the-art methods when tested on a few benchmark datasets.

Interesting review of hashing techniques.

Raises the question of customized similarity (read sameness) hashing algorithms for topic maps.

I first saw this in a tweet by Stefano Bertolo.

GNU C++ hash_set vs STL std::set: my notebook

Tuesday, July 10th, 2012

GNU C++ hash_set vs STL std::set: my notebook by Pierre Lindenbaum.

Pierre compares the C++ template set of the C++ Standard Template library to the GNU non-standard hash-based set on a set of random numbers to insert/remove.

The results may surprise you.

Worth investigating if you are removing duplicates post-query.

Hash Tables: Introduction

Saturday, July 7th, 2012

Hash Tables: Introduction

Randy Gaul has written a nice introduction to hash tables, in part to learn about hash tables.

In the next iteration of the topic maps course, I should have only a topic map (no papers) as the main project. Require draft maps to be posted on a weekly basis.

So that design choices can be made, discussed and debated as the map develop.

So that the students are teaching each other about the domains they have chosen as they are constructing their maps.

TH*:Scalable Distributed Trie Hashing

Thursday, May 3rd, 2012

TH*:Scalable Distributed Trie Hashing by Aridj Mohamed and Zegour Djamel Eddine.

In today’s world of computers, dealing with huge amounts of data is not unusual. The need to distribute this data in order to increase its availability and increase the performance of accessing it is more urgent than ever. For these reasons it is necessary to develop scalable distributed data structures. In this paper we propose a TH* distributed variant of the Trie Hashing data structure. First we propose Thsw new version of TH without node Nil in digital tree (trie), then this version will be adapted to multicomputer environment. The simulation results reveal that TH* is scalable in the sense that it grows gracefully, one bucket at a time, to a large number of servers, also TH* offers a good storage space utilization and high query efficiency special for ordering operations.

I ran across this article today on tries, which dates from 2010 (original publication date).

Can anyone point me to a recent survey of literature on tries?

Thanks!

The Simple Magic of Consistent Hashing

Friday, December 9th, 2011

The Simple Magic of Consistent Hashing by Mathias Meyer.

From the post:

The simplicity of consistent hashing is pretty mind-blowing. Here you have a number of nodes in a cluster of databases, or in a cluster of web caches. How do you figure out where the data for a particular key goes in that cluster?

You apply a hash function to the key. That’s it? Yeah, that’s the whole deal of consistent hashing. It’s in the name, isn’t it?

The same key will always return the same hash code (hopefully), so once you’ve figured out how you spread out a range of keys across the nodes available, you can always find the right node by looking at the hash code for a key.

It’s pretty ingenious, if you ask me. It was cooked up in the lab chambers at Akamai, back in the late nineties. You should go and read the original paper right after we’re done here.

A must read, for a variety of reasons. One of which is to build expandable and robust data structures.

Another is to reach a deeper understanding of hashing, consistent or otherwise.

Question: Does consistency mean within a system or across systems?

What is the “hashing trick”?

Monday, December 5th, 2011

What is the “hashing trick”?

I suspect this question:

I’ve heard people mention the “hashing trick” in machine learning, particularly with regards to machine learning on large data.

What is this trick, and what is it used for? Is it similar to the use of random projections?

(Yes, I know that there’s a brief page about it here. I guess I’m looking for an overview that might be more helpful than reading a bunch of papers.)

comes up fairly often. The answer given is unusually helpful so I wanted to point it out here.

LSH Algorithm and Implementation (E2LSH)

Tuesday, January 25th, 2011

LSH Algorithm and Implementation (E2LSH) Authors: Alexandr Andoni and Piotr Indyk

Andoni and Indyk aren’t any better with titles than they were in Whose Your Nearest Neighbor but here you will find additional information on their high dimension nearest neighbor algorithm as well as an implementation of the algorithm.

Whose Your Nearest Neighbor?

Tuesday, January 25th, 2011

Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions Authors: Alexandr Andoni and Piotr Indyk

OK, I lied about the title.

You would think there would be short courses on title writing. Maybe I should start offering a one-day seminar in effective title writing.

Anyway, whatever the title issues, this is a deeply fascinating work on detection of nearest neighbors.

The short version is that really close neighbors bump into each other when hashing. So collisions become a way to detect neighbors. Rather clever.

I think of collisions as a basis for identify the same subjects.

Works in metric spaces but topic maps apply to metric spaces as well. After all, subjects are what define and occupy metric spaces.

For the longer explanation, read the paper.