Archive for the ‘Memory’ Category

atomic<> Weapons

Saturday, February 16th, 2013

atomic<> Weapons by Herb Sutter.

C++ and Beyond 2012: Herb Sutter – atomic<> Weapons, 1 of 2

C++ and Beyond 2012: Herb Sutter – atomic<> Weapons, 2 of 2

Abstract:

This session in one word: Deep.

It’s a session that includes topics I’ve publicly said for years is Stuff You Shouldn’t Need To Know and I Just Won’t Teach, but it’s becoming achingly clear that people do need to know about it. Achingly, heartbreakingly clear, because some hardware incents you to pull out the big guns to achieve top performance, and C++ programmers just are so addicted to full performance that they’ll reach for the big red levers with the flashing warning lights. Since we can’t keep people from pulling the big red levers, we’d better document the A to Z of what the levers actually do, so that people don’t SCRAM unless they really, really, really meant to.

With all the recent posts about simplicity and user interaction, some readers may be getting bored.

Never fear, something a bit more challenging for you.

Multicore memory models along with comments that cite even more research.

Plus I liked the line: “…reach for the big red levers with the flashing warning lights.”

Enjoy!

Fast Set Intersection in Memory [Foul! They Peeked!]

Monday, August 20th, 2012

Fast Set Intersection in Memory by Bolin Ding and Arnd Christian König.

Abstract:

Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a worst-case efficient way. In general, given k (preprocessed) sets, with totally n elements, we will show how to compute their intersection in expected time O(n / sqrt(w) + kr), where r is the intersection size and w is the number of bits in a machine-word. In addition,we introduce a very simple version of this algorithm that has weaker asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and workloads.

Important not only for the algorithm but how they arrived at it.

They peeked at the data.

Imagine that.

Not trying to solve the set intersection problem in the abstract but looking at data you are likely to encounter.

I am all for the pure theory side of things but there is something to be said for less airy (dare I say windy?) solutions. ;-)

I first saw this at Theoretical Computer Science: Most efficient algorithm to compute set difference?

Introducing Galaxy, a novel in-memory data grid by Parallel Universe

Wednesday, July 11th, 2012

Introducing Galaxy, a novel in-memory data grid by Parallel Universe

Let me jump to the cool part:

Galaxy is a distributed RAM. It is not a key-value store. Rather, it is meant to be used as a infrastructure for building distributed data-structures. In fact, there is no way to query objects stored on Galaxy at all. Instead, Galaxy generates an ID for each item, that you can store in other items just like you’d store a normal reference in a plain object graph.

The application runs on all Galaxy nodes alongside with the portion of the data that is kept (in RAM) at each of the nodes, and when it wishes to read or write a data item, it requests the Galaxy API to fetch it.

At any given time an item is owned by exactly one node, but can be shared by many. Sharers store the item locally, but they can only read it. However, they remember who the owner is, and the owner maintains a list of all sharers. If a sharer (or any node) wants to update the item (a “write”) it requests the current owner for a transfer of ownership, and then receives the item and the list of sharers. Before modifying the item, it invalidates all sharers to ensure consistency. Even when the sharers are invalidated, they remember who the new owner is, so if they’d like to share or own the item again, they can request it from the new owner. If the application requests an item the local node has never seen (or it’s been migrated again after it had been validated), the node multicasts the entire cluster in search of it.

The idea is that when data access is predictable, expensive operations like item migration and a clueless lookup are rare, and more than offset by the common zero-I/O case. In addition, Galaxy uses some nifty hacks to eschew many of the I/O delays even in worst-case scenarios.

In the coming weeks I will post here the exact details of Galaxy’s inner-workings. What messages are transferred, how Galaxy deals with failures, and what tricks it employs to reduce latencies. In the meantime, I encourage you to read Galaxy’s documentation and take it for a spin.

May not fit your use case but like the man says, “take it for a spin.”

Jack Park sent this to my attention.

nessDB v1.8 with LSM-Tree

Saturday, March 3rd, 2012

nessDB v1.8 with LSM-Tree

From the webpage:

nessDB is a fast Key-Value database(embedded), supports Redis-Protocol(PING,SET,MSET,GET,MGET,DEL,EXISTS,INFO,SHUTDOWN).

Which is written in ANSI C with BSD LICENSE and works in most POSIX systems without external dependencies.

nessDB is very efficient on disk-based random access, since it’s using log-structured-merge (LSM) trees.

V1.8 FEATURES
=============
a. Better performances on Random-Read/Random-Write
b. Log recovery
c. Using LSM-Tree as storage engine
d. Background detached-thread merging
e. Level LRU
f. Support billion data

This came in over the nosql mailing list.

Pointers to literature on how “disk-based random access” has shaped our thinking/technology for processing? Or how going “off cache” for random access is going to shape the next mind-set about processing?

Translation Memory

Tuesday, December 6th, 2011

Translation Memory

As we mentioned in Teaching Etsy to Speak a Second Language, developers need to tag English content so it can be extracted and then translated. Since we are a company with a continuous deployment development process, we do this on a daily basis and as an result get a significant number of new messages to be translated along with changes or deletions of existing ones that have already been translated. Therefore we needed some kind of recollection system to easily reuse or follow the style of existing translations.

A translation memory is an organized collection of text extracted from a source language with one or more matching translations. A translation memory system stores this data and makes it easily accessible to human translators in order to assist with their tasks. There’s a variety of translation memory systems and related standards in the language industry. Yet, the nature of our extracted messages (containing relevant PHP, Smarty, and JavaScript placeholders) and our desire to maintain a translation style curated by a human language manager made us develop an in-house solution.

Go ahead, read the rest of the post, I’ll wait.

Interesting yes?

What if the title of my post were identification memory?

Not really that much difference between translation language to language and identification to identification, where we are talking about the same subject.

Hardly any difference at all when you think about it.

I am sure your current vendors will assure you their methods of identification are the best and they may be right. But on the other hand, they may also be wrong.

And there always is the issues of other data sources that have chosen to identify the same subjects differently. Like your company down the road, say five years from now. Preparing now for that “translation” project in the not too distant future, may save you from losing critical information down the road.

Preserving access to critical data is a form of translation memory. Yes?