Archive for the ‘Distributed RAM’ Category

The Titan Informatics Toolkit

Wednesday, October 17th, 2012

The Titan Informatics Toolkit

From the webpage:

A collaborative effort between Sandia National Laboratories and Kitware Inc., the Titan™ Informatics Toolkit is a collection of scalable algorithms for data ingestion and analysis that share a common set of data structures and a flexible, component-based pipeline architecture. The algorithms in Titan span a broad range of structured and unstructured analysis techniques, and are particularly suited to parallel computation on distributed memory supercomputers.

Titan components may be used by application developers using their native C++ API on all popular platforms, or using a broad set of language bindings that include Python, Java, TCL, and more. Developers will combine Titan components with their own application-specific business logic and user interface code to address problems in a specific domain. Titan is used in applications varying from command-line utilities and straightforward graphical user interface tools to sophisticated client-server applications and web services, on platforms ranging from individual workstations to some of the most powerful supercomputers in the world.

I stumbled across this while searching for the Titan (as in graph database) project.

The Parallel Latent Semantic Analysis component is available now. I did not see release dates on other modules, such as Advanced Graph Algorithms.

Source (C++) for the Titan Informatics Toolkit is available.

Introducing Galaxy, a novel in-memory data grid by Parallel Universe

Wednesday, July 11th, 2012

Introducing Galaxy, a novel in-memory data grid by Parallel Universe

Let me jump to the cool part:

Galaxy is a distributed RAM. It is not a key-value store. Rather, it is meant to be used as a infrastructure for building distributed data-structures. In fact, there is no way to query objects stored on Galaxy at all. Instead, Galaxy generates an ID for each item, that you can store in other items just like you’d store a normal reference in a plain object graph.

The application runs on all Galaxy nodes alongside with the portion of the data that is kept (in RAM) at each of the nodes, and when it wishes to read or write a data item, it requests the Galaxy API to fetch it.

At any given time an item is owned by exactly one node, but can be shared by many. Sharers store the item locally, but they can only read it. However, they remember who the owner is, and the owner maintains a list of all sharers. If a sharer (or any node) wants to update the item (a “write”) it requests the current owner for a transfer of ownership, and then receives the item and the list of sharers. Before modifying the item, it invalidates all sharers to ensure consistency. Even when the sharers are invalidated, they remember who the new owner is, so if they’d like to share or own the item again, they can request it from the new owner. If the application requests an item the local node has never seen (or it’s been migrated again after it had been validated), the node multicasts the entire cluster in search of it.

The idea is that when data access is predictable, expensive operations like item migration and a clueless lookup are rare, and more than offset by the common zero-I/O case. In addition, Galaxy uses some nifty hacks to eschew many of the I/O delays even in worst-case scenarios.

In the coming weeks I will post here the exact details of Galaxy’s inner-workings. What messages are transferred, how Galaxy deals with failures, and what tricks it employs to reduce latencies. In the meantime, I encourage you to read Galaxy’s documentation and take it for a spin.

May not fit your use case but like the man says, “take it for a spin.”

Jack Park sent this to my attention.