Archive for the ‘Distributed Systems’ Category

Distributed Systems Seminar [Accounting For Hostile Environments]

Saturday, February 17th, 2018

Distributed Systems Seminar by Peter Alvaro.

From the webpage:


This graduate seminar will explore distributed systems research, both current and historical, with a particular focus on storage systems and programming models.

Due to fundamental uncertainty in their executions arising from asynchronous communication and partial failure, distributed systems present unique challenges to programmers and users. Moreover, distributed systems are increasingly ubiquitous: nearly all non-trivial systems are now physically distributed. It is no longer possible to relegate responsibility for managing the complexity of distributed systems to a group of expert library or infrastructure writers: all programmers must now be distributed programmers. This is both a crisis and an opportunity.

A great deal of theoretical work in distributed systems establishes important impossibility results, including the famous FLP result, the CAP Theorem, the two generals problem and the impossibility of establishing common knowledge via protocol. These results tell us what we cannot achieve in a distributed system, or more constructively, they tell us about the properties we must trade off for the properties we require when designing or using large-scale systems. But what can we achieve? The history of applied distributed systems work is largely the history of infrastructures — storage systems as well as programming models — that attempt to manage the fundamental complexity of the domain with a variety of abstractions.

This course focuses on these systems, models and languages. We will cover the following topics:

  • Consistency models
  • Large-scale storage systems and data processing frameworks
  • Commit, consensus and synchronization protocols
  • Data replication and partitioning
  • Fault-tolerant design
  • Programming models
  • Distributed programming languages and program analysis
  • Seminal theoretical results in distributed systems


This course is a research seminar: we will focus primarily on reading and discussing conference papers. We will read 1-2 papers (typically 2) per session; for each paper, you will provide a brief summary (about 1 page). The summary should answer some or all of the following questions:

  • What problem does the paper solve? Is is important?
  • How does it solve the problem?
  • What alternative approaches are there? Are they adequately discussed in the reading?
  • How does this work relate to other research, whether covered in this course or not?
  • What specific research questions, if any, does the paper raise for you?

What a great list of readings!

An additional question of each paper: Does It Account For Hostile Environments?

As Alvaro says: “…nearly all non-trivial systems are now physically distributed.”

That’s a rather large attack surface to leave for unknown others, by unknown means, to secure to an unknown degree, on your behalf.

If you make that choice, add “cyber-victim” to your business cards.

If you aren’t already, you will be soon enough.

Fun with ddR: Using Distributed Data Structures in R [Your Holiday Quiet Spot]

Saturday, December 12th, 2015

Fun with ddR: Using Distributed Data Structures in R by Edward Ma and Vishrut Gupta (Hewlett Packard Enterprise).

From the post:

A few weeks ago, we revealed ddR (Distributed Data-structures in R), an exciting new project started by R-Core, Hewlett Packard Enterprise, and others that provides a fresh new set of computational primitives for distributed and parallel computing in R. The package sets the seed for what may become a standardized and easy way to write parallel algorithms in R, regardless of the computational engine of choice.

In designing ddR, we wanted to keep things simple and familiar. We expose only a small number of new user functions that are very close in semantics and API to their R counterparts. You can read the introductory material about the package here. In this post, we show how to use ddR functions.

Imagine that you are trapped after an indeterminate holiday meal in the TV room where A Christmas Story is playing for the fourth time that day.

You are at the point of saying/doing something that will offend the living members of your spouses family and generations to come.

What can you do?

Surely your powers of concentration exceed those of bridge players who claim to not see naked people cavorting about during bridge games.

Pull up the ddR post on your smartphone, read it and jump to the documentation and/or example programs.

You will have to be woken out of your reverie and handed your coat when it is time to go.

Well, maybe not exactly but it beats the hell out of biting one of your smaller relatives.

Learning from Distributed Data:… [Beating the Bounds]

Sunday, December 6th, 2015

Learning from Distributed Data: Mathematical and Computational Methods to Analyze De-centralized Information.

From the post:

Scientific advances typically produce massive amounts of data, which is, of course, a good thing. But when many of these datasets are at multiple locations, instead of all in one place, it becomes difficult and costly for researchers to extract meaningful information from them.

So, the question becomes: “How do we learn from these datasets if they cannot be shared or placed in a central location?” says Trilce Estrada-Piedra.

Estrada-Piedra, an assistant professor of computer sciences at the University of New Mexico (UNM) is working to find the solution. She designs software that will enable researchers to collaborate with one another, using decentralized data, without jeopardizing privacy or raising infrastructure concerns.

“Our contributions will help speed research in a variety of sciences like health informatics, astronomy, high energy physics, climate simulations and drug design,” Estrada-Piedra says. “It will be relevant for problems where data is spread out in many different locations.”

The aim of the National Science Foundation (NSF)-funded scientist’s project is to build mathematical models from each of the “local” data banks — those at each distributed site. These models will capture data patterns, rather than specific data points.

“Researchers then can share only the models, instead of sharing the actual data,” she says, citing a medical database as an example. “The original data, for example, would have the patient’s name, age, gender and particular metrics like blood pressure, heart rate, etcetera, and that one patient would be a data point. But the models will project his or her information and extract knowledge from the data. It would just be math. The idea is to build these local models that don’t have personal information, and then share the models without compromising privacy.”

Estrada-Piedra is designing algorithms for data projections and middleware: software that acts as a bridge between an operating system or database and applications, especially on a network. This will allow distributed data to be analyzed effectively.

I’m looking forward to hearing more about Estrada-Piedra’s work, although we all know there are more than data projection and middleware issues involved. Those are very real and very large problems, but as with all human endeavors, the last mile is defined by local semantics.

Efficiently managing local semantics, that is enabling others to seamlessly navigate your local semantics and to in turn navigate the local semantics of others, isn’t a technical task, or at least not primarily.

The primary obstacle to such a task is captured by John D. Cook in Medieval software project management.

The post isn’t long so I will quite it here:

Centuries ago, English communities would walk the little boys around the perimeter of their parish as a way of preserving land records. This was called “beating the bounds.” The idea was that by teaching the boundaries to someone young, the knowledge would be preserved for the lifespan of that person. Of course modern geological survey techniques make beating the bounds unnecessary.

Software development hasn’t reached the sophistication of geographic survey. Many software shops use a knowledge management system remarkably similar to beating the bounds. They hire a new developer to work on a new project. That developer will remain tied to that project for the rest of his or her career, like a serf tied to the land. The knowledge essential to maintaining that project resides only in the brain of its developer. There are no useful written records or reliable maps, just like medieval property boundaries.

Does that sound familiar? That only you or another person “know” the semantics of your datastores? Are you still “beating the bounds” to document your data semantics?

Or as John puts it:

There are no useful written records or reliable maps, just like medieval property boundaries.

It doesn’t have to be that way. You could have reliable maps, reliable maps that are updated when your data is mapped for yet another project. Another ETL is the acronym.

You can, as a manager, of course, simply allow data knowledge to evaporate from your projects but that seems like a very poor business practice.

Johanna Rothman responded to John’s post in Breaking Free of Legacy Projects with the suggestion that every project should have several young boys and girls “beating the bounds” for every major project.

The equivalent of avoiding a single point of failure in medieval software project management.

Better than relying on a single programmer but using more modern information management/retention techniques would be a better option.

I guess the question is do you like using medieval project management techniques for your data or not?

If you do, you won’t be any worse off than any of your competitors with a similar policy.

On the other hand, should one of your competitors break ranks, start using topic maps for example for mission critical data, well, you have been warned.

Spinning up a Spark Cluster on Spot Instances: Step by Step [$0.69 for 6 hours]

Thursday, October 29th, 2015

Spinning up a Spark Cluster on Spot Instances: Step by Step by Austin Ouyang.

From the post:

The DevOps series covers how to get started with the leading open source distributed technologies. In this tutorial, we step through how to deploy a Spark Standalone cluster on AWS Spot Instances for less than $1. In a follow up post, we will show you how to use a Jupyter notebook on Spark for ad hoc analysis of reddit comment data on Amazon S3.

One of the significant hurdles in learning to build distributed systems is understanding how these various technologies are installed and their inter-dependencies. In our experience, the best way to get started with these technologies is to roll up your sleeves and build projects you are passionate about.

This following tutorial shows how you can deploy your own Spark cluster in standalone mode on top of Hadoop. Due to Spark’s memory demand, we recommend using m4.large spot instances with 200GB of magnetic hard drive space each.

m4.large spot instances are not within the free-tier package on AWS, so this tutorial will incur a small cost. The tutorial should not take any longer than a couple hours, but if we allot 6 hours for your 4 node spot cluster, the total cost should run around $0.69 depending on the region of your cluster. If you run this cluster for an entire month we can look at a bill of around $80, so be sure to spin down you cluster after you are finished using it.

How does $0.69 to improve your experience with distributed systems sound?

It’s hard to imagine a better deal.

The only reason to lack experience with distributed systems is lack of interest.

Odd I know but it does happen (or so I have heard). 😉

I first saw this in a tweet by Kirk Borne.

The Back-to-Basics Readings of 2012

Tuesday, May 19th, 2015

The Back-to-Basics Readings of 2012 by Werner Vogels (CTO –

From the post:

After the AWS re: Invent conference I spent two weeks in Europe for the last customer visits of the year. I have since returned and am now in New York City enjoying a few days of winding down the last activities of the year before spending the holidays here with family. Do not expect too many blog posts or twitter updates. Although there are still a few very exciting AWS news updates to happen this year.

I thought this was a good moment to collect all the readings I suggested this year in one summary post. It was not until later in the year that I started to recording the readings here on the blog, so I hope this is indeed the complete list. I am pretty sure some if not all of these papers deserved to be elected to the hall of fame of best papers in distributed systems.

My count is twenty-four (24) papers. More than enough for a weekend at the beach! 😉

I first saw this in a tweet by Computer Science.

Wouldn’t it be fun to build your own Google?

Thursday, December 11th, 2014

Wouldn’t it be fun to build your own Google? by Martin Kleppmann.

Martin writes:

Imagine you had your own copy of the entire web, and you could do with it whatever you want. (Yes, it would be very expensive, but we’ll get to that later.) You could do automated analyses and surface the results to users. For example, you could collate the “best” articles (by some definition) written on many different subjects, no matter where on the web they are published. You could then create a tool which, whenever a user is reading something about one of those subjects, suggests further reading: perhaps deeper background information, or a contrasting viewpoint, or an argument on why the thing you’re reading is full of shit.

Unfortunately, at the moment, only Google and a small number of other companies that have crawled the web have the resources to perform such analyses and build such products. Much as I believe Google try their best to be neutral, a pluralistic society requires a diversity of voices, not a filter bubble controlled by one organization. Surely there are people outside of Google who want to work on this kind of thing. Many a start-up could be founded on the basis of doing useful things with data extracted from a web crawl.

He goes on to discuss current search efforts such a Common Crawl and Wayfinder before hitting full stride with his suggestion for a distributed web search engine. Painting in the broadest of strokes, Martin makes it sound almost plausible to contemplate such an effort.

While conceding the technological issues would be many, it is contended that the payoff would be immense, but in ways we won’t know until it is available. I suspect Martin is right but if so, then we should be able to see a similar impact from Common Crawl. Yes?

Not to rain on a parade I would like to join, but extracting value from a web crawl like Common Crawl is not a guaranteed thing. A more complete crawl of the web only multiplies those problems, it doesn’t make them easier to solve.

On the whole I think the idea of a distributed crawl of the web is a great idea, but while that develops, we best hone our skills at extracting value from the partial crawls that already exist.

Native Actors – A Scalable Software Platform for Distributed, Heterogeneous Environments

Saturday, September 27th, 2014

Native Actors – A Scalable Software Platform for Distributed, Heterogeneous Environments by Dominik Charousset, Thomas C. Schmidt, Raphael Hiesgen, and Matthias Wählisch.


Writing concurrent software is challenging, especially with low-level synchronization primitives such as threads or locks in shared memory environments. The actor model replaces implicit communication by an explicit message passing in a ‘shared-nothing’ paradigm. It applies to concurrency as well as distribution, but has not yet entered the native programming domain. This paper contributes the design of a native actor extension for C++, and the report on a software platform that implements our design for (a)concurrent, (b) distributed, and (c) heterogeneous hardware environments. GPGPU and embedded hardware components are integrated in a transparent way. Our software platform supports the development of scalable and efficient parallel software. It includes a lock-free mailbox algorithm with pattern matching facility for message processing. Thorough performance evaluations reveal an extraordinary small memory footprint in realistic application scenarios, while runtime performance not only outperforms existing mature actor implementations, but exceeds the scaling behavior of low-level message passing libraries such as OpenMPI.

When I read Stroustrup: Why the 35-year-old C++ still dominates ‘real’ dev I started to post a comment asking why there were no questions about functional programming languages? But, the interview is a “puff” piece and not a serious commentary on programming.

Then I ran across this work on implementing actors in C++. Maybe Stroustrup was correct without being aware of it.

Bundled with the C++ library libcppa, available at:

Onyx: Distributed Workflows….

Thursday, August 28th, 2014

Onyx: Distributed Workflows for Dynamic Systems by Michael Drogalis.

From the post:

If you’ve ever jumped heads down into a codebase maintaining complex distributed activity and tried to simplify or change the processing workflow, not only will you scratch your head for 7 sleepless nights before you can get anywhere, but you’ll come to realize that workflows are often deeply complected with their mechanism of execution.

In this talk, we’ll survey contemporary frameworks such as Storm and Cascading. We’ll identify the pain points that seem to crop up time and time again: workflow specification, stateful lifecycle management, and developer testing – to name a few.

Onyx is a new distributed computation system written in Clojure that addresses these problems head-on. Hardware advancements in the last 10 years have enabled new designs that leverage fast networks and SSDs. Onyx takes advantage and adapts to this new environment. The concepts and tools discussed remove the incidental complexity that plagues modern frameworks.

Attendees will come away with new perspective on leveraging immutability, persistent data structures, queues, and transactions to tackle increasingly complex problem spaces.

This and much more at Strangeloop, St. Louis, Sept. 17-19th, 2014.

CRDTs: Consistency without consensus

Tuesday, August 19th, 2014

CRDTs: Consistency without consensus by Peter Bourgon.


When you think of distributed systems, you probably think in terms of consistency via consensus. That is, enabling a heterogeneous group of systems to agree on facts, while remaining robust in the face of failure. But, as any distributed systems developer can attest, it’s harder than it sounds. Failure happens in myriad, byzantine ways, and failure modes interact unpredictably. Reliable distributed systems need more than competent engineering: they need a robust theoretical foundation. CRDTs, or Convergent Replicated Data Types, are a set of properties or behaviors, discovered more than invented, which enable a distributed system to achieve consistency without consensus, and sidestep entire classes of problems altogether. This talk provides a practical introduction to CRDTs, and describes a production CRDT system built at SoundCloud to serve high-volume time-series data.

Slides: bbuzz14-peter_bourgon_0.pdf

This is very much worth your time!

Great discussion of data models after time mark 23:00 (approximately).

BTW, the system discussed is open source and in production:

400 GTEPS on 4096 GPUs

Saturday, August 9th, 2014

Breadth-First Graph Search Uses 2D Domain Decomposition – 400 GTEPS on 4096 GPUs by Rob Farber.

From the post:

Parallel Breadth-First Search is a standard benchmark and the basis of many other graph algorithms. The challenge li[]es in partitioning the graph across multiple nodes in a cluster while avoiding load-imbalance and communications delays. The authors of the paper, “Parallel Breadth First Search on the Kepler Architecture” utilize an interesting 2D decomposition of the graph adjacency matrix. Tests on R-MAT graphs shows large graph performance ranging from 1.1 GTEP on a single K20 to 396 GTEP using 4096 GPUs. The tests also compared performance against the method of Beamer (10 GTEP single SMP device and 240 GTEP on 115k cores).

See Rob’s post for background on the distributed DFS problem and additional references.

Graph processing continues to improve at an impressive rate but I wonder how applicable some techniques are to intersections of graphs?

The optimization of using a bitmap to mark vertices visited (Scalable Graph Exploration on Multicore Processors, Agarwal, et al., 2010), cited by authors of Parallel Distributed Breadth First Search on the Kepler Architecture saying:

Then, to reduce the work, we used an integer map to keep track of visited vertices. Agarwal et al., first introduced this optimization using a bitmap that has been used in almost all subsequent works.

appears to be stumbling block to tracking a vertex that appears in intersecting graphs.

Or would you track visited vertices in each intersecting graph separately? And communicate results from each intersecting graph?

DSL for Distributed Heterogeneous Systems

Saturday, August 9th, 2014

A Domain-Specific Language for Volume Processing and Visualization on Distributed Heterogeneous Systems

From the webpage:

As the size of image data from microscopes and telescopes increases, the need for high-throughput processing and visualization of large volumetric data has become more pressing. At the same time, many-core processors and GPU accelerators are commonplace, making high-performance distributed heterogeneous computing systems affordable. However, effectively utilizing GPU clusters is difficult for novice programmers, and even experienced programmers often fail to fully leverage the computing power of new parallel architectures due their steep learning curve and programming complexity.

In this research, we propose a new domain-specific language for volume processing and visualization on distributed heterogeneous computing systems, called Vivaldi (VIsualization LAnguage for DIstributed sytstems). Vivaldi’s Python-like grammar and parallel processing abstractions provide flexible programming tools for non-experts to easily write high-performance parallel computing code. Vivaldi provides commonly used functions and numerical operators for customized visualization and high-throughput image processing applications. We demonstrate the performance and usability of Vivaldi on several examples ranging from volume rendering to image segmentation.

A paper has been accepted for presentation at VIS2014. (9-14 November 2014, Paris)

I don’t have any other details but will keep looking.

I first saw this in a tweet by Albert Swart.

A Distributed Systems Reading List

Friday, May 16th, 2014

A Distributed Systems Reading List by

From the introduction:

I often argue that the toughest thing about distributed systems is changing the way you think. The below is a collection of material I’ve found useful for motivating these changes.

Categories include:

  • Thought Provokers
  • Amazon
  • Google
  • eBay
  • Consistency Models
  • Theory
  • Languages and Tools
  • Infrastructure
  • Storage
  • Paxos Consensus
  • Other Consensus Papers
  • Gossip Protocols (Epidemic Behaviors)
  • P2P

Unless you think the knowledge in your domain is small enough to fit into a single system, I suggest you start reading about distributed systems this weekend.


I first saw this in a tweet by FoundationDB.

Distributed Environments and VirtualBox

Thursday, May 15th, 2014

While writing about Distributed LIBLINEAR: I discovered two guides to creating distributed environments with VirtualBox.

I mention that fact in the other post but thought the use of VirtualBox to create distributed environments needed more visibility than a mention.

The guides are:

MPI LIBLINEAR – VirtualBox Guide

Spark LIBLINEAR – VirtualBox Guide

and you will need to refer to the original site: Distributed LIBLINEAR: Libraries for Large-scale Linear Classification on Distributed Environments for information on using those environments with “Distributed LIBLINEAR.”

VirtualBox brings research on and using distributed systems within the reach of anyone with reasonable computing resources.

Please drop me a note if you are using VirtualBox to create distributed systems for topic map processing.

Distributed Systems and the End of the API

Monday, May 12th, 2014

Distributed Systems and the End of the API by Chas Emerick.

From the post:

This is a written (expanded) narrative of the content from a talk I first gave at PhillyETE on April 23rd, 2014. It mostly follows the flow of the presentation given then, but with a level of detail that I hope enhances clarity of the ideas therein. The talk’s original slides are available, though the key illustrations and bullet points contained therein are replicated (and somewhat enhanced) below. When audio/video of the talk is published, I will update this page to link to it.

I have two claims of which I would like to convince you today:

  1. The notion of the networked application API is an unsalvageable anachronism that fails to account for the necessary complexities of distributed systems.
  2. There exist a set of formalisms that do account for these complexities, but which are effectively absent from modern programming practice.

A bit further into the paper, distributed systems are defined as:

A distributed system is one that is comprised of multiple processes that must communicate to perform work.

The bottom line is that, given the ambient nature of the networks that surround us and the dependence we have upon those networks for so many of the tasks our programs, clients, customers, and users take for granted, nearly every system we build is a distributed system. Unless your software runs in a totally isolated environment — e.g. on an air-gapped computer — you are building a distributed system.

This is problematic in that distributed systems exhibit a set of uniformly unintuitive behaviours related to causality, consistency, and availability. These behaviours are largely emergent, and spring from the equally unintuitive semantics of the non-locality of the parts of those distributed systems and the networks that connect them. None of these behaviours or semantics are related at all to those which we — as programmers and engineers — are typically trained and acclimated to expect and reason about.

Note that even if you are doing something small, or “normal”, or common, you are not immune to these challenges. Even the most vanilla web application is definitionally a distributed system. By sending data from one computer (e.g. a server) to another (e.g. your customer’s web browser), you end up having to contemplate and address all sorts of problems that simply don’t exist when you run a program in a single process on a single machine that doesn’t touch the network: consistency, coping with non-availability (i.e. latency, services being down, timing-related bugs caused by long-running computations or things as simple as garbage collection), dealing with repeated messages from clients with spotty connections, and more. If you’ve not been bitten by these things, that is evidence of luck (or, of your not having noticed the problems yet!), not of your being immune, or otherwise that what you’ve built is somehow not a distributed system and so isn’t subject to these challenges.

A lot of heavy sledding but important for the future development of robust distributed systems.

It is important that people interested in semantics and XML participate in these discussions.

For example, Chas says of XML (and JSON):

the “richer” data representations that are favoured by most API services and clients (again, JSON, XML, etc) are fundamentally opaque and in general make reconciling independent changes impossible in a consistent way without special, often domain-specific intervention.

I’m am curious what is meant by “fundametally opaque,” at least insofar as Chas is talking about XML. If he means that independent changes impact the tree structure and make reconciliation of concurrent changes challenging, ok, but that’s not being opaque. And even that is an artifact of a processing model for XML, not XML proper.

I am even more concerned about the “semantics” to be addressed in distributed systems. At this point I will have to take Chas’ word for the distributed systems preserving machine to machine semantics (I have much reading left to do) but correct machine processing doesn’t warrant correct semantics for a human consumer of the same data.

I first saw this in a tweet by Tom Santero.

Idempotence Is Not a Medical Condition

Saturday, January 4th, 2014

Idempotence Is Not a Medical Condition by Pat Helland.

From the post:

The definition of distributed computing can be confusing. Sometimes, it refers to a tightly coupled cluster of computers working together to look like one larger computer. More often, however, it refers to a bunch of loosely related applications chattering together without a lot of system-level support.

This lack of support in distributed computing environments makes it difficult to write applications that work together. Messages sent between systems do not have crisp guarantees for delivery. They can get lost, and so, after a timeout, they are retried. The application on the other side of the communication may see multiple messages arrive where one was intended. These messages may be reordered and interleaved with different messages. Ensuring that the application behaves as intended can be very hard to design and implement. It is even harder to test.

In a world full of retried messages, idempotence is an essential property for reliable systems. Idempotence is a mathematical term meaning that performing an operation multiple times will have the same effect as performing it exactly one time. The challenges occur when messages are related to each other and may have ordering constraints. How are messages associated? What can go wrong? How can an application developer build a correctly functioning app without losing his or her mojo?

A very good discussion of idempotence in the context of distributed (message passing) systems. You may recall the TMRM defining merging operators to be idempotent. (Section 8 )

Pat’s examples on idempotence include:

  1. Sweeping the floor is idempotent. If you sweep it multiple times, you still get a clean floor.
  2. Baking a cake is not idempotent.
  3. Baking a cake starting from a shopping list (if you don’t care about money) is idempotent.

As an aside, #2 is not idempotent because “a cake” means a particular cake. It can only be baked once, at least if you want to have an edible result. In #3, the act of baking from a shopping list (I prefer a recipe) and not the cake, is idempotent.

The post is quite good, particularly if you are interested in a reliable messaging based system.

I first saw this in Stuff The Internet Says On Scalability For January 3rd, 2014, which had the following note:

Pat Helland with a classically great article on Idempotence. Fortunately the article is not idempotent. Everytime you read it your brain updates with something new.


Saturday, November 23rd, 2013

Introducing SAMOA, an open source platform for mining big data streams by Gianmarco De Francisci Morales and Albert Bifet.

From the post:

Machine learning and data mining are well established techniques in the world of IT and especially among web companies and startups. Spam detection, personalization and recommendations are just a few of the applications made possible by mining the huge quantity of data available nowadays. However, “big data” is not only about Volume, but also about Velocity (and Variety, 3V of big data).

The usual pipeline for modeling data (what “data scientists” do) involves taking a sample from production data, cleaning and preprocessing it to make it usable, training a model for the task at hand and finally deploying it to production. The final output of this process is a pipeline that needs to run periodically (and be maintained) in order to keep the model up to date. Hadoop and its ecosystem (e.g., Mahout) have proven to be an extremely successful platform to support this process at web scale.

However, no solution is perfect and big data is “data whose characteristics forces us to look beyond the traditional methods that are prevalent at the time”. The current challenge is to move towards analyzing data as soon as it arrives into the system, nearly in real-time.

For example, models for mail spam detection get outdated with time and need to be retrained with new data. New data (i.e., spam reports) comes in continuously and the model starts being outdated the moment it is deployed: all the new data is sitting without creating any value until the next model update. On the contrary, incorporating new data as soon as it arrives is what the “Velocity” in big data is about. In this case, Hadoop is not the ideal tool to cope with streams of fast changing data.

Distributed stream processing engines are emerging as the platform of choice to handle this use case. Examples of these platforms are Storm, S4, and recently Samza. These platforms join the scalability of distributed processing with the fast response of stream processing. Yahoo has already adopted Storm as a key technology for low-latency big data processing.

Alas, currently there is no common solution for mining big data streams, that is, for doing machine learning on streams on a distributed environment.


SAMOA (Scalable Advanced Massive Online Analysis) is a framework for mining big data streams. As most of the big data ecosystem, it is written in Java. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.

After you get SAMOA installed, you may want to read: Distributed Decision Tree Learning for Mining Big Data Streams by Arinto Murdopo (thesis).

The nature of streaming data prevents SAMOA from offering the range of machine learning algorithms common in machine learning packages.

But if the SAMOA algorithms fit your use cases, what other test would you apply?

Apache Aurora

Tuesday, October 1st, 2013

Apache Aurora

Apache Aurora entered incubation today!

From the webpage:

Aurora is a service scheduler used to schedule jobs onto Apache Mesos.

Oh, Apache Mesos?

From the webpage:

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes.

All the wiring is still pretty close to the surface but that’s not going to last long.

Better to learn it now while people still think it is hard. 😉

Design Patterns for Distributed…

Sunday, September 29th, 2013

Design Patterns for Distributed Non-Relational Databases by Todd Lipcon.

A bit dated (2009) but true design patterns should find refinement, not retirement.


  • Consistent Hashing
  • Consistency Models
  • Data Models
  • Storage Layouts
  • Log-Structured Merge Trees

Curious if you would suggest substantial changes to these patterns some four (4) years later?

Ceph: A Scalable, High-Performance Distributed File System

Saturday, May 4th, 2013

Ceph: A Scalable, High-Performance Distributed File System by Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn.


We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.

I have just started reading this paper but it strikes me as deeply important.


Ceph decouples data and metadata operations by eliminating file allocation tables and replacing them with generating functions. This allows Ceph to leverage the intelligence present in OSDs to distribute the complexity surrounding data access, update serialization, replication and reliability, failure detection, and recovery. Ceph utilizes a highly adaptive distributed metadata cluster architecture that dramatically improves the scalability of metadata access, and with it, the scalability of the entire system. We discuss the goals and workload assumptions motivating our choices in the design of the architecture, analyze their impact on system scalability and performance, and relate our experiences in implementing a functional system prototype.

The ability to scale “metadata,” in this case inodes and directory entries (file names), bodes well for scaling topic map based information about files.

Not to mention that experience with generating functions may free us from the overhead of URI based addressing.

For some purposes, I may wish to act as though only files exist but in a separate operation, I may wish to address discrete tokens or even characters in one such file.

Interesting work and worth a deep read.

The source code for Ceph:

A Distributed Graph Engine…

Friday, March 22nd, 2013

A Distributed Graph Engine for Web Scale RDF Data by Kai Zeng, Jiacheng Yang, Haixum Wang, Bin Shao and Zhongyuan Wang.


Much work has been devoted to supporting RDF data. But state-of-the-art systems and methods still cannot handle web scale RDF data e ffectively. Furthermore, many useful and general purpose graph-based operations (e.g., random walk, reachability, community discovery) on RDF data are not supported, as most existing systems store and index data in particular ways (e.g., as relational tables or as a bitmap matrix) to maximize one particular operation on RDF data: SPARQL query processing. In this paper, we introduce Trinity.RDF, a distributed, memory-based graph engine for web scale RDF data. Instead of managing the RDF data in triple stores or as bitmap matrices, we store RDF data in its native graph form. It achieves much better (sometimes orders of magnitude better) performance for SPARQL queries than the state-of-the-art approaches. Furthermore, since the data is stored in its native graph form, the system can support other operations (e.g., random walks, reachability) on RDF graphs as well. We conduct comprehensive experimental studies on real life, web scale RDF data to demonstrate the e ffectiveness of our approach.

From the conclusion:

We propose a scalable solution for managing RDF data as graphs in a distributed in-memory key-value store. Our query processing and optimization techniques support SPARQL queries without relying on join operations, and we report performance numbers of querying against RDF datasets of billions of triples. Besides scalability, our approach also has the potential to support queries and analytical tasks that are far more advanced than SPARQL queries, as RDF data is stored as graphs. In addition, our solution only utilizes basic (distributed) key-value store functions and thus can be ported to any in-memory key-value store.

A result that is:

  • scalable
  • goes beyond SPARQL
  • can be ported to any in-memory key-value store

Merits a very close read.

Makes me curious what other data models would work better if cast as graphs?

I first saw this in a tweet by Juan Sequeda.

Distributed Graph Computing with Gremlin

Thursday, March 7th, 2013

Distributed Graph Computing with Gremlin by Marko A. Rodriguez.

From the post:

The script-step in Faunus’ Gremlin allows for the arbitrary execution of a Gremlin script against all vertices in the Faunus graph. This simple idea has interesting ramifications for Gremlin-based distributed graph computing. For instance, it is possible evaluate a Gremlin script on every vertex in the source graph (e.g. Titan) in parallel while maintaining data/process locality. This section will discuss the following two use cases.

  • Global graph mutations: parallel update vertices/edges in a Titan cluster given some arbitrary computation.
  • Global graph algorithms: propagate information to arbitrary depths in a Titan cluster in order to compute some algorithm in a parallel fashion.

Another must read post from Marko A. Rodriguez!

Also a reminder that I need to pull out my Oxford Classical Dictionary to add some material to the mythology graph.

Building a highly scaleable distributed… [Webinar, MySQL/Shard-Query]

Friday, February 8th, 2013

Webinar: Building a highly scaleable distributed row, document or column store with MySQL and Shard-Query by Justin Swanhart.

From the post:

On Friday, February 15, 2013 10:00am Pacific Standard Time, I will be delivering a webinar entitled “Building a highly scaleable distributed row, document or column store with MySQL and Shard-Query”

The first part of this webinar will focus on why distributed databases are needed, and on the techniques employed by Shard-Query to implement a distributed MySQL database. The focus will then proceed to the types of distributed (massively parallel processing) database applications which can be deployed with Shard-Query and the performance aspects of each.

The following types of implementations will be described:

  • Distributed row store using XtraDB cluster
  • Distributed append-only column store using Infobright Community Edition
  • Distributed “document store” using XtraDB cluster and Flexviews

If you are using (or planning on using) MySQL as a topic map backend, this could be the webinar for you!

Logic and Lattices for Distributed Programming

Wednesday, January 30th, 2013

Logic and Lattices for Distributed Programming

From the post:

Neil Conway from Berkeley CS is giving an advanced level talk at a meetup today in San Francisco on a new paper: Logic and Lattices for Distributed Programming – extending set logic to support CRDT-style lattices.

The description of the meetup is probably the clearest introduction to the paper:

Developers are increasingly choosing datastores that sacrifice strong consistency guarantees in exchange for improved performance and availability. Unfortunately, writing reliable distributed programs without the benefit of strong consistency can be very challenging.


In this talk, I’ll discuss work from our group at UC Berkeley that aims to make it easier to write distributed programs without relying on strong consistency. Bloom is a declarative programming language for distributed computing, while CALM is an analysis technique that identifies programs that are guaranteed to be eventually consistent. I’ll then discuss our recent work on extending CALM to support a broader range of programs, drawing upon ideas from CRDTs (A Commutative Replicated Data Type).

If you have an eye towards understanding the future then this is for you.

Do note that the Bloom language is treated more extensively in Datalog Reloaded. You may recall that the basis for tolog (a topic map query language) was Datalog.

Tombstones in Topic Map Future?

Wednesday, January 16th, 2013

Watching the What’s New in Cassandra 1.2 (Notes) webcast and encountered an unfamiliar term: “tombstones.”

If you are already familiar with the concept, skip to another post.

If you’re not, the concept is used in distributed systems that maintain “eventual” consistency by the nodes replicating their content. Which works if all nodes are available but what if you delete data and a node is unavailable? When it comes back, the other nodes are “missing” data that needs to be replicated.

From the description at the Cassandra wiki, DistributedDeletes, not an easy problem to solve.

So, Cassandra turns it into a solvable problem.

Deletes are implemented with a special value known as a tombstone. The tombstone is propogated to nodes that missed the initial delete.

Since you will eventually want to delete the tombstones as well, a grace period can be set, which is slightly longer than the period needed to replace a non-responding node.

Distributed topic maps will face the same issue.

Complicated by imperative programming models of merging that make changes in properties that alter merging difficult to manage.

Perhaps functional models of merging, as with other forms of distributed processing, will carry the day.

Piccolo: Distributed Computing via Shared Tables

Saturday, December 8th, 2012

Piccolo: Distributed Computing via Shared Tables

From the homepage:

Piccolo is a framework designed to make it easy to develop efficient distributed applications.

In contrast to traditional data-centric models (such as Hadoop) which present the user a single object at a time to operate on, Piccolo exposes a global table interface which is available to all parts of the computation simulataneously. This allows users to specify programs in an intuitive manner very similar to that of writing programs for a single machine.

Piccolo includes a number of optimizations to ensure that using this table interface is not just easy, but also fast:

To ensure locality of execution, tables are explicitly partitioned across machines. User code that interacts with the tables can specify a locality preference: this ensures that the code is executed locally with the data it is accessing.
Not all load is created equal – often some partition of a computation will take much longer then others. Waiting idly for this task to finish wastes valuable time and resources. To address this Piccolo can migrate tasks away from busy machines to take advantage of otherwise idle workers, all while preserving the locality preferences and the correctness of the program.
Failure Handling
Machines failures are inevitable, and generally occur when you’re at the most critical time in your computation. Piccolo makes checkpointing and restoration easy and fast, allowing for quick recovery in case of failures.
Managing the correct synchronization and update across a distributed system can be complicated and slow. Piccolo addresses this by allowing users to defer synchronization logic to the system. Instead of explicitly locking tables in order to perform updates, users can attach accumulation functions to a table: these are used automatically by the framework to correctly combine concurrent updates to a table entry.

The closer you are to the metal, the more aware you will be of the distributed nature of processing and data.

Will the success of distributed processing/storage be when all but systems architects are unaware of its nature?

Netflix open sources Hystrix resilience library [Component for Distributed TMs]

Wednesday, November 28th, 2012

Netflix open sources Hystrix resilience library

From the post:

Netflix has moved on from just releasing the tools it uses to test the resilience of the cloud services that power the video streaming company, and has now open sourced a library that it uses to engineer in that resilience. Hystrix is an Apache 2 licensed library which Netflix engineers have been developing over the course of 2012 and which has been adopted by many teams within the company. It is designed to manage how distributed services interact and give more tolerance to latency within those connections and the inevitable failures that can occur.

The library isolates access points between services and then stops any failures from cascading between those access points. Hystrix uses a Command pattern to execute or queue Command objects and evaluate whether the circuit to the service for which the command is destined for is in operation. This may not be the case where what Hystrix calls a circuit breaker has triggered leaving the circuit “open”. Circuit breakers can be placed into a system to make it easier to trigger a coordinated failover. The library also checks for other issues which may prevent the execution of the command.

Does your distributed TM have the resilience of Netflix?

Is that the new “normal” for resilience?

The post goes on to say that a dashboard is forthcoming to monitor Hystrix.

RICON 2012 [videos, slides, resources]

Friday, November 2nd, 2012

RICON 2012 [videos, slides, resources]

From the webpage:

Basho Technologies, along with our sponsors, proudly presented RICON 2012, a two day conference dedicated to Riak, developers, and the future of distributed systems in production. This page is dedicated to post-conference consumption. Here you will find slidedecks, resources, and much more.

Videos for the weekend (for those of you without NetFlix accounts):

  • Joseph Blomstedt, Bringing Consistency to Riak
  • Sean Cribbs, Data Structures in Riak
  • Selena Deckelmann, Rapid Data Prototyping With Postgres
  • Dietrich Featherston, Modern Radiology for Distributed Systems
  • Gary Flake, Building a Social Application on Riak
  • Theo Schlossnagle, Next Generation Monitoring of Large Scale Riak Applications
  • Ines Sombra and Michael Brodhead, Riak in the Cloud
  • Andrew Thompson, Cloning the Cloud – Riak and Multi Data Center Replication

It is hard to decide what to watch first.

What do you think?

Metamarkets open sources distributed database Druid

Friday, October 26th, 2012

Metamarkets open sources distributed database Druid by Elliot Bentley.

From the post:

It’s no secret that the latest challenge for the ‘big data’ movement is moving from batch processing to real-time analysis. Metamarkets, who provide “Data Science-as-a-Service” business analytics, last year revealed details of in-house distributed database Druid – and have this week released it as an open source project.

Druid was designed to solve the problem of a database which allows multi-dimensional queries on data as and when it arrives. The company originally experimented with both relational and NoSQL databases, but concluded they were not fast enough for their needs and so rolled out their own.

The company claims that Druid’s scan speed is “33M rows per second per core”, able to ingest “up to 10K incoming records per second per node”. An earlier blog post outlines how the company managed to achieve scan speeds of 26B records per second using horizontal scaling. It does this via a distributed architecture, column orientation and bitmap indices.

It was exciting to read about Druid last year.

Now to see how exciting Druid is in fact!

Source code:

Service-Oriented Distributed Knowledge Discovery

Thursday, October 25th, 2012

Service-Oriented Distributed Knowledge Discovery by Domenico Talia, University of Calabria, Rende, Italy; Paolo Trunfio.

The publisher’s summary reads:

A new approach to distributed large-scale data mining, service-oriented knowledge discovery extracts useful knowledge from today’s often unmanageable volumes of data by exploiting data mining and machine learning distributed models and techniques in service-oriented infrastructures. Service-Oriented Distributed Knowledge Discovery presents techniques, algorithms, and systems based on the service-oriented paradigm. Through detailed descriptions of real software systems, it shows how the techniques, models, and architectures can be implemented.

The book covers key areas in data mining and service-oriented computing. It presents the concepts and principles of distributed knowledge discovery and service-oriented data mining. The authors illustrate how to design services for data analytics, describe real systems for implementing distributed knowledge discovery applications, and explore mobile data mining models. They also discuss the future role of service-oriented knowledge discovery in ubiquitous discovery processes and large-scale data analytics.

Highlighting the latest achievements in the field, the book gives many examples of the state of the art in service-oriented knowledge discovery. Both novices and more seasoned researchers will learn useful concepts related to distributed data mining and service-oriented data analysis. Developers will also gain insight on how to successfully use service-oriented knowledge discovery in databases (KDD) frameworks.

The idea of service-oriented data mining/analysis is very compatible with topic maps as marketable information sets.

It is not available through any of my usual channels, yet, but I would be cautious at $89.95 for 230 pages of text.

More comments to follow when I have a chance to review the text.

I first saw this at KDNuggets.

Distributed Algorithms in NoSQL Databases

Wednesday, October 10th, 2012

Distributed Algorithms in NoSQL Databases by Ilya Katsov.

From the post:

Scalability is one of the main drivers of the NoSQL movement. As such, it encompasses distributed system coordination, failover, resource management and many other capabilities. It sounds like a big umbrella, and it is. Although it can hardly be said that NoSQL movement brought fundamentally new techniques into distributed data processing, it triggered an avalanche of practical studies and real-life trials of different combinations of protocols and algorithms. These developments gradually highlight a system of relevant database building blocks with proven practical efficiency. In this article I’m trying to provide more or less systematic description of techniques related to distributed operations in NoSQL databases.

In the rest of this article we study a number of distributed activities like replication of failure detection that could happen in a database. These activities, highlighted in bold below, are grouped into three major sections:

  • Data Consistency. Historically, NoSQL paid a lot of attention to tradeoffs between consistency, fault-tolerance and performance to serve geographically distributed systems, low-latency or highly available applications. Fundamentally, these tradeoffs spin around data consistency, so this section is devoted data replication and data repair.
  • Data Placement. A database should accommodate itself to different data distributions, cluster topologies and hardware configurations. In this section we discuss how to distribute or rebalance data in such a way that failures are handled rapidly, persistence guarantees are maintained, queries are efficient, and system resource like RAM or disk space are used evenly throughout the cluster.
  • System Coordination. Coordination techniques like leader election are used in many databases to implements fault-tolerance and strong data consistency. However, even decentralized databases typically track their global state, detect failures and topology changes. This section describes several important techniques that are used to keep the system in a coherent state.

Slow going but well worth the effort.

Not the issues discussed in the puff-piece webinars extolling NoSQL solutions to “big data.”

But you already knew that if you read this far! Enjoy!

I first saw this at Christophe Lalanne’s A bag of tweets / September 2012