## Archive for the ‘Distributed Computing’ Category

### The Computer Science behind a modern distributed data store

Thursday, December 7th, 2017

From the description:

What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Every application needs a stateful layer which holds the data. There are at least three necessary ingredients which are everything else than trivial to combine and of course even more challenging when heading for an acceptable performance.

Over the past years there has been significant progress in respect in both the science and practical implementations of such data stores. In his talk Max Neunhöffer will introduce the audience to some of the needed ingredients, address the difficulties of their interplay and show four modern approaches of distributed open-source data stores.

Topics are:

• Challenges in developing a distributed, resilient data store
• Consensus, distributed transactions, distributed query optimization and execution
• The inner workings of ArangoDB, Cassandra, Cockroach and RethinkDB

The talk will touch complex and difficult computer science, but will at the same time be accessible to and enjoyable by a wide range of developers.

I haven’t found the slides for this presentation but did stumble across ArangoDB Tech Talks and Slides.

Neunhöffer’s presentation will make you look at ArangoDB more closely.

### ARGUS

Tuesday, August 9th, 2016

From the post:

This is one post in a series about programming models and languages for distributed computing that I’m writing as part of my history of distributed programming techniques.

• Abstraction Mechanisms in CLU, Liskov, Barbara and Snyder, Alan and Atkinson, Russell and Schaffert, Craig, CACM 1977 (Liskov et al. 1977).
• Guardians and Actions: Linguistic Support for Robust, Distributed Programs, Liskov, Barbara and Scheifler, Robert, TOPLAS 1982 (Liskov and Scheifler 1983).
• Orphan Detection in the Argus System, Walker, Edward Franklin, DTIC 1984 (Walker 1984).
• Implementation of Argus, Liskov, Barbara and Curtis, Dorothy and Johnson, Paul and Scheifer, Robert, SIGOPS 1987 (Liskov et al. 1987).
• Distributed Programming in Argus, Liskov, Barbara CACM 1988 (Liskov 1988).

I’m thinking about how to fix an XFCE trackpad problem and while I think about that, wanted to touch up the references from Christopher’s post.

Apologies but I was unable to find a public version of: Implementation of Argus, Liskov, Barbara and Curtis, Dorothy and Johnson, Paul and Scheifer, Robert, SIGOPS 1987 (Liskov et al. 1987).

Enjoy!

### Fun with ddR: Using Distributed Data Structures in R [Your Holiday Quiet Spot]

Saturday, December 12th, 2015

Fun with ddR: Using Distributed Data Structures in R by Edward Ma and Vishrut Gupta (Hewlett Packard Enterprise).

From the post:

A few weeks ago, we revealed ddR (Distributed Data-structures in R), an exciting new project started by R-Core, Hewlett Packard Enterprise, and others that provides a fresh new set of computational primitives for distributed and parallel computing in R. The package sets the seed for what may become a standardized and easy way to write parallel algorithms in R, regardless of the computational engine of choice.

In designing ddR, we wanted to keep things simple and familiar. We expose only a small number of new user functions that are very close in semantics and API to their R counterparts. You can read the introductory material about the package here. In this post, we show how to use ddR functions.

Imagine that you are trapped after an indeterminate holiday meal in the TV room where A Christmas Story is playing for the fourth time that day.

You are at the point of saying/doing something that will offend the living members of your spouses family and generations to come.

What can you do?

Surely your powers of concentration exceed those of bridge players who claim to not see naked people cavorting about during bridge games.

You will have to be woken out of your reverie and handed your coat when it is time to go.

Well, maybe not exactly but it beats the hell out of biting one of your smaller relatives.

### Learning from Distributed Data:… [Beating the Bounds]

Sunday, December 6th, 2015

From the post:

Scientific advances typically produce massive amounts of data, which is, of course, a good thing. But when many of these datasets are at multiple locations, instead of all in one place, it becomes difficult and costly for researchers to extract meaningful information from them.

So, the question becomes: “How do we learn from these datasets if they cannot be shared or placed in a central location?” says Trilce Estrada-Piedra.

Estrada-Piedra, an assistant professor of computer sciences at the University of New Mexico (UNM) is working to find the solution. She designs software that will enable researchers to collaborate with one another, using decentralized data, without jeopardizing privacy or raising infrastructure concerns.

“Our contributions will help speed research in a variety of sciences like health informatics, astronomy, high energy physics, climate simulations and drug design,” Estrada-Piedra says. “It will be relevant for problems where data is spread out in many different locations.”

The aim of the National Science Foundation (NSF)-funded scientist’s project is to build mathematical models from each of the “local” data banks — those at each distributed site. These models will capture data patterns, rather than specific data points.

“Researchers then can share only the models, instead of sharing the actual data,” she says, citing a medical database as an example. “The original data, for example, would have the patient’s name, age, gender and particular metrics like blood pressure, heart rate, etcetera, and that one patient would be a data point. But the models will project his or her information and extract knowledge from the data. It would just be math. The idea is to build these local models that don’t have personal information, and then share the models without compromising privacy.”

Estrada-Piedra is designing algorithms for data projections and middleware: software that acts as a bridge between an operating system or database and applications, especially on a network. This will allow distributed data to be analyzed effectively.
….

I’m looking forward to hearing more about Estrada-Piedra’s work, although we all know there are more than data projection and middleware issues involved. Those are very real and very large problems, but as with all human endeavors, the last mile is defined by local semantics.

Efficiently managing local semantics, that is enabling others to seamlessly navigate your local semantics and to in turn navigate the local semantics of others, isn’t a technical task, or at least not primarily.

The primary obstacle to such a task is captured by John D. Cook in Medieval software project management.

The post isn’t long so I will quite it here:

Centuries ago, English communities would walk the little boys around the perimeter of their parish as a way of preserving land records. This was called “beating the bounds.” The idea was that by teaching the boundaries to someone young, the knowledge would be preserved for the lifespan of that person. Of course modern geological survey techniques make beating the bounds unnecessary.

Software development hasn’t reached the sophistication of geographic survey. Many software shops use a knowledge management system remarkably similar to beating the bounds. They hire a new developer to work on a new project. That developer will remain tied to that project for the rest of his or her career, like a serf tied to the land. The knowledge essential to maintaining that project resides only in the brain of its developer. There are no useful written records or reliable maps, just like medieval property boundaries.

Does that sound familiar? That only you or another person “know” the semantics of your datastores? Are you still “beating the bounds” to document your data semantics?

Or as John puts it:

There are no useful written records or reliable maps, just like medieval property boundaries.

It doesn’t have to be that way. You could have reliable maps, reliable maps that are updated when your data is mapped for yet another project. Another ETL is the acronym.

You can, as a manager, of course, simply allow data knowledge to evaporate from your projects but that seems like a very poor business practice.

Johanna Rothman responded to John’s post in Breaking Free of Legacy Projects with the suggestion that every project should have several young boys and girls “beating the bounds” for every major project.

The equivalent of avoiding a single point of failure in medieval software project management.

Better than relying on a single programmer but using more modern information management/retention techniques would be a better option.

I guess the question is do you like using medieval project management techniques for your data or not?

If you do, you won’t be any worse off than any of your competitors with a similar policy.

On the other hand, should one of your competitors break ranks, start using topic maps for example for mission critical data, well, you have been warned.

### Christopher Meiklejohn – Doctoral Thesis Proposal

Wednesday, November 18th, 2015

From the proposal:

The goal of this research is to provide a declarative way to design distributed, fault-tolerant applications that do not contain observable nondeterminism. These applications should be able to be placed at arbitrary locations in the network: mobile devices, “Internet of Things” hardware, or personal computers. Applications should be tolerant to arbitrary message delays, duplication and reordering: these are first-class requirements of distributed computations over unreliable networks. When writing these applications, developers should not have to use traditional concurrency control or synchronization mechanisms such as mutexes, semaphores, or monitors: the primitive operations for composition in the language should yield “deterministic-by-construction” applications.

Christopher is looking for comments on his doctoral thesis proposal.

His proposal is dated November 11, 2015, so time remains for you to review the proposal and make comments.

It would be really nice if the community that will benefit from Christopher’s work would contribute some comments on it.

### DegDB (Open Source Distributed Graph Database) [Tackling Who Pays For This Data?]

Tuesday, November 17th, 2015

The Design Doc/Ramble reads in part:

Problems With Existing Graph Databases

• Owned by private companies with no incentive to share.
• Public databases are used by few people with no incentive to contribute.
• Large databases can’t fit on one machine and are expensive to traverse.
• Existing distributed graph databases require all nodes to be trusted.

Incentivizing Hosting of Data

Every request will have either a debit (with attached bitcoin) or credit (with bitcoin promised on delivery) payment system. The server nodes will attempt to estimate how much it will cost to serve the data and if there isn’t enough bitcoin attached, will drop the request. This makes large nodes want to serve as much popular data as possible, because it allows for faster responses as well as not having to pay other nodes for their data. At the same time, little used data will cost more to access due to requiring more hops to find the data and “cold storage” servers can inflate the prices thus making it profitable for them.

Incentivizing Creation of Data

Data Creation on Demand

A system for requesting certain data to be curated can be employed. The requestor would place a bid for a certain piece of data to be curated, and after n-sources add the data to the graph and verify its correctness the money would be split between them.
This system could be ripe for abuse by having bots automatically fulfilling every request with random data.

Creators Paid on Usage

This method involves the consumers of the data keeping track of their data sources and upon usage paying them. This is a trust based model and may end up not paying creators anything.

The one “wow” factor of this project is the forethought to put the discussion of “who pays for this data?” up front and center.

We have all seen the failing model that starts with:

For only $35.00 (U.S.) you can view this article for 24 hours. That makes you feel like you are almost robbing the publisher at that price. (NOT!) Right. I’m tracking down a citation to make sure a quote or data is correct and I am going to pay$35.00 (U.S.) to have access for 24 hours. Considering that the publishers with those pricing models have already made back their costs of production and publication plus a profit from institutional subscribers (challenge them for the evidence if they deny), a very low micro-payment would be more suitable. Say $00.01 per paragraph or something on that order. Payable out of a deposit with the publisher. I would amend the Creators Paid on Usage section to have created content unlocked only upon payment (set by the creator). Over time, creators would develop reputations for the value of their data and if you choose to buy from a private seller with no online history, that’s just your bad. Imagine that for the Paris incident (hypothetical, none of the following is true), I had the school records for half of the people carrying out that attack. Not only do I have the originals but I also have them translated into English, assuming some or all of them are in some other language. I could cast that data (I’m not fond of the poverty of triples) into a graph format and make it know as part of a distributed graph system. Some of the data, such as the identities of the people for who I had records, would appear in the graphs of others as “new” data. Up to the readers of the graph to decide if the data and the conditions for seeing it are acceptable to them. Data could even carry a public price tag. That is if you want to pay a large enough sum, then the data in question will be opened up for everyone to have access to it. I don’t know of any micropayment systems that are eating at the foundations of traditional publishers now but there will be many attempts before one eviscerates them one and all. The choices we face now of “free” (read unpaid for research, writing and publication, which excludes many) versus the “pay-per-view” model that supports early 20th century models of sloth, cronyism and gate-keeping, aren’t the only ones. We need to actively seek out better and more nuanced choices. ### Multiagent Systems Monday, November 16th, 2015 Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations by Yoav Shoham and Kevin Leyton-Brown. From the webpage: Multiagent systems consist of multiple autonomous entities having different information and/or diverging interests. This comprehensive introduction to the field offers a computer science perspective, but also draws on ideas from game theory, economics, operations research, logic, philosophy and linguistics. It will serve as a reference for researchers in each of these fields, and be used as a text for advanced undergraduate and graduate courses. Emphasizing foundations, the authors offer a broad and rigorous treatment of their subject, with thorough presentations of distributed problem solving, non-cooperative game theory, multiagent communication and learning, social choice, mechanism design, auctions, coalitional game theory, and logical theories of knowledge, belief, and other aspects of rational agency. For each topic, basic concepts are introduced, examples are given, proofs of key results are offered, and algorithmic considerations are examined. An appendix covers background material in probability theory, classical logic, Markov decision processes, and mathematical programming. Even better from the introduction: Imagine a personal software agent engaging in electronic commerce on your behalf. Say the task of this agent is to track goods available for sale in various online venues over time, and to purchase some of them on your behalf for an attractive price. In order to be successful, your agent will need to embody your preferences for products, your budget, and in general your knowledge about the environment in which it will operate. Moreover, the agent will need to embody your knowledge of other similar agents with which it will interact (e.g., agents who might compete with it in an auction, or agents representing store owners)—including their own preferences and knowledge. A collection of such agents forms a multiagent system. The goal of this book is to bring under one roof a variety of ideas and techniques that provide foundations for modeling, reasoning about, and building multiagent systems. Somewhat strangely for a book that purports to be rigorous, we will not give a precise definition of a multiagent system. The reason is that many competing, mutually inconsistent answers have been offered in the past. Indeed, even the seemingly simpler question—What is a (single) agent?—has resisted a definitive answer. For our purposes, the following loose definition will suffice: Multiagent systems are those systems that include multiple autonomous entities with either diverging information or diverging interests, or both. This looks like a great item for a wish list this close to the holidays. Broad enough to keep your interest up and relevant enough to argue you are “working” and not just reading. 😉 ### Microsoft open sources Distributed Machine Learning Toolkit… Friday, November 13th, 2015 From the post: Researchers at the Microsoft Asia research lab this week made the Microsoft Distributed Machine Learning Toolkit openly available to the developer community. The toolkit, available now on GitHub, is designed for distributed machine learning — using multiple computers in parallel to solve a complex problem. It contains a parameter server-based programing framework, which makes machine learning tasks on big data highly scalable, efficient and flexible. It also contains two distributed machine learning algorithms, which can be used to train the fastest and largest topic model and the largest word-embedding model in the world. The toolkit offers rich and easy-to-use APIs to reduce the barrier of distributed machine learning, so researchers and developers can focus on core machine learning tasks like data, model and training. The toolkit is unique because its features transcend system innovations by also offering machine learning advances, the researchers said. With the toolkit, the researchers said developers can tackle big-data, big-model machine learning problems much faster and with smaller clusters of computers than previously required. For example, using the toolkit one can train a topic model with one million topics and a 20-million word vocabulary, or a word-embedding model with 1000 dimensions and a 20-million word vocabulary, on a web document collection with 200 billion tokens utilizing a cluster of just 24 machines. That workload would previously have required thousands of machines. This has been a banner week for machine learning! On November 9th, Google open sourced TensorFlow. On November 12th, Single Artificial Neuron Taught to Recognize Hundreds of Patterns (why neurons have thousands of synapses) is published. On November 12th, Microsoft open sources its Distributed Machine Learning Toolkit. Not every week is like that for machine learning but it is impressive when that many major stories drop in a week! I do like the line from the Microsoft announcement: For example, using the toolkit one can train a topic model with one million topics and a 20-million word vocabulary, or a word-embedding model with 1000 dimensions and a 20-million word vocabulary, on a web document collection with 200 billion tokens utilizing a cluster of just 24 machines. (emphasis added) Prices are falling all the time and a 24 machine cluster should be within the reach of most startups if not most individuals now. Next year? Possibly within the reach of a large number of individuals. What are your machine learning plans for 2016? ### Spinning up a Spark Cluster on Spot Instances: Step by Step [$0.69 for 6 hours]

Thursday, October 29th, 2015

From the post:

The DevOps series covers how to get started with the leading open source distributed technologies. In this tutorial, we step through how to deploy a Spark Standalone cluster on AWS Spot Instances for less than $1. In a follow up post, we will show you how to use a Jupyter notebook on Spark for ad hoc analysis of reddit comment data on Amazon S3. One of the significant hurdles in learning to build distributed systems is understanding how these various technologies are installed and their inter-dependencies. In our experience, the best way to get started with these technologies is to roll up your sleeves and build projects you are passionate about. This following tutorial shows how you can deploy your own Spark cluster in standalone mode on top of Hadoop. Due to Spark’s memory demand, we recommend using m4.large spot instances with 200GB of magnetic hard drive space each. m4.large spot instances are not within the free-tier package on AWS, so this tutorial will incur a small cost. The tutorial should not take any longer than a couple hours, but if we allot 6 hours for your 4 node spot cluster, the total cost should run around$0.69 depending on the region of your cluster. If you run this cluster for an entire month we can look at a bill of around $80, so be sure to spin down you cluster after you are finished using it. How does$0.69 to improve your experience with distributed systems sound?

It’s hard to imagine a better deal.

The only reason to lack experience with distributed systems is lack of interest.

Odd I know but it does happen (or so I have heard). 😉

I first saw this in a tweet by Kirk Borne.

### Notes on Theory of Distributed Systems

Monday, May 4th, 2015

Notes on Theory of Distributed Systems by James Aspnes.

From the preface:

These are notes for the Spring 2014 semester version of the Yale course CPSC 465/565 Theory of Distributed Systems. This document also incorporates the lecture schedule and assignments, as well as some sample assignments from previous semesters. Because this is a work in progress, it will be updated frequently over the course of the semester.

Notes from Fall 2011 can be found at http://www.cs.yale.edu/homes/aspnes/classes/469/notes-2011.pdf.

Notes from earlier semesters can be found at http://pine.cs.yale.edu/pinewiki/465/.

Much of the structure of the course follows the textbook, Attiya and Welch’s Distributed Computing [AW04], with some topics based on Lynch’s Distributed Algorithms [Lyn96] and additional readings from the research literature. In most cases you’ll find these materials contain much more detail than what is presented here, so it is better to consider this document a supplement to them than to treat it as your primary source of information.

When something exceeds three hundred (> 300) pages, I have trouble calling it “notes.” 😉

A treasure trove of information on distributed computing.

I first saw this in a tweet by Henry Robinson.

### The Morning Paper [computing papers selected by Adrian Colyer]

Sunday, February 22nd, 2015

The Morning Paper [computing papers selected by Adrian Colyer]

The Morning Paper: a short summary of an important, influential, topical or otherwise interesting paper in the field of CS every weekday. The Morning Paper started out as a twitter project (#themorningpaper), then it became clear a longer form was also necessary because some papers just have too much good content to get across in a small batch of 140-character tweets!

The daily selection will still be tweeted on my twitter account (adriancolyer), with a quote or two to whet your appetite. Any longer excerpts or commentary will live here.

Why ‘The Morning Paper?’ (a) it’s a habit I enjoy, and (b) if one or two papers catch your attention and lead you to discover (or rediscover) something of interest then I’m happy.

Adrian’s 100th post was January 7, 2015 so you have some catching up to do. 😉

Very impressive and far more useful than the recent “newspaper” formats that automatically capture content from a variety of sources.

The Morning Paper is curated content, which makes all the difference in the world.

There is an emphasis on distributed computing making The Morning Paper a must read for anyone interested in the present and future of computing services.

Enjoy!

I first saw this in a tweet by Tyler Treat.

Thursday, December 11th, 2014

Martin writes:

Imagine you had your own copy of the entire web, and you could do with it whatever you want. (Yes, it would be very expensive, but we’ll get to that later.) You could do automated analyses and surface the results to users. For example, you could collate the “best” articles (by some definition) written on many different subjects, no matter where on the web they are published. You could then create a tool which, whenever a user is reading something about one of those subjects, suggests further reading: perhaps deeper background information, or a contrasting viewpoint, or an argument on why the thing you’re reading is full of shit.

Unfortunately, at the moment, only Google and a small number of other companies that have crawled the web have the resources to perform such analyses and build such products. Much as I believe Google try their best to be neutral, a pluralistic society requires a diversity of voices, not a filter bubble controlled by one organization. Surely there are people outside of Google who want to work on this kind of thing. Many a start-up could be founded on the basis of doing useful things with data extracted from a web crawl.

He goes on to discuss current search efforts such a Common Crawl and Wayfinder before hitting full stride with his suggestion for a distributed web search engine. Painting in the broadest of strokes, Martin makes it sound almost plausible to contemplate such an effort.

While conceding the technological issues would be many, it is contended that the payoff would be immense, but in ways we won’t know until it is available. I suspect Martin is right but if so, then we should be able to see a similar impact from Common Crawl. Yes?

Not to rain on a parade I would like to join, but extracting value from a web crawl like Common Crawl is not a guaranteed thing. A more complete crawl of the web only multiplies those problems, it doesn’t make them easier to solve.

On the whole I think the idea of a distributed crawl of the web is a great idea, but while that develops, we best hone our skills at extracting value from the partial crawls that already exist.

### World Community Grid

Saturday, December 6th, 2014

World Community Grid

World Community Grid enables anyone with a computer, smartphone or tablet to donate their unused computing power to advance cutting-edge scientific research on topics related to health, poverty and sustainability. Through the contributions of over 650,000 individuals and 460 organizations, World Community Grid has supported 24 research projects to date, including searches for more effective treatments for cancer, HIV/AIDS and neglected tropical diseases. Other projects are looking for low-cost water filtration systems and new materials for capturing solar energy efficiently.

How World Community Grid Works

World Community Grid has enabled important scientific advances in cancer treatment and clean energy. Our research partners have published over 35 peer-reviewed papers in scientific journals and have completed the equivalent of hundreds of thousands of years of research in less than a decade. World Community Grid is the biggest volunteer computing initiative devoted to humanitarian science, and is as powerful as some of the world’s fastest supercomputers. Learn More

On the cusp of current trends

World Community Grid brings together volunteers and researchers at the intersection of computational chemistry, open science and citizen science – three trends that are transforming the way scientific research is conducted. In 2013, World Community Grid also became one of the first major volunteer computing initiatives to enable mobile computing on Android smartphones and tablets. Learn More

An award-winning program

The pioneering work done on World Community Grid has been recognized internationally with awards including the Computerworld Data+ Editors Choice Award, Business in the Community Coffey International Award, and the Asian Forum on Corporate Social Responsibility’s Asian CSR Award.

Who are we?

Started in 2004, World Community Grid is a philanthropic initiative of IBM Corporate Citizenship, the corporate social responsibility and philanthropy division of IBM. Through Corporate Citizenship, IBM donates its technology and talent to address some of the world’s most pressing social and environmental issues.

Meet Our Team

One current focus is on Ebola vaccine research.

I saw this in a tweet by IBM Research. The tweet pointed to: IBM Helps You Donate Computer Power to Fight Ebola, where the only link to IBM wasn’t a hyperlink, just text. Thought you might prefer a link to the actual site rather than prose about the site. 😉

Enjoy!

### Tupleware: Redefining Modern Analytics

Saturday, October 18th, 2014

Tupleware: Redefining Modern Analytics by Andrew Crotty and Alexander Galakatos.

From the post:

Up until a decade ago, most companies sufficed with simple statistics and offline reporting, relying on traditional database management systems (DBMSs) to meet their basic business intelligence needs. This model prevailed in a time when data was small and analysis was simple.

But data has gone from being scarce to superabundant, and now companies want to leverage this wealth of information in order to make smarter business decisions. This data explosion has given rise to a host of new analytics platforms aimed at flexible processing in the cloud. Well-known systems like Hadoop and Spark are built upon the MapReduce paradigm and fulfill a role beyond the capabilities of traditional DBMSs. However, these systems are engineered for deployment on hundreds or thousands of cheap commodity machines, but non-tech companies like banks or retailers rarely operate clusters larger than a few dozen nodes. Analytics platforms, then, should no longer be built specifically to accommodate the bottlenecks of large cloud deployments, focusing instead on small clusters with more reliable hardware.

Furthermore, computational complexity is rapidly increasing, as companies seek to incorporate advanced data mining and probabilistic models into their business intelligence repertoire. Users commonly express these types of tasks as a workflow of user-defined functions (UDFs), and they want the ability to compose jobs in their favorite programming language. Yet, existing analytics systems fail to adequately serve this new generation of highly complex, UDF-centric jobs, especially when companies have limited resources or require sub-second response times. So what is the next logical step?

It’s time for a new breed of systems. In particular, a platform geared toward modern analytics needs the ability to (1) concisely express complex workflows, (2) optimize specifically for UDFs, and (3) leverage the characteristics of the underlying hardware. To meet these requirements, the Database Group at Brown University is developing Tupleware, a parallel high-performance UDF processing system that considers the data, computations, and hardware together to produce results as efficiently as possible.

The article is the “lite” introduction to Tuppleware. You may be more interested in:

Abstract:

There is a fundamental discrepancy between the targeted and actual users of current analytics frameworks. Most systems are designed for the data and infrastructure of the Googles and Facebooks of the world—petabytes of data distributed across large cloud deployments consisting of thousands of cheap commodity machines. Yet, the vast majority of users operate clusters ranging from a few to a few dozen nodes, analyze relatively small datasets of up to several terabytes, and perform primarily compute-intensive operations. Targeting these users fundamentally changes the way we should build analytics systems.

This paper describes the design of Tupleware, a new system specifically aimed at the challenges faced by the typical user. Tupleware’s architecture brings together ideas from the database, compiler, and programming languages communities to create a powerful end-to-end solution for data analysis. We propose novel techniques that consider the data, computations, and hardware together to achieve maximum performance on a case-by-case basis. Our experimental evaluation quantifies the impact of our novel techniques and shows orders of magnitude performance improvement over alternative systems.

Subject to the “in memory” limitation, speedups of 10 – 6,000x over other systems are nothing to dismiss without further consideration.

Interesting to see that “medium” data now reaches into the terabyte range. 😉

Are “mini-clouds” in the offing that provide specialized processing models?

The Tuppleware website.

I first saw this in a post by Danny Bickson, Tuppleware.

### Simple Testing Can Prevent Most Critical Failures:…

Thursday, October 9th, 2014

Abstract:

Large, production quality distributed systems still fail periodically,and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop Map Reduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the production failures.

We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code–the last line of defense–even with out an understanding of the software design. We extracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been prevented had Aspirator been used and the identified bugs fixed. Running Aspirator on the code of 9 distributed systems located 143 bugs and bad practices that have been fixed or confirmed by the developers.

If you aren’t already convinced you need to read this paper, consider one more quote:

almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software. (emphasis added)

How will catastrophic system failure reflect on your product or service? Hint: It doesn’t reflect well on topic maps or any other service or technology.

I say “read” this paper, perhaps putting it on a 90-day reading rotation would be better.

### GraphX: Graph Processing in a Distributed Dataflow Framework

Monday, September 15th, 2014

GraphX: Graph Processing in a Distributed Dataflow Framework by Joseph Gonzalez, Reynold Xin, Ankur Dave, Dan Crankshaw, Michael Franklin, Ion Stoica.

Abstract:

In pursuit of graph processing performance, the systems community has largely abandoned general-purpose distributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming abstractions and accelerate the execution of iterative graph algorithms. In this paper we argue that many of the advantages of specialized graph processing systems can be recovered in a modern general-purpose distributed dataflow system. We introduce GraphX, an embedded graph processing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphX presents a familiar composable graph abstraction that is sufficient to express existing graph APIs, yet can be implemented using only a few basic dataflow operators (e.g., join, map, group-by). To achieve performance parity with specialized graph systems, GraphX recasts graph-specific optimizations as distributed join optimizations and materialized view maintenance. By leveraging advances in distributed dataflow frameworks, GraphX brings low-cost fault tolerance to graph processing. We evaluate GraphX on real workloads and demonstrate that GraphX achieves an order of magnitude performance gain over the base dataflow framework and matches the performance of specialized graph processing systems while enabling a wider range of computation.

The “other” systems for comparison were GraphLab and Giraph. Those systems were tuned in cooperation with experts in their use. These are some of the “fairest” benchmarks you are likely to see this year. Quite different from “shiny graph engine” versus lame or misconfigured system benchmarks.

Definitely the slow-read paper for this week!

I first saw this in a tweet by Arnon Rotem-Gal-Oz.

### …Loosely Consistent Distributed Programming

Thursday, August 21st, 2014

Abstract:

Driven by the widespread adoption of both cloud computing and mobile devices, distributed computing is increasingly commonplace. As a result, a growing proportion of developers must tackle the complexity of distributed programming—that is, they must ensure correct application behavior in the face of asynchrony, concurrency, and partial failure.

To help address these difficulties, developers have traditionally relied upon system infrastructure that provides strong consistency guarantees (e.g., consensus protocols and distributed transactions). These mechanisms hide much of the complexity of distributed computing—for example, by allowing programmers to assume that all nodes observe the same set of events in the same order. Unfortunately, providing such strong guarantees becomes increasingly expensive as the scale of the system grows, resulting in availability and latency costs that are unacceptable for many modern applications.

Hence, many developers have explored building applications that only require loose consistency guarantees—for example, storage systems that only guarantee that all replicas eventually converge to the same state, meaning that a replica might exhibit an arbitrary state at any particular time. Adopting loose consistency involves making a well-known tradeoff: developers can avoid paying the latency and availability costs incurred by mechanisms for achieving strong consistency, but inexchange they must deal with the full complexity of distributed computing. As a result, achieving correct application behavior in this environment is very difficult.

This thesis explores how to aid developers of loosely consistent applications by providing programming language support for the difficulties they face. The language level is a natural place to tackle this problem: because developers that use loose consistency have fewer system facilities that they can depend on, consistency concerns are naturally pushed into application logic. In part, our goal has been to recognize, formalize, and automate application-level consistency patterns.

We describe three language variants that each tackle a different challenge in distributed programming. Each variant is a modification of Bloom, a declarative language for distributed programming we have developed at UC Berkeley. The first variant of Bloom, BloomL, enables deterministic distributed programming without the need for distributed coordination. Second, Edelweiss allows distributed storage reclamation protocols to be generated in a safe and automatic fashion. Finally, BloomPO adds sophisticated ordering constraints that we use to develop a declarative, high-level implementation of concurrent editing, a particularly difficult class of loosely consistent programs.

Unless you think of topic maps as static files, recent developments in “loosely consistent distributed programming” should be high on your reading list.

It’s entirely possible to have a topic map that is a static file, even one that has been printed out to paper. But that seems like a poor target for development. Captured information begins progressing towards staleness from the moment of its capture.

I first saw this in a tweet by Peter Bailis.

### A Distributed Systems Reading List

Friday, May 16th, 2014

From the introduction:

I often argue that the toughest thing about distributed systems is changing the way you think. The below is a collection of material I’ve found useful for motivating these changes.

Categories include:

• Thought Provokers
• Amazon
• eBay
• Consistency Models
• Theory
• Languages and Tools
• Infrastructure
• Storage
• Paxos Consensus
• Other Consensus Papers
• Gossip Protocols (Epidemic Behaviors)
• P2P

Unless you think the knowledge in your domain is small enough to fit into a single system, I suggest you start reading about distributed systems this weekend.

Enjoy!

I first saw this in a tweet by FoundationDB.

### Distributed Environments and VirtualBox

Thursday, May 15th, 2014

While writing about Distributed LIBLINEAR: I discovered two guides to creating distributed environments with VirtualBox.

I mention that fact in the other post but thought the use of VirtualBox to create distributed environments needed more visibility than a mention.

The guides are:

MPI LIBLINEAR – VirtualBox Guide

Spark LIBLINEAR – VirtualBox Guide

and you will need to refer to the original site: Distributed LIBLINEAR: Libraries for Large-scale Linear Classification on Distributed Environments for information on using those environments with “Distributed LIBLINEAR.”

VirtualBox brings research on and using distributed systems within the reach of anyone with reasonable computing resources.

Please drop me a note if you are using VirtualBox to create distributed systems for topic map processing.

### Distributed Systems and the End of the API

Monday, May 12th, 2014

Distributed Systems and the End of the API by Chas Emerick.

From the post:

This is a written (expanded) narrative of the content from a talk I first gave at PhillyETE on April 23rd, 2014. It mostly follows the flow of the presentation given then, but with a level of detail that I hope enhances clarity of the ideas therein. The talk’s original slides are available, though the key illustrations and bullet points contained therein are replicated (and somewhat enhanced) below. When audio/video of the talk is published, I will update this page to link to it.

I have two claims of which I would like to convince you today:

1. The notion of the networked application API is an unsalvageable anachronism that fails to account for the necessary complexities of distributed systems.
2. There exist a set of formalisms that do account for these complexities, but which are effectively absent from modern programming practice.

A bit further into the paper, distributed systems are defined as:

A distributed system is one that is comprised of multiple processes that must communicate to perform work.

The bottom line is that, given the ambient nature of the networks that surround us and the dependence we have upon those networks for so many of the tasks our programs, clients, customers, and users take for granted, nearly every system we build is a distributed system. Unless your software runs in a totally isolated environment — e.g. on an air-gapped computer — you are building a distributed system.

This is problematic in that distributed systems exhibit a set of uniformly unintuitive behaviours related to causality, consistency, and availability. These behaviours are largely emergent, and spring from the equally unintuitive semantics of the non-locality of the parts of those distributed systems and the networks that connect them. None of these behaviours or semantics are related at all to those which we — as programmers and engineers — are typically trained and acclimated to expect and reason about.

Note that even if you are doing something small, or “normal”, or common, you are not immune to these challenges. Even the most vanilla web application is definitionally a distributed system. By sending data from one computer (e.g. a server) to another (e.g. your customer’s web browser), you end up having to contemplate and address all sorts of problems that simply don’t exist when you run a program in a single process on a single machine that doesn’t touch the network: consistency, coping with non-availability (i.e. latency, services being down, timing-related bugs caused by long-running computations or things as simple as garbage collection), dealing with repeated messages from clients with spotty connections, and more. If you’ve not been bitten by these things, that is evidence of luck (or, of your not having noticed the problems yet!), not of your being immune, or otherwise that what you’ve built is somehow not a distributed system and so isn’t subject to these challenges.

A lot of heavy sledding but important for the future development of robust distributed systems.

It is important that people interested in semantics and XML participate in these discussions.

For example, Chas says of XML (and JSON):

the “richer” data representations that are favoured by most API services and clients (again, JSON, XML, etc) are fundamentally opaque and in general make reconciling independent changes impossible in a consistent way without special, often domain-specific intervention.

I’m am curious what is meant by “fundametally opaque,” at least insofar as Chas is talking about XML. If he means that independent changes impact the tree structure and make reconciliation of concurrent changes challenging, ok, but that’s not being opaque. And even that is an artifact of a processing model for XML, not XML proper.

I am even more concerned about the “semantics” to be addressed in distributed systems. At this point I will have to take Chas’ word for the distributed systems preserving machine to machine semantics (I have much reading left to do) but correct machine processing doesn’t warrant correct semantics for a human consumer of the same data.

I first saw this in a tweet by Tom Santero.

### LVars:…

Thursday, March 27th, 2014

At 144 slides and no sound, you probably need to pick up some background to really appreciate the slides.

I would start with: A ten-minute talk about my research, continue with later post under LVars and then onto:

LVars project repo: http://github.com/iu-parfunc/lvars

Code from this talk: http://github.com/lkuper/lvar-examples

Research blog: http://composition.al

Take up the slides when you feel comfortable with the nomenclature and basic concepts.

### ethereum

Monday, March 17th, 2014

ethereum

From the webpage:

Ethereum is a platform and a programming language that makes it possible for any developer to build and publish next-generation distributed applications.

Ethereum can be used to codify, decentralize, secure and trade just about anything: voting, domain names, financial exchanges, crowdfunding, company governance, contracts and agreements of most kind, intellectual property, and even smart property thanks to hardware integration.

Ethereum borrows the concept of decentralized consensus that makes bitcoin so resilient, yet makes it trivial to build on its foundation. To find out more about how Ethereum works, consult the whitepaper
.

…will you build out of the Ether?

Distributed systems are a great idea but most governments won’t tolerate parallel monetary systems.

Primarily because it interferes with the ability of a central bank to simply declare there is more money in the national treasury than anyone suspected.

Adjust the decimal place a bit and suddenly the government is solvent again.

Having a parallel monetary system, like Kong bucks, would interfere with that capability.

### Erlang Handbook

Wednesday, November 27th, 2013

Erlang Handbook: A concise reference for Erlang

From the webpage:

Originally written by Bjarne Däcker and later revised by Robert Virding, the Erlang Handbook is a summary of the language features and the runtime system. It is aimed at people with some programming experience, serving as a quick introduction to the Erlang domain.

Erlang Handbook (current release, pdf)

The handbook is just that, a handbook. At forty-six pages, it is a highly useful but also highly condensed view of Erlang.

I have been reminded of Erlang twice this week already.

The first time was by The Distributed Complexity of Large-scale Graph Processing research paper with its emphasis on message passing between graph nodes as a processing model.

The other reminder was Jans Aasman’s How to Use Graph Databases… [Topic Maps as Graph++?].

Jans was extolling the use of graphs to manage data about telecom customers, with an emphasis on “near real-time.”

Something kept nagging at me when I was watching the video but it was only afterwards that I remembered Ericsson’s development and use of Erlang for exactly that use case.

By way of excuse, I was watching Jans’ video at the end of a long day. 😉

Suggestions on where I can look for anyone using Erlang-based message passing for distributed processing of graphs?

With a truthful description like this one:

Erlang is a programming language used to build massively scalable soft real-time systems with requirements on high availability. Some of its uses are in telecoms, banking, e-commerce, computer telephony and instant messaging. Erlang’s runtime system has built-in support for concurrency, distribution and fault tolerance. (from http://www.erlang.org/)

are there any contraindications for Erlang?

### The Distributed Complexity of Large-scale Graph Processing

Tuesday, November 26th, 2013

The Distributed Complexity of Large-scale Graph Processing by Hartmut Klauck, Danupon Nanongkai, Gopal Pandurangan, Peter Robinson.

Abstract:

Motivated by the increasing need for fast distributed processing of large-scale graphs such as the Web graph and various social networks, we study a message-passing distributed computing model for graph processing and present lower bounds and algorithms for several graph problems. This work is inspired by recent large-scale graph processing systems (e.g., Pregel and Giraph) which are designed based on the message-passing model of distributed computing.

Our model consists of a point-to-point communication network of $k$ machines interconnected by bandwidth-restricted links. Communicating data between the machines is the costly operation (as opposed to local computation). The network is used to process an arbitrary $n$-node input graph (typically $k$ machines (a common implementation in many real world systems). Our goal is to study fundamental complexity bounds for solving graph problems in this model.

We present techniques for obtaining lower bounds on the distributed time complexity. Our lower bounds develop and use new bounds in random-partition communication complexity. We first show a lower bound of $\Omega(n/k)$ rounds for computing a spanning tree (ST) of the input graph. This result also implies the same bound for other fundamental problems such as computing a minimum spanning tree (MST). We also show an $\Omega(n/k^2)$ lower bound for connectivity, ST verification and other related problems.

We give algorithms for various fundamental graph problems in our model. We show that problems such as PageRank, MST, connectivity, and graph covering can be solved in $\tilde{O}(n/k)$ time, whereas for shortest paths, we present algorithms that run in $\tilde{O}(n/\sqrt{k})$ time (for $(1+\epsilon)$-factor approx.) and in $\tilde{O}(n/k)$ time (for $O(\log n)$-factor approx.) respectively.

The author’s state their main goal is:

…is to investigate the distributed time complexity, i.e., the number of distributed “rounds”, for solving various fundamental graph problems. The time complexity not only captures the (potential) speed up possible for a problem, but it also implicitly captures the communication cost of the algorithm as well, since links can transmit only a limited amount of bits per round; equivalently, we can view our model where instead of links, machines can send/receive only a limited amount of bits per round (cf. Section 1.1).

How would you investigate the number of “rounds,” to perform merging in a message passing topic map system?

With no one order of merging to reach a particular state, would you measure it statistically for some merging criteria N?

I first saw this in a tweet by Stefano Bertolo.

### SAMOA

Saturday, November 23rd, 2013

Introducing SAMOA, an open source platform for mining big data streams by Gianmarco De Francisci Morales and Albert Bifet.

From the post:

Machine learning and data mining are well established techniques in the world of IT and especially among web companies and startups. Spam detection, personalization and recommendations are just a few of the applications made possible by mining the huge quantity of data available nowadays. However, “big data” is not only about Volume, but also about Velocity (and Variety, 3V of big data).

The usual pipeline for modeling data (what “data scientists” do) involves taking a sample from production data, cleaning and preprocessing it to make it usable, training a model for the task at hand and finally deploying it to production. The final output of this process is a pipeline that needs to run periodically (and be maintained) in order to keep the model up to date. Hadoop and its ecosystem (e.g., Mahout) have proven to be an extremely successful platform to support this process at web scale.

However, no solution is perfect and big data is “data whose characteristics forces us to look beyond the traditional methods that are prevalent at the time”. The current challenge is to move towards analyzing data as soon as it arrives into the system, nearly in real-time.

For example, models for mail spam detection get outdated with time and need to be retrained with new data. New data (i.e., spam reports) comes in continuously and the model starts being outdated the moment it is deployed: all the new data is sitting without creating any value until the next model update. On the contrary, incorporating new data as soon as it arrives is what the “Velocity” in big data is about. In this case, Hadoop is not the ideal tool to cope with streams of fast changing data.

Distributed stream processing engines are emerging as the platform of choice to handle this use case. Examples of these platforms are Storm, S4, and recently Samza. These platforms join the scalability of distributed processing with the fast response of stream processing. Yahoo has already adopted Storm as a key technology for low-latency big data processing.

Alas, currently there is no common solution for mining big data streams, that is, for doing machine learning on streams on a distributed environment.

Enter SAMOA

SAMOA (Scalable Advanced Massive Online Analysis) is a framework for mining big data streams. As most of the big data ecosystem, it is written in Java. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.

After you get SAMOA installed, you may want to read: Distributed Decision Tree Learning for Mining Big Data Streams by Arinto Murdopo (thesis).

The nature of streaming data prevents SAMOA from offering the range of machine learning algorithms common in machine learning packages.

But if the SAMOA algorithms fit your use cases, what other test would you apply?

### On Demand Memory Specialization for Distributed Graph Databases

Thursday, October 24th, 2013

On Demand Memory Specialization for Distributed Graph Databases by Xavier Martinez-Palau, David Dominguez-Sal, Reza Akbarinia, Patrick Valduriez, Josep Lluís Larriba-Pey.

Abstract:

In this paper, we propose the DN-tree that is a data structure to build lossy summaries of the frequent data access patterns of the queries in a distributed graph data management system. These compact representations allow us an efficient communication of the data structure in distributed systems. We exploit this data structure with a new Dynamic Data Partitioning strategy (DYDAP) that assigns the portions of the graph according to historical data access patterns, and guarantees a small network communication and a computational load balance in distributed graph queries. This method is able to adapt dynamically to new workloads and evolve when the query distribution changes. Our experiments show that DYDAP yields a throughput up to an order of magnitude higher than previous methods based on cache specialization, in a variety of scenarios, and the average response time of the system is divided by two.

Graph partitioning based on evolving and summarized query experience. Results in a min cut on far less than the whole graph.

A heavy read but promising approach.

Makes me curious about non-uniform merging priorities.

For example, there could be some topics that merge frequently, say those representing financial information. While others, say for historical figures, may merge infrequently.

Rather than treating all topics equally in terms of computational resources, some topics could be processed at higher priority than others. (I am assuming an appropriately defined SLA.)

Although the partitioning approach could be applicable to distributed topic maps as well.

### Apache Aurora

Tuesday, October 1st, 2013

Apache Aurora

Apache Aurora entered incubation today!

From the webpage:

Aurora is a service scheduler used to schedule jobs onto Apache Mesos.

Oh, Apache Mesos?

From the webpage:

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes.

All the wiring is still pretty close to the surface but that’s not going to last long.

Better to learn it now while people still think it is hard. 😉

### Design Patterns for Distributed…

Sunday, September 29th, 2013

Design Patterns for Distributed Non-Relational Databases by Todd Lipcon.

A bit dated (2009) but true design patterns should find refinement, not retirement.

Covers:

• Consistent Hashing
• Consistency Models
• Data Models
• Storage Layouts
• Log-Structured Merge Trees

Curious if you would suggest substantial changes to these patterns some four (4) years later?

### …Conceptual Model For Evolving Graphs

Wednesday, September 11th, 2013

An Analytics-Aware Conceptual Model For Evolving Graphs by Amine Ghrab, Sabri Skhiri, Salim Jouili, and Esteban Zimanyi.

Abstract:

Graphs are ubiquitous data structures commonly used to represent highly connected data. Many real-world applications, such as social and biological networks, are modeled as graphs. To answer the surge for graph data management, many graph database solutions were developed. These databases are commonly classified as NoSQL graph databases, and they provide better support for graph data management than their relational counterparts. However, each of these databases implement their own operational graph data model, which differ among the products. Further, there is no commonly agreed conceptual model for graph databases.

In this paper, we introduce a novel conceptual model for graph databases. The aim of our model is to provide analysts with a set of simple, well-defined, and adaptable conceptual components to perform rich analysis tasks. These components take into account the evolving aspect of the graph. Our model is analytics-oriented, flexible and incremental, enabling analysis over evolving graph data. The proposed model provides a typing mechanism for the underlying graph, and formally defines the minimal set of data structures and operators needed to analyze the graph.

The authors concede that much work remains to be done, both theoretical and practical on their proposal.

With the rise of distributed computing, every “fact” depends upon a calculated moment of now. What was a “fact” five minutes ago may not longer be considered as a “fact” but as an “error.”

Who is responsible for changes in “facts,” warranties for “facts,” who gives and gets notices about changes in “facts,” all remain to be determined.

Models for evolving graphs may assist in untangling the rights, obligations and relationships that are nearly upon us with distributed computing.

### Building a distributed search system

Wednesday, August 28th, 2013

Building a distributed search system with Apache Hadoop and Lucene by Mirko Calvaresi.

From the preface:

This work analyses the problem coming from the so called Big Data scenario, which can be defined as the technological challenge to manage and administer quantity of information with global dimension in the order of Terabyte (1012bytes) or Petabyte (1015bytes) and with an exponential growth rate. We’ll explore a technological and algorithmic approach to handle and calculate theses amounts of data that exceed the limit of computation of a traditional architecture based on real-time request processing:in particular we’ll analyze a singular open source technology, called Apache Hadoop, which implements the approach described as Map and Reduce.

We’ll describe also how to distribute a cluster of common server to create a Virtual File System and use this environment to populate a centralized search index (realized using another open source technology, called Apache Lucene). The practical implementation will be a web based application which offers to the user a unified searching interface against a collection of technical papers. The scope is to demonstrate that a performant search system can be obtained pre-processing the data using the Map and Reduce paradigm, in order to obtain a real time response, which is independent to the underlying amount of data. Finally we’ll compare this solutions to different approaches based on clusterization or No SQL solutions, with the scope to describe the characteristics of concrete scenarios, which suggest the adoption of those technologies.

Fairly complete (75 pages) report on a project indexing academic papers with Lucene and Hadoop.

I would like to see treatment of the voiced demand for “real-time processing” versus the need for “real-time processing.”

When I started using research tools, indexes, like the Readers Guide to Periodical Literature were at a minimum two (2) weeks behind popular journals.

Academic indexes ran that far behind if not a good bit longer.

The timeliness of indexing journal articles is now nearly simultaneous with publication.

Has the quality of our research improved due to faster access?

I can imagine use cases, drug interactions for example, the discovery of which should be streamed out as soon as practical.

But drug interactions are not the average case.

It would be very helpful to see research on what factors favor “real-time” solutions and which are quite sufficient with “non-real-time” solutions.