Archive for the ‘CRDT’ Category

A Conflict-Free Replicated JSON Datatype

Tuesday, August 16th, 2016

A Conflict-Free Replicated JSON Datatype by Martin Kleppmann, Alastair R. Beresford.

Abstract:

Many applications model their data in a general-purpose storage format such as JSON. This data structure is modified by the application as a result of user input. Such modifications are well understood if performed sequentially on a single copy of the data, but if the data is replicated and modified concurrently on multiple devices, it is unclear what the semantics should be. In this paper we present an algorithm and formal semantics for a JSON data structure that automatically resolves concurrent modifications such that no updates are lost, and such that all replicas converge towards the same state. It supports arbitrarily nested list and map types, which can be modified by insertion, deletion and assignment. The algorithm performs all merging client-side and does not depend on ordering guarantees from the network, making it suitable for deployment on mobile devices with poor network connectivity, in peer-to-peer networks, and in messaging systems with end-to-end encryption.

Not a fast read and I need to think about its claim that JSON supports more complexity than XML. 😉

Enjoy!

CRDT: Datatype for the Apocalypse

Wednesday, November 18th, 2015

CRDT: Datatype for the Apocalypse by Alexander Songe.

From the description:

Conflict-free Replicated Data Types (CRDTs) are a hot new datatype in distributed programming that promise coordination-free and always eventually-consistent writes. Most of the stuff you will read about them is limited to the context of high-end distributed databases, but CRDT’s are more flexible and shouldn’t be limited to this field. In the talk, I will demonstrate how CRDT’s are great for applications that have partial connectivity (including websites): updates can be applied locally, and when communication is possible, you can send the data back up, and the data will remain consistent even in the most byzantine (or apocalyptic) scenarios. There are even scenarios that can support multiple simultaneous editors.

Beyond that, I will also demonstrate how Elixir’s metaprogramming can be used to compose complex models out of CRDT’s that themselves exhibit the same exact features. I will also exhibit some newer CRDT features, such as shared-context and delta-operation CRDT’s to overcome some of the shortcomings of older CRDT’s.

I plan to keep the talk light on theory (the academic literature is sufficient for that).

Great presentation on CRDTs!

I hope we are closer to use of CRDTs with documents than Alexander appears to think. Starting at time mark 5:27, Alexander says that document CRDTs can be ten times the size of the document for collaborative editing. (rough paraphrase)

I have written to Alexander to inquire if it would be possible to have more granular CRDTs, such as <p> element CRDTs which would address changes only to a particular paragraph and a separate document level CRDT that covers changes at the document (insertion/deletion of paragraphs)?

Alexander is the author of the loom CRDT library and the one link I didn’t see in Alexander’s presentation was for his GitHub page: https://github.com/asonge.

Additional resources cited in the presentation:

Lindsey Kuper – http://composition.al/

Christopher Meiklejohn – http://christophermeiklejohn.com

Carlos Bquero

riak_dt

Summary Paper: Marc Sharpiro, Nuno Pregui,ca, Carlos Baquero, Markek Zawirksi http://hal.upmc.fr/inria-00555588/document

(I supplied the links for Carlos Bquero and riak_dt.)

If you think about it, static data is an edge case of the sources of data always being in flux. (Date arising from the measurement or recording of those sources.)

Lasp

Saturday, August 1st, 2015

Lasp: A Language for Distributed, Eventually Consistent Computations by Christopher S. Meiklejohn and Peter Van Roy.

From the webpage:

Why Lasp?

Lasp is a new programming model designed to simplify large scale, fault-tolerant, distributed programming. Lasp is being developed as part of the SyncFree European research project. It leverages ideas from distributed dataflow extended with convergent replicated data types, or CRDTs. This supports computations where not all participants are online together at a given moment. The initial design supports synchronization-free programming by combining CRDTs together with primitives for composing them inspired by functional programming. This lets us write long-lived fault-tolerant distributed applications, including ones with nonmonotonic behavior, in a functional paradigm. The initial prototype is implemented as an Erlang library built on top of the Riak Core distributed systems infrastructure.

Interested?

Other resources include:

Lasp-dev, the mailing list for Lasp developers.

Lasp at Github.

I was reminded to post about Lasp by this post from Christopher Meiklejohn:

This post is a continuation of my first post about leaving Basho Technologies after almost 4 years.

It has been quite a long time in the making, but I’m finally happy to announce that I am the recipient of a Erasmus Mundus fellowship in their Joint Doctorate in Distribute Computing program. I will be pursuing a full-time Ph.D., with my thesis devoted to developing the Lasp programming language for distributed computing with the goals of simplifying deterministic, distributed, edge computation.

Starting in February 2016, I will be moving to Belgium to begin my first year of studies at the Université catholique de Louvain supervised by Peter Van Roy followed by a second year in Lisbon at IST supervised by Luís Rodrigues.

If you like this article, please consider supporting my writing on gittip.

Looks like exciting developments are ahead for Lash!

Congratulations to Christopher Meiklejohn!

A Comprehensive study of Convergent and Commutative Replicated Data Types

Wednesday, March 18th, 2015

A Comprehensive study of Convergent and Commutative Replicated Data Types reviewed by Adrian Colyer.

From the post:

This paper introduces the concept of a CRDT, a “simple, theoretically sound approach to eventual consistency.” Let’s adddress one of the pressing distributed systems questions of our time right here: “what does CRDT stand for?” We’ve seen over the last couple of weeks that there are two fundamental approaches to replication: you can execute operations at a primary and replicate the resulting state, or you can replicate the operations themselves. If you’re replicating state, then given some convergence rules for state, you can create Convergent Replicated Data Types. If you’re replicating operations, then given operations carefully designed to commute , you can create Commutative Replicated Data Types. Conveniently both ‘convergent’ and ‘commutative’ begin with C, so we can call both of these CRDTs. In both cases, the higher order goal is to avoid the need for coordination by ensuring that actions taken independently can’t conflict with each other (and thus can be composed at a later point in time). Thus we might also call them Conflict-free Replicated Data Types.

Think of it a bit like this: early on languages gave us standard data type implementations for set, list, map, and so on. Then we saw the introduction of concurrent versions of collections and related data types. With CRDTs, we are seeing the birth of distributed collections and related data types. Eventually any self-respecting language/framework will come with a distributed collections library – Riak already supports CRDTs and Jonas has an Akka CRDT library in github at least. As you read through the paper, it’s tempting to think “oh, these are pretty straightforward to implement,” but pay attention to the section on garbage collection – a bit like we saw with Edelweiss, making production implementations with state that doesn’t grow unbounded makes things more difficult.

If you haven’t read A comprehensive study of Convergent and Commutative Replicated Data Types by Marc Shapiro, Nuno Pregui¸ca, Carlos Baquero, and Marek Zawirski, this is a very useful and approachable introduction.

Enjoy!

CRDTs: Consistency without consensus

Tuesday, August 19th, 2014

CRDTs: Consistency without consensus by Peter Bourgon.

Abstract:

When you think of distributed systems, you probably think in terms of consistency via consensus. That is, enabling a heterogeneous group of systems to agree on facts, while remaining robust in the face of failure. But, as any distributed systems developer can attest, it’s harder than it sounds. Failure happens in myriad, byzantine ways, and failure modes interact unpredictably. Reliable distributed systems need more than competent engineering: they need a robust theoretical foundation. CRDTs, or Convergent Replicated Data Types, are a set of properties or behaviors, discovered more than invented, which enable a distributed system to achieve consistency without consensus, and sidestep entire classes of problems altogether. This talk provides a practical introduction to CRDTs, and describes a production CRDT system built at SoundCloud to serve high-volume time-series data.

Slides: bbuzz14-peter_bourgon_0.pdf

This is very much worth your time!

Great discussion of data models after time mark 23:00 (approximately).

BTW, the system discussed is open source and in production: http://github.com/soundcloud/roshi

Readings in conflict-free replicated data types

Tuesday, July 22nd, 2014

Readings in conflict-free replicated data types by Christopher Meiklejohn.

From the post:

This is a work in progress post outlining research topics related to conflict-free replicated data types, or CRDTs.

Yesterday, Basho announced the release of Riak 2.0.0 RC1, which contains a comprehensive set of “data types” that can be used for building more robust distributed applications. For an overview of how to use these data types in Riak to avoid custom, and error prone, merge functions, see the Basho documentation site.

You’re probably more familiar with another name for these data types: conflict-free replicated data types (CRDTs). Simply put, CRDTs are data structures which capture some aspect of causality, along with providing interfaces for safely operating over the value and correctly merging state with diverged and concurrently edited structures.

This provides a very useful property when combined with an eventual consistency, or AP-focused, data store: Strong Eventual Consistency (SEC). Strong Eventual Consistency is an even stronger convergence property than eventual consistency: given that all updates are delivered to all replicas, there is no need for conflict resolution, given the conflict-free merge properties of the data structure. Simply put, correct replicas which have received all updates have the same state.

Here’s a great overview by one of the inventors of CRDTs, Marc Shapiro, where he discusses conflict-free replicated data types and their relation to strong eventual consistency.

In this Hacker News thread, there was an interesting discussion about why one might want to implement these on the server, why implementing them is non-trivial, and what the most recent research related to them consists of.

This post serves as a reading guide on the the various areas of conflict-free replicated data types. Papers are broken down into various areas and sorted in reverse chronologically.

Relevant to me because the new change tracking in ODF is likely to be informed by CRDTs and because eventually consistent merging is important for distributed topic maps.

Confusion would be the result if the order of merging topics results in different topic maps.

CRDTs are an approach to avoid that unhappy outcome.

Enjoy!

PS: Remember to grab a copy of Riak 2.0.0 RC1.

Fun with CRDTs

Tuesday, May 20th, 2014

Fun with CRDTs by Richard Dallaway.

From the post:

At the end of last year I had some fun implementing a CRDT. These are data structures designed to combine together when you have no control over order of changes, timing of changes, or the number of participants in the data structure. The example I looked at was a sequential datatype, namely the WOOT CRDT for collaborative text editing.

Doesn’t:

combine together when you have no control over order of changes, timing of changes, or the number of participants in the data structure.

sound familiar? 😉

Richard points to:

Slides.

Video.

He also recommends that you watch: Reconciling Eventually-Consistent Data with CRDTs by Noel Welsh, before viewing his video.

Great stuff!

Chas Emerick on CRDT’s

Friday, May 16th, 2014

Several resources for Chas Emerick’s “A comprehensive study of Convergent and Communtative Replicated Data Types” at Papers We Love #4 May 15, 2014.

Slides.

Tom LaGatta’s Notes.

When the video of Chas’ presentation is posted I will update this post.

See also: Christopher Meiklejohn’s Time Clocks and the Ordering of Events in a Distributed System.

Abstract:

Whether you realize it or not, if you’ve built a rich web application in Ember.js, and you’re sending data between clients and a server, you’ve build a distributed system. This talk will discuss the challenges of building such a system, specifically the challenges related to preserving consistency when dealing with concurrent actors. We will begin with a primer on the various types of consistency, covering topics such as eventual consistency and causal consistency, and then move on to discuss recent industrial and academic research that aims to solve some of these problems without synchronization, specifically discussing operational transformations and convergent and commutative replicated data types.

CRDTs in New York (May 15, 2014)

Monday, May 12th, 2014

Chas Emerick – A comp study of Convergent & Commutative Replicated Data Types (May 15, 2014)

Thursday, May 15, 2014 7:00 PM

Viggle Inc. 902 Broadway 11, New York, NY (map)

From the meeting notice:

A comprehensive study of Convergent and Commutative Replicated Data Types‘ by Shapiro et al.

Commutative Replicated Data Types (CRDTs) are a formalism for providing practical data and programming primitives for use in distributed systems applications without necessitating expensive (and sometimes impractical) consensus mechanisms. Their key characteristic is that they provide conflict-free “merging” of distributed concurrent updates given only the weak guarantees of eventual consistency.

While this paper did not coin the term ‘CRDT’, it was the first to provide a comprehensive treatment of their definition, semantics, and possible construction separate from and beyond previous implementations of distributable datatypes that happened to provide CRDT-like semantics.

A “papers we love” meetup later this week in New York.

If you have or can make the time, RSVP and join the discussion!

Distributed Systems and the End of the API

Monday, May 12th, 2014

Distributed Systems and the End of the API by Chas Emerick.

From the post:

This is a written (expanded) narrative of the content from a talk I first gave at PhillyETE on April 23rd, 2014. It mostly follows the flow of the presentation given then, but with a level of detail that I hope enhances clarity of the ideas therein. The talk’s original slides are available, though the key illustrations and bullet points contained therein are replicated (and somewhat enhanced) below. When audio/video of the talk is published, I will update this page to link to it.

I have two claims of which I would like to convince you today:

  1. The notion of the networked application API is an unsalvageable anachronism that fails to account for the necessary complexities of distributed systems.
  2. There exist a set of formalisms that do account for these complexities, but which are effectively absent from modern programming practice.

A bit further into the paper, distributed systems are defined as:

A distributed system is one that is comprised of multiple processes that must communicate to perform work.

The bottom line is that, given the ambient nature of the networks that surround us and the dependence we have upon those networks for so many of the tasks our programs, clients, customers, and users take for granted, nearly every system we build is a distributed system. Unless your software runs in a totally isolated environment — e.g. on an air-gapped computer — you are building a distributed system.

This is problematic in that distributed systems exhibit a set of uniformly unintuitive behaviours related to causality, consistency, and availability. These behaviours are largely emergent, and spring from the equally unintuitive semantics of the non-locality of the parts of those distributed systems and the networks that connect them. None of these behaviours or semantics are related at all to those which we — as programmers and engineers — are typically trained and acclimated to expect and reason about.

Note that even if you are doing something small, or “normal”, or common, you are not immune to these challenges. Even the most vanilla web application is definitionally a distributed system. By sending data from one computer (e.g. a server) to another (e.g. your customer’s web browser), you end up having to contemplate and address all sorts of problems that simply don’t exist when you run a program in a single process on a single machine that doesn’t touch the network: consistency, coping with non-availability (i.e. latency, services being down, timing-related bugs caused by long-running computations or things as simple as garbage collection), dealing with repeated messages from clients with spotty connections, and more. If you’ve not been bitten by these things, that is evidence of luck (or, of your not having noticed the problems yet!), not of your being immune, or otherwise that what you’ve built is somehow not a distributed system and so isn’t subject to these challenges.

A lot of heavy sledding but important for the future development of robust distributed systems.

It is important that people interested in semantics and XML participate in these discussions.

For example, Chas says of XML (and JSON):

the “richer” data representations that are favoured by most API services and clients (again, JSON, XML, etc) are fundamentally opaque and in general make reconciling independent changes impossible in a consistent way without special, often domain-specific intervention.

I’m am curious what is meant by “fundametally opaque,” at least insofar as Chas is talking about XML. If he means that independent changes impact the tree structure and make reconciliation of concurrent changes challenging, ok, but that’s not being opaque. And even that is an artifact of a processing model for XML, not XML proper.

I am even more concerned about the “semantics” to be addressed in distributed systems. At this point I will have to take Chas’ word for the distributed systems preserving machine to machine semantics (I have much reading left to do) but correct machine processing doesn’t warrant correct semantics for a human consumer of the same data.

I first saw this in a tweet by Tom Santero.

Convergent and Commutative Replicated Data Types [Warning: Heavy Sledding Ahead]

Thursday, October 11th, 2012

A comprehensive study of Convergent and Commutative Replicated Data Types (PDF file) Marc Shapiro, Nuno M. Preguiça, Carlos Baquero, Marek Zawirski.

Abstract:

Eventual consistency aims to ensure that replicas of some mutable shared object converge without foreground synchronisation. Previous approaches to eventual consistency are ad-hoc and error-prone. We study a principled approach: to base the design of shared data types on some simple formal conditions that are sufficient to guarantee eventual consistency. We call these types Convergent or Commutative Replicated Data Types (CRDTs). This paper formalises asynchronous object replication, either state based or operation based, and provides a sufficient condition appropriate for each case. It describes several useful CRDTs, including container data types supporting both add and remove operations with clean semantics, and more complex types such as graphs, montonic DAGs, and sequences. It discusses some properties needed to implement non-trivial CRDTs.

I found this following a link in the readme for riak dt which said:

WHAT?

Currently under initial development, riak_dt is a platform for convergent data types. It’s built on riak core and deployed with riak. All of our current work is around supporting fast, replicated, eventually consistent counters (though more data types are in the repo, and on the way.) This work is based on the paper – A Comprehensive study of Convergent and Commutative Replicated Data Types – which you may find an interesting read.

WHY?

Riak’s current model for handling concurrent writes is to store sibling values and present them to the client for resolution on read. The client must encode the logic to merge these into a single, meaningful value, and then inform Riak by doing a further write. Convergent data types remove this burden from the client, as their structure guarantees they will deterministically converge to a single value. The simplest of these data types is a counter.

I haven’t thought of merging of subject representatives as a quest for “consistency” but that is one way to think about it.

The paper is forty-seven pages long and has forty-four references, most of which I suspect are necessary to fully appreciate the work.

Having said that, I suspect it will be well worth the effort.