Archive for the ‘Distributed Systems’ Category
Saturday, May 4th, 2013
Ceph: A Scalable, High-Performance Distributed File System by Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn.
Abstract:
We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.
I have just started reading this paper but it strikes me as deeply important.
Consider:
Ceph decouples data and metadata operations by eliminating file allocation tables and replacing them with generating functions. This allows Ceph to leverage the intelligence present in OSDs to distribute the complexity surrounding data access, update serialization, replication and reliability, failure detection, and recovery. Ceph utilizes a highly adaptive distributed metadata cluster architecture that dramatically improves the scalability of metadata access, and with it, the scalability of the entire system. We discuss the goals and workload assumptions motivating our choices in the design of the architecture, analyze their impact on system scalability and performance, and relate our experiences in implementing a functional system prototype.
The ability to scale “metadata,” in this case inodes and directory entries (file names), bodes well for scaling topic map based information about files.
Not to mention that experience with generating functions may free us from the overhead of URI based addressing.
For some purposes, I may wish to act as though only files exist but in a separate operation, I may wish to address discrete tokens or even characters in one such file.
Interesting work and worth a deep read.
The source code for Ceph: http://ceph.sourceforge.net/.
Posted in Distributed Systems, Files, Storage | No Comments »
Friday, March 22nd, 2013
A Distributed Graph Engine for Web Scale RDF Data by Kai Zeng, Jiacheng Yang, Haixum Wang, Bin Shao and Zhongyuan Wang.
Abstract:
Much work has been devoted to supporting RDF data. But state-of-the-art systems and methods still cannot handle web scale RDF data effectively. Furthermore, many useful and general purpose graph-based operations (e.g., random walk, reachability, community discovery) on RDF data are not supported, as most existing systems store and index data in particular ways (e.g., as relational tables or as a bitmap matrix) to maximize one particular operation on RDF data: SPARQL query processing. In this paper, we introduce Trinity.RDF, a distributed, memory-based graph engine for web scale RDF data. Instead of managing the RDF data in triple stores or as bitmap matrices, we store RDF data in its native graph form. It achieves much better (sometimes orders of magnitude better) performance for SPARQL queries than the state-of-the-art approaches. Furthermore, since the data is stored in its native graph form, the system can support other operations (e.g., random walks, reachability) on RDF graphs as well. We conduct comprehensive experimental studies on real life, web scale RDF data to demonstrate the effectiveness of our approach.
From the conclusion:
We propose a scalable solution for managing RDF data as graphs in a distributed in-memory key-value store. Our query processing and optimization techniques support SPARQL queries without relying on join operations, and we report performance numbers of querying against RDF datasets of billions of triples. Besides scalability, our approach also has the potential to support queries and analytical tasks that are far more advanced than SPARQL queries, as RDF data is stored as graphs. In addition, our solution only utilizes basic (distributed) key-value store functions and thus can be ported to any in-memory key-value store.
A result that is:
- scalable
- goes beyond SPARQL
- can be ported to any in-memory key-value store
Merits a very close read.
Makes me curious what other data models would work better if cast as graphs?
I first saw this in a tweet by Juan Sequeda.
Posted in Distributed Systems, Graphs, RDF, Trinity | No Comments »
Thursday, March 7th, 2013
Distributed Graph Computing with Gremlin by Marko A. Rodriguez.
From the post:
The script-step in Faunus’ Gremlin allows for the arbitrary execution of a Gremlin script against all vertices in the Faunus graph. This simple idea has interesting ramifications for Gremlin-based distributed graph computing. For instance, it is possible evaluate a Gremlin script on every vertex in the source graph (e.g. Titan) in parallel while maintaining data/process locality. This section will discuss the following two use cases.
- Global graph mutations: parallel update vertices/edges in a Titan cluster given some arbitrary computation.
- Global graph algorithms: propagate information to arbitrary depths in a Titan cluster in order to compute some algorithm in a parallel fashion.
Another must read post from Marko A. Rodriguez!
Also a reminder that I need to pull out my Oxford Classical Dictionary to add some material to the mythology graph.
Posted in Distributed Systems, Faunus, Graph Databases, Graphs, Gremlin, Titan | No Comments »
Friday, February 8th, 2013
Webinar: Building a highly scaleable distributed row, document or column store with MySQL and Shard-Query by Justin Swanhart.
From the post:
On Friday, February 15, 2013 10:00am Pacific Standard Time, I will be delivering a webinar entitled “Building a highly scaleable distributed row, document or column store with MySQL and Shard-Query”
The first part of this webinar will focus on why distributed databases are needed, and on the techniques employed by Shard-Query to implement a distributed MySQL database. The focus will then proceed to the types of distributed (massively parallel processing) database applications which can be deployed with Shard-Query and the performance aspects of each.
The following types of implementations will be described:
- Distributed row store using XtraDB cluster
- Distributed append-only column store using Infobright Community Edition
- Distributed “document store” using XtraDB cluster and Flexviews
If you are using (or planning on using) MySQL as a topic map backend, this could be the webinar for you!
Posted in Distributed Systems, MySQL | 1 Comment »
Wednesday, January 30th, 2013
Logic and Lattices for Distributed Programming
From the post:
Neil Conway from Berkeley CS is giving an advanced level talk at a meetup today in San Francisco on a new paper: Logic and Lattices for Distributed Programming – extending set logic to support CRDT-style lattices.
The description of the meetup is probably the clearest introduction to the paper:
Developers are increasingly choosing datastores that sacrifice strong consistency guarantees in exchange for improved performance and availability. Unfortunately, writing reliable distributed programs without the benefit of strong consistency can be very challenging.
….
In this talk, I’ll discuss work from our group at UC Berkeley that aims to make it easier to write distributed programs without relying on strong consistency. Bloom is a declarative programming language for distributed computing, while CALM is an analysis technique that identifies programs that are guaranteed to be eventually consistent. I’ll then discuss our recent work on extending CALM to support a broader range of programs, drawing upon ideas from CRDTs (A Commutative Replicated Data Type).
If you have an eye towards understanding the future then this is for you.
Do note that the Bloom language is treated more extensively in Datalog Reloaded. You may recall that the basis for tolog (a topic map query language) was Datalog.
Posted in Bloom Language, Datalog, Distributed Systems, Logic, tolog | No Comments »
Wednesday, January 16th, 2013
Watching the What’s New in Cassandra 1.2 (Notes) webcast and encountered an unfamiliar term: “tombstones.”
If you are already familiar with the concept, skip to another post.
If you’re not, the concept is used in distributed systems that maintain “eventual” consistency by the nodes replicating their content. Which works if all nodes are available but what if you delete data and a node is unavailable? When it comes back, the other nodes are “missing” data that needs to be replicated.
From the description at the Cassandra wiki, DistributedDeletes, not an easy problem to solve.
So, Cassandra turns it into a solvable problem.
Deletes are implemented with a special value known as a tombstone. The tombstone is propogated to nodes that missed the initial delete.
Since you will eventually want to delete the tombstones as well, a grace period can be set, which is slightly longer than the period needed to replace a non-responding node.
Distributed topic maps will face the same issue.
Complicated by imperative programming models of merging that make changes in properties that alter merging difficult to manage.
Perhaps functional models of merging, as with other forms of distributed processing, will carry the day.
Posted in Cassandra, Distributed Consistency, Distributed Systems, Functional Programming, Merging | No Comments »
Saturday, December 8th, 2012
Piccolo: Distributed Computing via Shared Tables
From the homepage:
Piccolo is a framework designed to make it easy to develop efficient distributed applications.
In contrast to traditional data-centric models (such as Hadoop) which present the user a single object at a time to operate on, Piccolo exposes a global table interface which is available to all parts of the computation simulataneously. This allows users to specify programs in an intuitive manner very similar to that of writing programs for a single machine.
Piccolo includes a number of optimizations to ensure that using this table interface is not just easy, but also fast:
- Locality
- To ensure locality of execution, tables are explicitly partitioned across machines. User code that interacts with the tables can specify a locality preference: this ensures that the code is executed locally with the data it is accessing.
- Load-balancing
- Not all load is created equal – often some partition of a computation will take much longer then others. Waiting idly for this task to finish wastes valuable time and resources. To address this Piccolo can migrate tasks away from busy machines to take advantage of otherwise idle workers, all while preserving the locality preferences and the correctness of the program.
- Failure Handling
- Machines failures are inevitable, and generally occur when you’re at the most critical time in your computation. Piccolo makes checkpointing and restoration easy and fast, allowing for quick recovery in case of failures.
- Synchronization
- Managing the correct synchronization and update across a distributed system can be complicated and slow. Piccolo addresses this by allowing users to defer synchronization logic to the system. Instead of explicitly locking tables in order to perform updates, users can attach accumulation functions to a table: these are used automatically by the framework to correctly combine concurrent updates to a table entry.
The closer you are to the metal, the more aware you will be of the distributed nature of processing and data.
Will the success of distributed processing/storage be when all but systems architects are unaware of its nature?
Posted in Annotation, Distributed Systems, Piccolo | No Comments »
Wednesday, November 28th, 2012
Netflix open sources Hystrix resilience library
From the post:
Netflix has moved on from just releasing the tools it uses to test the resilience of the cloud services that power the video streaming company, and has now open sourced a library that it uses to engineer in that resilience. Hystrix is an Apache 2 licensed library which Netflix engineers have been developing over the course of 2012 and which has been adopted by many teams within the company. It is designed to manage how distributed services interact and give more tolerance to latency within those connections and the inevitable failures that can occur.
The library isolates access points between services and then stops any failures from cascading between those access points. Hystrix uses a Command pattern to execute or queue Command objects and evaluate whether the circuit to the service for which the command is destined for is in operation. This may not be the case where what Hystrix calls a circuit breaker has triggered leaving the circuit “open”. Circuit breakers can be placed into a system to make it easier to trigger a coordinated failover. The library also checks for other issues which may prevent the execution of the command.
Does your distributed TM have the resilience of Netflix?
Is that the new “normal” for resilience?
The post goes on to say that a dashboard is forthcoming to monitor Hystrix.
Posted in Distributed Systems, Hystrix | No Comments »
Friday, November 2nd, 2012
RICON 2012 [videos, slides, resources]
From the webpage:
Basho Technologies, along with our sponsors, proudly presented RICON 2012, a two day conference dedicated to Riak, developers, and the future of distributed systems in production. This page is dedicated to post-conference consumption. Here you will find slidedecks, resources, and much more.
Videos for the weekend (for those of you without NetFlix accounts):
- Joseph Blomstedt, Bringing Consistency to Riak
- Sean Cribbs, Data Structures in Riak
- Selena Deckelmann, Rapid Data Prototyping With Postgres
- Dietrich Featherston, Modern Radiology for Distributed Systems
- Gary Flake, Building a Social Application on Riak
- Theo Schlossnagle, Next Generation Monitoring of Large Scale Riak Applications
- Ines Sombra and Michael Brodhead, Riak in the Cloud
- Andrew Thompson, Cloning the Cloud – Riak and Multi Data Center Replication
It is hard to decide what to watch first.
What do you think?
Posted in Distributed Systems, Erlang, Riak | No Comments »
Friday, October 26th, 2012
Metamarkets open sources distributed database Druid by Elliot Bentley.
From the post:
It’s no secret that the latest challenge for the ‘big data’ movement is moving from batch processing to real-time analysis. Metamarkets, who provide “Data Science-as-a-Service” business analytics, last year revealed details of in-house distributed database Druid – and have this week released it as an open source project.
Druid was designed to solve the problem of a database which allows multi-dimensional queries on data as and when it arrives. The company originally experimented with both relational and NoSQL databases, but concluded they were not fast enough for their needs and so rolled out their own.
The company claims that Druid’s scan speed is “33M rows per second per core”, able to ingest “up to 10K incoming records per second per node”. An earlier blog post outlines how the company managed to achieve scan speeds of 26B records per second using horizontal scaling. It does this via a distributed architecture, column orientation and bitmap indices.
It was exciting to read about Druid last year.
Now to see how exciting Druid is in fact!
Source code: https://github.com/metamx/druid
Posted in Distributed Systems, Druid, NoSQL | No Comments »
Thursday, October 25th, 2012
Service-Oriented Distributed Knowledge Discovery by Domenico Talia, University of Calabria, Rende, Italy; Paolo Trunfio.
The publisher’s summary reads:
A new approach to distributed large-scale data mining, service-oriented knowledge discovery extracts useful knowledge from today’s often unmanageable volumes of data by exploiting data mining and machine learning distributed models and techniques in service-oriented infrastructures. Service-Oriented Distributed Knowledge Discovery presents techniques, algorithms, and systems based on the service-oriented paradigm. Through detailed descriptions of real software systems, it shows how the techniques, models, and architectures can be implemented.
The book covers key areas in data mining and service-oriented computing. It presents the concepts and principles of distributed knowledge discovery and service-oriented data mining. The authors illustrate how to design services for data analytics, describe real systems for implementing distributed knowledge discovery applications, and explore mobile data mining models. They also discuss the future role of service-oriented knowledge discovery in ubiquitous discovery processes and large-scale data analytics.
Highlighting the latest achievements in the field, the book gives many examples of the state of the art in service-oriented knowledge discovery. Both novices and more seasoned researchers will learn useful concepts related to distributed data mining and service-oriented data analysis. Developers will also gain insight on how to successfully use service-oriented knowledge discovery in databases (KDD) frameworks.
The idea of service-oriented data mining/analysis is very compatible with topic maps as marketable information sets.
It is not available through any of my usual channels, yet, but I would be cautious at $89.95 for 230 pages of text.
More comments to follow when I have a chance to review the text.
I first saw this at KDNuggets.
Posted in Distributed Systems, Knowledge Discovery | No Comments »
Wednesday, October 10th, 2012
Distributed Algorithms in NoSQL Databases by Ilya Katsov.
From the post:
Scalability is one of the main drivers of the NoSQL movement. As such, it encompasses distributed system coordination, failover, resource management and many other capabilities. It sounds like a big umbrella, and it is. Although it can hardly be said that NoSQL movement brought fundamentally new techniques into distributed data processing, it triggered an avalanche of practical studies and real-life trials of different combinations of protocols and algorithms. These developments gradually highlight a system of relevant database building blocks with proven practical efficiency. In this article I’m trying to provide more or less systematic description of techniques related to distributed operations in NoSQL databases.
In the rest of this article we study a number of distributed activities like replication of failure detection that could happen in a database. These activities, highlighted in bold below, are grouped into three major sections:
- Data Consistency. Historically, NoSQL paid a lot of attention to tradeoffs between consistency, fault-tolerance and performance to serve geographically distributed systems, low-latency or highly available applications. Fundamentally, these tradeoffs spin around data consistency, so this section is devoted data replication and data repair.
- Data Placement. A database should accommodate itself to different data distributions, cluster topologies and hardware configurations. In this section we discuss how to distribute or rebalance data in such a way that failures are handled rapidly, persistence guarantees are maintained, queries are efficient, and system resource like RAM or disk space are used evenly throughout the cluster.
- System Coordination. Coordination techniques like leader election are used in many databases to implements fault-tolerance and strong data consistency. However, even decentralized databases typically track their global state, detect failures and topology changes. This section describes several important techniques that are used to keep the system in a coherent state.
Slow going but well worth the effort.
Not the issues discussed in the puff-piece webinars extolling NoSQL solutions to “big data.”
But you already knew that if you read this far! Enjoy!
I first saw this at Christophe Lalanne’s A bag of tweets / September 2012
Posted in Algorithms, Distributed Systems, NoSQL | No Comments »
Wednesday, September 19th, 2012
Process group in erlang: some thoughts about the pg module by Paolo D’Incau.
From the post:
One of the most common ways to achieve fault tolerance in distributed systems, consists in organizing several identical processes into a group, that can be accessed by a common name. The key concept here is that whenever a message is sent to the group, all members of the group receive it. This is a really nice feature, since if one process in the group fails, some other process can take over for it and handle the message, doing all the operations required.
Process groups allow also abstraction: when we send a message to a group, we don’t need to know who are the members and where they are. In fact process groups are all but static. Any process can join an existing group or leave one at runtime, moreover a process can be part of more groups at the same time.
Fault tolerance is going to be an issue if you are using topic maps and/or social media in an operational context.
Having really “cool” semantic capabilities isn’t worth much if the system fails at a critical point.
Posted in Distributed Systems, Erlang | No Comments »
Sunday, September 16th, 2012
Spanner : Google’s globally distributed database
From the post:
This paper, whose co-authors include Jeff Dean and Sanjay Ghemawat of MapReduce fame, describes Spanner. Spanner is Google’s scalable, multi-version, globally distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. Finally the paper comes out! Really exciting stuff!
Abstract from the paper:
Spanner is Google’s scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.
Spanner: Google’s Globally Distributed Database (PDF File)
Facing user requirements, Google did not say: Suck it up and use tools already provided.
Google engineered new tools to meet their requirements.
Is there a lesson there for other software projects?
Posted in Database, Distributed Systems | No Comments »
Monday, August 6th, 2012
What’s the Difference? Efficient Set Reconciliation without Prior Context by David Eppstein, Michael T. Goodrich, Frank Uyeda, and George Varghese.
Abstract:
We describe a synopsis structure, the Difference Digest, that allows two nodes to compute the elements belonging to the set difference in a single round with communication overhead proportional to the size of the difference times the logarithm of the keyspace. While set reconciliation can be done efficiently using logs, logs require overhead for every update and scale poorly when multiple users are to be reconciled. By contrast, our abstraction assumes no prior context and is useful in networking and distributed systems applications such as trading blocks in a peer-to-peer network, and synchronizing link-state databases after a partition.
Our basic set-reconciliation method has a similarity with the peeling algorithm used in Tornado codes [6], which is not surprising, as there is an intimate connection between set difference and coding. Beyond set reconciliation, an essential component in our Difference Digest is a new estimator for the size of the set difference that outperforms min-wise sketches [3] for small set differences.
Our experiments show that the Difference Digest is more efficient than prior approaches such as Approximate Reconciliation Trees [5] and Characteristic Polynomial Interpolation [17]. We use Difference Digests to implement a generic KeyDiff service in Linux that runs over TCP and returns the sets of keys that differ between machines.
Distributed topic maps anyone?
Posted in Distributed Systems, P2P, Set Reconciliation, Sets, Topic Map Software | 1 Comment »
Thursday, June 21st, 2012
Knowledge Discovery Using Cloud and Distributed Computing Platforms (KDCloud, 2012)
From the website:
Paper Submission August 10, 2012
Acceptance Notice October 01, 2012
Camera-Read Copy October 15, 2012
Workshop December 10, 2012 Brussels, Belgium
Collocated with the IEEE International Conference on Data Mining, ICDM 2012
From the website:
The 3rd International Workshop on Knowledge Discovery Using Cloud and Distributed Computing Platforms (KDCloud, 2012) provides an international platform to share and discuss recent research results in adopting cloud and distributed computing resources for data mining and knowledge discovery tasks.
Synopsis: Processing large datasets using dedicated supercomputers alone is not an economical solution. Recent trends show that distributed computing is becoming a more practical and economical solution for many organizations. Cloud computing, which is a large-scale distributed computing, has attracted significant attention of both industry and academia in recent years. Cloud computing is fast becoming a cheaper alternative to costly centralized systems. Many recent studies have shown the utility of cloud computing in data mining, machine learning and knowledge discovery. This workshop intends to bring together researchers, developers, and practitioners from academia, government, and industry to discuss new and emerging trends in cloud computing technologies, programming models, and software services and outline the data mining and knowledge discovery approaches that can efficiently exploit this modern computing infrastructures. This workshop also seeks to identify the greatest challenges in embracing cloud computing infrastructure for scaling algorithms to petabyte sized datasets. Thus, we invite all researchers, developers, and users to participate in this event and share, contribute, and discuss the emerging challenges in developing data mining and knowledge discovery solutions and frameworks around cloud and distributed computing platforms.
Topics: The major topics of interest to the workshop include but are not limited to:
- Programing models and tools needed for data mining, machine learning, and knowledge discovery
- Scalability and complexity issues
- Security and privacy issues relevant to KD community
- Best use cases: are there a class of algorithms that best suit to cloud and distributed computing platforms
- Performance studies comparing clouds, grids, and clusters
- Performance studies comparing various distributed file systems for data intensive applications
- Customizations and extensions of existing software infrastructures such as Hadoop for streaming, spatial, and spatiotemporal data mining
- Applications: Earth science, climate, energy, business, text, web and performance logs, medical, biology, image and video.
It’s December, Belgium and an interesting workshop. Can’t ask for much more than that!
Posted in Cloud Computing, Conferences, Distributed Systems, Knowledge Discovery | No Comments »
Saturday, June 9th, 2012
Distributed Systems Tracing with Zipkin
From the post:
Zipkin is a distributed tracing system that we created to help us gather timing data for all the disparate services involved in managing a request to the Twitter API. As an analogy, think of it as a performance profiler, like Firebug, but tailored for a website backend instead of a browser. In short, it makes Twitter faster. Today we’re open sourcing Zipkin under the APLv2 license to share a useful piece of our infrastructure with the open source community and gather feedback.
Hmmm, tracing based on the Dapper paper that comes with a web-based UI for a number of requests. Hard to beat that!
Thinking more about the sampling issue, what if I were to sample a very large stream of proxies and decided to only merge a certain percentage and pipe the rest to /dev/null?
For example, I have an UPI feed and that is my base set of “news” proxies. I have feeds from the various newspaper, radio and TV outlets around the United States. If the proxies from the non-UPI feeds are without some distance of the UPI feed proxies, they are simply discarded.
True, I am losing the information of which newspapers carried the stories, whose bylines consisted of changing the order of the words or dumbing them down, but those may not fall under my requirements.
I would rather than a few dozen very good sources than say 70,000 sources that say the same thing.
If you were testing for news coverage or the spread of news stories, your requirements might be different.
I first saw this at Alex Popescu’s myNoSQL.
Posted in BigData, Distributed Systems, Sampling, Systems Research, Tracing | No Comments »
Saturday, June 9th, 2012
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure by Benjamin H. Sigelman, Luiz Andr´e Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag.
Abstract:
Modern Internet services are often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by different teams, perhaps in different programming languages, and could span many thousands of machines across multiple physical facilities. Tools that aid in understanding system behavior and reasoning about performance issues are invaluable in such an environment.
Here we introduce the design of Dapper, Google’s production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met. Dapper shares conceptual similarities with other tracing systems, particularly Magpie [3] and X-Trace [12], but certain design choices were made that have been key to its success in our environment, such as the use of sampling and restricting the instrumentation to a rather small number of common libraries.
The main goal of this paper is to report on our experience building, deploying and using the system for over two years, since Dapper’s foremost measure of success has been its usefulness to developer and operations teams. Dapper began as a self-contained tracing tool but evolved into a monitoring platform which has enabled the creation of many different tools, some of which were not anticipated by its designers. We describe a few of the analysis tools that have been built using Dapper, share statistics about its usage within Google, present some example use cases, and discuss lessons learned so far.
A very important paper for anyone working with large and complex systems.
With lessons on data sampling as well:
… we have found that a sample of just one out of thousands of requests provides sufficient information for many common uses of the tracing data.
You have to wonder in “data in the petabyte range” cases, how many of them could be reduced to gigabyte (or smaller) size with no loss in accuracy?
Which would reduce storage requirements, increase analysis speed, increase the complexity of analysis, etc.
Have you sampled your “big data” recently?
I first saw this at Alex Popescu’s myNoSQL.
Posted in BigData, Distributed Systems, Sampling, Systems Research, Tracing | 1 Comment »
Saturday, April 21st, 2012
Distributed Temporal Graph Database Using Datomic
Post by Alex Popescu calling out construction of a “distributed temporal graph database.”
Temporal used in the sense of timestamping entries in the database.
Beyond such uses, beware, there be dragons.
Temporal modeling isn’t for the faint of heart.
Posted in Datomic, Distributed Systems, Graphs, Temporal Graph Database | No Comments »
Sunday, April 1st, 2012
Intro to Distributed Erlang (screencast) by Bryan Hunter.
From the description:
Here’s an introduction to distribution in Erlang. This screencast demonstrates creating three Erlang nodes on a Windows box and one on a Linux box and then connecting them using the one-liner “net_adm:ping” to form a mighty compute cluster.
Topics covered:
- Using erl to start an Erlang node (an instance of the Erlang runtime system).
- How to use net_adm:ping to connect four Erlang nodes (three on Windows, one on Linux).
- Using rpc:call to RickRoll a Linux box from an Erlang node running on a Windows box.
- Using nl to load (deploy) a module from one node to all connected nodes.
Not the most powerful cluster but a good way to learn distributed Erlang.
Posted in Distributed Systems, Erlang | No Comments »
Thursday, March 15th, 2012
A Distributed C Compiler System on MapReduce: Mrcc
Alex Popescu of myNoSQL points to software and a paper on distributed C code for compilation.
Changing to distributed architectures may uncover undocumented decisions made long ago and far away. Decisions that we may choose to make differently this time. Hopefully we will do a better job of documenting them. (Not that it will happen but there is no law against hoping.)
Posted in Compilers, Distributed Systems | No Comments »
Friday, February 3rd, 2012
Building Distributed Systems with Riak Core by Steve Vinoski (Basho).
From the description:
Riak Core is the distributed systems foundation for the Riak distributed database and the Riak Search full-text indexing system. Riak Core provides a proven architecture and key functionality required to quickly build scalable, distributed applications. This talk will cover the origins of Riak Core, the abstractions and functionality it provides, and some guidance on building distributed systems.
Rest assured or be forewarned that there is no Erlang code in this presentation.
For all that, it is still a very informative presentation on building scalable, distributed applications.
Posted in Distributed Systems, Riak | No Comments »
Friday, December 30th, 2011
Explorations in Parallel Distributed Processing: A Handbook of Models, Programs, and Exercises by James L. McClelland.
From Chapter 1, Introduction:
Several years ago, Dave Rumelhart and I first developed a handbook to introduce others to the parallel distributed processing (PDP) framework for modeling human cognition. When it was first introduced, this framwork represented a new way of thinking about perception, memory, learning, and thought, as well as a new way of characterizing the computational mechanisms for intelligent information processing in general. Since it was first introduced, the framework has continued to evolve, and it is still under active development and use in modeling many aspects of cognition and behavior.
Our own understanding of parallel distributed processing came about largely through hands-on experimentation with these models. And, in teaching PDP to others, we discovered that their understanding was enhanced through the same kind of hands-on simulation experience. The original edition of the handbook was intended to help a wider audience gain this kind of experience. It made many of the simulation models discussed in the two PDP volumes (Rumelhart et al., 1986; McClelland et al., 1986) available in a form that is intended to be easy to use. The handbook also provided what we hoped were accessible expositions of some of the main mathematical ideas that underlie the simulation models. And it provided a number of prepared exercises to help the reader begin exploring the simulation programs.
The current version of the handbook attempts to bring the older handbook up to date. Most of the original material has been kept, and a good deal of new material has been added. All of simulation programs have been implemented or re-implemented within the MATLAB programming environment. In keeping with other MATLAB projects, we call the suite of programs we have implemented the PDPTool software.
Latest revision (Sept. 2011) is online for your perusal. A good way to develop an understanding of parallel processing.
Apologies for not seeing this before Christmas. Please consider it an early birthday present for your birthday in 2012!
Posted in Distributed Systems, Parallel Programming | No Comments »
Sunday, November 27th, 2011
6th International Symposium on Intelligent Distributed Computing – IDC 2012
Important Dates:
Full paper submission: April 10, 2012
Notification of acceptance: May 10, 2012
Final (camera ready) paper due: June 1, 2012
Symposium: September 24-26, 2012
From the call for papers:
Intelligent computing covers a hybrid palette of methods and techniques derived from classical artificial intelligence, computational intelligence, multi-agent systems a.o. Distributed computing studies systems that contain loosely-coupled components running on different networked computers and that communicate and coordinate their actions by message transfer. The emergent field of intelligent distributed computing is expected to pose special challenges of adaptation and fruitful combination of results of both areas with a great impact on the development of new generation intelligent distributed information systems. The aim of this symposium is to bring together researchers involved in intelligent distributed computing to allow cross-fertilization and synergy of ideas and to enable advancement of researches in the field.
The symposium welcomes submissions of original papers concerning all aspects of intelligent distributed computing ranging from concepts and theoretical developments to advanced technologies and innovative applications. Papers acceptance and publication will be judged based on their relevance to the symposium theme, clarity of presentation, originality and accuracy of results and proposed solutions.
Posted in Artificial Intelligence, Conferences, Distributed Systems | No Comments »
Wednesday, November 2nd, 2011
Systems We Make Curating complex distributed systems of our times by Srihari Srinivasan.
About:
These are indeed great times for Distributed Systems enthusiasts. The boom in the number and variety of systems being built in both academia and the industry has created a strong need to curate interesting creations under one roof.
Systems We Make was conceived to fill this void. Although Systems We Make is still in its infancy I hope to shape it into something more than just a catalog. So stay tuned as we evolve this site and do write to me about how you feel!
Systems We Make may still be in its “infancy” but I am certainly going to both watch this site for news as well as mine the resources it already offers!
I don’t have any predictions for when it will happen but it isn’t hard to foresee a time when “distributed computing” is as archaic as “my computer.” Computing will be a service much like electricity or water, based on a computing fabric, the details of which matter only those charged with its maintenance.
Posted in Distributed Systems | No Comments »