Distributed Computing « Another Word For It

February 17, 2018

Distributed Systems Seminar [Accounting For Hostile Environments]

Filed under: Distributed Computing,Distributed Consistency,Distributed Systems — Patrick Durusau @ 8:22 pm

Distributed Systems Seminar by Peter Alvaro.

From the webpage:

Description

This graduate seminar will explore distributed systems research, both current and historical, with a particular focus on storage systems and programming models.

Due to fundamental uncertainty in their executions arising from asynchronous communication and partial failure, distributed systems present unique challenges to programmers and users. Moreover, distributed systems are increasingly ubiquitous: nearly all non-trivial systems are now physically distributed. It is no longer possible to relegate responsibility for managing the complexity of distributed systems to a group of expert library or infrastructure writers: all programmers must now be distributed programmers. This is both a crisis and an opportunity.

A great deal of theoretical work in distributed systems establishes important impossibility results, including the famous FLP result, the CAP Theorem, the two generals problem and the impossibility of establishing common knowledge via protocol. These results tell us what we cannot achieve in a distributed system, or more constructively, they tell us about the properties we must trade off for the properties we require when designing or using large-scale systems. But what can we achieve? The history of applied distributed systems work is largely the history of infrastructures — storage systems as well as programming models — that attempt to manage the fundamental complexity of the domain with a variety of abstractions.

This course focuses on these systems, models and languages. We will cover the following topics:

Consistency models

Large-scale storage systems and data processing frameworks

Commit, consensus and synchronization protocols

Data replication and partitioning

Fault-tolerant design

Programming models

Distributed programming languages and program analysis

Seminal theoretical results in distributed systems

Readings

This course is a research seminar: we will focus primarily on reading and discussing conference papers. We will read 1-2 papers (typically 2) per session; for each paper, you will provide a brief summary (about 1 page). The summary should answer some or all of the following questions:

What problem does the paper solve? Is is important?

How does it solve the problem?

What alternative approaches are there? Are they adequately discussed in the reading?

How does this work relate to other research, whether covered in this course or not?

What specific research questions, if any, does the paper raise for you?

…

What a great list of readings!

An additional question of each paper: Does It Account For Hostile Environments?

As Alvaro says: “…nearly all non-trivial systems are now physically distributed.”

That’s a rather large attack surface to leave for unknown others, by unknown means, to secure to an unknown degree, on your behalf.

If you make that choice, add “cyber-victim” to your business cards.

If you aren’t already, you will be soon enough.

Comments Off

December 7, 2017

The Computer Science behind a modern distributed data store

Filed under: ArangoDB,Computer Science,Distributed Computing,Distributed Consistency — Patrick Durusau @ 1:34 pm

From the description:

What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Every application needs a stateful layer which holds the data. There are at least three necessary ingredients which are everything else than trivial to combine and of course even more challenging when heading for an acceptable performance.

Over the past years there has been significant progress in respect in both the science and practical implementations of such data stores. In his talk Max Neunhöffer will introduce the audience to some of the needed ingredients, address the difficulties of their interplay and show four modern approaches of distributed open-source data stores.

Topics are:

Challenges in developing a distributed, resilient data store

Consensus, distributed transactions, distributed query optimization and execution

The inner workings of ArangoDB, Cassandra, Cockroach and RethinkDB

The talk will touch complex and difficult computer science, but will at the same time be accessible to and enjoyable by a wide range of developers.

I haven’t found the slides for this presentation but did stumble across ArangoDB Tech Talks and Slides.

Neunhöffer’s presentation will make you look at ArangoDB more closely.

Comments Off

August 9, 2016

ARGUS

Filed under: Distributed Computing,Functional Programming,Programming — Patrick Durusau @ 7:01 pm

ARGUS by Christopher Meiklejohn.

From the post:

This is one post in a series about programming models and languages for distributed computing that I’m writing as part of my history of distributed programming techniques.

Relevant Reading

Abstraction Mechanisms in CLU, Liskov, Barbara and Snyder, Alan and Atkinson, Russell and Schaffert, Craig, CACM 1977 (Liskov et al. 1977).

Guardians and Actions: Linguistic Support for Robust, Distributed Programs, Liskov, Barbara and Scheifler, Robert, TOPLAS 1982 (Liskov and Scheifler 1983).

Orphan Detection in the Argus System, Walker, Edward Franklin, DTIC 1984 (Walker 1984).

Implementation of Argus, Liskov, Barbara and Curtis, Dorothy and Johnson, Paul and Scheifer, Robert, SIGOPS 1987 (Liskov et al. 1987).

Distributed Programming in Argus, Liskov, Barbara CACM 1988 (Liskov 1988).

…

I’m thinking about how to fix an XFCE trackpad problem and while I think about that, wanted to touch up the references from Christopher’s post.

Abstraction Mechanisms in CLU, Liskov, Barbara and Snyder, Alan and Atkinson, Russell and Schaffert, Craig, CACM 1977 (Liskov et al. 1977).
Guardians and Actions: Linguistic Support for Robust, Distributed Programs, Liskov, Barbara and Scheifler, Robert, TOPLAS 1982 (Liskov and Scheifler 1983).
Orphan Detection in the Argus System, Walker, Edward Franklin, DTIC 1984 (Walker 1984).
Implementation of Argus, Liskov, Barbara and Curtis, Dorothy and Johnson, Paul and Scheifer, Robert, SIGOPS 1987 (Liskov et al. 1987). (ACM Digital Library)
Distributed Programming in Argus, Liskov, Barbara CACM 1988 (Liskov 1988).

Apologies but I was unable to find a public version of: Implementation of Argus, Liskov, Barbara and Curtis, Dorothy and Johnson, Paul and Scheifer, Robert, SIGOPS 1987 (Liskov et al. 1987).

Hoping that easier access to most of the relevant reading will increase your enjoyment of Christopher’s post.

Enjoy!

Comments Off

December 12, 2015

Fun with ddR: Using Distributed Data Structures in R [Your Holiday Quiet Spot]

Filed under: Distributed Computing,Distributed Systems,R — Patrick Durusau @ 5:52 pm

Fun with ddR: Using Distributed Data Structures in R by Edward Ma and Vishrut Gupta (Hewlett Packard Enterprise).

From the post:

A few weeks ago, we revealed ddR (Distributed Data-structures in R), an exciting new project started by R-Core, Hewlett Packard Enterprise, and others that provides a fresh new set of computational primitives for distributed and parallel computing in R. The package sets the seed for what may become a standardized and easy way to write parallel algorithms in R, regardless of the computational engine of choice.

In designing ddR, we wanted to keep things simple and familiar. We expose only a small number of new user functions that are very close in semantics and API to their R counterparts. You can read the introductory material about the package here. In this post, we show how to use ddR functions.

…

Imagine that you are trapped after an indeterminate holiday meal in the TV room where A Christmas Story is playing for the fourth time that day.

You are at the point of saying/doing something that will offend the living members of your spouses family and generations to come.

What can you do?

Surely your powers of concentration exceed those of bridge players who claim to not see naked people cavorting about during bridge games.

Pull up the ddR post on your smartphone, read it and jump to the documentation and/or example programs.

You will have to be woken out of your reverie and handed your coat when it is time to go.

Well, maybe not exactly but it beats the hell out of biting one of your smaller relatives.

Comments Off

December 6, 2015

Learning from Distributed Data:… [Beating the Bounds]

Filed under: Distributed Computing,Distributed Systems,Topic Maps — Patrick Durusau @ 10:35 pm

Learning from Distributed Data: Mathematical and Computational Methods to Analyze De-centralized Information.

From the post:

Scientific advances typically produce massive amounts of data, which is, of course, a good thing. But when many of these datasets are at multiple locations, instead of all in one place, it becomes difficult and costly for researchers to extract meaningful information from them.

So, the question becomes: “How do we learn from these datasets if they cannot be shared or placed in a central location?” says Trilce Estrada-Piedra.

Estrada-Piedra, an assistant professor of computer sciences at the University of New Mexico (UNM) is working to find the solution. She designs software that will enable researchers to collaborate with one another, using decentralized data, without jeopardizing privacy or raising infrastructure concerns.

“Our contributions will help speed research in a variety of sciences like health informatics, astronomy, high energy physics, climate simulations and drug design,” Estrada-Piedra says. “It will be relevant for problems where data is spread out in many different locations.”

The aim of the National Science Foundation (NSF)-funded scientist’s project is to build mathematical models from each of the “local” data banks — those at each distributed site. These models will capture data patterns, rather than specific data points.

“Researchers then can share only the models, instead of sharing the actual data,” she says, citing a medical database as an example. “The original data, for example, would have the patient’s name, age, gender and particular metrics like blood pressure, heart rate, etcetera, and that one patient would be a data point. But the models will project his or her information and extract knowledge from the data. It would just be math. The idea is to build these local models that don’t have personal information, and then share the models without compromising privacy.”

Estrada-Piedra is designing algorithms for data projections and middleware: software that acts as a bridge between an operating system or database and applications, especially on a network. This will allow distributed data to be analyzed effectively.
….

I’m looking forward to hearing more about Estrada-Piedra’s work, although we all know there are more than data projection and middleware issues involved. Those are very real and very large problems, but as with all human endeavors, the last mile is defined by local semantics.

Efficiently managing local semantics, that is enabling others to seamlessly navigate your local semantics and to in turn navigate the local semantics of others, isn’t a technical task, or at least not primarily.

The primary obstacle to such a task is captured by John D. Cook in Medieval software project management.

The post isn’t long so I will quite it here:

Centuries ago, English communities would walk the little boys around the perimeter of their parish as a way of preserving land records. This was called “beating the bounds.” The idea was that by teaching the boundaries to someone young, the knowledge would be preserved for the lifespan of that person. Of course modern geological survey techniques make beating the bounds unnecessary.

Software development hasn’t reached the sophistication of geographic survey. Many software shops use a knowledge management system remarkably similar to beating the bounds. They hire a new developer to work on a new project. That developer will remain tied to that project for the rest of his or her career, like a serf tied to the land. The knowledge essential to maintaining that project resides only in the brain of its developer. There are no useful written records or reliable maps, just like medieval property boundaries.

Does that sound familiar? That only you or another person “know” the semantics of your datastores? Are you still “beating the bounds” to document your data semantics?

Or as John puts it:

There are no useful written records or reliable maps, just like medieval property boundaries.

It doesn’t have to be that way. You could have reliable maps, reliable maps that are updated when your data is mapped for yet another project. Another ETL is the acronym.

You can, as a manager, of course, simply allow data knowledge to evaporate from your projects but that seems like a very poor business practice.

Johanna Rothman responded to John’s post in Breaking Free of Legacy Projects with the suggestion that every project should have several young boys and girls “beating the bounds” for every major project.

The equivalent of avoiding a single point of failure in medieval software project management.

Better than relying on a single programmer but using more modern information management/retention techniques would be a better option.

I guess the question is do you like using medieval project management techniques for your data or not?

If you do, you won’t be any worse off than any of your competitors with a similar policy.

On the other hand, should one of your competitors break ranks, start using topic maps for example for mission critical data, well, you have been warned.

Comments Off

November 18, 2015

Christopher Meiklejohn – Doctoral Thesis Proposal

Filed under: Distributed Computing,Functional Programming — Patrick Durusau @ 5:49 pm

Christopher Meiklejohn – Doctoral Thesis Proposal.

From the proposal:

…
The goal of this research is to provide a declarative way to design distributed, fault-tolerant applications that do not contain observable nondeterminism. These applications should be able to be placed at arbitrary locations in the network: mobile devices, “Internet of Things” hardware, or personal computers. Applications should be tolerant to arbitrary message delays, duplication and reordering: these are first-class requirements of distributed computations over unreliable networks. When writing these applications, developers should not have to use traditional concurrency control or synchronization mechanisms such as mutexes, semaphores, or monitors: the primitive operations for composition in the language should yield “deterministic-by-construction” applications.
…

Christopher is looking for comments on his doctoral thesis proposal.

His proposal is dated November 11, 2015, so time remains for you to review the proposal and make comments.

It would be really nice if the community that will benefit from Christopher’s work would contribute some comments on it.

Comments Off

November 17, 2015

DegDB (Open Source Distributed Graph Database) [Tackling Who Pays For This Data?]

Filed under: Distributed Computing,Graph Databases,Graphs,Open Source — Patrick Durusau @ 4:52 pm

DegDB (Open Source Distributed Graph Database) (GitHub)

The Design Doc/Ramble reads in part:

Problems With Existing Graph Databases

Owned by private companies with no incentive to share.

Public databases are used by few people with no incentive to contribute.

Large databases can’t fit on one machine and are expensive to traverse.

Existing distributed graph databases require all nodes to be trusted.

Incentivizing Hosting of Data

Every request will have either a debit (with attached bitcoin) or credit (with bitcoin promised on delivery) payment system. The server nodes will attempt to estimate how much it will cost to serve the data and if there isn’t enough bitcoin attached, will drop the request. This makes large nodes want to serve as much popular data as possible, because it allows for faster responses as well as not having to pay other nodes for their data. At the same time, little used data will cost more to access due to requiring more hops to find the data and “cold storage” servers can inflate the prices thus making it profitable for them.

Incentivizing Creation of Data

Data Creation on Demand

A system for requesting certain data to be curated can be employed. The requestor would place a bid for a certain piece of data to be curated, and after n-sources add the data to the graph and verify its correctness the money would be split between them.
This system could be ripe for abuse by having bots automatically fulfilling every request with random data.

Creators Paid on Usage

This method involves the consumers of the data keeping track of their data sources and upon usage paying them. This is a trust based model and may end up not paying creators anything.
…

The one “wow” factor of this project is the forethought to put the discussion of “who pays for this data?” up front and center.

We have all seen the failing model that starts with:

For only $35.00 (U.S.) you can view this article for 24 hours.

That makes you feel like you are almost robbing the publisher at that price. (NOT!)

Right. I’m tracking down a citation to make sure a quote or data is correct and I am going to pay $35.00 (U.S.) to have access for 24 hours. Considering that the publishers with those pricing models have already made back their costs of production and publication plus a profit from institutional subscribers (challenge them for the evidence if they deny), a very low micro-payment would be more suitable. Say $00.01 per paragraph or something on that order. Payable out of a deposit with the publisher.

I would amend the Creators Paid on Usage section to have created content unlocked only upon payment (set by the creator). Over time, creators would develop reputations for the value of their data and if you choose to buy from a private seller with no online history, that’s just your bad.

Imagine that for the Paris incident (hypothetical, none of the following is true), I had the school records for half of the people carrying out that attack. Not only do I have the originals but I also have them translated into English, assuming some or all of them are in some other language. I could cast that data (I’m not fond of the poverty of triples) into a graph format and make it know as part of a distributed graph system.

Some of the data, such as the identities of the people for who I had records, would appear in the graphs of others as “new” data. Up to the readers of the graph to decide if the data and the conditions for seeing it are acceptable to them.

Data could even carry a public price tag. That is if you want to pay a large enough sum, then the data in question will be opened up for everyone to have access to it.

I don’t know of any micropayment systems that are eating at the foundations of traditional publishers now but there will be many attempts before one eviscerates them one and all.

The choices we face now of “free” (read unpaid for research, writing and publication, which excludes many) versus the “pay-per-view” model that supports early 20th century models of sloth, cronyism and gate-keeping, aren’t the only ones. We need to actively seek out better and more nuanced choices.

Comments Off

November 16, 2015

Multiagent Systems

Filed under: Agents,Distributed Computing — Patrick Durusau @ 8:13 pm

Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations by Yoav Shoham and Kevin Leyton-Brown.

From the webpage:

Multiagent systems consist of multiple autonomous entities having different information and/or diverging interests. This comprehensive introduction to the field offers a computer science perspective, but also draws on ideas from game theory, economics, operations research, logic, philosophy and linguistics. It will serve as a reference for researchers in each of these fields, and be used as a text for advanced undergraduate and graduate courses.

Emphasizing foundations, the authors offer a broad and rigorous treatment of their subject, with thorough presentations of distributed problem solving, non-cooperative game theory, multiagent communication and learning, social choice, mechanism design, auctions, coalitional game theory, and logical theories of knowledge, belief, and other aspects of rational agency. For each topic, basic concepts are introduced, examples are given, proofs of key results are offered, and algorithmic considerations are examined. An appendix covers background material in probability theory, classical logic, Markov decision processes, and mathematical programming.

Even better from the introduction:

Imagine a personal software agent engaging in electronic commerce on your behalf. Say the task of this agent is to track goods available for sale in various online venues over time, and to purchase some of them on your behalf for an attractive price. In order to be successful, your agent will need to embody your preferences for products, your budget, and in general your knowledge about the environment in which it will operate. Moreover, the agent will need to embody your knowledge of other similar agents with which it will interact (e.g., agents who might compete with it in an auction, or agents representing store owners)—including their own preferences and knowledge. A collection of such agents forms a multiagent system. The goal of this book is to bring under one roof a variety of ideas and techniques that provide foundations for modeling, reasoning about, and building multiagent systems.

Somewhat strangely for a book that purports to be rigorous, we will not give a precise definition of a multiagent system. The reason is that many competing, mutually inconsistent answers have been offered in the past. Indeed, even the seemingly simpler question—What is a (single) agent?—has resisted a definitive answer. For our purposes, the following loose definition will suffice: Multiagent systems are those systems that include multiple autonomous entities with either diverging information or diverging interests, or both.

This looks like a great item for a wish list this close to the holidays. Broad enough to keep your interest up and relevant enough to argue you are “working” and not just reading.

Comments Off

November 13, 2015

Microsoft open sources Distributed Machine Learning Toolkit…

Filed under: Distributed Computing,Machine Learning,Microsoft,Open Source — Patrick Durusau @ 2:12 pm

Microsoft open sources Distributed Machine Learning Toolkit for more efficient big data research by George Thomas Jr.

From the post:

Researchers at the Microsoft Asia research lab this week made the Microsoft Distributed Machine Learning Toolkit openly available to the developer community.

The toolkit, available now on GitHub, is designed for distributed machine learning — using multiple computers in parallel to solve a complex problem. It contains a parameter server-based programing framework, which makes machine learning tasks on big data highly scalable, efficient and flexible. It also contains two distributed machine learning algorithms, which can be used to train the fastest and largest topic model and the largest word-embedding model in the world.

The toolkit offers rich and easy-to-use APIs to reduce the barrier of distributed machine learning, so researchers and developers can focus on core machine learning tasks like data, model and training.

The toolkit is unique because its features transcend system innovations by also offering machine learning advances, the researchers said. With the toolkit, the researchers said developers can tackle big-data, big-model machine learning problems much faster and with smaller clusters of computers than previously required.

For example, using the toolkit one can train a topic model with one million topics and a 20-million word vocabulary, or a word-embedding model with 1000 dimensions and a 20-million word vocabulary, on a web document collection with 200 billion tokens utilizing a cluster of just 24 machines. That workload would previously have required thousands of machines.
…

This has been a banner week for machine learning!

On November 9th, Google open sourced TensorFlow.

On November 12th, Single Artificial Neuron Taught to Recognize Hundreds of Patterns (why neurons have thousands of synapses) is published.

On November 12th, Microsoft open sources its Distributed Machine Learning Toolkit.

Not every week is like that for machine learning but it is impressive when that many major stories drop in a week!

I do like the line from the Microsoft announcement:

For example, using the toolkit one can train a topic model with one million topics and a 20-million word vocabulary, or a word-embedding model with 1000 dimensions and a 20-million word vocabulary, on a web document collection with 200 billion tokens utilizing a cluster of just 24 machines. (emphasis added)

Prices are falling all the time and a 24 machine cluster should be within the reach of most startups if not most individuals now. Next year? Possibly within the reach of a large number of individuals.

What are your machine learning plans for 2016?

More DMTK information.

Comments Off

October 29, 2015

Spinning up a Spark Cluster on Spot Instances: Step by Step [$0.69 for 6 hours]

Filed under: Amazon Web Services AWS,Distributed Computing,Distributed Systems,Hadoop,Spark — Patrick Durusau @ 4:55 pm

Spinning up a Spark Cluster on Spot Instances: Step by Step by Austin Ouyang.

From the post:

The DevOps series covers how to get started with the leading open source distributed technologies. In this tutorial, we step through how to deploy a Spark Standalone cluster on AWS Spot Instances for less than $1. In a follow up post, we will show you how to use a Jupyter notebook on Spark for ad hoc analysis of reddit comment data on Amazon S3.

One of the significant hurdles in learning to build distributed systems is understanding how these various technologies are installed and their inter-dependencies. In our experience, the best way to get started with these technologies is to roll up your sleeves and build projects you are passionate about.

This following tutorial shows how you can deploy your own Spark cluster in standalone mode on top of Hadoop. Due to Spark’s memory demand, we recommend using m4.large spot instances with 200GB of magnetic hard drive space each.

m4.large spot instances are not within the free-tier package on AWS, so this tutorial will incur a small cost. The tutorial should not take any longer than a couple hours, but if we allot 6 hours for your 4 node spot cluster, the total cost should run around $0.69 depending on the region of your cluster. If you run this cluster for an entire month we can look at a bill of around $80, so be sure to spin down you cluster after you are finished using it.
…

How does $0.69 to improve your experience with distributed systems sound?

It’s hard to imagine a better deal.

The only reason to lack experience with distributed systems is lack of interest.

Odd I know but it does happen (or so I have heard).

I first saw this in a tweet by Kirk Borne.

Comments Off

May 4, 2015

Notes on Theory of Distributed Systems

Filed under: CS Lectures,Distributed Computing — Patrick Durusau @ 8:06 pm

Notes on Theory of Distributed Systems by James Aspnes.

From the preface:

These are notes for the Spring 2014 semester version of the Yale course CPSC 465/565 Theory of Distributed Systems. This document also incorporates the lecture schedule and assignments, as well as some sample assignments from previous semesters. Because this is a work in progress, it will be updated frequently over the course of the semester.

Notes from Fall 2011 can be found at http://www.cs.yale.edu/homes/aspnes/classes/469/notes-2011.pdf.

Notes from earlier semesters can be found at http://pine.cs.yale.edu/pinewiki/465/.

Much of the structure of the course follows the textbook, Attiya and Welch’s Distributed Computing [AW04], with some topics based on Lynch’s Distributed Algorithms [Lyn96] and additional readings from the research literature. In most cases you’ll find these materials contain much more detail than what is presented here, so it is better to consider this document a supplement to them than to treat it as your primary source of information.
…

When something exceeds three hundred (> 300) pages, I have trouble calling it “notes.”

A treasure trove of information on distributed computing.

I first saw this in a tweet by Henry Robinson.

Comments Off

February 22, 2015

The Morning Paper [computing papers selected by Adrian Colyer]

Filed under: Computer Science,Distributed Computing,Programming — Patrick Durusau @ 10:58 am

The Morning Paper [computing papers selected by Adrian Colyer]

From the about page:

The Morning Paper: a short summary of an important, influential, topical or otherwise interesting paper in the field of CS every weekday. The Morning Paper started out as a twitter project (#themorningpaper), then it became clear a longer form was also necessary because some papers just have too much good content to get across in a small batch of 140-character tweets!

The daily selection will still be tweeted on my twitter account (adriancolyer), with a quote or two to whet your appetite. Any longer excerpts or commentary will live here.

Why ‘The Morning Paper?’ (a) it’s a habit I enjoy, and (b) if one or two papers catch your attention and lead you to discover (or rediscover) something of interest then I’m happy.

Adrian’s 100th post was January 7, 2015 so you have some catching up to do.

Very impressive and far more useful than the recent “newspaper” formats that automatically capture content from a variety of sources.

The Morning Paper is curated content, which makes all the difference in the world.

There is an emphasis on distributed computing making The Morning Paper a must read for anyone interested in the present and future of computing services.

Enjoy!

I first saw this in a tweet by Tyler Treat.

Comments Off

December 11, 2014

Wouldn’t it be fun to build your own Google?

Filed under: Common Crawl,Distributed Computing,Distributed Systems,Search Data,Search Engines,Webcrawler,WWW — Patrick Durusau @ 8:52 am

Wouldn’t it be fun to build your own Google? by Martin Kleppmann.

Martin writes:

Imagine you had your own copy of the entire web, and you could do with it whatever you want. (Yes, it would be very expensive, but we’ll get to that later.) You could do automated analyses and surface the results to users. For example, you could collate the “best” articles (by some definition) written on many different subjects, no matter where on the web they are published. You could then create a tool which, whenever a user is reading something about one of those subjects, suggests further reading: perhaps deeper background information, or a contrasting viewpoint, or an argument on why the thing you’re reading is full of shit.

…

Unfortunately, at the moment, only Google and a small number of other companies that have crawled the web have the resources to perform such analyses and build such products. Much as I believe Google try their best to be neutral, a pluralistic society requires a diversity of voices, not a filter bubble controlled by one organization. Surely there are people outside of Google who want to work on this kind of thing. Many a start-up could be founded on the basis of doing useful things with data extracted from a web crawl.

He goes on to discuss current search efforts such a Common Crawl and Wayfinder before hitting full stride with his suggestion for a distributed web search engine. Painting in the broadest of strokes, Martin makes it sound almost plausible to contemplate such an effort.

While conceding the technological issues would be many, it is contended that the payoff would be immense, but in ways we won’t know until it is available. I suspect Martin is right but if so, then we should be able to see a similar impact from Common Crawl. Yes?

Not to rain on a parade I would like to join, but extracting value from a web crawl like Common Crawl is not a guaranteed thing. A more complete crawl of the web only multiplies those problems, it doesn’t make them easier to solve.

On the whole I think the idea of a distributed crawl of the web is a great idea, but while that develops, we best hone our skills at extracting value from the partial crawls that already exist.

Comments Off

December 6, 2014

World Community Grid

Filed under: Distributed Computing — Patrick Durusau @ 5:39 pm

World Community Grid

From the about page:

World Community Grid enables anyone with a computer, smartphone or tablet to donate their unused computing power to advance cutting-edge scientific research on topics related to health, poverty and sustainability. Through the contributions of over 650,000 individuals and 460 organizations, World Community Grid has supported 24 research projects to date, including searches for more effective treatments for cancer, HIV/AIDS and neglected tropical diseases. Other projects are looking for low-cost water filtration systems and new materials for capturing solar energy efficiently.

How World Community Grid Works

Advancing scientific discovery

World Community Grid has enabled important scientific advances in cancer treatment and clean energy. Our research partners have published over 35 peer-reviewed papers in scientific journals and have completed the equivalent of hundreds of thousands of years of research in less than a decade. World Community Grid is the biggest volunteer computing initiative devoted to humanitarian science, and is as powerful as some of the world’s fastest supercomputers. Learn More

On the cusp of current trends

World Community Grid brings together volunteers and researchers at the intersection of computational chemistry, open science and citizen science – three trends that are transforming the way scientific research is conducted. In 2013, World Community Grid also became one of the first major volunteer computing initiatives to enable mobile computing on Android smartphones and tablets. Learn More

An award-winning program

The pioneering work done on World Community Grid has been recognized internationally with awards including the Computerworld Data+ Editors Choice Award, Business in the Community Coffey International Award, and the Asian Forum on Corporate Social Responsibility’s Asian CSR Award.

Who are we?

Started in 2004, World Community Grid is a philanthropic initiative of IBM Corporate Citizenship, the corporate social responsibility and philanthropy division of IBM. Through Corporate Citizenship, IBM donates its technology and talent to address some of the world’s most pressing social and environmental issues.

Meet Our Team

One current focus is on Ebola vaccine research.

I saw this in a tweet by IBM Research. The tweet pointed to: IBM Helps You Donate Computer Power to Fight Ebola, where the only link to IBM wasn’t a hyperlink, just text. Thought you might prefer a link to the actual site rather than prose about the site.

Enjoy!

Comments Off

October 18, 2014

Tupleware: Redefining Modern Analytics

Filed under: Distributed Computing,Functional Programming — Patrick Durusau @ 8:09 pm

Tupleware: Redefining Modern Analytics by Andrew Crotty and Alexander Galakatos.

From the post:

Up until a decade ago, most companies sufficed with simple statistics and offline reporting, relying on traditional database management systems (DBMSs) to meet their basic business intelligence needs. This model prevailed in a time when data was small and analysis was simple.

But data has gone from being scarce to superabundant, and now companies want to leverage this wealth of information in order to make smarter business decisions. This data explosion has given rise to a host of new analytics platforms aimed at flexible processing in the cloud. Well-known systems like Hadoop and Spark are built upon the MapReduce paradigm and fulfill a role beyond the capabilities of traditional DBMSs. However, these systems are engineered for deployment on hundreds or thousands of cheap commodity machines, but non-tech companies like banks or retailers rarely operate clusters larger than a few dozen nodes. Analytics platforms, then, should no longer be built specifically to accommodate the bottlenecks of large cloud deployments, focusing instead on small clusters with more reliable hardware.

Furthermore, computational complexity is rapidly increasing, as companies seek to incorporate advanced data mining and probabilistic models into their business intelligence repertoire. Users commonly express these types of tasks as a workflow of user-defined functions (UDFs), and they want the ability to compose jobs in their favorite programming language. Yet, existing analytics systems fail to adequately serve this new generation of highly complex, UDF-centric jobs, especially when companies have limited resources or require sub-second response times. So what is the next logical step?

It’s time for a new breed of systems. In particular, a platform geared toward modern analytics needs the ability to (1) concisely express complex workflows, (2) optimize specifically for UDFs, and (3) leverage the characteristics of the underlying hardware. To meet these requirements, the Database Group at Brown University is developing Tupleware, a parallel high-performance UDF processing system that considers the data, computations, and hardware together to produce results as efficiently as possible.
…

The article is the “lite” introduction to Tuppleware. You may be more interested in:

Tupleware: Redefining Modern Analytics (the paper):

Abstract:

There is a fundamental discrepancy between the targeted and actual users of current analytics frameworks. Most systems are designed for the data and infrastructure of the Googles and Facebooks of the world—petabytes of data distributed across large cloud deployments consisting of thousands of cheap commodity machines. Yet, the vast majority of users operate clusters ranging from a few to a few dozen nodes, analyze relatively small datasets of up to several terabytes, and perform primarily compute-intensive operations. Targeting these users fundamentally changes the way we should build analytics systems.

This paper describes the design of Tupleware, a new system specifically aimed at the challenges faced by the typical user. Tupleware’s architecture brings together ideas from the database, compiler, and programming languages communities to create a powerful end-to-end solution for data analysis. We propose novel techniques that consider the data, computations, and hardware together to achieve maximum performance on a case-by-case basis. Our experimental evaluation quantifies the impact of our novel techniques and shows orders of magnitude performance improvement over alternative systems.

Subject to the “in memory” limitation, speedups of 10 – 6,000x over other systems are nothing to dismiss without further consideration.

Interesting to see that “medium” data now reaches into the terabyte range.

Are “mini-clouds” in the offing that provide specialized processing models?

The Tuppleware website.

I first saw this in a post by Danny Bickson, Tuppleware.

Comments Off

October 9, 2014

Simple Testing Can Prevent Most Critical Failures:…

Filed under: Distributed Computing,Programming,Systems Administration — Patrick Durusau @ 6:28 pm

Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems by Ding Yuan, et al.

Abstract:

Large, production quality distributed systems still fail periodically,and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop Map Reduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the production failures.

We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code–the last line of defense–even with out an understanding of the software design. We extracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been prevented had Aspirator been used and the identified bugs fixed. Running Aspirator on the code of 9 distributed systems located 143 bugs and bad practices that have been fixed or confirmed by the developers.

If you aren’t already convinced you need to read this paper, consider one more quote:

almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software. (emphasis added)

How will catastrophic system failure reflect on your product or service? Hint: It doesn’t reflect well on topic maps or any other service or technology.

I say “read” this paper, perhaps putting it on a 90-day reading rotation would be better.

Comments Off

September 15, 2014

GraphX: Graph Processing in a Distributed Dataflow Framework

Filed under: Distributed Computing,Graphs,GraphX — Patrick Durusau @ 7:25 pm

GraphX: Graph Processing in a Distributed Dataflow Framework by Joseph Gonzalez, Reynold Xin, Ankur Dave, Dan Crankshaw, Michael Franklin, Ion Stoica.

Abstract:

In pursuit of graph processing performance, the systems community has largely abandoned general-purpose distributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming abstractions and accelerate the execution of iterative graph algorithms. In this paper we argue that many of the advantages of specialized graph processing systems can be recovered in a modern general-purpose distributed dataflow system. We introduce GraphX, an embedded graph processing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphX presents a familiar composable graph abstraction that is sufficient to express existing graph APIs, yet can be implemented using only a few basic dataflow operators (e.g., join, map, group-by). To achieve performance parity with specialized graph systems, GraphX recasts graph-specific optimizations as distributed join optimizations and materialized view maintenance. By leveraging advances in distributed dataflow frameworks, GraphX brings low-cost fault tolerance to graph processing. We evaluate GraphX on real workloads and demonstrate that GraphX achieves an order of magnitude performance gain over the base dataflow framework and matches the performance of specialized graph processing systems while enabling a wider range of computation.

GraphX: Graph Processing in a Distributed Dataflow Framework (as PDF file)

The “other” systems for comparison were GraphLab and Giraph. Those systems were tuned in cooperation with experts in their use. These are some of the “fairest” benchmarks you are likely to see this year. Quite different from “shiny graph engine” versus lame or misconfigured system benchmarks.

Definitely the slow-read paper for this week!

I first saw this in a tweet by Arnon Rotem-Gal-Oz.

Comments Off

August 21, 2014

…Loosely Consistent Distributed Programming

Filed under: Consistency,Distributed Computing,Programming — Patrick Durusau @ 3:15 pm

Language Support for Loosely Consistent Distributed Programming by Neil Conway.

Abstract:

Driven by the widespread adoption of both cloud computing and mobile devices, distributed computing is increasingly commonplace. As a result, a growing proportion of developers must tackle the complexity of distributed programming—that is, they must ensure correct application behavior in the face of asynchrony, concurrency, and partial failure.

To help address these difficulties, developers have traditionally relied upon system infrastructure that provides strong consistency guarantees (e.g., consensus protocols and distributed transactions). These mechanisms hide much of the complexity of distributed computing—for example, by allowing programmers to assume that all nodes observe the same set of events in the same order. Unfortunately, providing such strong guarantees becomes increasingly expensive as the scale of the system grows, resulting in availability and latency costs that are unacceptable for many modern applications.

Hence, many developers have explored building applications that only require loose consistency guarantees—for example, storage systems that only guarantee that all replicas eventually converge to the same state, meaning that a replica might exhibit an arbitrary state at any particular time. Adopting loose consistency involves making a well-known tradeoff: developers can avoid paying the latency and availability costs incurred by mechanisms for achieving strong consistency, but inexchange they must deal with the full complexity of distributed computing. As a result, achieving correct application behavior in this environment is very difficult.

This thesis explores how to aid developers of loosely consistent applications by providing programming language support for the difficulties they face. The language level is a natural place to tackle this problem: because developers that use loose consistency have fewer system facilities that they can depend on, consistency concerns are naturally pushed into application logic. In part, our goal has been to recognize, formalize, and automate application-level consistency patterns.

We describe three language variants that each tackle a different challenge in distributed programming. Each variant is a modification of Bloom, a declarative language for distributed programming we have developed at UC Berkeley. The first variant of Bloom, Bloom^L, enables deterministic distributed programming without the need for distributed coordination. Second, Edelweiss allows distributed storage reclamation protocols to be generated in a safe and automatic fashion. Finally, Bloom^PO adds sophisticated ordering constraints that we use to develop a declarative, high-level implementation of concurrent editing, a particularly difficult class of loosely consistent programs.

Unless you think of topic maps as static files, recent developments in “loosely consistent distributed programming” should be high on your reading list.

It’s entirely possible to have a topic map that is a static file, even one that has been printed out to paper. But that seems like a poor target for development. Captured information begins progressing towards staleness from the moment of its capture.

I first saw this in a tweet by Peter Bailis.

Comments Off

May 16, 2014

A Distributed Systems Reading List

Filed under: Distributed Computing,Distributed Consistency,Distributed Sensemaking,Distributed Systems — Patrick Durusau @ 6:56 pm

A Distributed Systems Reading List by

From the introduction:

I often argue that the toughest thing about distributed systems is changing the way you think. The below is a collection of material I’ve found useful for motivating these changes.

Categories include:

Thought Provokers
Amazon
Google
eBay
Consistency Models
Theory
Languages and Tools
Infrastructure
Storage
Paxos Consensus
Other Consensus Papers
Gossip Protocols (Epidemic Behaviors)
P2P

Unless you think the knowledge in your domain is small enough to fit into a single system, I suggest you start reading about distributed systems this weekend.

Enjoy!

I first saw this in a tweet by FoundationDB.

Comments Off

May 15, 2014

Distributed Environments and VirtualBox

Filed under: Distributed Computing,Distributed Systems,Virtual Machines — Patrick Durusau @ 10:35 am

While writing about Distributed LIBLINEAR: I discovered two guides to creating distributed environments with VirtualBox.

I mention that fact in the other post but thought the use of VirtualBox to create distributed environments needed more visibility than a mention.

The guides are:

MPI LIBLINEAR – VirtualBox Guide

Spark LIBLINEAR – VirtualBox Guide

and you will need to refer to the original site: Distributed LIBLINEAR: Libraries for Large-scale Linear Classification on Distributed Environments for information on using those environments with “Distributed LIBLINEAR.”

VirtualBox brings research on and using distributed systems within the reach of anyone with reasonable computing resources.

Please drop me a note if you are using VirtualBox to create distributed systems for topic map processing.

Comments Off

May 12, 2014

Distributed Systems and the End of the API

Filed under: CRDT,Distributed Computing,Distributed Systems — Patrick Durusau @ 3:53 pm

Distributed Systems and the End of the API by Chas Emerick.

From the post:

This is a written (expanded) narrative of the content from a talk I first gave at PhillyETE on April 23rd, 2014. It mostly follows the flow of the presentation given then, but with a level of detail that I hope enhances clarity of the ideas therein. The talk’s original slides are available, though the key illustrations and bullet points contained therein are replicated (and somewhat enhanced) below. When audio/video of the talk is published, I will update this page to link to it.

I have two claims of which I would like to convince you today:

The notion of the networked application API is an unsalvageable anachronism that fails to account for the necessary complexities of distributed systems.

There exist a set of formalisms that do account for these complexities, but which are effectively absent from modern programming practice.

A bit further into the paper, distributed systems are defined as:

A distributed system is one that is comprised of multiple processes that must communicate to perform work.

The bottom line is that, given the ambient nature of the networks that surround us and the dependence we have upon those networks for so many of the tasks our programs, clients, customers, and users take for granted, nearly every system we build is a distributed system. Unless your software runs in a totally isolated environment — e.g. on an air-gapped computer — you are building a distributed system.

This is problematic in that distributed systems exhibit a set of uniformly unintuitive behaviours related to causality, consistency, and availability. These behaviours are largely emergent, and spring from the equally unintuitive semantics of the non-locality of the parts of those distributed systems and the networks that connect them. None of these behaviours or semantics are related at all to those which we — as programmers and engineers — are typically trained and acclimated to expect and reason about.

Note that even if you are doing something small, or “normal”, or common, you are not immune to these challenges. Even the most vanilla web application is definitionally a distributed system. By sending data from one computer (e.g. a server) to another (e.g. your customer’s web browser), you end up having to contemplate and address all sorts of problems that simply don’t exist when you run a program in a single process on a single machine that doesn’t touch the network: consistency, coping with non-availability (i.e. latency, services being down, timing-related bugs caused by long-running computations or things as simple as garbage collection), dealing with repeated messages from clients with spotty connections, and more. If you’ve not been bitten by these things, that is evidence of luck (or, of your not having noticed the problems yet!), not of your being immune, or otherwise that what you’ve built is somehow not a distributed system and so isn’t subject to these challenges.

A lot of heavy sledding but important for the future development of robust distributed systems.

It is important that people interested in semantics and XML participate in these discussions.

For example, Chas says of XML (and JSON):

the “richer” data representations that are favoured by most API services and clients (again, JSON, XML, etc) are fundamentally opaque and in general make reconciling independent changes impossible in a consistent way without special, often domain-specific intervention.

I’m am curious what is meant by “fundametally opaque,” at least insofar as Chas is talking about XML. If he means that independent changes impact the tree structure and make reconciliation of concurrent changes challenging, ok, but that’s not being opaque. And even that is an artifact of a processing model for XML, not XML proper.

I am even more concerned about the “semantics” to be addressed in distributed systems. At this point I will have to take Chas’ word for the distributed systems preserving machine to machine semantics (I have much reading left to do) but correct machine processing doesn’t warrant correct semantics for a human consumer of the same data.

I first saw this in a tweet by Tom Santero.

Comments Off

March 27, 2014

LVars:…

Filed under: Distributed Computing,Functional Programming,Haskell,LVars,Parallel Programming,Parallelism — Patrick Durusau @ 6:14 pm

LVars: Lattice-based Data Structures for Deterministic Parallel and Distributed Programming by Lindsey Kuper.

At 144 slides and no sound, you probably need to pick up some background to really appreciate the slides.

I would start with: A ten-minute talk about my research, continue with later post under LVars and then onto:

LVars project repo: http://github.com/iu-parfunc/lvars

Code from this talk: http://github.com/lkuper/lvar-examples

Research blog: http://composition.al

Take up the slides when you feel comfortable with the nomenclature and basic concepts.

Comments Off

March 17, 2014

ethereum

Filed under: Distributed Computing — Patrick Durusau @ 4:37 pm

ethereum

From the webpage:

Ethereum is a platform and a programming language that makes it possible for any developer to build and publish next-generation distributed applications.

Ethereum can be used to codify, decentralize, secure and trade just about anything: voting, domain names, financial exchanges, crowdfunding, company governance, contracts and agreements of most kind, intellectual property, and even smart property thanks to hardware integration.

Ethereum borrows the concept of decentralized consensus that makes bitcoin so resilient, yet makes it trivial to build on its foundation. To find out more about how Ethereum works, consult the whitepaper
.

…will you build out of the Ether?

Distributed systems are a great idea but most governments won’t tolerate parallel monetary systems.

Primarily because it interferes with the ability of a central bank to simply declare there is more money in the national treasury than anyone suspected.

Adjust the decimal place a bit and suddenly the government is solvent again.

Having a parallel monetary system, like Kong bucks, would interfere with that capability.

Comments Off

November 27, 2013

Erlang Handbook

Filed under: Distributed Computing,Erlang,Graphs — Patrick Durusau @ 12:57 pm

Erlang Handbook: A concise reference for Erlang

From the webpage:

Originally written by Bjarne Däcker and later revised by Robert Virding, the Erlang Handbook is a summary of the language features and the runtime system. It is aimed at people with some programming experience, serving as a quick introduction to the Erlang domain.

Erlang Handbook (current release, pdf)

The handbook is just that, a handbook. At forty-six pages, it is a highly useful but also highly condensed view of Erlang.

I have been reminded of Erlang twice this week already.

The first time was by The Distributed Complexity of Large-scale Graph Processing research paper with its emphasis on message passing between graph nodes as a processing model.

The other reminder was Jans Aasman’s How to Use Graph Databases… [Topic Maps as Graph++?].

Jans was extolling the use of graphs to manage data about telecom customers, with an emphasis on “near real-time.”

Something kept nagging at me when I was watching the video but it was only afterwards that I remembered Ericsson’s development and use of Erlang for exactly that use case.

By way of excuse, I was watching Jans’ video at the end of a long day.

Suggestions on where I can look for anyone using Erlang-based message passing for distributed processing of graphs?

With a truthful description like this one:

Erlang is a programming language used to build massively scalable soft real-time systems with requirements on high availability. Some of its uses are in telecoms, banking, e-commerce, computer telephony and instant messaging. Erlang’s runtime system has built-in support for concurrency, distribution and fault tolerance. (from http://www.erlang.org/)

are there any contraindications for Erlang?

Comments Off

November 26, 2013

The Distributed Complexity of Large-scale Graph Processing

Filed under: Distributed Computing,Graphs,Merging,Messaging,Topic Maps — Patrick Durusau @ 2:52 pm

The Distributed Complexity of Large-scale Graph Processing by Hartmut Klauck, Danupon Nanongkai, Gopal Pandurangan, Peter Robinson.

Abstract:

Motivated by the increasing need for fast distributed processing of large-scale graphs such as the Web graph and various social networks, we study a message-passing distributed computing model for graph processing and present lower bounds and algorithms for several graph problems. This work is inspired by recent large-scale graph processing systems (e.g., Pregel and Giraph) which are designed based on the message-passing model of distributed computing.

Our model consists of a point-to-point communication network of $$k$$ machines interconnected by bandwidth-restricted links. Communicating data between the machines is the costly operation (as opposed to local computation). The network is used to process an arbitrary $$n$$-node input graph (typically $$n >> k > 1$$) that is randomly partitioned among the $$k$$ machines (a common implementation in many real world systems). Our goal is to study fundamental complexity bounds for solving graph problems in this model.

We present techniques for obtaining lower bounds on the distributed time complexity. Our lower bounds develop and use new bounds in random-partition communication complexity. We first show a lower bound of $$\Omega(n/k)$$ rounds for computing a spanning tree (ST) of the input graph. This result also implies the same bound for other fundamental problems such as computing a minimum spanning tree (MST). We also show an $$\Omega(n/k^2)$$ lower bound for connectivity, ST verification and other related problems.

We give algorithms for various fundamental graph problems in our model. We show that problems such as PageRank, MST, connectivity, and graph covering can be solved in $$\tilde{O}(n/k)$$ time, whereas for shortest paths, we present algorithms that run in $$\tilde{O}(n/\sqrt{k})$$ time (for $$(1+\epsilon)$$-factor approx.) and in $$\tilde{O}(n/k)$$ time (for $$O(\log n)$$-factor approx.) respectively.

The author’s state their main goal is:

…is to investigate the distributed time complexity, i.e., the number of distributed “rounds”, for solving various fundamental graph problems. The time complexity not only captures the (potential) speed up possible for a problem, but it also implicitly captures the communication cost of the algorithm as well, since links can transmit only a limited amount of bits per round; equivalently, we can view our model where instead of links, machines can send/receive only a limited amount of bits per round (cf. Section 1.1).

How would you investigate the number of “rounds,” to perform merging in a message passing topic map system?

With no one order of merging to reach a particular state, would you measure it statistically for some merging criteria N?

I first saw this in a tweet by Stefano Bertolo.

Comments (1)

November 23, 2013

SAMOA

Filed under: Data Mining,Distributed Computing,Distributed Systems,Machine Learning,Stream Analytics — Patrick Durusau @ 7:55 pm

Introducing SAMOA, an open source platform for mining big data streams by Gianmarco De Francisci Morales and Albert Bifet.

From the post:

https://github.com/yahoo/samoa

Machine learning and data mining are well established techniques in the world of IT and especially among web companies and startups. Spam detection, personalization and recommendations are just a few of the applications made possible by mining the huge quantity of data available nowadays. However, “big data” is not only about Volume, but also about Velocity (and Variety, 3V of big data).

The usual pipeline for modeling data (what “data scientists” do) involves taking a sample from production data, cleaning and preprocessing it to make it usable, training a model for the task at hand and finally deploying it to production. The final output of this process is a pipeline that needs to run periodically (and be maintained) in order to keep the model up to date. Hadoop and its ecosystem (e.g., Mahout) have proven to be an extremely successful platform to support this process at web scale.

However, no solution is perfect and big data is “data whose characteristics forces us to look beyond the traditional methods that are prevalent at the time”. The current challenge is to move towards analyzing data as soon as it arrives into the system, nearly in real-time.

For example, models for mail spam detection get outdated with time and need to be retrained with new data. New data (i.e., spam reports) comes in continuously and the model starts being outdated the moment it is deployed: all the new data is sitting without creating any value until the next model update. On the contrary, incorporating new data as soon as it arrives is what the “Velocity” in big data is about. In this case, Hadoop is not the ideal tool to cope with streams of fast changing data.

Distributed stream processing engines are emerging as the platform of choice to handle this use case. Examples of these platforms are Storm, S4, and recently Samza. These platforms join the scalability of distributed processing with the fast response of stream processing. Yahoo has already adopted Storm as a key technology for low-latency big data processing.

Alas, currently there is no common solution for mining big data streams, that is, for doing machine learning on streams on a distributed environment.

Enter SAMOA

SAMOA (Scalable Advanced Massive Online Analysis) is a framework for mining big data streams. As most of the big data ecosystem, it is written in Java. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.

…

After you get SAMOA installed, you may want to read: Distributed Decision Tree Learning for Mining Big Data Streams by Arinto Murdopo (thesis).

The nature of streaming data prevents SAMOA from offering the range of machine learning algorithms common in machine learning packages.

But if the SAMOA algorithms fit your use cases, what other test would you apply?

Comments Off

October 24, 2013

On Demand Memory Specialization for Distributed Graph Databases

Filed under: Distributed Computing,Graphs,Topic Maps — Patrick Durusau @ 4:36 pm

On Demand Memory Specialization for Distributed Graph Databases by Xavier Martinez-Palau, David Dominguez-Sal, Reza Akbarinia, Patrick Valduriez, Josep Lluís Larriba-Pey.

Abstract:

In this paper, we propose the DN-tree that is a data structure to build lossy summaries of the frequent data access patterns of the queries in a distributed graph data management system. These compact representations allow us an efficient communication of the data structure in distributed systems. We exploit this data structure with a new Dynamic Data Partitioning strategy (DYDAP) that assigns the portions of the graph according to historical data access patterns, and guarantees a small network communication and a computational load balance in distributed graph queries. This method is able to adapt dynamically to new workloads and evolve when the query distribution changes. Our experiments show that DYDAP yields a throughput up to an order of magnitude higher than previous methods based on cache specialization, in a variety of scenarios, and the average response time of the system is divided by two.

Graph partitioning based on evolving and summarized query experience. Results in a min cut on far less than the whole graph.

A heavy read but promising approach.

Makes me curious about non-uniform merging priorities.

For example, there could be some topics that merge frequently, say those representing financial information. While others, say for historical figures, may merge infrequently.

Rather than treating all topics equally in terms of computational resources, some topics could be processed at higher priority than others. (I am assuming an appropriately defined SLA.)

Although the partitioning approach could be applicable to distributed topic maps as well.

Comments Off

October 1, 2013

Apache Aurora

Filed under: Distributed Computing,Distributed Systems,Mesos — Patrick Durusau @ 6:26 pm

Apache Aurora

Apache Aurora entered incubation today!

From the webpage:

Aurora is a service scheduler used to schedule jobs onto Apache Mesos.

Oh, Apache Mesos?

From the webpage:

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes.

All the wiring is still pretty close to the surface but that’s not going to last long.

Better to learn it now while people still think it is hard.

Comments Off

September 29, 2013

Design Patterns for Distributed…

Filed under: Distributed Computing,Distributed Systems,NoSQL — Patrick Durusau @ 4:45 pm

Design Patterns for Distributed Non-Relational Databases by Todd Lipcon.

A bit dated (2009) but true design patterns should find refinement, not retirement.

Covers:

Consistent Hashing
Consistency Models
Data Models
Storage Layouts
Log-Structured Merge Trees

Curious if you would suggest substantial changes to these patterns some four (4) years later?

Comments Off

September 11, 2013

…Conceptual Model For Evolving Graphs

Filed under: Distributed Computing,Evoluntionary,Graphs,Networks — Patrick Durusau @ 5:17 pm

An Analytics-Aware Conceptual Model For Evolving Graphs by Amine Ghrab, Sabri Skhiri, Salim Jouili, and Esteban Zimanyi.

Abstract:

Graphs are ubiquitous data structures commonly used to represent highly connected data. Many real-world applications, such as social and biological networks, are modeled as graphs. To answer the surge for graph data management, many graph database solutions were developed. These databases are commonly classified as NoSQL graph databases, and they provide better support for graph data management than their relational counterparts. However, each of these databases implement their own operational graph data model, which differ among the products. Further, there is no commonly agreed conceptual model for graph databases.

In this paper, we introduce a novel conceptual model for graph databases. The aim of our model is to provide analysts with a set of simple, well-defined, and adaptable conceptual components to perform rich analysis tasks. These components take into account the evolving aspect of the graph. Our model is analytics-oriented, flexible and incremental, enabling analysis over evolving graph data. The proposed model provides a typing mechanism for the underlying graph, and formally defines the minimal set of data structures and operators needed to analyze the graph.

The authors concede that much work remains to be done, both theoretical and practical on their proposal.

With the rise of distributed computing, every “fact” depends upon a calculated moment of now. What was a “fact” five minutes ago may not longer be considered as a “fact” but as an “error.”

Who is responsible for changes in “facts,” warranties for “facts,” who gives and gets notices about changes in “facts,” all remain to be determined.

Models for evolving graphs may assist in untangling the rights, obligations and relationships that are nearly upon us with distributed computing.

Comments Off

Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 17, 2018

Description

Readings

December 7, 2017

August 9, 2016

Relevant Reading

December 12, 2015

December 6, 2015

November 18, 2015

November 17, 2015

November 16, 2015

November 13, 2015

October 29, 2015

May 4, 2015

February 22, 2015

December 11, 2014

December 6, 2014

October 18, 2014

October 9, 2014

September 15, 2014

August 21, 2014

May 16, 2014

May 15, 2014

May 12, 2014

March 27, 2014

March 17, 2014

November 27, 2013

November 26, 2013

November 23, 2013

October 24, 2013

October 1, 2013

September 29, 2013

September 11, 2013