Archive for the ‘Clustering (servers)’ Category

Meet Fenton (my data crunching machine)

Saturday, February 25th, 2017

Meet Fenton (my data crunching machine) by Alex Staravoitau.

From the post:

As you might be aware, I have been experimenting with AWS as a remote GPU-enabled machine for a while, configuring Jupyter Notebook to use it as a backend. It seemed to work fine, although costs did build over time, and I had to always keep in mind to shut it off, alongside with a couple of other limitations. Long story short, around 3 months ago I decided to build my own machine learning rig.

My idea in a nutshell was to build a machine that would only act as a server, being accessible from anywhere to me, always ready to unleash its computational powers on whichever task I’d be working on. Although this setup did take some time to assess, assemble and configure, it has been working flawlessly ever since, and I am very happy with it.

This is the most crucial part. After serious consideration and leveraging the budget I decided to invest into EVGA GeForce GTX 1080 8GB card backed by Nvidia GTX 1080 GPU. It is really snappy (and expensive), and in this particular case it only takes 15 minutes to run — 3 times faster than a g2.2xlarge AWS machine! If you still feel hesitant, think of it this way: the faster your model runs, the more experiments you can carry out over the same period of time.
… (emphasis in original)

Total for this GPU rig? £1562.26

You now know the fate of your next big advance. 😉

If you are interested in comparing the performance of a Beowulf cluster, see: A Homemade Beowulf Cluster: Part 1, Hardware Assembly and A Homemade Beowulf Cluster: Part 2, Machine Configuration.

Either way, you are going to have enough processing power that your skill and not hardware limits are going to be the limiting factor.

DIY OpenShift Cluster

Saturday, April 30th, 2016


No videos are planned for the Neo4j cluster I mentioned in Visual Searching with Google – One Example – Neo4j – Raspberry Pi but that’s all right, Marek Jelen has started a series on building an OpenShift cluster.

Deploying embedded OpenShift cluster (part 1) introduces the project:

In this series we are going to discuss deploying OpenShift cluster on development boards, specifically MinnowBoards. You might be asking, why the hell would I do that? Well, there are some benefits. First, they do have much lower power consumption. In my case, I am using Minnowboards, with light demand, one board takes approximately 3W-4W. Running a cluster of 4 boards including a switch takes 17W, deploying and starting 10 containers adds 1W. But yeah, that does not include fast disks. But that will come as well. Next benefit is the form factor. My cluster of four boards has dimensions of 7.5cm x 10.0cm x 8cm, about the size of a pack of credit cards. Quite a powerful cluster that can fit pretty much anywhere. The small size bring another benefit – mobility. Do you need computer power on the go? Well, this kind of boards can help solve your problem. Anyway, let’s get on with it.

Lower power consumption and form factor aren’t high on my list as reasons to pursue this project.

Security while learning about OpenShift clusters would be my top reason.

Anything without an air gap between it and outside networks is by definition insecure. Even with air gaps systems can be insecure but air gaps reduce the attack surfaces.

I appreciate Marek’s preference for MinnowBoards but there is a vibrant community around the Raspberry Pi.

Looking forward to the next post in this series!

PS: Physical security is rarely accorded the priority it deserves. Using a MinnowBoard or Raspberry Pi, a very small form factor computer could be installed behind any external firewalls. “PC call home.”

Erlang/OTP [New Homepage]

Monday, June 16th, 2014

Erlang/OTP [New Homepage]

I saw a tweet advising that the Erlang/OTP homepage had been re-written.

This shot from the Wayback Machine, dated October 11, 2011, Erlang/OTP homepage 2011, is how I remember the old homepage.

Today, the page seems a bit deep to me but includes details like the top three reasons to use Erlang/OTP for a cluster system (C/S):

  • Cost cheaper to use an open source C/S than write or rent one
  • Speed To Market quicker to use an C/S than write one
  • Availability and Reliability Erlang/OTP systems have been measured at 99.9999999% uptime (31ms a year downtime) (emphasis added)

That would be a good question to ask at the next big data conference: What is the measured reliability of system X?

Elastic Mesos

Wednesday, January 8th, 2014

Mesosphere Launches Elastic Mesos, Makes Setting Up A Mesos Cluster A 3-Step Process by Frederic Lardinois.

From the post:

Mesosphere, a startup that focuses on developing Mesos, a technology that makes running complex distributed applications easier, is launching Elastic Mesos today. This new product makes setting up a Mesos cluster on Amazon Web Services a basic three-step process that asks you for the size of the cluster you want to set up, your AWS credentials and an email where you want to get notifications about your cluster’s state.

Given the complexity of setting up a regular Mesos cluster, this new project will make it easier for developers to experiment with Mesos and the frameworks Mesosphere and others have created around it.

As Mesosphere’s founder Florian Leibert describes it, for many applications, the data center is now the computer. Most applications now run on distributed systems, but connecting all of the distributed parts is often still a manual process. Mesos’ job is to abstract away all of these complexities and to ensure that an application can treat the data center and all your nodes as a single computer. Instead of setting up various server clusters for different parts of your application, Mesos creates a shared pool of servers where resources can be allocated dynamically as needed.

Remote computing isn’t as secure as my NATO SDIP-27 Level A (formerly AMSG 720B) and USA NSTISSAM Level I conformant office but there is a trade-off between maintenance/upgrade of local equipment and the convenience of remote computing.

In the near future, all forms of digital communication will be secure from the NSA and others. Before Snowden, it was widely known in a vague sense that the NSA and others were spying on U.S. citizens and others. Post-Snowden, user demand will result in vendors developing secure communications with two settings, secure and very secure.

Ironic that overreaching by the NSA will result in greater privacy for everyone of interest to the NSA.

PS: See Learn how to use Apache Mesos as well.

Installing Distributed Solr 4 with Fabric

Thursday, May 23rd, 2013

Installing Distributed Solr 4 with Fabric by Martijn Koster

From the post:

Solr 4 has a subset of features that allow it be run as a distributed fault-tolerant cluster, referred to as “SolrCloud”. Installing and configuring Solr on a multi-node cluster can seem daunting when you’re a developer who just wants to give the latest release a try. The wiki page is long and complex, and configuring nodes manually is laborious and error-prone. And while your OS has ZooKeeper/Solr packages, they are probably outdated. But it doesn’t have to be a lot of work: in this post I will show you how to deploy and test a Solr 4 cluster using just a few commands, using mechanisms you can easily adjust for your own deployments.

I am using a cluster consisting of a virtual machines running Ubuntu 12.04 64bit and I am controlling them from my MacBook Pro. The Solr configuration will mimic the Two shard cluster with shard replicas and zookeeper ensemble example from the wiki.

You can run this on AWS EC2, but some special considerations apply, see the footnote.

We’ll use Fabric, a light-weight deployment tool that is basically a Python library to easily execute commands on remote nodes over ssh. Compared to Chef/Puppet it is simpler to learn and use, and because it’s an imperative approach it makes sequential orchestration of dependencies more explicit. Most importantly, it does not require a separate server or separate node-side software installation.

DISCLAIMER: these instructions and associated scripts are released under the Apache License; use at your own risk.

I strongly recommend you use disposable virtual machines to experiment with.

Something to get you excited about the upcoming weekend!


What’s New in Cassandra 1.2 (Notes)

Saturday, January 12th, 2013

What’s New in Cassandra 1.2

From the description:

Apache Cassandra Project Chair, Jonathan Ellis, looks at all the great improvements in Cassandra 1.2, including Vnodes, Parallel Leveled Compaction, Collections, Atomic Batches and CQL3.

There is only so much you can cover in an hour but Jonathan did a good job of hitting the high points of virtual nodes (rebuild failed drives/nodes faster), atomic batches (fewer requirements on clients, new default btw), CQL improvements, and tracing.

Enough to make you interested in running (not watching) the examples plus your own.

The slides:

Cassandra homepage.

CQL 3 Language Reference.

The Cooperative Computing Lab

Monday, December 17th, 2012

The Cooperative Computing Lab

I encountered this site while tracking down resources for the DASPOS post.

From the homepage:

The Cooperative Computing Lab at the University of Notre Dame seeks to give ordinary users the power to harness large systems of hundreds or thousands of machines, often called clusters, clouds, or grids. We create real software that helps people to attack extraordinary problems in fields such as physics, chemistry, bioinformatics, biometrics, and data mining. We welcome others at the University to make use of our computing systems for research and education.

As the computing requirements of your data mining or topic maps increase, so will your need for clusters, clouds, or grids.

The CCL offers several software packages for free download that you may find useful.

Deploying a GraphLab/Spark/Mesos cluster on EC2

Tuesday, October 23rd, 2012

Deploying a GraphLab/Spark/Mesos cluster on EC2 by Danny Bickson.

From the post:

I got the following instructions from my collaborator Jay (Haijie Gu) who spent some time learning Spark cluster deployment and adapted those useful scripts to be used in GraphLab.

This tutorial will help you spawn a GraphLab distributed cluster, run alternating least squares task, collect the results and shutdown the cluster.

This tutorial is very new beta release. Please contact me if you are brave enough to try it out..

I haven’t seen any responses to Danny’s post. Is yours going to be the first one?

Meet the Committer, Part Two: Matt Foley [Ambari herein]

Sunday, September 23rd, 2012

Meet the Committer, Part Two: Matt Foley by Kim Truong

From the post:

For the next installation of “Future of Apache Hadoop” webinar series, I would like to introduce to you Matt Foley and Ambari. Matt is a member of Hortonworks technical staff, Committer and PMC member for Apache Hadoop core project and will be our guest speaker on September 26, 2012 @10am PDT / 1pm EDT webinar: Deployment and Management of Hadoop Clusters with AMBARI.

Get to know Matt in this second installment of our “Meet the Committer” series.

No pressure but I do hope this compares well to the Alan Gates webinar on Pig. No pressure. 😉

In case you want to investigate/learn/brush up on Ambari.

Automating Your Cluster with Cloudera Manager API

Monday, September 10th, 2012

Automating Your Cluster with Cloudera Manager API

From the post:

API access was a new feature introduced in Cloudera Manager 4.0 (download free edition here.). Although not visible in the UI, this feature is very powerful, providing programmatic access to cluster operations (such as configuration and restart) and monitoring information (such as health and metrics). This article walks through an example of setting up a 4-node HDFS and MapReduce cluster via the Cloudera Manager (CM) API.

Cloudera Manager API Basics

The CM API is an HTTP REST API, using JSON serialization. The API is served on the same host and port as the CM web UI, and does not require an extra process or extra configuration. The API supports HTTP Basic Authentication, accepting the same users and credentials as the Web UI. API users have the same privileges as they do in the web UI world.

You can read the full API documentation here.

We are nearing mid-September so the holiday season will be here before long. It isn’t too early to start planning on price/hardware break points.

This will help configure a HDFS and MapReduce cluster on your holiday hardware.

Understanding Apache Hadoop’s Capacity Scheduler

Thursday, July 26th, 2012

Understanding Apache Hadoop’s Capacity Scheduler by Arun Murthy

From the post:

As organizations continue to ramp the number of MapReduce jobs processed in their Hadoop clusters, we often get questions about how best to share clusters. I wanted to take the opportunity to explain the role of Capacity Scheduler, including covering a few common use cases.

Let me start by stating the underlying challenge that led to the development of Capacity Scheduler and similar approaches.

As organizations become more savvy with Apache Hadoop MapReduce and as their deployments mature, there is a significant pull towards consolidation of Hadoop clusters into a small number of decently sized, shared clusters. This is driven by the urge to consolidate data in HDFS, allow ever-larger processing via MapReduce and reduce operational costs & complexity of managing multiple small clusters. It is quite common today for multiple sub-organizations within a single parent organization to pool together Hadoop/IT budgets to deploy and manage shared Hadoop clusters.

Initially, Apache Hadoop MapReduce supported a simple first-in-first-out (FIFO) job scheduler that was insufficient to address the above use case.

Enter the Capacity Scheduler.

Shared Hadoop clusters?

So long as we don’t have to drop off our punch cards at the shared Hadoop cluster computing center I suppose that’s ok.


Just teasing.

Shared Hadoop clusters are more cost effective and makes better use of your Hadoop specialists.

The Data-Scope Project

Sunday, February 5th, 2012

The Data-Scope Project – 6PB storage, 500GBytes/sec sequential IO, 20M IOPS, 130TFlops

From the post:

While Galileo played life and death doctrinal games over the mysteries revealed by the telescope, another revolution went unnoticed, the microscope gave up mystery after mystery and nobody yet understood how subversive would be what it revealed. For the first time these new tools of perceptual augmentation allowed humans to peek behind the veil of appearance. A new new eye driving human invention and discovery for hundreds of years.

Data is another material that hides, revealing itself only when we look at different scales and investigate its underlying patterns. If the universe is truly made of information, then we are looking into truly primal stuff. A new eye is needed for Data and an ambitious project called Data-scope aims to be the lens.

A detailed paper on the Data-Scope tells more about what it is:

The Data-Scope is a new scientific instrument, capable of ‘observing’ immense volumes of data from various scientific domains such as astronomy, fluid mechanics, and bioinformatics. The system will have over 6PB of storage, about 500GBytes per sec aggregate sequential IO, about 20M IOPS, and about 130TFlops. The Data-Scope is not a traditional multi-user computing cluster, but a new kind of instrument, that enables people to do science with datasets ranging between 100TB and 1000TB. There is a vacuum today in data-intensive scientific computations, similar to the one that lead to the development of the BeoWulf cluster: an inexpensive yet efficient template for data intensive computing in academic environments based on commodity components. The proposed Data-Scope aims to fill this gap.

A very accessible interview by Nicole Hemsoth with Dr. Alexander Szalay, Data-Scope team lead, is available at The New Era of Computing: An Interview with “Dr. Data”. Roberto Zicari also has a good interview with Dr. Szalay in Objects in Space vs. Friends in Facebook.

I am not altogether convinced that the data/computing center model is the best one but the lessons learned here may hasten more sophisticated architectures.

Subject identity issues abound in any environment but some are easier to see in a complex one.

For example, what if the choices of researchers are captured as subject identifications and associations are created to other data set (or data within those sets) based on those choices?

Perhaps to power recommendations of additional data or notices of when additional data becomes available.

Spark: Cluster Computing with Working Sets

Sunday, January 8th, 2012

Spark: Cluster Computing with Working Sets

From the post:

One of the aspects you can’t miss even as you just begin reading this paper is the strong scent of functional programming that the design of Spark bears. The use of FP idioms is quite widespread across the architecture of Spark such as the ability to restore a partition from by applying a closure block, operations such as reduce and map/collect, distributed accumulators etc. It would suffice to say that it is a very functional system. Pun intended!

Spark is written in Scala and is well suited for the class of applications that reuse a working set of data across multiple parallel operations. It claims to outperform Hadoop by 10x in iterative machine learning jobs, and has been tried successfully to interactively query a 39 GB dataset with sub-second response time!

Its is built on top of Mesos, a resource management infrastructure, that lets multiple parallel applications share a cluster in a fine-grained manner and provides an API for applications to launch tasks on a cluster.

Developers write a driving program that orchestrates various parallel operations. Spark’s programming model provides two abstractions to work with large datasets : resilient distributed datasets and parallel operations. In addition it supports two kinds of shared variables.

If more technical papers had previews like this one, more technical papers would be read!

Interesting approach on first blush. Not sure I make that much out of sub-second queries on 39 GB dataset as that is a physical memory issue these days. I do like the idea of sets of data, subject to repeated operations.

New: Spark Project Homepage.

New: Spark technical report published.

Apache Whirr 0.7.0 has been released

Wednesday, December 28th, 2011

Apache Whirr 0.7.0 has been released

From Patrick Hunt at Cloudera:

Apache Whirr release 0.7.0 is now available. It includes changes covering over 50 issues, four of which were considered blockers. Whirr is a tool for quickly starting and managing clusters running on cloud services like Amazon EC2. This is the first Whirr release as a top level Apache project (previously releases were under the auspices of the Incubator). In addition to improving overall stability some of the highlights are described below:

Support for Apache Mahout as a deployable component is new in 0.7.0. Mahout is a scalable machine learning library implemented on top of Apache Hadoop.

  • WHIRR-384 – Add Mahout as a service
  • WHIRR-49 – Allow Whirr to use Chef for configuration management
  • WHIRR-258 – Add Ganglia as a service
  • WHIRR-385 – Implement support for using nodeless, masterless Puppet to provision and run scripts

Whirr 0.7.0 will be included in a scheduled update to CDH4.

Getting Involved

The Apache Whirr project is working on a number of new features. The How To Contribute page is a great place to start if you’re interested in getting involved as a developer.

Cluster management or even the “cloud” in your topic map future?

You could do worse than learning one of the most recent top level Apache top level projects to prepare for a future that may arrive sooner than you think!

Apache Zookeeper 3.3.4

Wednesday, November 30th, 2011

Apache Zookeeper 3.3.4

From the post:

Apache ZooKeeper release 3.3.4 is now available: this is a fix release covering 22 issues, 9 of which were considered blockers. Some of the more serious issues include:

  • ZOOKEEPER-1208 Ephemeral nodes may not be removed after the client session is invalidated
  • ZOOKEEPER-961 Watch recovery fails after disconnection when using chroot connection option
  • ZOOKEEPER-1049 Session expire/close flooding renders heartbeats to delay significantly
  • ZOOKEEPER-1156 Log truncation truncating log too much – can cause data loss
  • ZOOKEEPER-1046 Creating a new sequential node incorrectly results in a ZNODEEXISTS error
  • ZOOKEEPER-1097 Quota is not correctly rehydrated on snapshot reload
  • ZOOKEEPER-1117 zookeeper 3.3.3 fails to build with gcc >= 4.6.1 on Debian/Ubuntu

In case you are unfamiliar with Zookeeper:

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed. (from Apache Zookeeper)

More Google Cluster Data

Wednesday, November 30th, 2011

More Google Cluster Data

From the post:

Google has a strong interest in promoting high quality systems research, and we believe that providing information about real-life workloads to the academic community can help.

In support of this we published a small (7-hour) sample of resource-usage information from a Google production cluster in 2010 (research blog on Google Cluster Data). Approximately a dozen researchers at UC Berkeley, CMU, Brown, NCSU, and elsewhere have made use of it.

Recently, we released a larger dataset. It covers a longer period of time (29 days) for a larger cell (about 11k machines) and includes significantly more information, including:

I remember Robert Barta describing the use of topic maps for systems administration. This data set could give some insight into the design of a topic map for cluster management.

What subjects and relationships would you recognize, how and why?

If you are looking for employment, this might be a good way to attract Google’s attention. (Hint to Google: Releasing interesting data sets could be a way to vet potential applicants in realistic situations.)

ParLearning 2012 (silos or maps?)

Friday, September 23rd, 2011

ParLearning 2012 : Workshop on Parallel and Distributed Computing for Machine Learning and Inference Problems


When May 25, 2012 – May 25, 2012
Where Shanghai, China
Submission Deadline Dec 19, 2011
Notification Due Feb 1, 2012
Final Version Due Feb 21, 2012

From the notice:


  • Foster collaboration between HPC community and AI community
  • Applying HPC techniques for learning problems
  • Identifying HPC challenges from learning and inference
  • Explore a critical emerging area with strong industry interest without overlapping with existing IPDPS workshops
  • Great opportunity for researchers worldwide for collaborating with Chinese Academia and Industry


Authors are invited to submit manuscripts of original unpublished research that demonstrate a strong interplay between parallel/distributed computing techniques and learning/inference applications, such as algorithm design and libraries/framework development on multicore/ manycore architectures, GPUs, clusters, supercomputers, cloud computing platforms that target applications including but not limited to:

  • Learning and inference using large scale Bayesian Networks
  • Large scale inference algorithms using parallel TPIC models, clustering and SVM etc.
  • Parallel natural language processing (NLP).
  • Semantic inference for disambiguation of content on web or social media
  • Discovering and searching for patterns in audio or video content
  • On-line analytics for streaming text and multimedia content
  • Comparison of various HPC infrastructures for learning
  • Large scale learning applications in search engine and social networks
  • Distributed machine learning tools (e.g., Mahout and IBM parallel tool)
  • Real-time solutions for learning algorithms on parallel platforms

If you are wondering what role topic maps have to play in this arena, ask yourself the following question:

Will the systems and techniques demonstrated at this conference use the same means to identify the same subjects?*

If your answer is no, what would you suggest is the solution for mapping different identifications of the same subjects together?

My answer to that question is to use topic maps.

*Whatever your ascribe as its origin, semantic diversity is part and parcel of the human condition. We can either develop silos or maps across silos. Which do you prefer?

Spark – Lighting-Fast Cluster Computing

Monday, June 27th, 2011

Spark – Lighting-Fast Cluster Computing

From the webpage:

What is Spark?

Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce.

To make programming faster, Spark integrates into the Scala language, letting you manipulate distributed datasets like local collections. You can also use Spark interactively to query big data from the Scala interpreter.

What can it do?

Spark was initially developed for two applications where keeping data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining. In both cases, Spark can outperform Hadoop by 30x. However, you can use Spark’s convenient API to for general data processing too. Check out our example jobs.

Spark runs on the Mesos cluster manager, so it can coexist with Hadoop and other systems. It can read any data source supported by Hadoop.

Who uses it?

Spark was developed in the UC Berkeley AMP Lab. It’s used by several groups of researchers at Berkeley to run large-scale applications such as spam filtering, natural language processing and road traffic prediction. It’s also used to accelerate data analytics at Conviva. Spark is open source under a BSD license, so download it to check it out!

Hadoop must be doing something right to be treated as the solution to beat.

Still, depending on your requirements, Spark definitely merits your consideration.


Saturday, February 12th, 2011

BigCouch 0.3 release.

From the website:

BigCouch is our open-source flavor of CouchDB with built-in clustering capability.

The main difference between BigCouch and standalone Couch is the inclusion of an OTP application that ‘clusters’ CouchDB across multiple servers.

For now, BigCouch is a stand-alone fork of CouchDB. In the future, we believe (and hope!) that many of the upgrades we’ve made will be incorporated back into CouchDB proper.

Many worthwhile topic map applications can be written without clustering, but “clustering” is one of those buzz words to include your response to an RFP, grant proposal, etc.

Good to have some background on what clustering means/requires in general and beating on several of the clustering solutions will develop that background.

Not to mention that you will know when it makes sense to actually implement clustering.