Archive for the ‘Parallel Programming’ Category

Raspberry Pi Zero — The $5 Tiny Computer is Here [Paper Thoughts?]

Saturday, November 28th, 2015

Raspberry Pi Zero — The $5 Tiny Computer is Here by Swati Khandelwal.

From the post:

Get ready for a ThanksGiving celebration from the Raspberry Pi Foundation.

Raspberry Pi, the charitable foundation behind the United Kingdom’s best-selling computer, has just unveiled its latest wonder – the Raspberry Pi Zero.

Raspberry Pi Zero is a programmable computer that costs just $5 (or £4), may rank as the world’s cheapest computer.

The Raspberry Pi Zero is on sale from today and is also given away with this month’s copy of the Raspberry Pi own magazine MagPi (available at Barnes & Noble and Microcenter).

Do you intend to use your Raspberry Pi Zero, which far exceeds anything available during the early years of atomic bomb development “…as [a] really fast paper emulator?”

The quote is from:

…how the media in which we choose to represent our ideas shape (and too often, limit) what ideas we can have. “We have these things called computers, and we’re basically just using them as really fast paper emulators,” he says. “With the invention of the printing press, we invented a form of knowledge work which meant sitting at a desk, staring at little tiny rectangles and moving your hand a little bit. It used to be those tiny rectangles were papers or books and you’re moving your hand with a pen.

Now we’re staring at computer screens and moving our hands on a keyboard, but it’s basically the same thing. We’re computer users thinking paper thoughts.”

(The Utopian UI Architect)

Is the Raspberry Pi Zero going to be where you or your child steps beyond “…paper thoughts?”

Or doing the same activities of yesteryear, only faster?


FORTH For A Supercomputer

Monday, June 15th, 2015

Raspberry Pi JonesFORTH O/S

From the post:

A bare-metal operating system for Raspberry Pi, based on Jonesforth-ARM.

Jonesforth-ARM is an ARM port, by M2IHP’13 class members listed in AUTHORS, of x86 JonesForth.

x86 JonesForth is a Linux-hosted FORTH presented in a Literate Programming style by Richard W.M. Jones originally at Comments embedded in the original provide an excellent FORTH implementation tutorial. See the /annexia/ directory for a copy of this original source.

The algorithm for our unsigned DIVMOD instruction is extracted from ‘ARM Software Development Toolkit User Guide v2.50’ published by ARM in 1997-1998

Firmware files to make bootable images are maintained at See the /firmware/ directory for local copies used in the build process.

The Raspberry Pi or something very similar will be commonplace in the IoT.

Will those machines be doing your bidding or someone else’s?


Sunday, June 14th, 2015


From the getting started page:

ClojureCL is a Clojure library for High Performance Computing with OpenCL, which supports:

  • GPUs from AMD, nVidia, Intel;
  • CPUs from Intel, AMD, ARM etc;
  • Computing accelerators and embedded devices (Intel Xeon Phi, Parallella, etc.).
  • BTW, the page is asking for native English speakers to help edit the pages. Recent enough effort that the documentation may not be set in stone, unlike some projects.

    You have to smile at the comment found under: Making sense of OpenCL:

    Once you get past the beginner’s steep learning curve, it makes sense, and opens a whole new world of high-performance computing – you practically have a supercomputer on your desktop. (emphasis added)

    I’m not accustomed to that much honesty in documentation. 😉

    Surprising enough to make it worthwhile to read further.


    Fast parallel computing with Intel Phi coprocessors

    Tuesday, May 19th, 2015

    Fast parallel computing with Intel Phi coprocessors by Andrew Ekstrom.

    Andrew tells a tale of going from more than a week processing a 10,000×10,000 matrix raised to 10^17 to 6-8 hours and then substantially shorter times. Sigh, using Windows but still an impressive feat! As you might expect, using Revolution Analytics RRO, Intel’s Math Kernel Library (MKL), Intel Phi coprocessors, etc.

    There’s enough detail (I suspect) for you to duplicate this feat on your own Windows box, or perhaps more easily on Linux.


    I first saw this in a tweet by David Smith.

    Breaking the Similarity Bottleneck

    Saturday, May 9th, 2015

    Ultra-Fast Data-Mining Hardware Architecture Based on Stochastic Computing by Antoni Morro, et al.


    Minimal hardware implementations able to cope with the processing of large amounts of data in reasonable times are highly desired in our information-driven society. In this work we review the application of stochastic computing to probabilistic-based pattern-recognition analysis of huge database sets. The proposed technique consists in the hardware implementation of a parallel architecture implementing a similarity search of data with respect to different pre-stored categories. We design pulse-based stochastic-logic blocks to obtain an efficient pattern recognition system. The proposed architecture speeds up the screening process of huge databases by a factor of 7 when compared to a conventional digital implementation using the same hardware area.

    I haven’t included the hyperlinks, but:

    In this work we present a highly efficient methodology for data mining based on probabilistic processing. High dimensional data is inherently complex in clustering, classification and similarity search [15]. The proposed approach is evaluated showing its application to a similarity search over a huge database. Most data mining algorithms use similarity search as a subroutine core [16–18], and thus the time taken for this task is the bottleneck of virtually all data mining algorithms [19]. Similarity search plays a fundamental role in many data mining and machine learning problems, e.g. text categorization [20], collaborative filtering [21], time-series analysis [22,23], protein sequencing [24] or any application-specific task as petroglyphs comparison [25]. At the same time, the mining of huge datasets implies the use of large computer clusters [26,27]. The proposed approach based on the use of probabilistic processing shows large improvements in terms of hardware resources when compared with conventional solutions.

    Sorry they omitted topic maps but what is a merging criteria if it isn’t a type of “similarity?”

    From the conclusion:

    This implementation uses less hardware resources than conventional digital methodologies (based on binary and not probabilistic logic) and is able to process the order of 13GBytes of information per second (in contrast to the estimated 2GBytes/s of speed that could be achieved by the conventional implementation using the same hardware area). With the 12-dimensional space used to allocate each vector in the example shown in this paper we obtain the order of 1 billion of comparisons per second. A patent application has been done for this new mining methodology [32].

    The patent was filed in Spanish but English and French auto-translations are available.

    Hopefully the patent will be used in such a way as to promote widespread implementation of this technique.

    I could stand 1 billion comparisons a second, quite easily. Interactive development of merging algorithms anyone?

    I first saw this in a tweet by Stefano Bertolo.

    How to install RethinkDB on Raspberry PI 2

    Sunday, March 8th, 2015

    How to install RethinkDB on Raspberry PI 2 by Zaher Ghaibeh.

    A short video to help you install RethinkDB on your Raspberry PI 2.

    Install when your Raspberry PI 2 is not part of a larger cluster of Raspberry PI 2’s that is tracking social data on your intelligence services.

    Your advantage in that regard is that you aren’t (shouldn’t be) piling up bigger haystacks to investigate for needles using a pitchfork.

    Focused intelligence (beginning with HUMINT and incorporating SIGINT and other types of intelligence, can result in much higher quality intelligence at lower operational cost when compared to the data vacuum approach.

    For one reason, knowing the signal you are seeking boosts the chances of it being detected. Searching for an unknown signal adrift in a sea of data is a low percentage proposition.

    How do you plan to use your RethinkDB to track intelligence on local or state government?

    Raspberry Pi gets 6x the power, 2x the memory…

    Tuesday, February 3rd, 2015

    Raspberry Pi gets 6x the power, 2x the memory and still costs $35 by Stacey Higginbotham.

    From the post:

    Makers, academics and generally anyone who likes to play with computers: get ready for some awesomesauce. Raspberry Pis, the tiny Linux computers that currently sell for $35 are getting a makeover that will give a tremendous boost to their compute power and double their memory while still keeping their price the same.

    The Pi 2 boards will be available today, and Pi creator and CEO of Raspberry Pi (Trading) Ltd. Eben Upton says the organization has already built 100,000 units, so buyers shouldn’t have to wait like they did at the original Pi launch. The Pi 2 will have the following specs:

    • SoC : Broadcom BCM2836 (CPU, GPU, DSP, SDRAM, and single USB port)
    • CPU: 900 MHz quad-core ARM Cortex A7 (ARMv7 instruction set)
    • GPU: Broadcom VideoCore IV @ 250 MHz, OpenGL ES 2.0 (24 GFLOPS), 1080p30 MPEG-2 and VC-1 decoder (with license), 1080p30 h.264/MPEG-4 AVC high-profile decoder and encoder
    • Memory: 1 GB (shared with GPU)
    • Total backwards compatibility (in terms of multimedia, form-factor and interfacing) with Pi 1

    Why order a new Raspberry Pi?

    Well, Kevin Trainor had Ontopia running on the first version: Ontopia Runs on Raspberry Pi [This Rocks!]

    Hadoop on a Raspberry Pi

    And bear in mind my post: 5,000 operations per second – Computations for Hydrogen Bomb. What are you going to design with your new Raspberry Pi?

    If 5,000 operations per second could design a Hydrogen Bomb, what can you do with a 24 GFLOPS video chip? Faster Pac-Man, more detailed WarCraft, or Call of Duty, Future Warfare 4?

    Money makers no doubt but at the end of the day, still substitutes for changing the world.

    Parallel Programming with GPUs and R

    Sunday, February 1st, 2015

    Parallel Programming with GPUs and R by Norman Matloff.

    From the post:

    You’ve heard that graphics processing units — GPUs — can bring big increases in computational speed. While GPUs cannot speed up work in every application, the fact is that in many cases it can indeed provide very rapid computation. In this tutorial, we’ll see how this is done, both in passive ways (you write only R), and in more direct ways, where you write C/C++ code and interface it to R.

    Norman provides as readable an introduction to GPUs as I have seen in a while, a quick overview of R packages for accessing GPUs and then a quick look at writing CUDA code and problems you may have compiling it.

    Of particular note is Norman’s reference to a draft of his new book, Parallel Computation for Data Science, which introduces parallel computing with R.

    It won’t be long before parallel or not processing will be a detail hidden from the average programmer. Until then, see this post and Norman’s book for parallel processing and R.

    Dissertation draft readers wanted!

    Monday, August 18th, 2014

    Dissertation draft readers wanted!

    From the post:

    Inspired by Brent Yorgey, I’m finally going public with a draft of my dissertation!

    My thesis is that a certain kind of data structures, which I call “lattice-based data structures” or “LVars” for short, lend themselves well to guaranteed-deterministic parallel programming. My dissertation combines material from various alreadypublished papers, making it a three-papers-stapled-together dissertation in some sense, but I’m also retconning a lot of my work to make it tell the story I want to tell now.

    When people ask what the best introduction to LVars is, I have trouble recommending the first LVars paper; even though it was only published a year ago, my thinking has changed quite a lot as my collaborators and I have figured things out since then, and the paper doesn’t match the way I like to present things now. So I’m hoping that my dissertation will be something I can point to as the definitive introduction to LVars.1

    The latest draft is here; it’s automatically updated every time I commit to the repo.2 Because I thought it might be useful to my committee to see my thought process, I left my “peanut gallery” comments in there: notes to myself begin with “LK: ” and are in red, and TODOs are in a brighter red and begin with “TODO: ”. And, as you can see, there are still many TODOs — but it’s more or less starting to look like a dissertation. (Unlike Brent, I’m doing this a little late in the game: I’ve actually already sent a draft to my committee, and my defense is in only three weeks, on September 8. Still, I’m happy for any feedback, even at this late date; I probably won’t turn in a final version until some time after my defense, so there’s no rush.)

    I’ll echo Brent in saying that if you notice typos or grammatical errors, feel free to put in a pull request. However, if you have any more substantial comments or suggestions, please send me an email (lindsey at this domain) instead.

    Thanks so much for reading!

    What do you say?

    Ready to offer some eyes on a proposal for guaranteed-deterministic parallel programming?

    I’m interested both from the change tracking perspective of ODF as well as parallel processing of topic maps.

    Supercomputing frontiers and innovations

    Saturday, August 9th, 2014

    Supercomputing frontiers and innovations (New Journal)

    From the homepage:

    Parallel scientific computing has entered a new era. Multicore processors on desktop computers make parallel computing a fundamental skill required by all computer scientists. High-end systems have surpassed the Petaflop barrier, and significant efforts are devoted to the development of the next generation of hardware and software technologies towards Exascale systems. This is an exciting time for computing as we begin the journey on the road to exascale computing. ‘Going to the exascale’ will mean radical changes in computing architecture, software, and algorithms – basically, vastly increasing the levels of parallelism to the point of billions of threads working in tandem – which will force radical changes in how hardware is designed and how we go about solving problems. There are many computational and technical challenges ahead that must be overcome. The challenges are great, different than the current set of challenges, and exciting research problems await us.

    This journal, Supercomputing Frontiers and Innovations, gives an introduction to the area of innovative supercomputing technologies, prospective architectures, scalable and highly parallel algorithms, languages, data analytics, issues related to computational co-design, and cross-cutting HPC issues as well as papers on supercomputing education and massively parallel computing applications in science and industry.

    This journal provides immediate open access to its content on the principle that making research freely available to the public supports a greater global exchange of knowledge. We hope you find this journal timely, interesting, and informative. We welcome your contributions, suggestions, and improvements to this new journal. Please join us in making this exciting new venture a success. We hope you will find Supercomputing Frontiers and Innovations an ideal venue for the publication of your team’s next exciting results.

    Becoming “massively parallel” isn’t going to free “computing applications in science and industry” from semantics. If anything, the more complex applications become, the easier it will be to mislay semantics, to the user’s peril.

    Semantic efforts that did not scale for applications in the last decade face even dimmer prospects in the face of “big data” and massively parallel applications.

    I suggest we move the declaration of semantics closer to or at the authors of content/data. At least as a starting point for discussion/research.

    Current issue.

    MapGraph:… [3 billion Traversed Edges Per Second (TEPS) on a GPU]

    Saturday, August 2nd, 2014

    MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs by Zhisong Fu, Michael Personick, and Bryan Thompson.


    High performance graph analytics are critical for a long list of application domains. In recent years, the rapid advancement of many-core processors, in particular graphical processing units (GPUs), has sparked a broad interest in developing high performance parallel graph programs on these architectures. However, the SIMT architecture used in GPUs places particular constraints on both the design and implementation of the algorithms and data structures, making the development of such programs difficult and time-consuming.

    We present MapGraph, a high performance parallel graph programming framework that delivers up to 3 billion Traversed Edges Per Second (TEPS) on a GPU. MapGraph provides a high-level abstraction that makes it easy to write graph programs and obtain good parallel speedups on GPUs. To deliver high performance, MapGraph dynamically chooses among different scheduling strategies depending on the size of the frontier and the size of the adjacency lists for the vertices in the frontier. In addition, a Structure Of Arrays (SOA) pattern is used to ensure coalesced memory access. Our experiments show that, for many graph analytics algorithms, an implementation, with our abstraction, is up to two orders of magnitude faster than a parallel CPU implementation and is comparable to state-of-the-art, manually optimized GPU implementations. In addition, with our abstraction, new graph analytics can be developed with relatively little effort.

    Those of us who remember Bryan Thompson from the early days of topic maps are not surprised to see his name on a paper with phrases like: “…delivers up to 3 billion Traversed Edges Per Second (TEPS) on a GPU,” and “…is up to two orders of magnitude faster than a parallel CPU implementation….”

    Heavy sledding but definitely worth the effort.

    Oh, btw, did I mention this is an open source project?

    I first saw this in MapGraph: speeding up graph processing with GPUs by Danny Bickson.

    COSMOS: Python library for massively parallel workflows

    Friday, August 1st, 2014

    COSMOS: Python library for massively parallel workflows by Erik Gafni, et al. (Bioinformatics (2014) doi: 10.1093/bioinformatics/btu385 )


    Summary: Efficient workflows to shepherd clinically generated genomic data through the multiple stages of a next-generation sequencing pipeline are of critical importance in translational biomedical science. Here we present COSMOS, a Python library for workflow management that allows formal description of pipelines and partitioning of jobs. In addition, it includes a user interface for tracking the progress of jobs, abstraction of the queuing system and fine-grained control over the workflow. Workflows can be created on traditional computing clusters as well as cloud-based services.

    Availability and implementation: Source code is available for academic non-commercial research purposes. Links to code and documentation are provided at and

    Contact: or

    Supplementary information: Supplementary data are available at Bioinformatics online.

    A very good abstract but for pitching purposes, I would have chosen the first paragraph of the introduction:

    The growing deluge of data from next-generation sequencers leads to analyses lasting hundreds or thousands of compute hours per specimen, requiring massive computing clusters or cloud infrastructure. Existing computational tools like Pegasus (Deelman et al., 2005) and more recent efforts like Galaxy (Goecks et al., 2010) and Bpipe (Sadedin et al., 2012) allow the creation and execution of complex workflows. However, few projects have succeeded in describing complicated workflows in a simple, but powerful, language that generalizes to thousands of input files; fewer still are able to deploy workflows onto distributed resource management systems (DRMs) such as Platform Load Sharing Facility (LSF) or Sun Grid Engine that stitch together clusters of thousands of compute cores. Here we describe COSMOS, a Python library developed to address these and other needs.

    That paragraph highlights the bioinformatics aspects of COSMOS but also hints at a language that might be adapted to other “massively parallel workflows.” Workflows may differ details but the need to efficiently and effectively define them is a common problem.

    Improving RRB-Tree Performance through Transience

    Wednesday, July 23rd, 2014

    Improving RRB-Tree Performance through Transience by Jean Niklas L’orange.


    The RRB-tree is a confluently persistent data structure based on the persistent vector, with efficient concatenation and slicing, and effectively constant time indexing, updates and iteration. Although efficient appends have been discussed, they have not been properly studied.

    This thesis formally describes the persistent vector and the RRB-tree, and presents three optimisations for the RRB-tree which have been successfully used in the persistent vector. The differences between the implementations are discussed, and the performance is measured. To measure the performance, the C library librrb is implemented with the proposed optimisations.

    Results shows that the optimisations improves the append performance of the RRB-tree considerably, and suggests that its performance is comparable to mutable array lists in certain situations.

    Jean’s thesis is available at:

    Although immutable data structures are obviously better suited for parallel programming, years of hacks on mutable data structures have set a high bar for performance. Unreasonably, parallel programmers want the same level of performance from immutable data structures as from their current mutable ones. 😉

    Research such as Jean’s moves functional languages one step closer to being the default for parallel processing.


    Thursday, June 19th, 2014

    Rth: a Flexible Parallel Computation Package for R by Norm Matloff.

    From the post:

    The key feature of Rth is in the word flexible in the title of this post, which refers to the fact that Rth can be used on two different kinds of platforms for parallel computation: multicore systems and Graphics Processing Units (GPUs). You all know about the former–it’s hard to buy a PC these days that is not at least dual-core–and many of you know about the latter. If your PC or laptop has a somewhat high-end graphics card, this enables extremely fast computation on certain kinds of problems. So, whether have, say, a quad-core PC or a good NVIDIA graphics card, you can run Rth for fast computation, again for certain types of applications. And both multicore and GPUs are available in the Amazon EC2 cloud service.

    Rth Quick Start

    Our Rth home page tells you the GitHub site at which you can obtain the package, and how to install it. (We plan to place it on CRAN later on.) Usage is simple, as in this example:

    Rth is an example of what I call Pretty Good Parallelism (an allusion to Pretty Good Privacy). For certain applications it can get you good speedup on two different kinds of common platforms (multicore, GPU). Like most parallel computation systems, it works best on very regular, “embarrassingly parallel” problems. For very irregular, complex apps, one may need to resort to very detailed C code to get a good speedup.

    Rth has not been tested on Windows so I am sure the authors would appreciate reports on your use of Rth with Windows.

    Contributions of new Rth functions are solicited. At least if you don’t mind making parallel processing easier for everyone. 😉

    I first saw this in a tweet by Christopher Lalanne.

    Modern GPU

    Friday, June 6th, 2014

    Modern GPU by Sean Baxter.

    From the webpage:

    Modern GPU is code and commentary intended to promote new and productive ways of thinking about GPU computing.

    This project is a library, an algorithms book, a tutorial, and a best-practices guide. If you are new to CUDA, start here. If you’re already familiar with CUDA, are ready for a challenge, and want to learn design patterns for parallel programming, enjoy this series. (emphasis in original)

    Just in time for the weekend! And you don’t need an iOS device to read the text. 😉

    Skimming the FAQ this is serious work but will return serious dividends as well.


    Parallel Graph Partitioning for Complex Networks

    Monday, April 21st, 2014

    Parallel Graph Partitioning for Complex Networks by Henning Meyerhenke, Peter Sanders, and, Christian Schulz.


    Processing large complex networks like social networks or web graphs has recently attracted considerable interest. In order to do this in parallel, we need to partition them into pieces of about equal size. Unfortunately, previous parallel graph partitioners originally developed for more regular mesh-like networks do not work well for these networks. This paper addresses this problem by parallelizing and adapting the label propagation technique originally developed for graph clustering. By introducing size constraints, label propagation becomes applicable for both the coarsening and the refinement phase of multilevel graph partitioning. We obtain very high quality by applying a highly parallel evolutionary algorithm to the coarsened graph. The resulting system is both more scalable and achieves higher quality than state-of-the-art systems like ParMetis or PT-Scotch. For large complex networks the performance differences are very big. For example, our algorithm can partition a web graph with 3.3 billion edges in less than sixteen seconds using 512 cores of a high performance cluster while producing a high quality partition — none of the competing systems can handle this graph on our system.

    Clustering in this article is defined by a node’s “neighborhood,” I am curious if defining a “neighborhood” based on multi-part (hierarchical?) identifiers might enable parallel processing of merging conditions?

    While looking for resources on graph contraction, I encountered a series of lectures by Kanat Tangwongsan from: Parallel and Sequential Data Structures and Algorithms, 15-210 (Spring 2012) (link to the course schedule with numerous resources):

    Lecture 17 — Graph Contraction I: Tree Contraction

    Lecture 18 — Graph Contraction II: Connectivity and MSTs

    Lecture 19 — Graph Contraction III: Parallel MST and MIS



    Thursday, March 27th, 2014

    LVars: Lattice-based Data Structures for Deterministic Parallel and Distributed Programming by Lindsey Kuper.

    At 144 slides and no sound, you probably need to pick up some background to really appreciate the slides.

    I would start with: A ten-minute talk about my research, continue with later post under LVars and then onto:

    LVars project repo:

    Code from this talk:

    Research blog:

    Take up the slides when you feel comfortable with the nomenclature and basic concepts.

    Beyond Monads: Practical Category Theory

    Wednesday, February 26th, 2014

    Beyond Monads: Practical Category Theory by Jim Duey.

    From the description:

    Category Theory is a rich toolbox of techniques to deal with complexity in building software. Unfortunately, the learning curve is steep with many abstract concepts that must be grasped before their utility is seen. This, coupled with an impenetrable vocabulary, are barriers for the average programmer to level up their capabilities, so that many of these benefits are not realized in industry.

    This talk gives the working programmer an avenue to begin using the tools Category Theory gives us by building a series of small steps starting from pure functions. The stage is set by describing how complexity arises from functional side effects. Then moves on to various forms of composition, gradually adding more structure to slice the accidental complexity into manageable pieces. Some of the vocabulary used in Category Theory will be explained to give listeners some signposts in their future study of these concepts.

    Listeners will leave this talk with a new perspective on the problems they face in their day-to-day programming tasks and some tools on how to address them, regardless of their chosen programming language.

    I would dearly love to see the presentation that went along with these slides!

    Jim also lists further references for which I have supplied links:

    Monads – Wadler “Monads for Functional Programming

    Applicative Functors – McBride/Paterson “Applicative Programming with Effects

    Comonads – Kieburtz “Codata and Comonads in Haskell

    Arrows – Hughes “Generalizing Monads to Arrows

    Jim does supply the link for: Haskell Typeclassopedia

    I guess it comes down to this question: Are you now or do you plan to be a parallel programmer?

    GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

    Thursday, February 13th, 2014

    GraphX: Unifying Data-Parallel and Graph-Parallel Analytics by Reynold S Xin, et. al.


    From social networks to language modeling, the growing scale and importance of graph data has driven the development of numerous new graph-parallel systems (e.g., Pregel, GraphLab). By restricting the computation that can be expressed and introducing new techniques to partition and distribute the graph, these systems can efficiently execute iterative graph algorithms orders of magnitude faster than more general data-parallel systems. However, the same restrictions that enable the performance gains also make it difficult to express many of the important stages in a typical graph-analytics pipeline: constructing the graph, modifying its structure, or expressing computation that spans multiple graphs. As a consequence, existing graph analytics pipelines compose graph-parallel and data-parallel systems using external storage systems, leading to extensive data movement and complicated programming model.

    To address these challenges we introduce GraphX, a distributed graph computation framework that unifies graph-parallel and data-parallel computation. GraphX provides a small, core set of graph-parallel operators expressive enough to implement the Pregel and PowerGraph abstractions, yet simple enough to be cast in relational algebra. GraphX uses a collection of query optimization techniques such as automatic join rewrites to efficiently implement these graph-parallel operators. We evaluate GraphX on real-world graphs and workloads and demonstrate that GraphX achieves comparable performance as specialized graph computation systems, while outperforming them in end-to-end graph pipelines. Moreover, GraphX achieves a balance between expressiveness, performance, and ease of use.

    Contributions of the paper:

    1. a data model that unifies graphs and collections as composable first-class objects and enables both data-parallel and graph-parallel operations.

    2. identifying a “narrow-waist” for graph computation, consisting of a small, core set of graph-operators cast in classic relational algebra; we believe these operators can express all graph computations in previous graph parallel systems, including Pregel and GraphLab.

    3. an efficient distributed graph representation embedded in horizontally partitioned collections and indices, and a collection of execution strategies that achieve efficient graph computations by exploiting properties of graph computations.

    UPDATE GraphX merged into Spark 0.9.0 release:

    You will want to be familiar with Spark.

    I first saw this in a tweet by Stefano Bertolo.


    Tuesday, January 28th, 2014

    Large scale data analysis made easier with SparkR by Shivaram Venkataraman.

    From the post:

    R is a widely used statistical programming language and supports a variety of data analysis tasks through extension packages. In fact, a recent survey of data scientists showed that R is the most frequently used tool other than SQL databases. However, data analysis in R is limited as the runtime is single threaded and can only process data sets that fit in a single machine.

    In an effort to enable large scale data analysis from R, we have recently released SparkR. SparkR is an R package that provides a light-weight frontend to use Spark from R. SparkR allows users to create and transform RDDs in R and interactively run jobs from the R shell on a Spark cluster. You can can try out SparkR today by installing it from our github repo.

    Be mindful of the closing caveat:

    Right now, SparkR works well for algorithms like gradient descent that are parallelizable but requires users to decide which parts of the algorithm can be run in parallel. In the future, we hope to provide direct access to large scale machine learning algorithms by integrating with Spark’s MLLib. More examples and details about SparkR can be found at

    Early days for SparkR but it has a lot of promise.

    I first saw this in a tweet by Jason Trost.

    MPGraph: [GPU = 3 Billion Traversed Edges Per Second]

    Wednesday, January 15th, 2014

    mpgraph Beta: Massively Parallel Graph processing on GPUs

    From the webpage:

    MPGraph is Massively Parallel Graph processing on GPUs.

    The MPGraph API makes it easy to develop high performance graph analytics on GPUs. The API is based on the Gather-Apply-Scatter (GAS) model as used in GraphLab. To deliver high performance computation and efficiently utilize the high memory bandwidth of GPUs, MPGraph’s CUDA kernels use multiple sophisticated strategies, such as vertex-degree-dependent dynamic parallelism granularity and frontier compaction.

    MPGraph is up to two order of magnitude faster than parallel CPU implementations on up 24 CPU cores and has performance comparable to a state-of-the-art manually optimized GPU implementation.

    New algorithms can be implemented in a few hours that fully exploit the data-level parallelism of the GPU and offer throughput of up to 3 billion traversed edges per second on a single GPU.

    Before some wag blows off the “3 billion traversed edges per second on a single GPU” by calling MPGraph a “graph compute engine,” consider this performance graphic:

    MPGraph performance

    Screenshot 1 / 1 MPGraph showing BFS speedup over graphlab. Comparison is a single NVIDIA K20 verus up to 24 CPU cores using an 3.33 GHz X5680 CPU chipset.

    Don’t let name calling keep you from seeking the graph performance you need.

    Flying an F-16 requires more user skill than a VW. But when you need an F-16, don’t settle for a VW because its easier.

    Supercomputing on the cheap with Parallella

    Tuesday, December 10th, 2013

    Supercomputing on the cheap with Parallella by Federico Lucifredi.

    From the post:

    Packing impressive supercomputing power inside a small credit card-sized board running Ubuntu, Adapteva‘s $99 ARM-based Parallella system includes the unique Ephiphany numerical accelerator that promises to unleash industrial strength parallel processing on the desktop at a rock-bottom price. The Massachusetts-based startup recently ran a successfully funded Kickstarter campaign and gained widespread attention only to run into a few roadblocks along the way. Now, with their setbacks behind them, Adapteva is slated to deliver its first units mid-December 2013, with volume shipping in the following months.

    What makes the Parallella board so exciting is that it breaks new ground: imagine an Open Source Hardware board, powered by just a few Watts of juice, delivering 90 GFLOPS of number crunching. Combine this with the possibility of clustering multiple boards, and suddenly the picture of an exceedingly affordable desktop supercomputer emerges.

    This review looks in-depth at a pre-release prototype board (so-called Generation Zero, a development run of 50 units), giving you a pretty complete overview of what the finished board will look like.

    Whether you participate in this aspect of the computing revolution or not, you will be impacted by it.

    The more successful Parallela and similar efforts become in bringing desktop supercomputing, the more pressure there will be on cloud computing providers to match those capabilities at lower prices.

    Another point of impact will be non-production experimentation with parallel processing. Which may, like Thomas Edison, discover (or re-discover) 10,000 ways that don’t work but discover 1 that far exceeds anyone’s expectations.

    That is to say that supercomputing will become cheap enough to tolerate frequent failure while experimenting with it.

    What would you like to invent for supercomputing?

    ParLearning 2014

    Friday, November 8th, 2013

    ParLearning 2014 The 3rd International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics.


    Workshop Paper Due: December 30, 2013
    Author Notification: February 14, 2014
    Camera-ready Paper Due: March 14, 2014
    Workshop: May 23, 2014 Phoenix, AZ, USA

    From the webpage:

    Data-driven computing needs no introduction today. The case for using data for strategic advantages is exemplified by web search engines, online translation tools and many more examples. The past decade has seen 1) the emergence of multicore architectures and accelerators as GPGPUs, 2) widespread adoption of distributed computing via the map-reduce/hadoop eco-system and 3) democratization of the infrastructure for processing massive datasets ranging into petabytes by cloud computing. The complexity of the technological stack has grown to an extent where it is imperative to provide frameworks to abstract away the system architecture and orchestration of components for massive-scale processing. However, the growth in volume and heterogeneity in data seems to outpace the growth in computing power. A “collect everything” culture stimulated by cheap storage and ubiquitous sensing capabilities contribute to increasing the noise-to-signal ratio in all collected data. Thus, as soon as the data hits the processing infrastructure, determining the value of information, finding its rightful place in a knowledge representation and determining subsequent actions are of paramount importance. To use this data deluge to our advantage, a convergence between the field of Parallel and Distributed Computing and the interdisciplinary science of Artificial Intelligence seems critical. From application domains of national importance as cyber-security, health-care or smart-grid to providing real-time situational awareness via natural interface based smartphones, the fundamental AI tasks of Learning and Inference need to be enabled for large-scale computing across this broad spectrum of application domains.

    Many of the prominent algorithms for learning and inference are notorious for their complexity. Adopting parallel and distributed computing appears as an obvious path forward, but the mileage varies depending on how amenable the algorithms are to parallel processing and secondly, the availability of rapid prototyping capabilities with low cost of entry. The first issue represents a wider gap as we continue to think in a sequential paradigm. The second issue is increasingly recognized at the level of programming models, and building robust libraries for various machine-learning and inferencing tasks will be a natural progression. As an example, scalable versions of many prominent graph algorithms written for distributed shared memory architectures or clusters look distinctly different from the textbook versions that generations of programmers have grown with. This reformulation is difficult to accomplish for an interdisciplinary field like Artificial Intelligence for the sheer breadth of the knowledge spectrum involved. The primary motivation of the proposed workshop is to invite leading minds from AI and Parallel & Distributed Computing communities for identifying research areas that require most convergence and assess their impact on the broader technical landscape.

    Taking full advantage of parallel processing remains a distant goal. This workshop looks like a good concrete step towards that goal.

    DynamiTE: Parallel Materialization of Dynamic RDF Data

    Wednesday, October 23rd, 2013

    DynamiTE: Parallel Materialization of Dynamic RDF Data by Jacopo Urbani, Alessandro Margara, Ceriel Jacobs, Frank van Harmelen, Henri Bal.


    One of the main advantages of using semantically annotated data is that machines can reason on it, deriving implicit knowledge from explicit information. In this context, materializing every possible implicit derivation from a given input can be computationally expensive, especially when considering large data volumes.

    Most of the solutions that address this problem rely on the assumption that the information is static, i.e., that it does not change, or changes very infrequently. However, the Web is extremely dynamic: online newspapers, blogs, social networks, etc., are frequently changed so that outdated information is removed and replaced with fresh data. This demands for a materialization that is not only scalable, but also reactive to changes.

    In this paper, we consider the problem of incremental materialization, that is, how to update the materialized derivations when new data is added or removed. To this purpose, we consider the df RDFS fragment [12], and present a parallel system that implements a number of algorithms to quickly recalculate the derivation. In case new data is added, our system uses a parallel version of the well-known semi-naive evaluation of Datalog. In case of removals, we have implemented two algorithms, one based on previous theoretical work, and another one that is more efficient since it does not require a complete scan of the input.

    We have evaluated the performance using a prototype system called DynamiTE, which organizes the knowledge bases with a number of indices to facilitate the query process and exploits parallelism to improve the performance. The results show that our methods are indeed capable to recalculate the derivation in a short time, opening the door to reasoning on much more dynamic data than is currently possible.

    Not “lite” reading but refreshing to see the dynamic nature of information being taken as a starting point.

    Echoes of re-merging on the entry or deletion of information in a topic map. Yes?

    Source code online: (Java)

    I first saw this in a tweet by Marin Dimitrov.

    Work Efficient Parallel Algorithms for Large Graph Exploration

    Monday, October 21st, 2013

    Work Efficient Parallel Algorithms for Large Graph Exploration by Dip Sankar Banerjee, Shashank Sharma, Kishore Kothapalli.


    Graph algorithms play a prominent role in several fields of sciences and engineering. Notable among them are graph traversal, finding the connected components of a graph, and computing shortest paths. There are several efficient implementations of the above problems on a variety of modern multiprocessor architectures. It can be noticed in recent times that the size of the graphs that correspond to real world data sets has been increasing. Parallelism offers only a limited succor to this situation as current parallel architectures have severe short-comings when deployed for most graph algorithms. At the same time, these graphs are also getting very sparse in nature. This calls for particular work efficient solutions aimed at processing large, sparse graphs on modern parallel architectures. In this paper, we introduce graph pruning as a technique that aims to reduce the size of the graph. Certain elements of the graph can be pruned depending on the nature of the computation. Once a solution is obtained for the pruned graph, the solution is extended to the entire graph. We apply the above technique on three fundamental graph algorithms: breadth first search (BFS), Connected Components (CC), and All Pairs Shortest Paths (APSP). To validate our technique, we implement our algorithms on a heterogeneous platform consisting of a multicore CPU and a GPU. On this platform, we achieve an average of 35% improvement compared to state-of-the-art solutions. Such an improvement has the potential to speed up other applications that rely on these algorithms.

    This paper makes me curious if pruning could be a viable solution for processing very large topic maps?

    I may have a map that includes all of Wikipedia but for some purposes, I have no interest in any of the Wikipedia data.

    Or I may have different accounting entries that display depending upon the user accessing my accounting map. Totals, audit trails should be generated from one consistent set of entries.

    Any experience or thoughts in that regard?

    Programming on Parallel Machines

    Tuesday, October 8th, 2013

    Programming on Parallel Machines by Norm Matloff.

    From “About This Book:”

    Why is this book di fferent from all other parallel programming books? It is aimed more on the practical end of things, in that:

    • There is very little theoretical content, such as O() analysis, maximum theoretical speedup, PRAMs, directed acyclic graphs (DAGs) and so on.
    • Real code is featured throughout.
    • We use the main parallel platforms OpenMP, CUDA and MPI rather than languages that at this stage are largely experimental or arcane.
    • The running performance themes communications latency, memory/network contention, load balancing and so on|are interleaved throughout the book, discussed in the context of specifi c platforms or applications.
    • Considerable attention is paid to techniques for debugging.

    The main programming language used is C/C++, but some of the code is in R, which has become the pre-eminent language for data analysis. As a scripting language, R can be used for rapid prototyping. In our case here, it enables me to write examples which are much less less cluttered than they would be in C/C++, thus easier for students to discern the fundamental parallelixation principles involved. For the same reason, it makes it easier for students to write their own parallel code, focusing on those principles. And R has a rich set of parallel libraries.

    It is assumed that the student is reasonably adept in programming, and has math background through linear algebra. An appendix reviews the parts of the latter needed for this book. Another appendix presents an overview of various systems issues that arise, such as process scheduling and virtual memory.

    It should be note that most of the code examples in the book are NOT optimized. The primary emphasis is on simplicity and clarity of the techniques and languages used. However, there is plenty of discussion on factors that aff ect speed, such cache coherency issues, network delays, GPU memory structures and so on.

    Here’s how to get the code fi les you’ll see in this book: The book is set in LaTeX, and the raw .tex files are available in Simply download the relevant fi le (the fi le names should be clear), then use a text editor to trim to the program code of interest.

    In order to illustrate for students the fact that research and teaching (should) enhance each other, I occasionally will make brief references here to some of my research work.

    Like all my open source textbooks, this one is constantly evolving. I continue to add new topics, new examples and so on, and of course x bugs and improve the exposition. For that reason, it is better to link to the latest version, which will always be at, rather than to copy it.

    If you don’t mind a bit of practical writing this early in the week, this could be the book for you!

    If you read/use the book, please give feedback to the author about any bug or improvements that can be made.

    That is a good way to encourage the production of open-source textbooks.

    Parallel Astronomical Data Processing with Python:…

    Saturday, August 17th, 2013

    Parallel Astronomical Data Processing with Python: Recipes for multicore machines by Bruce Berriman.

    From the post:

    Most astronomers (myself included) have a high performance compute engine on their desktops. Modern computers now contain multicore processors, whose development was prompted by the need to reduce heat dissipation and power consumption but which give users a powerful processing machine at their fingertips. Singh, Browne and Butler have recently posted a preprint on astro-ph, submitted to Astronomy and Computing, that offers recipes in Python for running data parallel processing on multicore machines. Such machines offer an alternative to grids, clouds and clusters for many tasks, and the authors give examples based on commonly used astronomy toolkits.

    The paper restricts itself to the use of CPython’s native multiprocessing module, for two reasons: much astronomical software is written in it, and it places sufficiently strong restrictions on managing threads launched by the OS that it can make parallel jobs run slower than serial jobs (not so for other flavors of Python, though, such as PyPy and Jython). The authors also chose to study data parallel applications, which are common in astronomy, rather than task parallel applications. The heart of the paper is a comparison of three approaches to multiprocessing in Python, with sample code snippets for each:

    Bruce’s quick overview will give you the motivation to read this paper.

    Astronomical data is easier to process in parallel than some data.

    Suggestions on how to transform other data to make it easier to process in parallel?

    Parallella is Shipping!

    Thursday, August 1st, 2013

    Creating a $99 parallel computing machine is just as hard as it sounds by Jon Brodkin.

    From the post:

    Ten months ago, the chipmaker Adapteva unveiled a bold quest—to create a Raspberry Pi-sized computer that can perform the same types of tasks typically reserved for supercomputers. And… they wanted to sell it for only $99. A successful Kickstarter project raised nearly $900,000 for the so-called “Parallella,” and the company got to work with a goal of shipping the first devices by February 2013 and the rest by May 2013.

    As so often happens, the deadlines slipped, but Adapteva has done what it set out to do. Last week, the company shipped the first 40 Parallellas and says it will ship all 6,300 computers ordered through the Kickstarter by the end of August. Anyone who didn’t back the Kickstarter can now pre-order for delivery in October.

    The first version of the board was finished in January, but it cost $150 to produce. “After that it was iterating time after time to get the bill of materials down to something we wouldn’t be losing $50 per board on,” Adapteva CEO and founder Andreas Olofsson told Ars.

    Adapteva called Parallella “A Supercomputer For Everyone” in the title of its Kickstarter. Of course, each individual Parallella computer is not a supercomputer in and of itself. But each is capable of the types of parallel computing tasks performed by supercomputers—on a smaller scale—and they can be joined together in Ethernet-connected clusters to create something resembling a supercomputer, Adapteva says.


    Parallella is open source hardware, meaning Adapteva released all the details about the components to the public. Board design files are on GitHub, for example. Theoretically, companies besides Adapteva could build Parallella computers.

    See the post for details but there is a higher priced 64-bit chip as well.

    Thoughts about parallel processing of topic maps?

    Medusa: Simplified Graph Processing on GPUs

    Saturday, May 11th, 2013

    Medusa: Simplified Graph Processing on GPUs by Jianlong Zhong, Bingsheng He.


    Graphs are the de facto data structures for many applications, and efficient graph processing is a must for the application performance. Recently, the graphics processing unit (GPU) has been adopted to accelerate various graph processing algorithms such as BFS and shortest path. However, it is difficult to write correct and efficient GPU programs and even more difficult for graph processing due to the irregularities of graph structures. To simplify graph processing on GPUs, we propose a programming framework called Medusa which enables developers to leverage the capabilities of GPUs by writing sequential C/C++ code. Medusa offers a small set of user-defined APIs, and embraces a runtime system to automatically execute those APIs in parallel on the GPU. We develop a series of graph-centric optimizations based on the architecture features of GPU for efficiency. Additionally, Medusa is extended to execute on multiple GPUs within a machine. Our experiments show that (1) Medusa greatly simplifies implementation of GPGPU programs for graph processing, with much fewer lines of source code written by developers; (2) The optimization techniques significantly improve the performance of the runtime system, making its performance comparable with or better than the manually tuned GPU graph operations.

    Just in case you are interested in high performance graph processing. 😉

    Massively Parallel Suffix Array Queries…

    Saturday, May 11th, 2013

    Massively Parallel Suffix Array Queries and On-Demand Phrase Extraction for Statistical Machine Translation Using GPUs by Hua He, Jimmy Lin, Adam Lopez.


    Translation models can be scaled to large corpora and arbitrarily-long phrases by looking up translations of source phrases on the fly in an indexed parallel text. However, this is impractical because on-demand extraction of phrase tables is a major computational bottleneck. We solve this problem by developing novel algorithms for general purpose graphics processing units (GPUs), which enable suffix array queries for phrase lookup and phrase extractions to be massively parallelized. Our open-source implementation improves the speed of a highly-optimized, state-of-the-art serial CPU-based implementation by at least an order of magnitude. In a Chinese-English translation task, our GPU implementation extracts translation tables from approximately 100 million words of parallel text in less than 30 milliseconds.

    If you think about topic maps as mapping the identification of a subject in multiple languages to a single representative, then the value of translation software becomes obvious.

    You may or may not, depending upon project requirements, want to rely solely on automated mappings of phrases.

    Whether you use automated mapping of phrases as an “assist” to or as a sanity check on human curation, this work looks very interesting.