Archive for the ‘Parallelism’ Category

ClojureCL

Sunday, June 14th, 2015

ClojureCL

From the getting started page:

ClojureCL is a Clojure library for High Performance Computing with OpenCL, which supports:

  • GPUs from AMD, nVidia, Intel;
  • CPUs from Intel, AMD, ARM etc;
  • Computing accelerators and embedded devices (Intel Xeon Phi, Parallella, etc.).
  • BTW, the page is asking for native English speakers to help edit the pages. Recent enough effort that the documentation may not be set in stone, unlike some projects.

    You have to smile at the comment found under: Making sense of OpenCL:

    Once you get past the beginner’s steep learning curve, it makes sense, and opens a whole new world of high-performance computing – you practically have a supercomputer on your desktop. (emphasis added)

    I’m not accustomed to that much honesty in documentation. 😉

    Surprising enough to make it worthwhile to read further.

    Enjoy!

    ArrayFire: A Portable Open-Source Accelerated Computing Library

    Wednesday, December 10th, 2014

    ArrayFire: A Portable Open-Source Accelerated Computing Library by Pavan Yalamanchilli.

    From the post:

    The ArrayFire library is a high-performance software library with a focus on portability and productivity. It supports highly tuned, GPU-accelerated algorithms using an easy-to-use API. ArrayFire wraps GPU memory into a simple “array” object, enabling developers to process vectors, matrices, and volumes on the GPU using high-level routines, without having to get involved with device kernel code.

    ArrayFire Capabilities

    ArrayFire is an Fortran. ArrayFire has a range of functionality, including

    ArrayFire has three back ends to enable portability across many platforms: CUDA, OpenCL and CPU. It even works on embedded platforms like NVIDIA’s Jetson TK1.

    In a past post about ArrayFire we demonstrated the ArrayFire capabilities and how you can increase your productivity by using ArrayFire. In this post I will tell you how you can use ArrayFire to exploit various kind of parallelism on NVIDIA GPUs.

    Just in case you get a box full of GPUs during the holidays and/or need better performance from ones you already have!

    Enjoy!

    Understanding and Expressing Scalable Concurrency

    Saturday, July 5th, 2014

    Understanding and Expressing Scalable Concurrency by Aaron Turon.

    Abstract
    

    The Holy Grail of parallel programming is to provide good speedup while hiding or avoiding the pitfalls of concurrency. But some level in the tower of abstraction must face facts: parallel processors execute code concurrently, and the interplay between concurrent code, synchronization, and the memory subsystem is a major determiner of performance. Effšective parallel programming must ultimately be supported by scalable concurrent algorithms—algorithms that tolerate (or even embrace) concurrency for the sake of scaling with available parallelism. This dissertation makes several contributions to the understanding and expression of such algorithms:

    • It shows how to understand scalable algorithms in terms of local protocols governing each part of their hidden state. These protocols are visual artifacts that can be used to informally explain an algorithm at the whiteboard. But they also play a formal role in a new logic for verifying concurrent algorithms, enabling correctness proofs that are local in space, time, and thread execution. Correctness is stated in terms of refinement: clients of an algorithm can reason as if they were using the much simpler specification code it refines.
    • It shows how to express synchronization in a declarative but scalable way, based on a new library providing join patterns. By declarative, we mean that the programmer needs only to write down the constraints of a synchronization problem, and the library will automatically derive a correct solution.By scalable, we mean that the derived solutions deliver robust performance with increasing processor count and problem complexity. The library’s performance on common synchronization problems is competitive with specialized algorithms from the literature.
    • It shows how to express scalable algorithms through reagents, a new monadic abstraction. With reagents, concurrent algorithms no longer need to be constructed from “wholecloth,” i.e., by using system-level primitives directly. Instead, they are built using a mixture of shared-state and message-passing combinators. Concurrency experts benefit, because they can write libraries at a higher level, with more reuse, without sacrificing scalability. Their clients benefit, because composition empowers them to extend and tailor a library without knowing the details of its underlying algorithms.

    Not for the faint of heart! 😉

    But if you are interested in algorithms for when processing is always parallel by default, best dig in.

    I like the author’s imagery of “Go Fish” when he says:

    A scalable hashtable is useful not just for concurrent systems; it can also be a boon for explicit parallel programming. A simple but vivid example is the problem of duplicate removal: given a vector of items, return the items in any order, but without any duplicates. Since the input is unstructured, any way of dividing it amongst parallel threads appears to require global coordination to discover duplicate items. The key to avoiding a multiprocessor game of “Go Fish” is to focus on producing the output rather than dividing the input. If threads share a scalable hashtable that allows parallel insertion of distinct elements, they can construct the correct output with (on average) very little coordination, by simply each inserting a segment of the input into the table, one element at a time.

    Now that I think about it, topic map processing does a lot of duplicate removal.

    Topic maps in a parallel processing environment anyone?

    I first saw this in a tweet by Alex Clemmer.

    LVars:…

    Thursday, March 27th, 2014

    LVars: Lattice-based Data Structures for Deterministic Parallel and Distributed Programming by Lindsey Kuper.

    At 144 slides and no sound, you probably need to pick up some background to really appreciate the slides.

    I would start with: A ten-minute talk about my research, continue with later post under LVars and then onto:

    LVars project repo: http://github.com/iu-parfunc/lvars

    Code from this talk: http://github.com/lkuper/lvar-examples

    Research blog: http://composition.al

    Take up the slides when you feel comfortable with the nomenclature and basic concepts.

    Is Parallel Programming Hard,…

    Tuesday, March 25th, 2014

    Is Parallel Programming Hard, And, If So, What Can You Do About It? by Paul E. McKenney.

    From Chapter 1 How To Use This Book:

    The purpose of this book is to help you understand how to program shared-memory parallel machines without risking your sanity.[1] By describing the algorithms and designs that have worked well in the past, we hope to help you avoid at least some of the pitfalls that have beset parallel-programming projects. But you should think of this book as a foundation on which to build, rather than as a completed cathedral. Your mission, if you choose to accept, is to help make further progress in the exciting field of parallel programming—progress that should in time render this book obsolete. Parallel programming is not as hard as some say, and we hope that this book makes your parallel-programming projects easier and more fun.

    In short, where parallel programming once focused on science, research, and grand-challenge projects, it is quickly becoming an engineering discipline. We therefore examine the specific tasks required for parallel programming and describe how they may be most effectively handled. In some surprisingly common special cases, they can even be automated.

    This book is written in the hope that presenting the engineering discipline underlying successful parallel-programming projects will free a new generation of parallel hackers from the need to slowly and painstakingly reinvent old wheels, enabling them to instead focus their energy and creativity on new frontiers. We sincerely hope that parallel programming brings you at least as much fun, excitement, and challenge that it has brought to us!

    I should not have been surprised by:

    16.4 Functional Programming for Parallelism

    When I took my first-ever functional-programming class in the early 1980s, the professor asserted that the side- effect-free functional-programming style was well-suited to trivial parallelization and analysis. Thirty years later, this assertion remains, but mainstream production use of parallel functional languages is minimal, a state of affairs that might well stem from this professor’s additional assertion that programs should neither maintain state nor do I/O. There is niche use of functional languages such as Erlang, and multithreaded support has been added to several other functional languages, but mainstream production usage remains the province of procedural languages such as C, C++, Java, and FORTRAN (usually augmented with OpenMP or MPI).

    The state of software vulnerability is testimony enough to the predominance of C, C++, and Java.

    I’m not real sure I would characterize Erlang as a “niche” language. Niche languages aren’t often found running telecommunications networks, or at least that is my impression.

    I would take McKenney’s comments as a challenge to use functional languages such as Clojure and Erlang to make in-roads into mainstream production.

    While you use this work to learn the procedural approach to parallelism, you can be building contrasts to a functional one.

    I first saw this in Nat Torkington’s Four short links: 13 March 2014.

    Beyond Monads: Practical Category Theory

    Wednesday, February 26th, 2014

    Beyond Monads: Practical Category Theory by Jim Duey.

    From the description:

    Category Theory is a rich toolbox of techniques to deal with complexity in building software. Unfortunately, the learning curve is steep with many abstract concepts that must be grasped before their utility is seen. This, coupled with an impenetrable vocabulary, are barriers for the average programmer to level up their capabilities, so that many of these benefits are not realized in industry.

    This talk gives the working programmer an avenue to begin using the tools Category Theory gives us by building a series of small steps starting from pure functions. The stage is set by describing how complexity arises from functional side effects. Then moves on to various forms of composition, gradually adding more structure to slice the accidental complexity into manageable pieces. Some of the vocabulary used in Category Theory will be explained to give listeners some signposts in their future study of these concepts.

    Listeners will leave this talk with a new perspective on the problems they face in their day-to-day programming tasks and some tools on how to address them, regardless of their chosen programming language.

    I would dearly love to see the presentation that went along with these slides!

    Jim also lists further references for which I have supplied links:

    Monads – Wadler “Monads for Functional Programming

    Applicative Functors – McBride/Paterson “Applicative Programming with Effects

    Comonads – Kieburtz “Codata and Comonads in Haskell

    Arrows – Hughes “Generalizing Monads to Arrows

    Jim does supply the link for: Haskell Typeclassopedia http://www.haskell.org/haskellwiki/Typeclassopedia

    I guess it comes down to this question: Are you now or do you plan to be a parallel programmer?

    GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

    Thursday, February 13th, 2014

    GraphX: Unifying Data-Parallel and Graph-Parallel Analytics by Reynold S Xin, et. al.

    Abstract:

    From social networks to language modeling, the growing scale and importance of graph data has driven the development of numerous new graph-parallel systems (e.g., Pregel, GraphLab). By restricting the computation that can be expressed and introducing new techniques to partition and distribute the graph, these systems can efficiently execute iterative graph algorithms orders of magnitude faster than more general data-parallel systems. However, the same restrictions that enable the performance gains also make it difficult to express many of the important stages in a typical graph-analytics pipeline: constructing the graph, modifying its structure, or expressing computation that spans multiple graphs. As a consequence, existing graph analytics pipelines compose graph-parallel and data-parallel systems using external storage systems, leading to extensive data movement and complicated programming model.

    To address these challenges we introduce GraphX, a distributed graph computation framework that unifies graph-parallel and data-parallel computation. GraphX provides a small, core set of graph-parallel operators expressive enough to implement the Pregel and PowerGraph abstractions, yet simple enough to be cast in relational algebra. GraphX uses a collection of query optimization techniques such as automatic join rewrites to efficiently implement these graph-parallel operators. We evaluate GraphX on real-world graphs and workloads and demonstrate that GraphX achieves comparable performance as specialized graph computation systems, while outperforming them in end-to-end graph pipelines. Moreover, GraphX achieves a balance between expressiveness, performance, and ease of use.

    Contributions of the paper:

    1. a data model that unifies graphs and collections as composable first-class objects and enables both data-parallel and graph-parallel operations.

    2. identifying a “narrow-waist” for graph computation, consisting of a small, core set of graph-operators cast in classic relational algebra; we believe these operators can express all graph computations in previous graph parallel systems, including Pregel and GraphLab.

    3. an efficient distributed graph representation embedded in horizontally partitioned collections and indices, and a collection of execution strategies that achieve efficient graph computations by exploiting properties of graph computations.

    UPDATE GraphX merged into Spark 0.9.0 release: http://spark.incubator.apache.org/releases/spark-release-0-9-0.html

    You will want to be familiar with Spark.

    I first saw this in a tweet by Stefano Bertolo.

    Exploiting Parallelism and Scalability (XPS)

    Monday, January 13th, 2014

    Exploiting Parallelism and Scalability (XPS) NSF

    Full Proposal Window: February 10, 2014 – February 24, 2014

    Synopsis:

    Computing systems have undergone a fundamental transformation from the single-processor devices of the turn of the century to today’s ubiquitous and networked devices and warehouse-scale computing via the cloud. Parallelism is abundant at many levels. At the same time, semiconductor technology is facing fundamental physical limits and single processor performance has plateaued. This means that the ability to achieve predictable performance improvements through improved processor technologies alone has ended. Thus, parallelism has become critically important.

    The Exploiting Parallelism and Scalability (XPS) program aims to support groundbreaking research leading to a new era of parallel computing. Achieving the needed breakthroughs will require a collaborative effort among researchers representing all areas– from services and applications down to the micro-architecture– and will be built on new concepts, theories, and foundational principles. New approaches to achieve scalable performance and usability need new abstract models and algorithms, new programming models and languages, new hardware architectures, compilers, operating systems and run-time systems, and must exploit domain and application-specific knowledge. Research is also needed on energy efficiency, communication efficiency, and on enabling the division of effort between edge devices and clouds.

    The January 10th webinar for this activity hasn’t been posted yet.

    Without semantics, XPS will establish a new metric:

    GFS: Garbage per Femtosecond.

    Supercomputing on the cheap with Parallella

    Tuesday, December 10th, 2013

    Supercomputing on the cheap with Parallella by Federico Lucifredi.

    From the post:

    Packing impressive supercomputing power inside a small credit card-sized board running Ubuntu, Adapteva‘s $99 ARM-based Parallella system includes the unique Ephiphany numerical accelerator that promises to unleash industrial strength parallel processing on the desktop at a rock-bottom price. The Massachusetts-based startup recently ran a successfully funded Kickstarter campaign and gained widespread attention only to run into a few roadblocks along the way. Now, with their setbacks behind them, Adapteva is slated to deliver its first units mid-December 2013, with volume shipping in the following months.

    What makes the Parallella board so exciting is that it breaks new ground: imagine an Open Source Hardware board, powered by just a few Watts of juice, delivering 90 GFLOPS of number crunching. Combine this with the possibility of clustering multiple boards, and suddenly the picture of an exceedingly affordable desktop supercomputer emerges.

    This review looks in-depth at a pre-release prototype board (so-called Generation Zero, a development run of 50 units), giving you a pretty complete overview of what the finished board will look like.

    Whether you participate in this aspect of the computing revolution or not, you will be impacted by it.

    The more successful Parallela and similar efforts become in bringing desktop supercomputing, the more pressure there will be on cloud computing providers to match those capabilities at lower prices.

    Another point of impact will be non-production experimentation with parallel processing. Which may, like Thomas Edison, discover (or re-discover) 10,000 ways that don’t work but discover 1 that far exceeds anyone’s expectations.

    That is to say that supercomputing will become cheap enough to tolerate frequent failure while experimenting with it.

    What would you like to invent for supercomputing?

    Use multiple CPU Cores with your Linux commands…

    Saturday, November 2nd, 2013

    Use multiple CPU Cores with your Linux commands — awk, sed, bzip2, grep, wc, etc.

    From the post:

    Here’s a common problem: You ever want to add up a very large list (hundreds of megabytes) or grep through it, or other kind of operation that is embarrassingly parallel? Data scientists, I am talking to you. You probably have about four cores or more, but our tried and true tools like grep, bzip2, wc, awk, sed and so forth are singly-threaded and will just use one CPU core. To paraphrase Cartman, “How do I reach these cores”? Let’s use all of our CPU cores on our Linux box by using GNU Parallel and doing a little in-machine map-reduce magic by using all of our cores and using the little-known parameter –pipes (otherwise known as –spreadstdin). Your pleasure is proportional to the number of CPUs, I promise. BZIP2 So, bzip2 is better compression than gzip, but it’s so slow! Put down the razor, we have the technology to solve this.

    A very interesting post, particularly if you explore data with traditional Unix tools.

    A comment to the post mentions that SSD is being presumed by the article.

    Perhaps but learning the technique will be useful for when SSD is standard.

    I first saw this in Pete Warden’s Five short links for Thursday, October 31, 2013.

    DynamiTE: Parallel Materialization of Dynamic RDF Data

    Wednesday, October 23rd, 2013

    DynamiTE: Parallel Materialization of Dynamic RDF Data by Jacopo Urbani, Alessandro Margara, Ceriel Jacobs, Frank van Harmelen, Henri Bal.

    Abstract:

    One of the main advantages of using semantically annotated data is that machines can reason on it, deriving implicit knowledge from explicit information. In this context, materializing every possible implicit derivation from a given input can be computationally expensive, especially when considering large data volumes.

    Most of the solutions that address this problem rely on the assumption that the information is static, i.e., that it does not change, or changes very infrequently. However, the Web is extremely dynamic: online newspapers, blogs, social networks, etc., are frequently changed so that outdated information is removed and replaced with fresh data. This demands for a materialization that is not only scalable, but also reactive to changes.

    In this paper, we consider the problem of incremental materialization, that is, how to update the materialized derivations when new data is added or removed. To this purpose, we consider the df RDFS fragment [12], and present a parallel system that implements a number of algorithms to quickly recalculate the derivation. In case new data is added, our system uses a parallel version of the well-known semi-naive evaluation of Datalog. In case of removals, we have implemented two algorithms, one based on previous theoretical work, and another one that is more efficient since it does not require a complete scan of the input.

    We have evaluated the performance using a prototype system called DynamiTE, which organizes the knowledge bases with a number of indices to facilitate the query process and exploits parallelism to improve the performance. The results show that our methods are indeed capable to recalculate the derivation in a short time, opening the door to reasoning on much more dynamic data than is currently possible.

    Not “lite” reading but refreshing to see the dynamic nature of information being taken as a starting point.

    Echoes of re-merging on the entry or deletion of information in a topic map. Yes?

    Source code online: https://github.com/jrbn/dynamite (Java)

    I first saw this in a tweet by Marin Dimitrov.

    Programming on Parallel Machines

    Tuesday, October 8th, 2013

    Programming on Parallel Machines by Norm Matloff.

    From “About This Book:”

    Why is this book di fferent from all other parallel programming books? It is aimed more on the practical end of things, in that:

    • There is very little theoretical content, such as O() analysis, maximum theoretical speedup, PRAMs, directed acyclic graphs (DAGs) and so on.
    • Real code is featured throughout.
    • We use the main parallel platforms OpenMP, CUDA and MPI rather than languages that at this stage are largely experimental or arcane.
    • The running performance themes communications latency, memory/network contention, load balancing and so on|are interleaved throughout the book, discussed in the context of specifi c platforms or applications.
    • Considerable attention is paid to techniques for debugging.

    The main programming language used is C/C++, but some of the code is in R, which has become the pre-eminent language for data analysis. As a scripting language, R can be used for rapid prototyping. In our case here, it enables me to write examples which are much less less cluttered than they would be in C/C++, thus easier for students to discern the fundamental parallelixation principles involved. For the same reason, it makes it easier for students to write their own parallel code, focusing on those principles. And R has a rich set of parallel libraries.

    It is assumed that the student is reasonably adept in programming, and has math background through linear algebra. An appendix reviews the parts of the latter needed for this book. Another appendix presents an overview of various systems issues that arise, such as process scheduling and virtual memory.

    It should be note that most of the code examples in the book are NOT optimized. The primary emphasis is on simplicity and clarity of the techniques and languages used. However, there is plenty of discussion on factors that aff ect speed, such cache coherency issues, network delays, GPU memory structures and so on.

    Here’s how to get the code fi les you’ll see in this book: The book is set in LaTeX, and the raw .tex files are available in http://heather.cs.ucdavis.edu/~matloff/158/PLN. Simply download the relevant fi le (the fi le names should be clear), then use a text editor to trim to the program code of interest.

    In order to illustrate for students the fact that research and teaching (should) enhance each other, I occasionally will make brief references here to some of my research work.

    Like all my open source textbooks, this one is constantly evolving. I continue to add new topics, new examples and so on, and of course x bugs and improve the exposition. For that reason, it is better to link to the latest version, which will always be at http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf, rather than to copy it.

    If you don’t mind a bit of practical writing this early in the week, this could be the book for you!

    If you read/use the book, please give feedback to the author about any bug or improvements that can be made.

    That is a good way to encourage the production of open-source textbooks.

    PODC and SPAA 2013 Accepted Papers

    Thursday, April 25th, 2013

    ACM Symposium on Principles of Distributed Computing [PODC] accepted papers. (Montréal, Québec, Canada, July 22-24, 2013) Main PODC page.

    Symposium on Parallelism in Algorithms and Architectures [SPAA] accepted papers. (Montréal, Québec, Canada, July 23 – 25, 2013) Main SPAA page.

    Just scanning the titles reveals a number of very interesting papers.

    Suggest you schedule a couple of weeks of vacation in Canada following SPAA before attending the Balisage Conference, August 6-9, 2013.

    The weather is quite temperate and the outdoor dining superb.

    I first saw this at: PODC AND SPAA 2013 ACCEPTED PAPERS.

    Parallella: The $99 Linux supercomputer

    Thursday, April 18th, 2013

    Parallella: The $99 Linux supercomputer by Steven J. Vaughan-Nichols.

    From the post:

    What Adapteva has done is create a credit-card sized parallel-processing board. This comes with a dual-core ARM A9 processor and a 64-core Epiphany Multicore Accelerator chip, along with 1GB of RAM, a microSD card, two USB 2.0 ports, 10/100/1000 Ethernet, and an HDMI connection. If all goes well, by itself, this board should deliver about 90 GFLOPS of performance, or — in terms PC users understand — about the same horse-power as a 45GHz CPU.

    This board will use Ubuntu Linux 12.04 for its operating system. To put all this to work, the platform reference design and drivers are now available.

    From Adapteva.

    I wonder which will come first:

    A really kick-ass 12 dimensional version of Asteroids?

    or

    New approaches to graph processing?

    What do you think?

    Exploiting Parallelism and Scalability (XPS) (NSF)

    Thursday, October 25th, 2012

    Exploiting Parallelism and Scalability (XPS) (NSF)

    From the announcement:

    Synopsis of Program:

    Computing systems have undergone a fundamental transformation from the single-processor devices of the turn of the century to today’s ubiquitous and networked devices and warehouse-scale computing via the cloud. Parallelism has become ubiquitous at many levels. The proliferation of multi- and many-core processors, ever-increasing numbers of interconnected high performance and data intensive edge devices, and the data centers servicing them, is enabling a new set of global applications with large economic and social impact. At the same time, semiconductor technology is facing fundamental physical limits and single processor performance has plateaued. This means that the ability to achieve predictable performance improvements through improved processor technologies has ended.

    The Exploiting Parallelism and Scalability (XPS) program aims to support groundbreaking research leading to a new era of parallel computing. XPS seeks research re-evaluating, and possibly re-designing, the traditional computer hardware and software stack for today’s heterogeneous parallel and distributed systems and exploring new holistic approaches to parallelism and scalability. Achieving the needed breakthroughs will require a collaborative effort among researchers representing all areas– from the application layer down to the micro-architecture– and will be built on new concepts and new foundational principles. New approaches to achieve scalable performance and usability need new abstract models and algorithms, programming models and languages, hardware architectures, compilers, operating systems and run-time systems, and exploit domain and application-specific knowledge. Research should also focus on energy- and communication-efficiency and on enabling the division of effort between edge devices and clouds.

    Full proposals due: February 20, 2013, (due by 5 p.m. proposer’s local time).

    I see the next wave of parallelism and scalability being based on language and semantics. Less so on more cores and better designs in silicon.

    Not surprising since I work in languages and semantics every day.

    Even so, consider a go-cart that exceeds 160 miles per hour (260 km/h) remains a go-cart.

    Go beyond building a faster go-cart.

    Consider language and semantics when writing your proposal for this program.

    Parallella: A Supercomputer For Everyone

    Friday, October 5th, 2012

    Parallella: A Supercomputer For Everyone

    For a $99 pledge you help make the Parallella computer a reality (and get one when produced).

    • Dual-core ARM A9 CPU
    • Epiphany Multicore Accelerator (16 or 64 cores)
    • 1GB RAM
    • MicroSD Card
    • USB 2.0 (two)
    • Two general purpose expansion connectors
    • Ethernet 10/100/1000
    • HDMI connection
    • Ships with Ubuntu OS
    • Ships with free open source Epiphany development tools that include C compiler, multicore debugger, Eclipse IDE, OpenCL SDK/compiler, and run time libraries.
    • Dimensions are 3.4” x 2.1”

    Once completed, the Parallella computer should deliver up to 45 GHz of equivalent CPU performance on a board the size of a credit card while consuming only 5 Watts under typical work loads. Counting GHz, this is more horsepower than a high end server costing thousands of dollars and consuming 400W.

    $99 to take a flyer on changing the fabric of supercomputing?

    I’ll take that chance. How about you?

    PS: Higher pledge amounts carry extra benefits, such as projected delivery of a beta version by December of 2012. ($5,000) Got a hard core geek on your holiday shopping list?

    PPS: I first saw this at: Adapteva Launches Crowd-Source Funding for Its Floating Point Accelerator by Michael Feldman (HPC).

    Cell Architectures (adding dashes of heterogeneity)

    Saturday, May 12th, 2012

    Cell Architectures

    From the post:

    A consequence of Service Oriented Architectures is the burning need to provide services at scale. The architecture that has evolved to satisfy these requirements is a little known technique called the Cell Architecture.

    A Cell Architecture is based on the idea that massive scale requires parallelization and parallelization requires components be isolated from each other. These islands of isolation are called cells. A cell is a self-contained installation that can satisfy all the operations for a shard. A shard is a subset of a much larger dataset, typically a range of users, for example.

    Cell Architectures have several advantages:

    • Cells provide a unit of parallelization that can be adjusted to any size as the user base grows.
    • Cell are added in an incremental fashion as more capacity is required.
    • Cells isolate failures. One cell failure does not impact other cells.
    • Cells provide isolation as the storage and application horsepower to process requests is independent of other cells.
    • Cells enable nice capabilities like the ability to test upgrades, implement rolling upgrades, and test different versions of software.
    • Cells can fail, be upgraded, and distributed across datacenters independent of other cells.

    The intersection of semantic heterogeneity and scaling remains largely unexplored.

    I suggest scaling in a homogeneous environment and then adding dashes of heterogeneity to see what breaks.

    Adjust and try again.

    Ohio State University Researcher Compares Parallel Systems

    Tuesday, April 3rd, 2012

    Ohio State University Researcher Compares Parallel Systems

    From the post:

    Surveying the wide range of parallel system architectures offered in the supercomputer market, an Ohio State University researcher recently sought to establish some side-by-side performance comparisons.

    The journal, Concurrency and Computation: Practice and Experience, in February published, “Parallel solution of the subset-sum problem: an empirical study.” The paper is based upon a master’s thesis written last year by former computer science and engineering graduate student Saniyah Bokhari.

    “We explore the parallelization of the subset-sum problem on three contemporary but very different architectures, a 128-processor Cray massively multithreaded machine, a 16-processor IBM shared memory machine, and a 240-core NVIDIA graphics processing unit,” said Bokhari. “These experiments highlighted the strengths and weaknesses of these architectures in the context of a well-defined combinatorial problem.”

    Bokhari evaluated the conventional central processing unit architecture of the IBM 1350 Glenn Cluster at the Ohio Supercomputer Center (OSC) and the less-traditional general-purpose graphic processing unit (GPGPU) architecture, available on the same cluster. She also evaluated the multithreaded architecture of a Cray Extreme Multithreading (XMT) supercomputer at the Pacific Northwest National Laboratory’s (PNNL) Center for Adaptive Supercomputing Software.

    What I found fascinating about this approach was the comparison of:

    the strengths and weaknesses of these architectures in the context of a well-defined combinatorial problem.

    True enough, there is a place for general methods and solutions, but one pays the price for using general methods and solutions.

    Thinking that for subject identity and “merging” in a “big data” context, that we will need a deeper understanding of specific identity and merging requirements. So that the result of that study is one or more well-defined combinatorial problems.

    That is to say that understanding one or more combinatorial problems precedes proposing a solution.

    You can view/download the thesis by Saniyah Bokhari, Parallel Solution of the Subset-sum Problem: An Empirical Study

    Or view the article (assuming you have access):

    Parallel solution of the subset-sum problem: an empirical study

    Abstract (of the article):

    The subset-sum problem is a well-known NP-complete combinatorial problem that is solvable in pseudo-polynomial time, that is, time proportional to the number of input objects multiplied by the sum of their sizes. This product defines the size of the dynamic programming table used to solve the problem. We show how this problem can be parallelized on three contemporary architectures, that is, a 128-processor Cray Extreme Multithreading (XMT) massively multithreaded machine, a 16-processor IBM x3755 shared memory machine, and a 240-core NVIDIA FX 5800 graphics processing unit (GPU). We show that it is straightforward to parallelize this algorithm on the Cray XMT primarily because of the word-level locking that is available on this architecture. For the other two machines, we present an alternating word algorithm that can implement an efficient solution. Our results show that the GPU performs well for problems whose tables fit within the device memory. Because GPUs typically have memories in the order of 10GB, such architectures are best for small problem sizes that have tables of size approximately 1010. The IBM x3755 performs very well on medium-sized problems that fit within its 64-GB memory but has poor scalability as the number of processors increases and is unable to sustain performance as the problem size increases. This machine tends to saturate for problem sizes of 1011 bits. The Cray XMT shows very good scaling for large problems and demonstrates sustained performance as the problem size increases. However, this machine has poor scaling for small problem sizes; it performs best for problem sizes of 1012 bits or more. The results in this paper illustrate that the subset-sum problem can be parallelized well on all three architectures, albeit for different ranges of problem sizes. The performance of these three machines under varying problem sizes show the strengths and weaknesses of the three architectures. Copyright © 2012 John Wiley & Sons, Ltd.

    Retrofitting Programming Languages for a Parallel World

    Thursday, March 1st, 2012

    Retrofitting Programming Languages for a Parallel World by James Reinders.

    From the post:

    The most widely used computer programming languages today were not designed as parallel programming languages. But retrofitting existing programming languages for parallel programming is underway. We can compare and contrast retrofits by looking at four key features, five key qualities, and the various implementation approaches.

    In this article, I focus on the features and qualities, leaving the furious debates over best approaches (language vs. library vs. directives, and abstract and portable vs. low-level with lots of controls) for another day.

    Four key features:

    • Memory model
    • Synchronization
    • Tasks, not threads
    • Data, parallel support

    Five qualities to desire:

    • Composability
    • Sequential reasoning
    • Communication minimization
    • Performance portability
    • Safety

    Parallel processing as the default isn’t that far in the future.

    Do you see any of these issues as not being relevant for the processing of topic maps?

    And unlike programming languages, topic maps by definition can operate in semantically heterogeneous environments.

    How’s that for icing on the cake of parallel processing?

    The time to address the issues of parallel processing of topic maps is now.

    Suggestions?

    Java, Python, Ruby, Linux, Windows, are all doomed

    Friday, February 3rd, 2012

    Java, Python, Ruby, Linux, Windows, are all doomed by Russell Winder.

    From the description:

    The Multicore Revolution gathers pace. Moore’s Law remains in force — chips are getting more and more transistors on a quarterly basis. Intel are now out and about touting the “many core chip”. The 80-core chip continues its role as research tool. The 48-core chip is now actively driving production engineering. Heterogeneity not homogeneity is the new “in” architecture.

    Where Intel research, AMD and others cannot be far behind.

    The virtual machine based architectures of the 1990s, Python, Ruby and Java, currently cannot cope with the new hardware architectures. Indeed Linux and Windows cannot cope with the new hardware architectures either. So either we will have lots of hardware which the software cannot cope with, or . . . . . . well you’ll just have to come to the session.

    The slides are very hard to see so grab a copy at: http://www.russel.org.uk/Presentations/accu_london_2010-11-18.pdf

    From the description: Heterogeneity not homogeneity is the new “in” architecture.

    Is greater heterogeneity in programming languages coming?

    Parallel approaches in next-generation sequencing analysis pipelines

    Tuesday, November 1st, 2011

    Parallel approaches in next-generation sequencing analysis pipelines

    From the post:

    My last post described a distributed exome analysis pipeline implemented on the CloudBioLinux and CloudMan frameworks. This was a practical introduction to running the pipeline on Amazon resources. Here I’ll describe how the pipeline runs in parallel, specifically diagramming the workflow to identify points of parallelization during lane and sample processing.

    Incredible innovation in throughput makes parallel processing critical for next-generation sequencing analysis. When a single Hi-Seq run can produce 192 samples (2 flowcells x 8 lanes per flowcell x 12 barcodes per lane), the analysis steps quickly become limited by the number of processing cores available.

    The heterogeneity of architectures utilized by researchers is a major challenge in building re-usable systems. A pipeline needs to support powerful multi-core servers, clusters and virtual cloud-based machines. The approach we took is to scale at the level of individual samples, lanes and pipelines, exploiting the embarassingly parallel nature of the computation. An AMQP messaging queue allows for communication between processes, independent of the system architecture. This flexible approach allows the pipeline to serve as a general framework that can be easily adjusted or expanded to incorporate new algorithms and analysis methods.

    The message passing based parallelism sounds a lot like Storm doesn’t it? Will message passing be what frees us from the constraints of architecture? Wondering what sort of performance “hit” we will take when not working really close to the metal? But, then the “metal” may become the basis for such message passing systems. Not quite yet but perhaps not so far away either.

    Deterministic Parallel Programming in Haskell

    Friday, October 7th, 2011

    Deterministic Parallel Programming in Haskell by Andres Löaut;h (pdf)

    The bundle for the tutorial with slides, source code and exercises: http://www.well-typed.com/Hal6/Hal6.zip.

    Very interesting slide deck on parallel programming in Haskell but you will also learn about parallelism in general.

    Whether needed or not, you have to admit it is a current marketing phrase.

    Parallel Haskell Tutorial: The Par Monad [Topic Map Profiling?]

    Monday, October 3rd, 2011

    Parallel Haskell Tutorial: The Par Monad

    Parallel programming will become largely transparent at some point but not today. 😉

    Walk through parallel processing of Sudoku and k-means, as well as measuring performance and debugging. Code is available.

    I think the debugging aspects of this tutorial stand out the most for me. Understanding a performance issue as opposed to throwing resources at it seems like the better approach to me.

    I know that a lot of time has been spent by the vendors of topic maps software profiling their code, but I wonder if anyone has profiled a topic map?

    That is we make choices in terms of topic map construction, some of which may result in more or less processing demands, to reach the same ends.

    As topic maps grow in size, the “how” a topic map is written may be as important as the “why” certain subjects were represented and merged.

    Have you profiled the construction of your topic maps? Comments appreciated.

    Python for brain mining:…

    Saturday, July 16th, 2011

    Python for brain mining: (neuro)science with state of the art machine learning and data visualization by Gaël Varoquaux.

    Brief slide deck on three tools:

    Mayavi: For 3-D visualizations.

    scikit-learn, which we reported on at: scikits.learn machine learning in Python.

    Joblib: running Python function as pipeline jobs.

    All three look useful, although I suspect Joblib may be the one of more immediate interest.

    Depends on your interests. Comments?

    Parasail

    Tuesday, May 10th, 2011

    Parasail (Parallel, Specification and Implementation Language) is a programming language with inherent parallelism.

    More correctly it represents the development of a programming language with inherent parallelism.

    Extending parallel processing to be a more general solution is all the rage and it is hard to choose winners and losers.

    It isn’t too early to start thinking about parallel processing of semantics.

    Some resources on Parasail to get you started in that direction:

    Designing ParaSail, a new programming language: A blog on the development of Parasail.

    Map/Reduce in ParaSail; Parameterized operations: A Map/Reduce post that may peak your interesting in Parasail.

    ParaSail Programming Language: Google Discussion Group on Parasail.

    An Introduction to ParaSail: Parallel Specification and Implementation Language

    Parasail Reference Manual — First Complete Draft

    Parallel prototyping leads to better design results, more divergence, and increased self-efficacy

    Wednesday, January 5th, 2011

    Parallel prototyping leads to better design results, more divergence, and increased self-efficacy Authors: Steven P. Dow, Alana Glassco, Jonathan Kass, Melissa Schwarz, Daniel L. Schwartz, and Scott R. Klemmer

    Abstract:

    Iteration can help people improve ideas. It can also give rise to fixation, continuously refining one option without considering others. Does creating and receiving feedback on multiple prototypes in parallel, as opposed to serially, affect learning, self-efficacy, and design exploration? An experiment manipulated whether independent novice designers created graphic Web advertisements in parallel or in series. Serial participants received descriptive critique directly after each prototype. Parallel participants created multiple prototypes before receiving feedback. As measured by click-through data and expert ratings, ads created in the Parallel condition significantly outperformed those from the Serial condition. Moreover, independent raters found Parallel prototypes to be more diverse. Parallel participants also reported a larger increase in task-specific self-confidence. This article outlines a theoretical foundation for why parallel prototyping produces better design results and discusses the implications for design education.

    Interesting that I should stumble on this article after posting about parallel processing.

    A useful heuristic for the development of subject identifications in an organization.

    Is Parallel Programming Hard, And, If So, What Can You Do About It?

    Wednesday, January 5th, 2011

    Is Parallel Programming Hard, And, If So, What Can You Do About It? Editor Paul E. McKenney

    Kirk Lowery forwarded this link to my attention.

    Just skimming the first couple of chapters, I have to say it has some of the most amusing graphics I have seen in any CS book.

    Performance, productivity and generality are all goals of parallel programming, as cited by this book.

    I have to wonder though if subject recognition tasks, analogous to computer vision, that are inherently parallel.

    Doing them in parallel does not make them easier but not doing them in parallel certainly makes them harder.

    For example, consider the last time you failed to recognize someone who wasn’t in the location or context where you normally see them.

    Do you recognize the context in addition or in parallel to your recognition of the person’s face?

    Questions:

    1. What benefits/drawbacks do you see in parallel processing of TMDM instances? (3-5 pages, citations)
    2. How would you design subject identifications for processing in a parallel environment? (3-5 pages, citations)
    3. How would you evaluate the need for parallel processing of subject identifications? (3-5 pages, citations)