Archive for the ‘GPU’ Category

Meet Fenton (my data crunching machine)

Saturday, February 25th, 2017

Meet Fenton (my data crunching machine) by Alex Staravoitau.

From the post:

As you might be aware, I have been experimenting with AWS as a remote GPU-enabled machine for a while, configuring Jupyter Notebook to use it as a backend. It seemed to work fine, although costs did build over time, and I had to always keep in mind to shut it off, alongside with a couple of other limitations. Long story short, around 3 months ago I decided to build my own machine learning rig.

My idea in a nutshell was to build a machine that would only act as a server, being accessible from anywhere to me, always ready to unleash its computational powers on whichever task I’d be working on. Although this setup did take some time to assess, assemble and configure, it has been working flawlessly ever since, and I am very happy with it.

This is the most crucial part. After serious consideration and leveraging the budget I decided to invest into EVGA GeForce GTX 1080 8GB card backed by Nvidia GTX 1080 GPU. It is really snappy (and expensive), and in this particular case it only takes 15 minutes to run — 3 times faster than a g2.2xlarge AWS machine! If you still feel hesitant, think of it this way: the faster your model runs, the more experiments you can carry out over the same period of time.
… (emphasis in original)

Total for this GPU rig? £1562.26

You now know the fate of your next big advance. 😉

If you are interested in comparing the performance of a Beowulf cluster, see: A Homemade Beowulf Cluster: Part 1, Hardware Assembly and A Homemade Beowulf Cluster: Part 2, Machine Configuration.

Either way, you are going to have enough processing power that your skill and not hardware limits are going to be the limiting factor.

GPU + Russian Algorithm Bests Supercomputer

Thursday, June 30th, 2016

No need for supercomputers

From the post:

Senior researchers Vladimir Pomerantcev and Olga Rubtsova, working under the guidance of Professor Vladimir Kukulin (SINP MSU), were able to use on an ordinary desktop PC with GPU to solve complicated integral equations of quantum mechanics — previously solved only with the powerful, expensive supercomputers. According to Vladimir Kukulin, the personal computer does the job much faster: in 15 minutes it is doing the work requiring normally 2-3 days of the supercomputer time.

The main problem in solving the scattering equations of multiple quantum particles was the calculation of the integral kernel — a huge two-dimensional table, consisting of tens or hundreds of thousands of rows and columns, with each element of such a huge matrix being the result of extremely complex calculations. But this table appeared to look like a monitor screen with tens of billions of pixels, and with a good GPU it was quite possible to calculate all of these. Using the software developed in Nvidia and having written their own programs, the researchers split their calculations on the many thousands of streams and were able to solve the problem brilliantly.

“We reached the speed we couldn’t even dream of,” Vladimir Kukulin said. “The program computes 260 million of complex double integrals on a desktop computer within three seconds only. No comparison with supercomputers! My colleague from the University of Bochum in Germany (recently deceased, mournfully), whose lab did the same, carried out the calculations by one of the largest supercomputers in Germany with the famous blue gene architecture that is actually very expensive. And what his group is seeking for two or three days, we do in 15 minutes without spending a dime.”

The most amazing thing is that the desired quality of graphics processors and a huge amount of software to them exist for ten years already, but no one used them for such calculations, preferring supercomputers. Anyway, our physicists surprised their Western counterparts pretty much.

One of the principal beneficiaries of the US restricting the export of the latest generation of computer technology to the former USSR, was of course Russia.

Deprived of the latest hardware, Russian mathematicians and computer scientists were forced to be more efficient with equipment that was one or two generations off the latest mark for computing.

Parity between the USSR and the USA in nuclear weapons is testimony to their success and the failure of US export restriction policies.

For the technical details: V.N. Pomerantsev, V.I. Kukulin, O.A. Rubtsova, S.K. Sakhiev. Fast GPU-based calculations in few-body quantum scattering. Computer Physics Communications, 2016; 204: 121 DOI: 10.1016/j.cpc.2016.03.018.

Will a GPU help you startle your colleagues in the near future?

New Nvidia Resources – Data Science Bowl [Topology and Aligning Heart Images?]

Thursday, January 28th, 2016

New Resources Available to Help Participants by Pauline Essalou.

From the post:

Hungry for more help? NVIDIA can feed your passion and fuel your progress.

The free course includes lecture recordings and hands-on exercises. You’ll learn how to design, train, and integrate neural network-powered artificial intelligence into your applications using widely-used open source frameworks and NVIDIA software.

Visit NVIDIA at:

For access to the hands-on labs for free, you’ll need to register, using the promo code KAGGLE, at:

With weeks to go until the March 7 stage one deadline and stage two data release deadline, there’s still plenty of time for participants to take advantage of these tools and continue to submit solutions. Visit the Data Science Bowl Resources page for a complete listing of free resources.

If you aren’t already competing, the challenge in brief:

Declining cardiac function is a key indicator of heart disease. Doctors determine cardiac function by measuring end-systolic and end-diastolic volumes (i.e., the size of one chamber of the heart at the beginning and middle of each heartbeat), which are then used to derive the ejection fraction (EF). EF is the percentage of blood ejected from the left ventricle with each heartbeat. Both the volumes and the ejection fraction are predictive of heart disease. While a number of technologies can measure volumes or EF, Magnetic Resonance Imaging (MRI) is considered the gold standard test to accurately assess the heart’s squeezing ability.

The challenge with using MRI to measure cardiac volumes and derive ejection fraction, however, is that the process is manual and slow. A skilled cardiologist must analyze MRI scans to determine EF. The process can take up to 20 minutes to complete—time the cardiologist could be spending with his or her patients. Making this measurement process more efficient will enhance doctors’ ability to diagnose heart conditions early, and carries broad implications for advancing the science of heart disease treatment.

The 2015 Data Science Bowl challenges you to create an algorithm to automatically measure end-systolic and end-diastolic volumes in cardiac MRIs. You will examine MRI images from more than 1,000 patients. This data set was compiled by the National Institutes of Health and Children’s National Medical Center and is an order of magnitude larger than any cardiac MRI data set released previously. With it comes the opportunity for the data science community to take action to transform how we diagnose heart disease.

This is not an easy task, but together we can push the limits of what’s possible. We can give people the opportunity to spend more time with the ones they love, for longer than ever before. (From:

Unlike the servant with the one talent, Nvidia isn’t burying its talent under a basket. It is spreading access to its information as far as possible, in contrast to editorial writers at the New England Journal of Medicine.

Care to guess who is going to have the greater impact on cardiology and medicine?

I forgot to mention that Nietzsche described the editorial page writers of the New England Journal of Medicine quite well when he said, “…they tell the proper time and make a modest noise when doing so….” (Of Scholars).

I first saw this in a tweet by Kirk D. Borne.

PS: Kirk pointed to Image Preprocessing: The Challenges and Approach by Peter VanMaasdam today.

Are you surprised that the data is dirty? 😉

I’m not a professional mathematicians but what if you created a common topology for hearts and then treated the different measurements for each one as dimensions?

I say that having recently read: Quantum algorithms for topological and geometric analysis of data by Seth Lloyd, Silvano Garnerone & Paolo Zanardi. Nature Communications 7, Article number: 10138 doi:10.1038/ncomms10138, Published 25 January 2016.

Whether you have a quantum computer or not, given the small size of the heart data set, some of those methods might be applicable.

Unless my memory fails me, the entire GPU Gems series in online at Nvidia and has several chapters on topological methods.

Good luck!

Better Beer Through GPUs:…

Friday, September 4th, 2015

Better Beer Through GPUs: How GPUs and Deep Learning Help Brewers Improve Their Suds By Brian Caulfield.

From the post:

Jason Cohen isnʼt the first man to look for the solution to his problems at the bottom of a beer glass. But the 24-year-old entrepreneur might be the first to have found it.

Cohenʼs tale would make a great episode of HBOʼs “Silicon Valley” if only his epiphany had taken place in sun-dappled Palo Alto, Calif., rather than blustery State College, Pa. That Cohen has involved GPUs in this sudsy story should surprise no one.

This is the tale of a man who didnʼt master marketing to sell his product — quality control software for beer makers. He had to master it to make his product. The answer, of course, turned out to be free beer. And thatʼs put Cohen right in the middle of the fizzy business of craft brewing, a business that moves so fast heʼs enlisted GPUs to help his software keep up.

If you like micro-brew beer, suggest they contact Jason Cohen at Analytical Flavor Systems.

They will have a more consistent brew and you will too. A win-win situation.

I first saw this in a tweet by Lars Marius Garshol. (of course)

How IP Stifles Innovation

Monday, August 31st, 2015

What Will Become of the World’s First Open Source GPU? by Nicole Hemsoth.

From the post:

Open source hardware and microprocessor projects are certainly nothing new, and while there has been great momentum on the CPU front, there have not been efforts to release an open source GPU into the wild. However, at the Hot Chips conference this week, a team of researchers revealed their plans for MIAOW, a unique take on open source hardware that leverages a subset of AMD’s Southern Islands ISA that is used for AMD’s own GPU and can run OpenCL codes at what appears to be an impressive performance point that is comparable to existing single-precision GPU results.

As an open source project, it is reasonable to think that once it is further refined, some clever startup might decide to take the chip into full production. However, as one might imagine, there are likely going to be some serious IP infringement issues to address. Since the entire scope of the project is based on a pared-down variant of the AMD ISA for its own GPUs, the team will either need to work within AMD’s confines to continue pushing such a project or the effort, no matter how well proven it is in FPGA prototyping or actual silicon, could be a series of lawsuits waiting to happen.

Dr. Karu Sankaralingam, who led the team’s effort at the University of Wisconsin, where the project is based, says that building an open source or any other hardware project is bound to incur legal wrangling, in part because the IP almost has to be reused in one form or another. Generally, he says that for open source hardware projects like this one, the best defense is to use anything existing as a base but focus innovation on building on top of that. He says that to date, AMD has not been involved in the project beyond a few individuals offering some insight on various architectural elements. In other words, if the team is able to roll this beyond research and into any kind of volume, AMD will likely have words.

See Nicole’s post for technical details.

Sad that the future of such a project must depend on the largess of an IP holder.

If you were a venture capitalist, would you invest in an IP minefield waiting to explode? Likely not.

Patents on chips should require a production version of the chip. And then a patent only for three (3) years. If you haven’t captured the market in three (3) years, you are doing something wrong or there not much IP in your patent.

IP protection should apply to new ideas and inventions, not traps laid for the unwary by the non-productive.

Network motif discovery: A GPU approach [subgraph isomorphism and topics]

Sunday, August 30th, 2015

Network motif discovery: A GPU approach by Wenqing Lin ; Xiaokui Xiao ; Xing Xie ; Xiao-Li Li.


The identification of network motifs has important applications in numerous domains, such as pattern detection in biological networks and graph analysis in digital circuits. However, mining network motifs is computationally challenging, as it requires enumerating subgraphs from a real-life graph, and computing the frequency of each subgraph in a large number of random graphs. In particular, existing solutions often require days to derive network motifs from biological networks with only a few thousand vertices. To address this problem, this paper presents a novel study on network motif discovery using Graphical Processing Units (GPUs). The basic idea is to employ GPUs to parallelize a large number of subgraph matching tasks in computing subgraph frequencies from random graphs, so as to reduce the overall computation time of network motif discovery. We explore the design space of GPU-based subgraph matching algorithms, with careful analysis of several crucial factors that affect the performance of GPU programs. Based on our analysis, we develop a GPU-based solution that (i) considerably differs from existing CPU-based methods, and (ii) exploits the strengths of GPUs in terms of parallelism while mitigating their limitations in terms of the computation power per GPU core. With extensive experiments on a variety of biological networks, we show that our solution is up to two orders of magnitude faster than the best CPU-based approach, and is around 20 times more cost-effective than the latter, when taking into account the monetary costs of the CPU and GPUs used.

For those of you who need to dodge the paywall: Network motif discovery: A GPU approach

From the introduction (topic map comments interspersed):

Given a graph G, a network motif in G is a subgraph g of G, such that g appears much more frequently in G than in random graphs whose degree distributions are similar to that of G [1]. The identification of network motifs finds important applications in numerous domains. For example, network motifs are used (i) in system biology to predict protein interactions in biological networks and discover functional sub-units [2], (ii) in electronic engineering to understand the characteristics of circuits [3], and (iii) in brain science to study the functionalities of brain networks [4].

Unlike a topic map considered as graph G, the network motif problem is to discover subgraphs of G. In a topic map, the subgraph constituting a topic and its defined isomophism with other topics, is declarable.

…Roughly speaking, all existing techniques adopt a common two-phase framework as follows:

Subgraph Enumeration: Given a graph G and a parameter k, enumerate the subgraphs g of G with k vertices each;

Frequency Estimation:

Rather than enumerating subgraph g of G (a topic map), we need only collect those subgraphs for the isomorphism test.

…To compute the frequency of g in a random graph G, however, we need to derive the number of subgraphs of G that are isomorphic to g – this requires a large number of subgraph isomorphism tests [14], which are known to be computationally expensive.

Footnote 14 is a reference to: Practical graph isomorphism, II by Brendan D. McKay, Adolfo Piperno.

See also: nauty and Traces by Brendan McKay and Adolfo Piperno.

nauty and Traces are programs for computing automorphism groups of graphs and digraphs [*]. They can also produce a canonical label. They are written in a portable subset of C, and run on a considerable number of different systems.

This is where topic map subgraph g isomorphism issue intersects with the more general subgraph isomorphism case.

Any topic can have properties (nee internal occurrences) in addition to those thought to make it “isomorphic” to another topic. Or play roles in associations that are not played by other topics representing the same subject.

Motivated by the deficiency of existing work, we present an in-depth study on efficient solutions for network motif discovery. Instead of focusing on the efficiency of individual subgraph isomorphism tests, we propose to utilize Graphics Processing Units (GPUs) to parallelize a large number of isomorphism tests, in order to reduce the computation time of the frequency estimation phase. This idea is intuitive, and yet, it presents a research challenge since there is no existing algorithm for testing subgraph isomorphisms on GPUs. Furthermore, as shown in Section III, existing CPU-based algorithms for subgraph isomorphism tests cannot be translated into efficient solutions on GPUs, as the characteristics of GPUs make them inherently unsuitable for several key procedures used in CPU-based algorithms.

The punch line is that the authors present a solution that:

…is up to two orders of magnitude faster than the best CPU-based approach, and is around 20 times more cost-effective than the latter, when taking into account the monetary costs of the CPU and GPUs used.

Since topic isomophism is defined as opposed to discovered, this looks like a fruitful avenue to explore for topic map engine performance.

GPU Linux Rootkit/Keylogger

Saturday, May 9th, 2015

New GPU-based Linux Rootkit and Keylogger with Excellent Stealth and Computing Power by Swati Khandelwal.

From the post:

The world of hacking has become more organized and reliable over recent years and so the techniques of hackers.

Nowadays, attackers use highly sophisticated tactics and often go to extraordinary lengths in order to mount an attack.

And there is something new to the list:

A team of developers has created not one, but two pieces of malware that run on an infected computer’s graphics processor unit (GPU) instead of its central processor unit (CPU), in order to enhance their stealthiness and computational efficiency.

The two pieces of malware:

The source code of both the Jellyfish Rootkit and the Demon keylogger, which are described as proof-of-concepts malware, have been published on Github.

Until now, security researchers have discovered nasty malware running on the CPU and exploiting the GPU capabilities in an attempt to mine cryptocurrencies such as Bitcoins.

However, these two malware could operate without exploiting or modifying the processes in the operating system kernel, and this is why they do not trigger any suspicion that a system is infected and remain hidden.

As Swati says, proof-of-concept, but the distance between proof-of-concept and in the wild isn’t predictable.

After an overview of the rootkit and keylogger, Swati asks:

However, if exploited in future, What could be the area of attack vectors? Hit the comments below.

No comments as of 12:28 EST on Saturday, 9 May 2015.

One resource that may spur comments on your part:

Vulnerability analysis of GPU computing by Michael James Patterson.

Comments for Swati?

DIGITS: Deep Learning GPU Training System

Friday, March 20th, 2015

DIGITS: Deep Learning GPU Training System by Allison Gray.

From the post:

The hottest area in machine learning today is Deep Learning, which uses Deep Neural Networks (DNNs) to teach computers to detect recognizable concepts in data. Researchers and industry practitioners are using DNNs in image and video classification, computer vision, speech recognition, natural language processing, and audio recognition, among other applications.

The success of DNNs has been greatly accelerated by using GPUs, which have become the platform of choice for training these large, complex DNNs, reducing training time from months to only a few days. The major deep learning software frameworks have incorporated GPU acceleration, including Caffe, Torch7, Theano, and CUDA-Convnet2. Because of the increasing importance of DNNs in both industry and academia and the key role of GPUs, last year NVIDIA introduced cuDNN, a library of primitives for deep neural networks.

Today at the GPU Technology Conference, NVIDIA CEO and co-founder Jen-Hsun Huang introduced DIGITS, the first interactive Deep Learning GPU Training System. DIGITS is a new system for developing, training and visualizing deep neural networks. It puts the power of deep learning into an intuitive browser-based interface, so that data scientists and researchers can quickly design the best DNN for their data using real-time network behavior visualization. DIGITS is open-source software, available on GitHub, so developers can extend or customize it or contribute to the project.

Apologies for the delay in seeing Allison’s post but at least I saw it before the weekend!

In addition to a great write-up, Allison walks through how she has used DIGITS. In terms of “onboarding” to software, it doesn’t get any better than this.

What are you going to apply DIGITS to?

I first saw this in a tweet by Christian Rosnes.

GPU-Accelerated Graph Analytics in Python with Numba

Thursday, March 19th, 2015

GPU-Accelerated Graph Analytics in Python with Numba Siu Kwan Lam.


Numba is an open-source just-in-time (JIT) Python compiler that generates native machine code for X86 CPU and CUDA GPU from annotated Python Code. (Mark Harris introduced Numba in the post “NumbaPro: High-Performance Python with CUDA Acceleration”.) Numba specializes in Python code that makes heavy use of NumPy arrays and loops. In addition to JIT compiling NumPy array code for the CPU or GPU, Numba exposes “CUDA Python”: the CUDA programming model for NVIDIA GPUs in Python syntax.

By speeding up Python, we extend its ability from a glue language to a complete programming environment that can execute numeric code efficiently.

Python enthusiasts, I would not take the “…from a glue language to a complete programming environment…” comment to heart.

The author also says:

Numba helps by letting you write pure Python code and run it with speed comparable to a compiled language, like C++. Your development cycle shortens when your prototype Python code can scale to process the full dataset in a reasonable amount of time.

and then summarizes the results of code in the post:

Our GPU PageRank implementation completed in just 163 seconds on the full graph of 623 million edges and 43 million nodes using a single NVIDIA Tesla K20 GPU accelerator. Our equivalent Numba CPU-JIT version took at least 5 times longer on a smaller graph.

plus points out techniques for optimizing the code.

I’d say no hard feelings. Yes? 😉

MapGraph [Graphs, GPUs, 30 GTEPS (30 billion traversed edges per second)]

Tuesday, March 10th, 2015

MapGraph [Graphs, GPUs, 30 GTEPS (30 billion traversed edges per second)]

From the post:

MapGraph is Massively Parallel Graph processing on GPUs. (Previously known as “MPGraph”).

  • The MapGraph API makes it easy to develop high performance graph analytics on GPUs. The API is based on the Gather-Apply-Scatter (GAS) model as used in GraphLab. To deliver high performance computation and efficiently utilize the high memory bandwidth of GPUs, MapGraph’s CUDA kernels use multiple sophisticated strategies, such as vertex-degree-dependent dynamic parallelism granularity and frontier compaction.
  • New algorithms can be implemented in a few hours that fully exploit the data-level parallelism of the GPU and offer throughput of up to 3 billion traversed edges per second on a single GPU.
  • Preliminary results for the multi-GPU version of MapGraph have traversal rates of nearly 30 GTEPS (30 billion traversed edges per second) on a scale-free random graph with 4.3 billion directed edges using a 64 GPU cluster. See the multi-GPU paper referenced below for details.
  • The MapGraph API also comes in a CPU-only version that is currently packaged and distributed with the bigdata open-source graph database. GAS programs operate over the graph data loaded into the database and are accessed via either a Java API or a SPARQL 1.1 Service Call . Packaging the GPU version inside bigdata will be in a future release.

MapGraph is under the Apache 2 license. You can download MapGraph from . For the lastest version of this documentation, see You can subscribe to receive notice for future updates on the project home page. For open source support, please ask a question on the MapGraph mailing lists or file a ticket. To inquire about commercial support, please email us at You can follow MapGraph and the bigdata graph database platform at

This work was (partially) funded by the DARPA XDATA program under AFRL Contract #FA8750-13-C-0002.

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. D14PC00029.

MapGraph Publications

You do have to wonder when the folks at Systap sleep. 😉 This is the same group that produced BlazeGraph, recently adopted by WikiData. Granting WikiData only has 13.6 million data items as of today but it isn’t “small” data.

The rest of the page has additional pointers and explanations for MapGraph.


Understanding Natural Language with Deep Neural Networks Using Torch

Tuesday, March 3rd, 2015

Understanding Natural Language with Deep Neural Networks Using Torch by Soumith Chintala and Wojciech Zaremba.

This is a deeply impressive article and a good introduction to Torch (scientific computing package with neural network, optimization, etc.)

In the preliminary materials, the authors illustrate one of the difficulties of natural language processing by machine:

For a machine to understand language, it first has to develop a mental map of words, their meanings and interactions with other words. It needs to build a dictionary of words, and understand where they stand semantically and contextually, compared to other words in their dictionary. To achieve this, each word is mapped to a set of numbers in a high-dimensional space, which are called “word embeddings”. Similar words are close to each other in this number space, and dissimilar words are far apart. Some word embeddings encode mathematical properties such as addition and subtraction (For some examples, see Table 1).

Word embeddings can either be learned in a general-purpose fashion before-hand by reading large amounts of text (like Wikipedia), or specially learned for a particular task (like sentiment analysis). We go into a little more detail on learning word embeddings in a later section.

You can already see the problem but just to call it out, the language usage in Wikipedia, for example, may or may not match the domain of interest. You could certainly use it as a general case but it will produce very odd results when the text to be “understood” in a regional version of a language where common words have meanings other than you will find in Wikipedia.

Slang is a good example. In the 17th century for example, “cab” was a term used for a brothel. To take a “hit” has a different meaning than being struck by a boxer, would be a more recent example.

“Understanding” natural language with machines is a great leap forward but one should never leap without looking.

Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars

Wednesday, February 18th, 2015

Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars by Hua He, Jimmy Lin, Adam Lopez. (Transactions of the Association for Computational Linguistics, vol. 3, pp. 87–100, 2015.)


Grammars for machine translation can be materialized on demand by finding source phrases in an indexed parallel corpus and extracting their translations. This approach is limited in practical applications by the computational expense of online lookup and extraction. For phrase-based models, recent work has shown that on-demand grammar extraction can be greatly accelerated by parallelization on general purpose graphics processing units (GPUs), but these algorithms do not work for hierarchical models, which require matching patterns that contain gaps. We address this limitation by presenting a novel GPU algorithm for on-demand hierarchical grammar extraction that is at least an order of magnitude faster than a comparable CPU algorithm when processing large batches of sentences. In terms of end-to-end translation, with decoding on the CPU, we increase throughput by roughly two thirds on a standard MT evaluation dataset. The GPU necessary to achieve these improvements increases the cost of a server by about a third. We believe that GPU-based extraction of hierarchical grammars is an attractive proposition, particularly for MT applications that demand high throughput.

If you are interested in cross-language search, DNA sequence alignment or other pattern matching problems, you need to watch the progress of this work.

This article and other important research is freely accessible at: Transactions of the Association for Computational Linguistics

Parallel Programming with GPUs and R

Sunday, February 1st, 2015

Parallel Programming with GPUs and R by Norman Matloff.

From the post:

You’ve heard that graphics processing units — GPUs — can bring big increases in computational speed. While GPUs cannot speed up work in every application, the fact is that in many cases it can indeed provide very rapid computation. In this tutorial, we’ll see how this is done, both in passive ways (you write only R), and in more direct ways, where you write C/C++ code and interface it to R.

Norman provides as readable an introduction to GPUs as I have seen in a while, a quick overview of R packages for accessing GPUs and then a quick look at writing CUDA code and problems you may have compiling it.

Of particular note is Norman’s reference to a draft of his new book, Parallel Computation for Data Science, which introduces parallel computing with R.

It won’t be long before parallel or not processing will be a detail hidden from the average programmer. Until then, see this post and Norman’s book for parallel processing and R.

Minerva: a fast and flexible system for deep learning

Saturday, January 3rd, 2015

Minerva: a fast and flexible system for deep learning

From the post:

Minerva is a fast and flexible tool for deep learning. It provides NDarray programming interface, just like Numpy. Python bindings and C++ bindings are both available. The resulting code can be run on CPU or GPU. Multi-GPU support is very easy. Please refer to the examples to see how multi-GPU setting is used.Minerva is a fast and flexible tool for deep learning. It provides NDarray programming interface, just like Numpy. Python bindings and C++ bindings are both available. The resulting code can be run on CPU or GPU. Multi-GPU support is very easy. Please refer to the examples to see how multi-GPU setting is used.


  • Matrix programming interface
  • Easy interaction with NumPy
  • Multi-GPU, multi-CPU support
  • Good performance: ImageNet AlexNet training achieves 213 and 403 images/s with one and two Titan GPU, respectivly. Four GPU cards number will be coming soon.

I first saw this in a blog post by Danny Bickson, Minerva: open source deep learning on GPU software from MS.

Deep learning is gaining traction fast. Fast enough that when government contractors convince the FBI wire tapping is no long a matter of plugging into the local junction box, they may start working on deep learning.

Before deep learning gets to that point, defensive measures against deep learning need to be developed. Given the variety of deep learning approaches and algorithms that is going to be a real challenge.

Perhaps immutable data structures where copying enables real time performance in presenting results that are expected? While maintaining a copy of the unexpected results?

I think there is a presumption that on querying, information systems repeat the information they have stored. That’s a fairly naive view of data storage. We know it is a matter of permissions to “see” data. Why shouldn’t the answer you see also depend upon permissions?

Defenses against deep learning and reactive data storage may become very relevant in the not too distant future. Give it some thought.

Deep learning for… chess

Monday, December 15th, 2014

Deep learning for… chess by Erik Bernhardsson.

From the post:

I’ve been meaning to learn Theano for a while and I’ve also wanted to build a chess AI at some point. So why not combine the two? That’s what I thought, and I ended up spending way too much time on it. I actually built most of this back in September but not until Thanksgiving did I have the time to write a blog post about it.

Chess sets are a common holiday gift so why not do something different this year?

Pretty print a copy of this post and include a gift certificate from AWS for a GPU instance for say a week to ten days.

I don’t think AWS sells gift certificates, but they certainly should. Great stocking stuffer, anniversary/birthday/graduation present, etc. Not so great for Valentines Day.

If you ask AWS for a gift certificate, mention my name. They don’t know who I am so I could use the publicity. 😉

I first saw this in a tweet by Onepaperperday.

ArrayFire: A Portable Open-Source Accelerated Computing Library

Wednesday, December 10th, 2014

ArrayFire: A Portable Open-Source Accelerated Computing Library by Pavan Yalamanchilli.

From the post:

The ArrayFire library is a high-performance software library with a focus on portability and productivity. It supports highly tuned, GPU-accelerated algorithms using an easy-to-use API. ArrayFire wraps GPU memory into a simple “array” object, enabling developers to process vectors, matrices, and volumes on the GPU using high-level routines, without having to get involved with device kernel code.

ArrayFire Capabilities

ArrayFire is an Fortran. ArrayFire has a range of functionality, including

ArrayFire has three back ends to enable portability across many platforms: CUDA, OpenCL and CPU. It even works on embedded platforms like NVIDIA’s Jetson TK1.

In a past post about ArrayFire we demonstrated the ArrayFire capabilities and how you can increase your productivity by using ArrayFire. In this post I will tell you how you can use ArrayFire to exploit various kind of parallelism on NVIDIA GPUs.

Just in case you get a box full of GPUs during the holidays and/or need better performance from ones you already have!


How to run the Caffe deep learning vision library…

Wednesday, October 29th, 2014

How to run the Caffe deep learning vision library on Nvidia’s Jetson mobile GPU board by Pete Warden.

From the post:

Jetson boardPhoto by Gareth Halfacree

My colleague Yangqing Jia, creator of Caffe, recently spent some free time getting the framework running on Nvidia’s Jetson board. If you haven’t heard of the Jetson, it’s a small development board that includes Nvidia’s TK1 mobile GPU chip. The TK1 is starting to appear in high-end tablets, and has 192 cores so it’s great for running computational tasks like deep learning. The Jetson’s a great way to get a taste of what we’ll be able to do on mobile devices in the future, and it runs Ubuntu so it’s also an easy environment to develop for.

Caffe comes with a pre-built ‘Alexnet’ model, a version of the Imagenet-winning architecture that recognizes 1,000 different kinds of objects. Using this as a benchmark, the Jetson can analyze an image in just 34ms! Based on this table I’m estimating it’s drawing somewhere around 10 or 11 watts, so it’s power-intensive for a mobile device but not too crazy.

Yangqing passed along his instructions, and I’ve checked them on my own Jetson, so here’s what you need to do to get Caffe up and running.

Hardware fun for the middle of your week!

192 cores for under $200, plus GPU experience.

Accelerate Machine Learning with cuDNN Deep Neural Network Library

Monday, September 8th, 2014

Accelerate Machine Learning with the cuDNN Deep Neural Network Library by Larry Brown.

From the post:

Introducing cuDNN

NVIDIA cuDNN is a GPU-accelerated library of primitives for DNNs. It provides tuned implementations of routines that arise frequently in DNN applications, such as:

  • convolution
  • pooling
  • softmax
  • neuron activations, including:
    • Sigmoid
    • Rectified linear (ReLU)
    • Hyperbolic tangent (TANH)

Of course these functions all support the usual forward and backward passes. cuDNN’s convolution routines aim for performance competitive with the fastest GEMM-based (matrix multiply) implementations of such routines while using significantly less memory.

cuDNN features customizable data layouts, supporting flexible dimension ordering, striding and subregions for the 4D tensors used as inputs and outputs to all of its routines. This flexibility allows easy integration into any neural net implementation and avoids the input/output transposition steps sometimes necessary with GEMM-based convolutions.

cuDNN is thread safe, and offers a context-based API that allows for easy multithreading and (optional) interoperability with CUDA streams. This allows the developer to explicitly control the library setup when using multiple host threads and multiple GPUs, and ensure that a particular GPU device is always used in a particular host thread (for example).

cuDNN allows DNN developers to easily harness state-of-the-art performance and focus on their application and the machine learning questions, without having to write custom code. cuDNN works on Windows or Linux OSes, and across the full range of NVIDIA GPUs, from low-power embedded GPUs like Tegra K1 to high-end server GPUs like Tesla K40. When a developer leverages cuDNN, they can rest assured of reliable high performance on current and future NVIDIA GPUs, and benefit from new GPU features and capabilities in the future.

I didn’t quote the background and promotional material on machine learning or deep neural networks (DNN’s), assuming that if you are interested at all, you will read the original post to pick up that material. Attention has been paid to making cuDNN “easy” to use. “Easy” is a relative term but I think you will appreciate the effort.

BTW, cuDNN is free for any purpose but does require you to have a registered CUDA developer account. If you are already a registered CUDA developer or after you are, see:

Caffe, a deep learning framework, has support for cuDNN in its current development branch.

I first saw this in a tweet by Mark Harris.

Archive of Formal Proofs

Monday, August 11th, 2014

Archive of Formal Proofs

From the webpage:

The Archive of Formal Proofs is a collection of proof libraries, examples, and larger scientific developments, mechanically checked in the theorem prover Isabelle. It is organized in the way of a scientific journal, is indexed by dblp and has an ISSN: 2150-914x. Submissions are refereed. The preferred citation style is available [here].

It may not be tomorrow but if I don’t capture this site today, I will need to find it in the near future!

Just skimming I did see an entry of interest to the GPU crowd: Syntax and semantics of a GPU kernel programming language by John Wickerson.


This document accompanies the article “The Design and Implementation of a Verification Technique for GPU Kernels” by Adam Betts, Nathan Chong, Alastair F. Donaldson, Jeroen Ketema, Shaz Qadeer, Paul Thomson and John Wickerson. It formalises all of the definitions provided in Sections 3 and 4 of the article.

I first saw this in a tweet by Computer Science.

400 GTEPS on 4096 GPUs

Saturday, August 9th, 2014

Breadth-First Graph Search Uses 2D Domain Decomposition – 400 GTEPS on 4096 GPUs by Rob Farber.

From the post:

Parallel Breadth-First Search is a standard benchmark and the basis of many other graph algorithms. The challenge li[]es in partitioning the graph across multiple nodes in a cluster while avoiding load-imbalance and communications delays. The authors of the paper, “Parallel Breadth First Search on the Kepler Architecture” utilize an interesting 2D decomposition of the graph adjacency matrix. Tests on R-MAT graphs shows large graph performance ranging from 1.1 GTEP on a single K20 to 396 GTEP using 4096 GPUs. The tests also compared performance against the method of Beamer (10 GTEP single SMP device and 240 GTEP on 115k cores).

See Rob’s post for background on the distributed DFS problem and additional references.

Graph processing continues to improve at an impressive rate but I wonder how applicable some techniques are to intersections of graphs?

The optimization of using a bitmap to mark vertices visited (Scalable Graph Exploration on Multicore Processors, Agarwal, et al., 2010), cited by authors of Parallel Distributed Breadth First Search on the Kepler Architecture saying:

Then, to reduce the work, we used an integer map to keep track of visited vertices. Agarwal et al., first introduced this optimization using a bitmap that has been used in almost all subsequent works.

appears to be stumbling block to tracking a vertex that appears in intersecting graphs.

Or would you track visited vertices in each intersecting graph separately? And communicate results from each intersecting graph?

550 talks related to big data [GPU Technology Conference]

Sunday, August 3rd, 2014

550 talks related to big data by Amy.

Amy has forged links to forty-four (44) topic areas at the GPU Technology Conference 2014.

Definitely a post to bookmark!

You may remember that GPUs were what Bryan Thompson and others are using to achieve 3 Billion Traversed Edges Per Second (TEPS) for graph work. Non-kiddie graph work.


MapGraph:… [3 billion Traversed Edges Per Second (TEPS) on a GPU]

Saturday, August 2nd, 2014

MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs by Zhisong Fu, Michael Personick, and Bryan Thompson.


High performance graph analytics are critical for a long list of application domains. In recent years, the rapid advancement of many-core processors, in particular graphical processing units (GPUs), has sparked a broad interest in developing high performance parallel graph programs on these architectures. However, the SIMT architecture used in GPUs places particular constraints on both the design and implementation of the algorithms and data structures, making the development of such programs difficult and time-consuming.

We present MapGraph, a high performance parallel graph programming framework that delivers up to 3 billion Traversed Edges Per Second (TEPS) on a GPU. MapGraph provides a high-level abstraction that makes it easy to write graph programs and obtain good parallel speedups on GPUs. To deliver high performance, MapGraph dynamically chooses among different scheduling strategies depending on the size of the frontier and the size of the adjacency lists for the vertices in the frontier. In addition, a Structure Of Arrays (SOA) pattern is used to ensure coalesced memory access. Our experiments show that, for many graph analytics algorithms, an implementation, with our abstraction, is up to two orders of magnitude faster than a parallel CPU implementation and is comparable to state-of-the-art, manually optimized GPU implementations. In addition, with our abstraction, new graph analytics can be developed with relatively little effort.

Those of us who remember Bryan Thompson from the early days of topic maps are not surprised to see his name on a paper with phrases like: “…delivers up to 3 billion Traversed Edges Per Second (TEPS) on a GPU,” and “…is up to two orders of magnitude faster than a parallel CPU implementation….”

Heavy sledding but definitely worth the effort.

Oh, btw, did I mention this is an open source project?

I first saw this in MapGraph: speeding up graph processing with GPUs by Danny Bickson.

Random Forests…

Monday, July 7th, 2014

Random Forests of Very Fast Decision Trees on GPU for Mining Evolving Big Data Streams by Diego Marron, Albert Bifet, Gianmarco De Francisci Morales.


Random Forests is a classical ensemble method used to improve the performance of single tree classifiers. It is able to obtain superior performance by increasing the diversity of the single classifiers. However, in the more challenging context of evolving data streams, the classifier has also to be adaptive and work under very strict constraints of space and time. Furthermore, the computational load of using a large number of classifiers can make its application extremely expensive. In this work, we present a method for building Random Forests that use Very Fast Decision Trees for data streams on GPUs. We show how this method can benefit from the massive parallel architecture of GPUs, which are becoming an efficient hardware alternative to large clusters of computers. Moreover, our algorithm minimizes the communication between CPU and GPU by building the trees directly inside the GPU. We run an empirical evaluation and compare our method to two well know machine learning frameworks, VFML and MOA. Random Forests on the GPU are at least 300x faster while maintaining a similar accuracy.

The authors should get a special mention for honesty in research publishing. Figure 11 shows their GPU Random Forest algorithm seeming to scale almost constantly. The authors explain:

In this dataset MOA scales linearly while GPU Random Forests seems to scale almost constantly. This is an effect of the scale, as GPU Random Forests runs in milliseconds instead of minutes.

How fast/large are your data streams?

I first saw this in a tweet by Stefano Bertolo.


Thursday, June 19th, 2014

Rth: a Flexible Parallel Computation Package for R by Norm Matloff.

From the post:

The key feature of Rth is in the word flexible in the title of this post, which refers to the fact that Rth can be used on two different kinds of platforms for parallel computation: multicore systems and Graphics Processing Units (GPUs). You all know about the former–it’s hard to buy a PC these days that is not at least dual-core–and many of you know about the latter. If your PC or laptop has a somewhat high-end graphics card, this enables extremely fast computation on certain kinds of problems. So, whether have, say, a quad-core PC or a good NVIDIA graphics card, you can run Rth for fast computation, again for certain types of applications. And both multicore and GPUs are available in the Amazon EC2 cloud service.

Rth Quick Start

Our Rth home page tells you the GitHub site at which you can obtain the package, and how to install it. (We plan to place it on CRAN later on.) Usage is simple, as in this example:

Rth is an example of what I call Pretty Good Parallelism (an allusion to Pretty Good Privacy). For certain applications it can get you good speedup on two different kinds of common platforms (multicore, GPU). Like most parallel computation systems, it works best on very regular, “embarrassingly parallel” problems. For very irregular, complex apps, one may need to resort to very detailed C code to get a good speedup.

Rth has not been tested on Windows so I am sure the authors would appreciate reports on your use of Rth with Windows.

Contributions of new Rth functions are solicited. At least if you don’t mind making parallel processing easier for everyone. 😉

I first saw this in a tweet by Christopher Lalanne.

Puck, a high-speed GPU-powered parser

Monday, June 9th, 2014

Puck, a high-speed GPU-powered parser by David Hall.

From the post:

I’m pleased to announce yet another new parser called Puck. Puck is a lightning-fast parser that uses Nvidia GPUs to do most of its computation. It is capable of parsing over 400 sentences per second, or about half a million words per minute. (Most CPU constituency parsers of its quality are on the order of 10 sentences per second.)

Puck is based on the same grammars used in the Berkeley Parser, and produces nearly identical trees. Puck is only available for English right now.

For more information about Puck, please see the project github page ( , or the accompanying paper (

Because of some its dependencies are not formally released yet (namely the wonderful JavaCL library), I can’t push artifacts to Maven Central. Instead I’ve uploaded a fat assembly jar here: (See the readme on github for how to use it.) It’s better used as a command line tool, anyway.

Even more motivation for learning to use GPUs!

I first saw this in a tweet by Jason Baldridge.

Modern GPU

Friday, June 6th, 2014

Modern GPU by Sean Baxter.

From the webpage:

Modern GPU is code and commentary intended to promote new and productive ways of thinking about GPU computing.

This project is a library, an algorithms book, a tutorial, and a best-practices guide. If you are new to CUDA, start here. If you’re already familiar with CUDA, are ready for a challenge, and want to learn design patterns for parallel programming, enjoy this series. (emphasis in original)

Just in time for the weekend! And you don’t need an iOS device to read the text. 😉

Skimming the FAQ this is serious work but will return serious dividends as well.


Jetson TK1:… [$192.00]

Friday, April 4th, 2014

Jetson TK1: Mobile Embedded Supercomputer Takes CUDA Everywhere by Mark Harris.

From the post:

Jetson TK1 is a tiny but full-featured computer designed for development of embedded and mobile applications. Jetson TK1 is exciting because it incorporates Tegra K1, the first mobile processor to feature a CUDA-capable GPU. Jetson TK1 brings the capabilities of Tegra K1 to developers in a compact, low-power platform that makes development as simple as developing on a PC.

Tegra K1 is NVIDIA’s latest mobile processor. It features a Kepler GPU with 192 cores, an NVIDIA 4-plus-1 quad-core ARM Cortex-A15 CPU, integrated video encoding and decoding support, image/signal processing, and many other system-level features. The Kepler GPU in Tegra K1 is built on the same high-performance, energy-efficient Kepler GPU architecture that is found in our high-end GeForce, Quadro, and Tesla GPUs for graphics and computing. That makes it the only mobile processor today that supports CUDA 6 for computing and full desktop OpenGL 4.4 and DirectX 11 for graphics.

Tegra K1 is a parallel processor capable of over 300 GFLOP/s of 32-bit floating point computation. Not only is that a huge achievement in a processor with such a low power footprint (Tegra K1 power consumption is in the range of 5 Watts for real workloads), but K1′s support for CUDA and desktop graphics APIs means that much of your existing compute and graphics software will compile and run largely as-is on this platform.

Are you old enough to remember looking at the mini-computers on the back of most computer zines?

And then sighing at the price tag?

Times have changed!

Order Jetson TK1 Now, just $192

Jetson TK1 is available to pre-order today for $192. In the United States, it is available from the NVIDIA website, as well as and Micro Center. See the Jetson TK1 page for details on international orders.

Some people, General Clapper comes to mind, use supercomputers to mine dots that are already connected together (phone data).

Other people, create algorithms to assist users in connecting dots between diverse and disparate data sources.

You know who my money is riding on.


Apache Tez 0.3 Released!

Monday, March 10th, 2014

Apache Tez 0.3 Released! by Bikas Saha.

From the post:

The Apache Tez community has voted to release 0.3 of the software.

Apache™ Tez is a replacement of MapReduce that provides a powerful framework for executing a complex topology of tasks. Tez 0.3.0 is an important release towards making the software ready for wider adoption by focusing on fundamentals and ironing out several key functions. The major action areas in this release were

  1. Security. Apache Tez now works on secure Hadoop 2.x clusters using the built-in security mechanisms of the Hadoop ecosystem.
  2. Scalability. We tested the software on large clusters, very large data sets and large applications processing tens of TB each to make sure it scales well with both data-sets and machines.
  3. Fault Tolerance. Apache Tez executes a complex DAG workflow that can be subject to multiple failure conditions in clusters of commodity hardware and is highly resilient to these and other sorts of failures.
  4. Stability. A large number of bug fixes went into this release as early adopters and testers put the software through its paces and reported issues.

To prove the stability and performance of Tez, we executed complex jobs comprised of more than 50 different stages and tens of thousands of tasks on a fairly large cluster (> 300 Nodes, > 30TB data). Tez passed all our tests and we are certain that new adopters can integrate confidently with Tez and enjoy the same benefits as Apache Hive & Apache Pig have already.

I am curious how the Hadoop community is going to top 2013. I suspect Tez is going to be part of that answer!

CUDA 6, Available as Free Download, …

Wednesday, March 5th, 2014

CUDA 6, Available as Free Download, Makes Parallel Programming Easier, Faster by George Millington.

From the post:

We’re always striving to make parallel programming better, faster and easier for developers creating next-gen scientific, engineering, enterprise and other applications.

With the latest release of the CUDA parallel programming model, we’ve made improvements in all these areas.

Available now to all developers on the CUDA website, the CUDA 6 Release Candidate is packed with several new features that are sure to please developers.

A few highlights:

  • Unified Memory – This major new feature lets CUDA applications access CPU and GPU memory without the need to manually copy data from one to the other. This is a major time saver that simplifies the programming process, and makes it easier for programmers to add GPU acceleration in a wider range of applications.
  • Drop-in Libraries – Want to instantly accelerate your application by up to 8X? The new drop-in libraries can automatically accelerate your BLAS and FFTW calculations by simply replacing the existing CPU-only BLAS or FFTW library with the new, GPU-accelerated equivalent.
  • Multi-GPU Scaling – Re-designed BLAS and FFT GPU libraries automatically scale performance across up to eight GPUs in a single node. This provides over nine teraflops of double-precision performance per node, supporting larger workloads than ever before (up to 512GB).

And there’s more.

In addition to the new features, the CUDA 6 platform offers a full suite of programming tools, GPU-accelerated math libraries, documentation and programming guides.

To keep informed about the latest CUDA developments, and to access a range parallel programing tools and resources, we encourage you to sign up for the free CUDA/GPU Computing Registered Developer Program at the NVIDIA Developer Zone website.

The only sad note is that processing power continues to out-distance the ability to document and manipulate the semantics of data.

Not unlike having a car that can cross the North American continent in a hour but not having a map of locations between the coasts.

You arrive quickly, but is it where you wanted to go?

Map-D: A GPU Database…

Thursday, February 6th, 2014

Map-D: A GPU Database for Real-time Big Data Analytics and Interactive Visualization by Todd Mostak (map-D) and Tom Graham (map-D). (MP4)

From the description:

map-D makes big data interactive for anyone! map-D is a super-fast GPU database that allows anyone to interact and visualize streaming big data in real time. Its unique architecture runs 70-1,000x faster than other in-memory databases or big data analytics platforms. To boot, it works with any size or kind of dataset; works with data that is streaming live on to the system; uses cheap, off-the-shelf hardware; is easily is focused on learning from big data. At the moment, the map-D team is working on projects with MIT CSAIL, the Harvard Center for Geographic Analysis and the Harvard-Smithsonian Center for Astrophysics. Join Todd Mostak and Tom Graham, key members of the map-D team, as they demonstrate the speed and agility of map-D and describe the live processing, search and mapping of over 1 billion tweets.

I have been haunting the GTC On-Demand page waiting for this to be posted.

I had to download the MP4. (Approximately 124 MB) Suspect they are creating a lot of traffic at the GTC On-Demand page.

As a bonus, see also:

Map-D: GPU-Powered Databases and Interactive Social Science Research in Real Time by Tom Graham (Map_D) and Todd Mostak (Map_D) (streaming) or PDF.

From the description:

Map-D (Massively Parallel Database) uses multiple NVIDIA GPUs to interactively query and visualize big data in real-time. Map-D is an SQL-enabled column store that generates 70-400X speedups over other in-memory databases. This talk discusses the basic architecture of the system, the advantages and challenges of running queries on the GPU, and the implications of interactive and real-time big data analysis in the social sciences and beyond.

Suggestions of more links/papers on Map-D greatly appreciated!


PS: Just so you aren’t too shocked, the Twitter demo involves scanning a billion row database in 5 mili-seconds.