Archive for the ‘CUDA’ Category


Thursday, October 17th, 2013

cudaMap: a GPU accelerated program for gene expression connectivity mapping by Darragh G McArt, Peter Bankhead, Philip D Dunne, Manuel Salto-Tellez, Peter Hamilton, Shu-Dong Zhang.


BACKGROUND: Modern cancer research often involves large datasets and the use of sophisticated statistical techniques. Together these add a heavy computational load to the analysis, which is often coupled with issues surrounding data accessibility. Connectivity mapping is an advanced bioinformatic and computational technique dedicated to therapeutics discovery and drug re-purposing around differential gene expression analysis. On a normal desktop PC, it is common for the connectivity mapping task with a single gene signature to take > 2h to complete using sscMap, a popular Java application that runs on standard CPUs (Central Processing Units). Here, we describe new software, cudaMap, which has been implemented using CUDA C/C++ to harness the computational power of NVIDIA GPUs (Graphics Processing Units) to greatly reduce processing times for connectivity mapping.

RESULTS: cudaMap can identify candidate therapeutics from the same signature in just over thirty seconds when using an NVIDIA Tesla C2050 GPU. Results from the analysis of multiple gene signatures, which would previously have taken several days, can now be obtained in as little as 10 minutes, greatly facilitating candidate therapeutics discovery with high throughput. We are able to demonstrate dramatic speed differentials between GPU assisted performance and CPU executions as the computational load increases for high accuracy evaluation of statistical significance.

CONCLUSION: Emerging ‘omics’ technologies are constantly increasing the volume of data and information to be processed in all areas of biomedical research. Embracing the multicore functionality of GPUs represents a major avenue of local accelerated computing. cudaMap will make a strong contribution in the discovery of candidate therapeutics by enabling speedy execution of heavy duty connectivity mapping tasks, which are increasingly required in modern cancer research. cudaMap is open source and can be freely downloaded from

Or to put that in lay terms, the goal is to establish the connections between human diseases, genes that underlie them and drugs that treat them.

Going from several days to ten (10) minutes is quite a gain in performance.

This is processing of experimental data but is it a window into techniques for scaling topic maps?

I first saw this in a tweet by Stefano Bertolo.

Third Age of Computing?

Monday, August 26th, 2013

The ‘third era’ of app development will be fast, simple, and compact by Rik Myslewski.

From the post:

The tutorial was conducted by members of the HSA – heterogeneous system architecture – Foundation, a consortium of SoC vendors and IP designers, software companies, academics, and others including such heavyweights as ARM, AMD, and Samsung. The mission of the Foundation, founded last June, is “to make it dramatically easier to program heterogeneous parallel devices.”

As the HSA Foundation explains on its website, “We are looking to bring about applications that blend scalar processing on the CPU, parallel processing on the GPU, and optimized processing of DSP via high bandwidth shared memory access with greater application performance at low power consumption.”

Last Thursday, HSA Foundation president and AMD corporate fellow Phil Rogers provided reporters with a pre-briefing on the Hot Chips tutorial, and said the holy grail of transparent “write once, use everywhere” programming for shared-memory heterogeneous systems appears to be on the horizon.

According to Rogers, heterogeneous computing is nothing less than the third era of computing, the first two being the single-core era and the muti-core era. In each era of computing, he said, the first programming models were hard to use but were able to harness the full performance of the chips.


Exactly how HSA will get there is not yet fully defined, but a number of high-level features are accepted. Unified memory addressing across all processor types, for example, is a key feature of HSA. “It’s fundamental that we can allocate memory on one processor,” Rogers said, “pass a pointer to another processor, and execute on that data – we move the compute rather than the data.”

Rik does a deep dive with references to the HSA Programmers Reference Manual to Project Sumatra that bring data-parallel algorithms to Java 9 (2015).

The only discordant note is that Nivdia and Intel are both missing from the HSA Foundation. Invited but not present.

Customers of Nvidia and/or Intel (I’m both) should contact Nvidia (Contact us) and Intel (contact us) and urge them to join the HSA Foundation. And pass this request along.

Sharing of memory is one of the advantages of HSA (heterogeneous systems architecture) and it is the where the semantics of shared data will come to the fore.

I haven’t read the available HSA documents in detail, but the HSA Programmer’s Reference Manual appears to presume that shared data has only one semantic. (It never says that but that is my current impression.)

We have seen that the semantics of data is not “transparent.” The same demonstration illustrates that data doesn’t always have the same semantic.

Simply because I am pointed to a particular memory location, there is no reason to presume I should approach that data with the same semantics.

For example, what if I have a Social Security Number (SSN). In processing that number for the Social Security Administration, it may serve to recall claim history, eligibility, etc. If I am accessing the same data to compare it to SSN records maintained by the Federal Bureau of Investigation (FBI), it may not longer be a unique identifier in the same sense as at the SSA.

Same “data,” but different semantics.

Who you gonna call? Topic Maps!

PS: Perhaps not as part of the running code but to document the semantics you are using to process data. Same data, same memory location, multiple semantics.

GPU Scripting and Code Generation with PyCUDA

Saturday, May 11th, 2013

GPU Scripting and Code Generation with PyCUDA by Andreas Klockner, Nicolas Pinto, Bryan Catanzaro, Yunsup Lee, Paul Ivanov, Ahmed Fasih.


High-level scripting languages are in many ways polar opposites to GPUs. GPUs are highly parallel, subject to hardware subtleties, and designed for maximum throughput, and they offer a tremendous advance in the performance achievable for a significant number of computational problems. On the other hand, scripting languages such as Python favor ease of use over computational speed and do not generally emphasize parallelism. PyCUDA is a package that attempts to join the two together. This chapter argues that in doing so, a programming environment is created that is greater than just the sum of its two parts. We would like to note that nearly all of this chapter applies in unmodified form to PyOpenCL, a sister project of PyCUDA, whose goal it is to realize the same concepts as PyCUDA for OpenCL.

The author’s argue that while measurement of the productivity gains from PyCUDA are missing, spread use of PyCUDA is an indication of its usefulness.

Point taken.

More importantly, in my view, is PyCUDA’s potential to make use of GPUs more widespread.

Widespread use will uncover better algorithms, data structures, appropriate problems for GPUs, etc., potentially more quickly than occasional use.


Tuesday, November 6th, 2012

Halide – a language for image processing and computational photography

From the website:

Halide is a new programming language designed to make it easier to write high-performance image processing code on modern machines. Its current front end is an embedding in C++. Hardware targets include x86-64/SSE, ARM v7/NEON, and CUDA.

If you need a reason to learn C++, this could be it.

I first pointed to Halide at: Halide Wins Image Processing Gold Medal!.

Writing a modular GPGPU program in Java

Monday, August 6th, 2012

Writing a modular GPGPU program in Java by Masayuki Ioki, Shumpei Hozumi, and Shigeru Chiba.


This paper proposes a Java to CUDA runtime program translator for scientific-computing applications. Traditionally, these applications have been written in Fortran or C without using a rich modularization mechanism. Our translator enables those applications to be written in Java and run on GPGPUs while exploiting a rich modularization mechanism in Java. This translator dynamically generates optimized CUDA code from a Java program given at bytecode level when the program is running. By exploiting dynamic type information given at translation, the translator devirtualizes dynamic method dispatches and flattens objects into simple data representation in CUDA. To do this, a Java program must be written to satisfy certain constraints.

This paper also shows that the performance overheads due to Java and WootinJ are not significantly high.

Just in case you are starting to work on topic map processing routines for GPGPUs.

Something to occupy your time during the “dog days” of August.

Accelerating SQL Database Operations on a GPU with CUDA (merging spreadsheet data?)

Tuesday, January 31st, 2012

Accelerating SQL Database Operations on a GPU with CUDA by Peter Bakkum and Kevin Skadron.


Prior work has shown dramatic acceleration for various database operations on GPUs, but only using primitives that are not part of conventional database languages such as SQL. This paper implements a subset of the SQLite command processor directly on the GPU. This dramatically reduces the eff ort required to achieve GPU acceleration by avoiding the need for database programmers to use new programming languages such as CUDA or modify their programs to use non-SQL libraries.

This paper focuses on accelerating SELECT queries and describes the considerations in an efficient GPU implementation of the SQLite command processor. Results on an NVIDIA Tesla C1060 achieve speedups of 20-70X depending on the size of the result set.

Important lessons to be learned from this paper:

  • Don’t invent new languages for the average user to learn.
  • Avoid the need to modify existing programs
  • Write against common software

Remember that 75% of the BI market is still using spreadsheets. For all sorts of data but numeric data in particular.

I don’t have any experience with importing files into Excel but I assume there is a macro language that can used to create import processes.

Curious if there has been any work on creating import macros for Excel that incorporate merging as part of those imports?

That would:

  • Not be a new language for users to learn.
  • Avoid modification of existing programs (or data)
  • Be written against common software

I am not sure about the requirements for merging numeric data but that should make the exploration process all the more enjoyable.

PGStrom (PostgreSQL + GPU)

Tuesday, January 31st, 2012


From the webpage:

PG-Strom is a module of FDW (foreign data wrapper) of PostgreSQL database. It was designed to utilize GPU devices to accelarate sequential scan on massive amount of records with complex qualifiers. Its basic concept is CPU and GPU should focus on the workload with their advantage, and perform concurrently. CPU has much more flexibility, thus, it has advantage on complex stuff such as Disk-I/O, on the other hand, GPU has much more parallelism of numerical calculation, thus, it has advantage on massive but simple stuff such as check of qualifiers for each rows.

The below figure is a basic concept of PG-Strom. Now, on sequential scan workload, vanilla PostgreSQL does iteration of fetch a tuple and checks of qualifiers for each tuples. If we could consign GPU the workload of green portion, it enables to reduce workloads of CPU, thus, it shall be able to load more tuples in advance. Eventually, it should allow to provide shorter response-time on complex queries towards large amount of data.

Requires setting up the table for the GPU ahead of time but performance increase is reported to be 10x – 20x.

It occurs to me that GPUs should be well suited for graph processing. Yes? Will have to look into that and report back.


Friday, December 2nd, 2011


From the post:

We are glad to announce the new version 3.1 of rCUDA. It has been developed in a joint collaboration with the Parallel Architectures Group from the Technical University of Valencia.

The rCUDA framework enables the concurrent usage of CUDA-compatible devices remotely.

rCUDA employs the socket API for the communication between clients and servers. Thus, it can be useful in three different environments:

  • Clusters. To reduce the number of GPUs installed in High Performance Clusters. This leads to increased GPU usage and therefore energy savings as well as other related savings like acquisition costs, maintenance, space, cooling, etc.
  • Academia. In commodity networks, to offer access to a few high performance GPUs concurrently to many students.
  • Virtual Machines. To enable the access to the CUDA facilities on the physical machine.

The current version of rCUDA (v3.1) implements most of the functions in the CUDA Runtime API version 4.0, excluding only those related with graphics interoperability. rCUDA 3.1 targets the Linux OS (for 32- and 64-bit architectures) on both client and server sides.

This was mentioned in the Letting GPUs run free post but I thought it merited a separate entry. This is very likely to be important.

Letting GPUs run free

Friday, December 2nd, 2011

Letting GPUs run free by Dan Olds.

From the post:

One of the most interesting things I saw at SC11 was a joint Mellanox and University of Valencia demonstration of rCUDA over Infiniband. With rCUDA, applications can access a GPU (or multiple GPUs) on any other node in the cluster. It makes GPUs a sharable resource and is a big step towards making them as virtualisable (I don’t think that’s a word, but going to go with it anyway) as any other compute resource.

There aren’t a lot of details out there yet, there’s this press release from Mellanox and Valencia and this explanation of the rCUDA project.

This is a big deal. To me, the future of computing will be much more heterogeneous and hybrid than homogeneous and, well, some other word that means ‘common’ and begins with ‘H’. We’re moving into a mindset where systems are designed to handle particular workloads, rather than workloads that are modified to run sort of well on whatever systems are cheapest per pound or flop.


Wednesday, November 2nd, 2011


If you need to access a NVIDIA CUDA interface for statistical calculations, GPUStats may be of assistance.

From the webpage:

gpustats is a PyCUDA-based library implementing functionality similar to that present in scipy.stats. It implements a simple framework for specifying new CUDA kernels and extending existing ones. Here is a (partial) list of target functionality:

  • Probability density functions (pdfs). These are intended to speed up likelihood calculations in particular in Bayesian inference applications, such as in PyMC
  • Random variable generation using CURAND

MATLAB GPU / CUDA experiences

Thursday, July 28th, 2011

MATLAB GPU / CUDA experiences and tutorials on my laptop – Introduction

From the post:

These days it seems that you can’t talk about scientific computing for more than 5 minutes without somone bringing up the topic of Graphics Processing Units (GPUs). Originally designed to make computer games look pretty, GPUs are massively parallel processors that promise to revolutionise the way we compute.

A brief glance at the specification of a typical laptop suggests why GPUs are the new hotness in numerical computing. Take my new one for instance, a Dell XPS L702X, which comes with a Quad-Core Intel i7 Sandybridge processor running at up to 2.9Ghz and an NVidia GT 555M with a whopping 144 CUDA cores. If you went back in time a few years and told a younger version of me that I’d soon own a 148 core laptop then young Mike would be stunned. He’d also be wondering ‘What’s the catch?’

Parallel computing has been around for years but in the form of GPUs it has reached the hands of hackers and innovators. Will your next topic map application take advantage of parallel processing?

Thrust Graph Library

Sunday, April 17th, 2011

Thrust Graph Library

From the website:

Thrust Graph Library provides graph container, algorithm, and other concepts like a Boost Graph Library. This Library based on the thrust, which is a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL).

The R Journal, Issue 2/2

Friday, December 31st, 2010

The R Journal, Issue 2/2 has arrived!

Download complete issue.

Or Individual articles.

A number of topic map relevant papers are in this issue, ranging from stringr: modern, consistent string processing, Hadley Wickham; to the edgy cudaBayesreg: Bayesian Computation in CUDA, Adelino Ferreira da Silva; to a technique that started in the late 1950’s, The RecordLinkage Package: Detecting Errors in Data, Murat Sariyar and Andreas Borg.

An Approach for Fast Hierarchical Agglomerative Clustering Using Graphics Processors with CUDA

Sunday, October 10th, 2010

An Approach for Fast Hierarchical Agglomerative Clustering Using Graphics Processors with CUDA Authors: S.A. Arul Shalom, Manoranjan Dash, Minh Tue Keywords: CUDA. Hierarchical clustering, High performance Computing, Computations using Graphics hardware, complete linkage


Graphics Processing Units in today’s desktops can well be thought of as a high performance parallel processor. Each single processor within the GPU is able to execute different tasks independently but concurrently. Such computational capabilities of the GPU are being exploited in the domain of Data mining. Two types of Hierarchical clustering algorithms are realized on GPU using CUDA. Speed gains from 15 times up to about 90 times have been realized. The challenges involved in invoking Graphical hardware for such Data mining algorithms and effects of CUDA blocks are discussed. It is interesting to note that block size of 8 is optimal for GPU with 128 internal processors.

GPUs offer a great deal of processing power and programming them may provoke deeper insights into subject identification and mapping.

Topic mappers may be able to claim NVIDIA based software/hardware and/or Sony Playstation 3 and 4 units (Cell Broadband Engine) as a business expense (check with your tax advisor).

A GPU based paper for TMRA 2011 anyone?