Archive for the ‘CUDA’ Category

GPU Scripting and Code Generation with PyCUDA

Saturday, May 11th, 2013

GPU Scripting and Code Generation with PyCUDA by Andreas Klockner, Nicolas Pinto, Bryan Catanzaro, Yunsup Lee, Paul Ivanov, Ahmed Fasih.

Abstract:

High-level scripting languages are in many ways polar opposites to GPUs. GPUs are highly parallel, subject to hardware subtleties, and designed for maximum throughput, and they offer a tremendous advance in the performance achievable for a significant number of computational problems. On the other hand, scripting languages such as Python favor ease of use over computational speed and do not generally emphasize parallelism. PyCUDA is a package that attempts to join the two together. This chapter argues that in doing so, a programming environment is created that is greater than just the sum of its two parts. We would like to note that nearly all of this chapter applies in unmodified form to PyOpenCL, a sister project of PyCUDA, whose goal it is to realize the same concepts as PyCUDA for OpenCL.

The author’s argue that while measurement of the productivity gains from PyCUDA are missing, spread use of PyCUDA is an indication of its usefulness.

Point taken.

More importantly, in my view, is PyCUDA’s potential to make use of GPUs more widespread.

Widespread use will uncover better algorithms, data structures, appropriate problems for GPUs, etc., potentially more quickly than occasional use.

Halide

Tuesday, November 6th, 2012

Halide – a language for image processing and computational photography

From the website:

Halide is a new programming language designed to make it easier to write high-performance image processing code on modern machines. Its current front end is an embedding in C++. Hardware targets include x86-64/SSE, ARM v7/NEON, and CUDA.

If you need a reason to learn C++, this could be it.

I first pointed to Halide at: Halide Wins Image Processing Gold Medal!.

Writing a modular GPGPU program in Java

Monday, August 6th, 2012

Writing a modular GPGPU program in Java by Masayuki Ioki, Shumpei Hozumi, and Shigeru Chiba.

Abstract:

This paper proposes a Java to CUDA runtime program translator for scientific-computing applications. Traditionally, these applications have been written in Fortran or C without using a rich modularization mechanism. Our translator enables those applications to be written in Java and run on GPGPUs while exploiting a rich modularization mechanism in Java. This translator dynamically generates optimized CUDA code from a Java program given at bytecode level when the program is running. By exploiting dynamic type information given at translation, the translator devirtualizes dynamic method dispatches and flattens objects into simple data representation in CUDA. To do this, a Java program must be written to satisfy certain constraints.

This paper also shows that the performance overheads due to Java and WootinJ are not significantly high.

Just in case you are starting to work on topic map processing routines for GPGPUs.

Something to occupy your time during the “dog days” of August.

Accelerating SQL Database Operations on a GPU with CUDA (merging spreadsheet data?)

Tuesday, January 31st, 2012

Accelerating SQL Database Operations on a GPU with CUDA by Peter Bakkum and Kevin Skadron.

Abstract:

Prior work has shown dramatic acceleration for various database operations on GPUs, but only using primitives that are not part of conventional database languages such as SQL. This paper implements a subset of the SQLite command processor directly on the GPU. This dramatically reduces the eff ort required to achieve GPU acceleration by avoiding the need for database programmers to use new programming languages such as CUDA or modify their programs to use non-SQL libraries.

This paper focuses on accelerating SELECT queries and describes the considerations in an efficient GPU implementation of the SQLite command processor. Results on an NVIDIA Tesla C1060 achieve speedups of 20-70X depending on the size of the result set.

Important lessons to be learned from this paper:

  • Don’t invent new languages for the average user to learn.
  • Avoid the need to modify existing programs
  • Write against common software

Remember that 75% of the BI market is still using spreadsheets. For all sorts of data but numeric data in particular.

I don’t have any experience with importing files into Excel but I assume there is a macro language that can used to create import processes.

Curious if there has been any work on creating import macros for Excel that incorporate merging as part of those imports?

That would:

  • Not be a new language for users to learn.
  • Avoid modification of existing programs (or data)
  • Be written against common software

I am not sure about the requirements for merging numeric data but that should make the exploration process all the more enjoyable.

PGStrom (PostgreSQL + GPU)

Tuesday, January 31st, 2012

PGStrom

From the webpage:

PG-Strom is a module of FDW (foreign data wrapper) of PostgreSQL database. It was designed to utilize GPU devices to accelarate sequential scan on massive amount of records with complex qualifiers. Its basic concept is CPU and GPU should focus on the workload with their advantage, and perform concurrently. CPU has much more flexibility, thus, it has advantage on complex stuff such as Disk-I/O, on the other hand, GPU has much more parallelism of numerical calculation, thus, it has advantage on massive but simple stuff such as check of qualifiers for each rows.

The below figure is a basic concept of PG-Strom. Now, on sequential scan workload, vanilla PostgreSQL does iteration of fetch a tuple and checks of qualifiers for each tuples. If we could consign GPU the workload of green portion, it enables to reduce workloads of CPU, thus, it shall be able to load more tuples in advance. Eventually, it should allow to provide shorter response-time on complex queries towards large amount of data.

Requires setting up the table for the GPU ahead of time but performance increase is reported to be 10x – 20x.

It occurs to me that GPUs should be well suited for graph processing. Yes? Will have to look into that and report back.

rCUDA

Friday, December 2nd, 2011

rCUDA

From the post:

We are glad to announce the new version 3.1 of rCUDA. It has been developed in a joint collaboration with the Parallel Architectures Group from the Technical University of Valencia.

The rCUDA framework enables the concurrent usage of CUDA-compatible devices remotely.

rCUDA employs the socket API for the communication between clients and servers. Thus, it can be useful in three different environments:

  • Clusters. To reduce the number of GPUs installed in High Performance Clusters. This leads to increased GPU usage and therefore energy savings as well as other related savings like acquisition costs, maintenance, space, cooling, etc.
  • Academia. In commodity networks, to offer access to a few high performance GPUs concurrently to many students.
  • Virtual Machines. To enable the access to the CUDA facilities on the physical machine.

The current version of rCUDA (v3.1) implements most of the functions in the CUDA Runtime API version 4.0, excluding only those related with graphics interoperability. rCUDA 3.1 targets the Linux OS (for 32- and 64-bit architectures) on both client and server sides.

This was mentioned in the Letting GPUs run free post but I thought it merited a separate entry. This is very likely to be important.

Letting GPUs run free

Friday, December 2nd, 2011

Letting GPUs run free by Dan Olds.

From the post:

One of the most interesting things I saw at SC11 was a joint Mellanox and University of Valencia demonstration of rCUDA over Infiniband. With rCUDA, applications can access a GPU (or multiple GPUs) on any other node in the cluster. It makes GPUs a sharable resource and is a big step towards making them as virtualisable (I don’t think that’s a word, but going to go with it anyway) as any other compute resource.

There aren’t a lot of details out there yet, there’s this press release from Mellanox and Valencia and this explanation of the rCUDA project.

This is a big deal. To me, the future of computing will be much more heterogeneous and hybrid than homogeneous and, well, some other word that means ‘common’ and begins with ‘H’. We’re moving into a mindset where systems are designed to handle particular workloads, rather than workloads that are modified to run sort of well on whatever systems are cheapest per pound or flop.

GPUStats

Wednesday, November 2nd, 2011

GPUStats

If you need to access a NVIDIA CUDA interface for statistical calculations, GPUStats may be of assistance.

From the webpage:

gpustats is a PyCUDA-based library implementing functionality similar to that present in scipy.stats. It implements a simple framework for specifying new CUDA kernels and extending existing ones. Here is a (partial) list of target functionality:

  • Probability density functions (pdfs). These are intended to speed up likelihood calculations in particular in Bayesian inference applications, such as in PyMC
  • Random variable generation using CURAND

MATLAB GPU / CUDA experiences

Thursday, July 28th, 2011

MATLAB GPU / CUDA experiences and tutorials on my laptop – Introduction

From the post:

These days it seems that you can’t talk about scientific computing for more than 5 minutes without somone bringing up the topic of Graphics Processing Units (GPUs). Originally designed to make computer games look pretty, GPUs are massively parallel processors that promise to revolutionise the way we compute.

A brief glance at the specification of a typical laptop suggests why GPUs are the new hotness in numerical computing. Take my new one for instance, a Dell XPS L702X, which comes with a Quad-Core Intel i7 Sandybridge processor running at up to 2.9Ghz and an NVidia GT 555M with a whopping 144 CUDA cores. If you went back in time a few years and told a younger version of me that I’d soon own a 148 core laptop then young Mike would be stunned. He’d also be wondering ‘What’s the catch?’

Parallel computing has been around for years but in the form of GPUs it has reached the hands of hackers and innovators. Will your next topic map application take advantage of parallel processing?

Thrust Graph Library

Sunday, April 17th, 2011

Thrust Graph Library

From the website:

Thrust Graph Library provides graph container, algorithm, and other concepts like a Boost Graph Library. This Library based on the thrust, which is a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL).

The R Journal, Issue 2/2

Friday, December 31st, 2010

The R Journal, Issue 2/2 has arrived!

Download complete issue.

Or Individual articles.

A number of topic map relevant papers are in this issue, ranging from stringr: modern, consistent string processing, Hadley Wickham; to the edgy cudaBayesreg: Bayesian Computation in CUDA, Adelino Ferreira da Silva; to a technique that started in the late 1950′s, The RecordLinkage Package: Detecting Errors in Data, Murat Sariyar and Andreas Borg.

An Approach for Fast Hierarchical Agglomerative Clustering Using Graphics Processors with CUDA

Sunday, October 10th, 2010

An Approach for Fast Hierarchical Agglomerative Clustering Using Graphics Processors with CUDA Authors: S.A. Arul Shalom, Manoranjan Dash, Minh Tue Keywords: CUDA. Hierarchical clustering, High performance Computing, Computations using Graphics hardware, complete linkage

Abstract:

Graphics Processing Units in today’s desktops can well be thought of as a high performance parallel processor. Each single processor within the GPU is able to execute different tasks independently but concurrently. Such computational capabilities of the GPU are being exploited in the domain of Data mining. Two types of Hierarchical clustering algorithms are realized on GPU using CUDA. Speed gains from 15 times up to about 90 times have been realized. The challenges involved in invoking Graphical hardware for such Data mining algorithms and effects of CUDA blocks are discussed. It is interesting to note that block size of 8 is optimal for GPU with 128 internal processors.

GPUs offer a great deal of processing power and programming them may provoke deeper insights into subject identification and mapping.

Topic mappers may be able to claim NVIDIA based software/hardware and/or Sony Playstation 3 and 4 units (Cell Broadband Engine) as a business expense (check with your tax advisor).

A GPU based paper for TMRA 2011 anyone?