## Archive for the ‘GPU’ Category

### Medusa: Simplified Graph Processing on GPUs

Saturday, May 11th, 2013

Medusa: Simplified Graph Processing on GPUs by Jianlong Zhong, Bingsheng He.

Abstract:

Graphs are the de facto data structures for many applications, and efficient graph processing is a must for the application performance. Recently, the graphics processing unit (GPU) has been adopted to accelerate various graph processing algorithms such as BFS and shortest path. However, it is difficult to write correct and efficient GPU programs and even more difficult for graph processing due to the irregularities of graph structures. To simplify graph processing on GPUs, we propose a programming framework called Medusa which enables developers to leverage the capabilities of GPUs by writing sequential C/C++ code. Medusa offers a small set of user-defined APIs, and embraces a runtime system to automatically execute those APIs in parallel on the GPU. We develop a series of graph-centric optimizations based on the architecture features of GPU for efficiency. Additionally, Medusa is extended to execute on multiple GPUs within a machine. Our experiments show that (1) Medusa greatly simplifies implementation of GPGPU programs for graph processing, with much fewer lines of source code written by developers; (2) The optimization techniques significantly improve the performance of the runtime system, making its performance comparable with or better than the manually tuned GPU graph operations.

Just in case you are interested in high performance graph processing.

### GPU Scripting and Code Generation with PyCUDA

Saturday, May 11th, 2013

GPU Scripting and Code Generation with PyCUDA by Andreas Klockner, Nicolas Pinto, Bryan Catanzaro, Yunsup Lee, Paul Ivanov, Ahmed Fasih.

Abstract:

High-level scripting languages are in many ways polar opposites to GPUs. GPUs are highly parallel, subject to hardware subtleties, and designed for maximum throughput, and they offer a tremendous advance in the performance achievable for a significant number of computational problems. On the other hand, scripting languages such as Python favor ease of use over computational speed and do not generally emphasize parallelism. PyCUDA is a package that attempts to join the two together. This chapter argues that in doing so, a programming environment is created that is greater than just the sum of its two parts. We would like to note that nearly all of this chapter applies in unmodified form to PyOpenCL, a sister project of PyCUDA, whose goal it is to realize the same concepts as PyCUDA for OpenCL.

The author’s argue that while measurement of the productivity gains from PyCUDA are missing, spread use of PyCUDA is an indication of its usefulness.

Point taken.

More importantly, in my view, is PyCUDA’s potential to make use of GPUs more widespread.

Widespread use will uncover better algorithms, data structures, appropriate problems for GPUs, etc., potentially more quickly than occasional use.

### Massively Parallel Suffix Array Queries…

Saturday, May 11th, 2013

Massively Parallel Suffix Array Queries and On-Demand Phrase Extraction for Statistical Machine Translation Using GPUs by Hua He, Jimmy Lin, Adam Lopez.

Abstract:

Translation models can be scaled to large corpora and arbitrarily-long phrases by looking up translations of source phrases on the fly in an indexed parallel text. However, this is impractical because on-demand extraction of phrase tables is a major computational bottleneck. We solve this problem by developing novel algorithms for general purpose graphics processing units (GPUs), which enable suffix array queries for phrase lookup and phrase extractions to be massively parallelized. Our open-source implementation improves the speed of a highly-optimized, state-of-the-art serial CPU-based implementation by at least an order of magnitude. In a Chinese-English translation task, our GPU implementation extracts translation tables from approximately 100 million words of parallel text in less than 30 milliseconds.

If you think about topic maps as mapping the identification of a subject in multiple languages to a single representative, then the value of translation software becomes obvious.

You may or may not, depending upon project requirements, want to rely solely on automated mappings of phrases.

Whether you use automated mapping of phrases as an “assist” to or as a sanity check on human curation, this work looks very interesting.

### Data-rich astronomy: mining synoptic sky surveys [Data Bombing]

Saturday, May 11th, 2013

Data-rich astronomy: mining synoptic sky surveys by Stefano Cavuoti.

Abstract:

In the last decade a new generation of telescopes and sensors has allowed the production of a very large amount of data and astronomy has become, a data-rich science; this transition is often labeled as: “data revolution” and “data tsunami”. The first locution puts emphasis on the expectations of the astronomers while the second stresses, instead, the dramatic problem arising from this large amount of data: which is no longer computable with traditional approaches to data storage, data reduction and data analysis. In a new, age new instruments are necessary, as it happened in the Bronze age when mankind left the old instruments made out of stone to adopt the new, better ones made with bronze. Everything changed, even the social structure. In a similar way, this new age of Astronomy calls for a new generation of tools and, for a new methodological approach to many problems, and for the acquisition of new skills. The attempts to find a solution to this problems falls under the umbrella of a new discipline which originated by the intersection of astronomy, statistics and computer science: Astroinformatics, (Borne, 2009; Djorgovski et al., 2006).

Dissertation by the same Stefano Cavuoti of: Astrophysical data mining with GPU….

Along with every new discipline comes semantics that are transparent to insiders and opaque to others.

Not out of malice but economy. Why explain a term if all those attending the discussion understand what it means?

But that lack of explanation, like our current ignorance about the means used to construct the pyramids, can come back to bite you.

In some cases far more quickly than intellectual curiosity about ancient monuments by the tin hat crowd.

Take the continuing failure of data integration by the U.S. intelligence services for example.

Rather than the current mule-like resistance to sharing, I would data bomb the other intelligence services with incompatible data exports every week.

Full sharing, for all they would be able to do with it.

Unless they had a topic map.

### Virtual School summer courses…

Monday, May 6th, 2013

Virtual School summer courses on data-intensive and many-core computing

From the webpage:

Graduate students, post-docs and professionals from academia, government, and industry are invited to sign up now for two summer school courses offered by the Virtual School of Computational Science and Engineering.

These Virtual School courses will be delivered to sites nationwide using high-definition videoconferencing technologies, allowing students to participate at a number of convenient locations where they will be able to work with a cohort of fellow computational scientists, have access to local experts, and interact in real time with course instructors.

The Data Intensive Summer School focuses on the skills needed to manage, process, and gain insight from large amounts of data. It targets researchers from the physical, biological, economic, and social sciences who need to deal with large collections of data. The course will cover the nuts and bolts of data-intensive computing, common tools and software, predictive analytics algorithms, data management, and non-relational database models.

(…)

The Proven Algorithmic Techniques for Many-core Processors summer school will present students with the seven most common and crucial algorithm and data optimization techniques to support successful use of GPUs for scientific computing.

Studying many current GPU computing applications, the course instructors have learned that the limits of an application’s scalability are often related to some combination of memory bandwidth saturation, memory contention, imbalanced data distribution, or data structure/algorithm interactions. Successful GPU application developers often adjust their data structures and problem formulation specifically for massive threading and executed their threads leveraging shared on-chip memory resources for bigger impact. The techniques presented in the course can improve performance of applicable kernels by 2-10X in current processors while improving future scalability.

(…)

For more information about the Proven Algorithmic Techniques for Many-core Processors course, including pre-requisites and course topics, visit http://www.vscse.org/summerschool/2013/manycore.html.

Think of it as summer camp. For $100 (waived at some locations), it would be hard to do better. ### Introduction to Parallel Programming Friday, May 3rd, 2013 Introduction to Parallel Programming by John Owens, David Luebke, Cheng-Han Lee and Mike Roberts. (UDACITY) Class Summary: Learn the fundamentals of parallel computing with the GPU and the CUDA programming environment! In this class, you’ll learn about parallel programming by coding a series of image processing algorithms, such as you might find in Photoshop or Instagram. You’ll be able to program and run your assignments on high-end GPUs, even if you don’t own one yourself. What Should I Know? We expect students to have a solid experience with the C programming language and basic knowledge of data structures and algorithms. What Will I Learn? You’ll master the fundamentals of massively parallel computing by using CUDA C/C++ to program modern GPUs. You’ll learn the GPU programming model and architecture, key algorithms and parallel programming patterns, and optimization techniques. Your assignments will illustrate these concepts through image processing applications, but this is a parallel computing course and what you learn will translate to any application domain. Most of all we hope you’ll learn how to think in parallel. In Fast Database Emerges from MIT Class… [Think TweetMap] you read about a new SQL database based on GPUs. What new approach is going to emerge from your knowing more about GPUs and parallel programming? ### History of the Modern GPU Series Wednesday, April 24th, 2013 History of the Modern GPU Series From the post: Graham Singer over at Techspot posted a series of articles a few weeks ago covering the history of the modern GPU. It is well-written and in-depth. For GPU affectionados, this is a nice read. There are 4 parts to the series: Just in case you are excited about the GPU news reported below, a bit of history might not hurt. ### Fast Database Emerges from MIT Class… [Think TweetMap] Wednesday, April 24th, 2013 Fast Database Emerges from MIT Class, GPUs and Student’s Invention by Ian B. Murphy. Details the invention of MapD by Todd Mostak. From the post: MapD, At A Glance: MapD is a new database in development at MIT, created by Todd Mostak. • MapD stands for “massively parallel database.” • The system uses graphics processing units (GPUs) to parallelize computations. Some statistical algorithms run 70 times faster compared to CPU-based systems like MapReduce. • A MapD server costs around$5,000 and runs on the same power as five light bulbs.
• MapD runs at between 1.4 and 1.5 teraflops, roughly equal to the fastest supercomputer in 2000.
• MapD uses SQL to query data.
• Mostak intends to take the system open source sometime in the next year.

Sam Madden (MIT) describes MapD this way:

Madden said there are three elements that make Mostak’s database a disruptive technology. The first is the millisecond response time for SQL queries across “huge” datasets. Madden, who was a co-creator of the Vertica columnar database, said MapD can do in milliseconds what Vertica can do in minutes. That difference in speed is everything when doing iterative research, he said.

The second is the very tight coupling between data processing and visually rendering the data; this is a byproduct of building the system from GPUs from the beginning. That adds the ability to visualize the results of the data processing in under a second. Third is the cost to build the system. MapD runs in a server that costs around $5,000. “He can do what a 1000 node MapReduce cluster would do on a single processor for some of these applications,” Madden said. Not a lot of technical detail but you could start learning CUDA while waiting for the open source release. At 1.4 to 1.5 teraflops on$5,000 worth of hardware, how will clusters will retain their customer base?

### Welcome to TweetMap ALPHA

Wednesday, April 24th, 2013

Welcome to TweetMap ALPHA

From the introduction popup:

TweetMap is an instance of MapD, a massively parallel database platform being developed through a collaboration between Todd Mostak, (currently a researcher at MIT), and the Harvard Center for Geographic Analysis (CGA).

The tweet database presented here starts on 12/10/2012 and ends 12/31/2012. Currently 95 million tweets are available to be queried by time, space, and keyword. This could increase to billions and we are working on real time streaming from tweet-tweeted to tweet-on-the-map in under a second.

MapD is a general purpose SQL database that can be used to provide real-time visualization and analysis of just about any very large data set. MapD makes use of commodity Graphic Processing Units (GPUs) to parallelize hard compute jobs such as that of querying and rendering very large data sets on-the-fly.

This is a real treat!

Try something popular, like “gaga,” without the quotes.

Remember this is running against 95 million tweets.

Impressive! Yes?

### Astrophysical data mining with GPU…

Tuesday, April 9th, 2013

Astrophysical data mining with GPU. A case study: genetic classification of globular clusters by Stefano Cavuoti, Mauro Garofalo, Massimo Brescia, Maurizio Paolillo, Antonio Pescape’, Giuseppe Longo, Giorgio Ventre.

Abstract:

We present a multi-purpose genetic algorithm, designed and implemented with GPGPU / CUDA parallel computing technology. The model was derived from our CPU serial implementation, named GAME (Genetic Algorithm Model Experiment). It was successfully tested and validated on the detection of candidate Globular Clusters in deep, wide-field, single band HST images. The GPU version of GAME will be made available to the community by integrating it into the web application DAMEWARE (DAta Mining Web Application REsource), a public data mining service specialized on massive astrophysical data. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm leads to a speedup of a factor of 200x in the training phase with respect to the CPU based version.

BTW, DAMEWARE (DAta Mining Web Application REsource, http://dame.dsf.unina.it/beta_info.html.

In case you are curious about the application of genetic algorithms in a low signal/noise situation with really “big” data, this is a good starting point.

Makes me curious about the “noise” in other communications.

The “signal” is fairly easy to identify in astronomy, but what about in text or speech?

I suppose “background noise, music, automobiles” would count as “noise” on a tape recording of a conversation, but is there “noise” in a written text?

Or noise in a conversation that is clearly audible?

If we have 100% signal, how do we explain failing to understand a message in speech or writing?

If it is not “noise,” then what is the problem?

### Deploying Graph Algorithms on GPUs: an Adaptive Solution

Sunday, April 7th, 2013

Deploying Graph Algorithms on GPUs: an Adaptive Solution by Da Li and Michela Becchi. (27th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2013)

From the post:

Thanks to their massive computational power and their SIMT computational model, Graphics Processing Units (GPUs) have been successfully used to accelerate a wide variety of regular applications (linear algebra, stencil computations, image processing and bioinformatics algorithms, among others). However, many established and emerging problems are based on irregular data structures, such as graphs. Examples can be drawn from different application domains: networking, social networking, machine learning, electrical circuit modeling, discrete event simulation, compilers, and computational sciences. It has been shown that irregular applications based on large graphs do exhibit runtime parallelism; moreover, the amount of available parallelism tends to increase with the size of the datasets. In this work, we explore an implementation space for deploying a variety of graph algorithms on GPUs. We show that the dynamic nature of the parallelism that can be extracted from graph algorithms makes it impossible to find an optimal solution. We propose a runtime system able to dynamically transition between different implementations with minimal overhead, and investigate heuristic decisions applicable across algorithms and datasets. Our evaluation is performed on two graph algorithms: breadth-first search and single-source shortest paths. We believe that our proposed mechanisms can be extended and applied to other graph algorithms that exhibit similar computational patterns.

A development that may surprise some graph software vendors, there are “no optimal solution[s] across graph problems and datasets” for graph algorithms on GPU.

This paper points towards an adaptive technique that may prove to be “resilient to the irregularity and heterogeneity of real world graphs.”

I first saw this in a tweet by Stefano Bertolo.

### STScI’s Engineering and Technology Colloquia

Tuesday, April 2nd, 2013

STScI’s Engineering and Technology Colloquia Series Webcasts by Bruce Berriman.

From the post:

Last week, I wrote a post about Michelle Borkin’s presentation on Astronomical Medicine and Beyond, part of the Space Telescope Science Institute’s (STScI) Engineering and Technology Colloquia Series. STScI archives and posts on-line all the presentations in this series. The talks go back to 2008 (with one earlier one dating to 2001), are generally given monthly or quarterly, and represent a rich source of information on many aspects of engineering and technology. The archive includes, where available, abstracts, Power Point Slides, videos for download, and for the more recent presentations, webcasts.

Definitely a astronomy/space flavor but also includes:

Scientific Data Visualization by Adam Bly (Visualizing.org, Seed Media Group).

Knowledge Retention & Transfer: What You Need to Know by Jay Liebowitz (UMUC).

Fast Parallel Processing Using GPUs for Accelerating Image Processing by Tom Reed (Nvidia Corporation).

Every field is struggling with the same data/knowledge issues, often using different terminologies or examples.

We can all struggle separately or we can learn from others.

Which approach do you use?

### Duplicate Detection on GPUs

Saturday, March 23rd, 2013

Duplicate Detection on GPUs by Benedikt Forchhammer, Thorsten Papenbrock, Thomas Stening, Sven Viehmeier, Uwe Draisbach, Felix Naumann.

Abstract:

With the ever increasing volume of data and the ability to integrate different data sources, data quality problems abound. Duplicate detection, as an integral part of data cleansing, is essential in modern information systems. We present a complete duplicate detection workflow that utilizes the capabilities of modern graphics processing units (GPUs) to increase the efficiency of finding duplicates in very large datasets. Our solution covers several well-known algorithms for pair selection, attribute-wise similarity comparison, record-wise similarity aggregation, and clustering. We redesigned these algorithms to run memory-efficiently and in parallel on the GPU. Our experiments demonstrate that the GPU-based workflow is able to outperform a CPU-based implementation on large, real-world datasets. For instance, the GPU-based algorithm deduplicates a dataset with 1.8m entities 10 times faster than a common CPU-based algorithm using comparably priced hardware.

Synonyms: Duplicate detection = entity matching = record linkage (and all the other alternatives for those terms).

This looks wicked cool!

I first saw this in a tweet by Stefano Bertolo.

### Graph Databases, GPUs, and Graph Analytics

Wednesday, February 20th, 2013

Graph Databases, GPUs, and Graph Analytics by Bryan Thompson.

From the post:

For people who want to track what we’ve been up to on the XDATA project, there are three surveys articles that we’ve produced:

- Literature Review of Graph Databases (Bryan Thompson, SYSTAP)
- Large Scale Graph Algorithms on the GPU (Yangzihao Wang and John Owens, UC Davis)
- Graph Pattern Matching, Search, and OLAP (Dr. Xifeng Yan, UCSB)

It may be too early to take off for a weekend of reading but I wish….

### Getting Started with ArrayFire – a 30-minute Jump Start

Thursday, January 10th, 2013

Getting Started with ArrayFire – a 30-minute Jump Start

From the post:

In case you missed it, we recently held a webinar on the ArrayFire GPU Computing Library. This webinar was part of an ongoing series of webinars that will help you learn more about the many applications of ArrayFire, while interacting with AccelerEyes GPU computing experts.

ArrayFire is the world’s most comprehensive GPU software library. In this webinar, James Malcolm, who has built many of ArrayFire’s core components, walked us through the basic principles and syntax for ArrayFire. He also provided an overview of existing efforts in GPU software, and compared them to the extensive capabilities of ArrayFire.

If you need to push the limits of current performance, GPUs are one way to go.

Maybe 2013 will be your GPU year!

### Fast Parallel Sorting Algorithms on GPUs

Wednesday, December 5th, 2012

Fast Parallel Sorting Algorithms on GPUs by Bilal Jan, Bartolomeo Montrucchio, Carlo Ragusa, Fiaz Gul Khan, Omar Khan.

Abstract:

This paper presents a comparative analysis of the three widely used parallel sorting algorithms: OddEven sort, Rank sort and Bitonic sort in terms of sorting rate, sorting time and speed-up on CPU and different GPU architectures. Alongside we have implemented novel parallel algorithm: min-max butterfly network, for finding minimum and maximum in large data sets. All algorithms have been implemented exploiting data parallelism model, for achieving high performance, as available on multi-core GPUs using the OpenCL specification. Our results depicts minimum speed-up19x of bitonic sort against oddeven sorting technique for small queue sizes on CPU and maximum of 2300x speed-up for very large queue sizes on Nvidia Quadro 6000 GPU architecture. Our implementation of full-butterfly network sorting results in relatively better performance than all of the three sorting techniques: bitonic, odd-even and rank sort. For min-max butterfly network, our findings report high speed-up of Nvidia quadro 6000 GPU for high data set size reaching 224 with much lower sorting time.

Is there a GPU in your topic map processing future?

I first saw this in a tweet by Stefano Bertolo.

### Efficient similarity search on multimedia databases [No Petraeus Images, Yet]

Wednesday, November 14th, 2012

Efficient similarity search on multimedia databases by Mariela Lopresti, Natalia Miranda, Fabiana Piccoli, Nora Reyes.

Abstract:

Manipulating and retrieving multimedia data has received increasing attention with the advent of cloud storage facilities. The ability of querying by similarity over large data collections is mandatory to improve storage and user interfaces. But, all of them are expensive operations to solve only in CPU; thus, it is convenient to take into account High Performance Computing (HPC) techniques in their solutions. The Graphics Processing Unit (GPU) as an alternative HPC device has been increasingly used to speedup certain computing processes. This work introduces a pure GPU architecture to build the Permutation Index and to solve approximate similarity queries on multimedia databases. The empirical results of each implementation have achieved different level of speedup which are related with characteristics of GPU and the particular database used.

No images have been published, yet, in the widening scandal around David Petraeus.

When they do, searching multimedia databases such as Flickr, Facebook, YouTube and others will be a hot issue.

Once found, there is the problem of finding unique ones again and duplicates not again.

### hgpu.org

Thursday, November 8th, 2012

hgpu.org – high performance computing on graphics processing units

Wealth of GPU computing resources. Will take days to explore fully (if then).

Highest level view:

• Applications – Where it’s used
• Hardware – Specs and reviews
• Programming – Algorithms and techniques
• Resources – Source Code, tutorials, books, etc.
• Tools – GPU Sources

Homepage is rather “busy” but packed with information (as opposed to gadgets). Lists the most recent entries, most viewed papers, most recent source code and events.

One special item to note:

Free GPU computing node at hgpu.org

Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.

Oh, did I mention that registration is free?

If you don’t get a multi-GPU unit under the Christmas tree, you can still hum along.

### Efficient implementation of data flow graphs on multi-gpu clusters

Thursday, November 8th, 2012

Abstract:

Nowadays, it is possible to build a multi-GPU supercomputer, well suited for implementation of digital signal processing algorithms, for a few thousand dollars. However, to achieve the highest performance with this kind of architecture, the programmer has to focus on inter-processor communications, tasks synchronization. In this paper, we propose a high level programming model based on a data flow graph (DFG) allowing an efficient implementation of digital signal processing applications on a multi-GPU computer cluster. This DFG-based design flow abstracts the underlying architecture. We focus particularly on the efficient implementation of communications by automating computation-communication overlap, which can lead to significant speedups as shown in the presented benchmark. The approach is validated on three experiments: a multi-host multi-gpu benchmark, a 3D granulometry application developed for research on materials and an application for computing visual saliency maps.

Analysis of the statistics of sizes in images (granulometry) and focusing on a particular place of interest in an image (visual saliency) were interesting use cases.

May or may not be helpful in particular cases, depending on your tests for subject identity.

### Java for graphics cards

Sunday, August 19th, 2012

Java for graphics cards

From the post:

Phil Pratt-Szeliga, a postgraduate at Syracuse University in New York, has released the source code of his Rootbeer GPU compiler on Github. The developer presented the software at the High Performance Computing and Communication conference in Liverpool in June. The slides from this presentation can be found in the documentation section of the Github directory.

Short summary of Phil Pratt-Szeliga’s GPU compiler.

Is it a waste to have GPU cycles lying around or is there some more fundamental issue at stake?

To what degree does chip architecture drive choices at higher levels of abstraction?

Suggestions of ways to explore that question?

### Writing a modular GPGPU program in Java

Monday, August 6th, 2012

Writing a modular GPGPU program in Java by Masayuki Ioki, Shumpei Hozumi, and Shigeru Chiba.

Abstract:

This paper proposes a Java to CUDA runtime program translator for scientific-computing applications. Traditionally, these applications have been written in Fortran or C without using a rich modularization mechanism. Our translator enables those applications to be written in Java and run on GPGPUs while exploiting a rich modularization mechanism in Java. This translator dynamically generates optimized CUDA code from a Java program given at bytecode level when the program is running. By exploiting dynamic type information given at translation, the translator devirtualizes dynamic method dispatches and flattens objects into simple data representation in CUDA. To do this, a Java program must be written to satisfy certain constraints.

This paper also shows that the performance overheads due to Java and WootinJ are not significantly high.

Just in case you are starting to work on topic map processing routines for GPGPUs.

Something to occupy your time during the “dog days” of August.

### Cloud-Hosted GPUs And Gaming-As-A-Service

Friday, May 18th, 2012

From the post:

NVIDIA is all buckled up to redefine the dynamics of gaming. The company has spilled the beans over three novel cloud technologies aimed at accelerating the available remote computational power by endorsing the number-crunching potential of its very own (and redesigned) graphical processing units.

At the heart of each of the three technologies lies the latest Kepler GPU architecture, custom-tailored for utility in volumetric datacenters. Through virtualization software, a number of users achieve access through the cutting-edge computational capability of the GPUs.

Jen-Hsun Huang, NVIDIA’s president and CEO, firmly believes that the Kepler cloud GPU technology is bound to take cloud computing to an entirely new level. He advocates that the GPU has become a significant constituent of contemporary computing devices. Digital artists are essentially dependent upon the GPU for conceptualizing their thoughts. Touch devices owe a great deal to the GPU for delivering a streamlined graphical experience.

With the introduction of the cloud GPU, NVIDIA is all set to change the game—literally. NVIDIA’s cloud-based GPU will bring an amazingly pleasant experience to gamers on a hunt to play in an untethered manner from a console or personal computer.

First in line is the NVIDIA VGX platform, an enterprise-level execution of the Kepler cloud technologies, primarily targeting virtualized desktop performance boosts. The company is hopeful that ventures will make use of this particular platform to ensure flawless remote computing and cater to the most computationally starved applications to be streamed directly to a notebook, tablet or any other mobile device variant. Jeff Brown, GM at NVIDIA’s Professional Solutions Group, is reported to have marked the VGX as the starting point for a “new era in desktop virtualization” that promises a cost-effective virtualization solution offering “an experience almost indistinguishable from a full desktop”.

Results with GPUs have been encouraging and spreading their availability as a cloud-based GPU should lead to a wider variety of experiences.

The emphasis here is making the lives of gamers more pleasant but one expects serious uses, such as graph processing, to not be all that far behind.

### Ohio State University Researcher Compares Parallel Systems

Tuesday, April 3rd, 2012

Ohio State University Researcher Compares Parallel Systems

From the post:

Surveying the wide range of parallel system architectures offered in the supercomputer market, an Ohio State University researcher recently sought to establish some side-by-side performance comparisons.

The journal, Concurrency and Computation: Practice and Experience, in February published, “Parallel solution of the subset-sum problem: an empirical study.” The paper is based upon a master’s thesis written last year by former computer science and engineering graduate student Saniyah Bokhari.

“We explore the parallelization of the subset-sum problem on three contemporary but very different architectures, a 128-processor Cray massively multithreaded machine, a 16-processor IBM shared memory machine, and a 240-core NVIDIA graphics processing unit,” said Bokhari. “These experiments highlighted the strengths and weaknesses of these architectures in the context of a well-defined combinatorial problem.”

Bokhari evaluated the conventional central processing unit architecture of the IBM 1350 Glenn Cluster at the Ohio Supercomputer Center (OSC) and the less-traditional general-purpose graphic processing unit (GPGPU) architecture, available on the same cluster. She also evaluated the multithreaded architecture of a Cray Extreme Multithreading (XMT) supercomputer at the Pacific Northwest National Laboratory’s (PNNL) Center for Adaptive Supercomputing Software.

the strengths and weaknesses of these architectures in the context of a well-defined combinatorial problem.

True enough, there is a place for general methods and solutions, but one pays the price for using general methods and solutions.

Thinking that for subject identity and “merging” in a “big data” context, that we will need a deeper understanding of specific identity and merging requirements. So that the result of that study is one or more well-defined combinatorial problems.

That is to say that understanding one or more combinatorial problems precedes proposing a solution.

You can view/download the thesis by Saniyah Bokhari, Parallel Solution of the Subset-sum Problem: An Empirical Study

Or view the article (assuming you have access):

Parallel solution of the subset-sum problem: an empirical study

Abstract (of the article):

The subset-sum problem is a well-known NP-complete combinatorial problem that is solvable in pseudo-polynomial time, that is, time proportional to the number of input objects multiplied by the sum of their sizes. This product defines the size of the dynamic programming table used to solve the problem. We show how this problem can be parallelized on three contemporary architectures, that is, a 128-processor Cray Extreme Multithreading (XMT) massively multithreaded machine, a 16-processor IBM x3755 shared memory machine, and a 240-core NVIDIA FX 5800 graphics processing unit (GPU). We show that it is straightforward to parallelize this algorithm on the Cray XMT primarily because of the word-level locking that is available on this architecture. For the other two machines, we present an alternating word algorithm that can implement an efficient solution. Our results show that the GPU performs well for problems whose tables fit within the device memory. Because GPUs typically have memories in the order of 10GB, such architectures are best for small problem sizes that have tables of size approximately 1010. The IBM x3755 performs very well on medium-sized problems that fit within its 64-GB memory but has poor scalability as the number of processors increases and is unable to sustain performance as the problem size increases. This machine tends to saturate for problem sizes of 1011 bits. The Cray XMT shows very good scaling for large problems and demonstrates sustained performance as the problem size increases. However, this machine has poor scaling for small problem sizes; it performs best for problem sizes of 1012 bits or more. The results in this paper illustrate that the subset-sum problem can be parallelized well on all three architectures, albeit for different ranges of problem sizes. The performance of these three machines under varying problem sizes show the strengths and weaknesses of the three architectures. Copyright © 2012 John Wiley & Sons, Ltd.

### Accelerating SQL Database Operations on a GPU with CUDA (merging spreadsheet data?)

Tuesday, January 31st, 2012

Accelerating SQL Database Operations on a GPU with CUDA by Peter Bakkum and Kevin Skadron.

Abstract:

Prior work has shown dramatic acceleration for various database operations on GPUs, but only using primitives that are not part of conventional database languages such as SQL. This paper implements a subset of the SQLite command processor directly on the GPU. This dramatically reduces the eff ort required to achieve GPU acceleration by avoiding the need for database programmers to use new programming languages such as CUDA or modify their programs to use non-SQL libraries.

This paper focuses on accelerating SELECT queries and describes the considerations in an efficient GPU implementation of the SQLite command processor. Results on an NVIDIA Tesla C1060 achieve speedups of 20-70X depending on the size of the result set.

Important lessons to be learned from this paper:

• Don’t invent new languages for the average user to learn.
• Avoid the need to modify existing programs
• Write against common software

Remember that 75% of the BI market is still using spreadsheets. For all sorts of data but numeric data in particular.

I don’t have any experience with importing files into Excel but I assume there is a macro language that can used to create import processes.

Curious if there has been any work on creating import macros for Excel that incorporate merging as part of those imports?

That would:

• Not be a new language for users to learn.
• Avoid modification of existing programs (or data)
• Be written against common software

I am not sure about the requirements for merging numeric data but that should make the exploration process all the more enjoyable.

### PGStrom (PostgreSQL + GPU)

Tuesday, January 31st, 2012

PGStrom

From the webpage:

PG-Strom is a module of FDW (foreign data wrapper) of PostgreSQL database. It was designed to utilize GPU devices to accelarate sequential scan on massive amount of records with complex qualifiers. Its basic concept is CPU and GPU should focus on the workload with their advantage, and perform concurrently. CPU has much more flexibility, thus, it has advantage on complex stuff such as Disk-I/O, on the other hand, GPU has much more parallelism of numerical calculation, thus, it has advantage on massive but simple stuff such as check of qualifiers for each rows.

The below figure is a basic concept of PG-Strom. Now, on sequential scan workload, vanilla PostgreSQL does iteration of fetch a tuple and checks of qualifiers for each tuples. If we could consign GPU the workload of green portion, it enables to reduce workloads of CPU, thus, it shall be able to load more tuples in advance. Eventually, it should allow to provide shorter response-time on complex queries towards large amount of data.

Requires setting up the table for the GPU ahead of time but performance increase is reported to be 10x – 20x.

It occurs to me that GPUs should be well suited for graph processing. Yes? Will have to look into that and report back.

### Ålenkå

Monday, January 30th, 2012

Ålenkå

If you don’t mind alpha code, ålenkå was pointed out in the bitmap posting I cited earlier today.

From its homepage:

Alenka is a modern analytical database engine written to take advantage of vector based processing and high bandwidth of modern GPUs.

Features include:

Vector-based processing
CUDA programming model allows a single operation to be applied to an entire set of data at once.

Self optimizing compression
Ultra fast compression and decompression performed directly inside GPU

Column-based storage
Minimize disk I/O by only accessing the relevant data

Data load times measured in minutes, not in hours.

Open source and free

Apologies for the name spelling differences, Ålenkå versus Alenka. I suspect it has something to do with character support in whatever produced the readme file, but can’t say for sure.

The benchmarks (there is that term again) are impressive.

Would semantic benchmarks be different from the ones used in IR currently? Different from precision and recall? What about range (same subject but identified differently) or accuracy (different identifications but same subject, how many false positives)?

### rCUDA

Friday, December 2nd, 2011

rCUDA

From the post:

We are glad to announce the new version 3.1 of rCUDA. It has been developed in a joint collaboration with the Parallel Architectures Group from the Technical University of Valencia.

The rCUDA framework enables the concurrent usage of CUDA-compatible devices remotely.

rCUDA employs the socket API for the communication between clients and servers. Thus, it can be useful in three different environments:

• Clusters. To reduce the number of GPUs installed in High Performance Clusters. This leads to increased GPU usage and therefore energy savings as well as other related savings like acquisition costs, maintenance, space, cooling, etc.
• Academia. In commodity networks, to offer access to a few high performance GPUs concurrently to many students.
• Virtual Machines. To enable the access to the CUDA facilities on the physical machine.

The current version of rCUDA (v3.1) implements most of the functions in the CUDA Runtime API version 4.0, excluding only those related with graphics interoperability. rCUDA 3.1 targets the Linux OS (for 32- and 64-bit architectures) on both client and server sides.

This was mentioned in the Letting GPUs run free post but I thought it merited a separate entry. This is very likely to be important.

### Letting GPUs run free

Friday, December 2nd, 2011

Letting GPUs run free by Dan Olds.

From the post:

One of the most interesting things I saw at SC11 was a joint Mellanox and University of Valencia demonstration of rCUDA over Infiniband. With rCUDA, applications can access a GPU (or multiple GPUs) on any other node in the cluster. It makes GPUs a sharable resource and is a big step towards making them as virtualisable (I don’t think that’s a word, but going to go with it anyway) as any other compute resource.

There aren’t a lot of details out there yet, there’s this press release from Mellanox and Valencia and this explanation of the rCUDA project.

This is a big deal. To me, the future of computing will be much more heterogeneous and hybrid than homogeneous and, well, some other word that means ‘common’ and begins with ‘H’. We’re moving into a mindset where systems are designed to handle particular workloads, rather than workloads that are modified to run sort of well on whatever systems are cheapest per pound or flop.

### GPUs: Path into the future

Thursday, December 1st, 2011

GPUs: Path into the future

From the introduction:

With the announcement of a new Blue Waters petascale system that includes a considerable amount of GPU capability, it is clear GPUs are the future of supercomputing. Access magazine’s Barbara Jewett recently sat down with Wen-mei Hwu, a professor of electrical and computer engineering at the University of Illinois, a co-principal investigator on the Blue Waters project, and an expert in computer architecture, especially GPUs.

Find out why you should start thinking about GPU systems, now.

### A Common GPU n-Dimensional Array for Python and C

Tuesday, November 29th, 2011

A Common GPU n-Dimensional Array for Python and C by Frédéric Bastien, Arnaud Bergeron, Pascal Vincent and Yoshua Bengio

From the webpage:

Currently there are multiple incompatible array/matrix/n-dimensional base object implementations for GPUs. This hinders the sharing of GPU code and causes duplicate development work. This paper proposes and presents a first version of a common GPU n-dimensional array(tensor) named GpuNdArray that works with both CUDA and OpenCL. It will be usable from python, C and possibly other languages.

Apologies, all I can give you today is a pointer to the accepted papers for Big Learning, Day 1, first paper, which promises a PDF soon.

I didn’t check the PDF link yesterday when I saw it. My bad.

Anyway, there are a lot of other interesting papers at this site and I will update this entry when this paper appears. The conference is December 16-17, 2011 so it may not be too long of a wait.