Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 5, 2014

CUDA 6, Available as Free Download, …

Filed under: GPU,NVIDIA — Patrick Durusau @ 5:02 pm

CUDA 6, Available as Free Download, Makes Parallel Programming Easier, Faster by George Millington.

From the post:

We’re always striving to make parallel programming better, faster and easier for developers creating next-gen scientific, engineering, enterprise and other applications.

With the latest release of the CUDA parallel programming model, we’ve made improvements in all these areas.

Available now to all developers on the CUDA website, the CUDA 6 Release Candidate is packed with several new features that are sure to please developers.

A few highlights:

  • Unified Memory – This major new feature lets CUDA applications access CPU and GPU memory without the need to manually copy data from one to the other. This is a major time saver that simplifies the programming process, and makes it easier for programmers to add GPU acceleration in a wider range of applications.
  • Drop-in Libraries – Want to instantly accelerate your application by up to 8X? The new drop-in libraries can automatically accelerate your BLAS and FFTW calculations by simply replacing the existing CPU-only BLAS or FFTW library with the new, GPU-accelerated equivalent.
  • Multi-GPU Scaling – Re-designed BLAS and FFT GPU libraries automatically scale performance across up to eight GPUs in a single node. This provides over nine teraflops of double-precision performance per node, supporting larger workloads than ever before (up to 512GB).

And there’s more.

In addition to the new features, the CUDA 6 platform offers a full suite of programming tools, GPU-accelerated math libraries, documentation and programming guides.

To keep informed about the latest CUDA developments, and to access a range parallel programing tools and resources, we encourage you to sign up for the free CUDA/GPU Computing Registered Developer Program at the NVIDIA Developer Zone website.

The only sad note is that processing power continues to out-distance the ability to document and manipulate the semantics of data.

Not unlike having a car that can cross the North American continent in a hour but not having a map of locations between the coasts.

You arrive quickly, but is it where you wanted to go?

February 6, 2014

Map-D: A GPU Database…

Filed under: GPU,MapD,NVIDIA — Patrick Durusau @ 8:34 pm

Map-D: A GPU Database for Real-time Big Data Analytics and Interactive Visualization by Todd Mostak (map-D) and Tom Graham (map-D). (MP4)

From the description:

map-D makes big data interactive for anyone! map-D is a super-fast GPU database that allows anyone to interact and visualize streaming big data in real time. Its unique architecture runs 70-1,000x faster than other in-memory databases or big data analytics platforms. To boot, it works with any size or kind of dataset; works with data that is streaming live on to the system; uses cheap, off-the-shelf hardware; is easily scalable.map-D is focused on learning from big data. At the moment, the map-D team is working on projects with MIT CSAIL, the Harvard Center for Geographic Analysis and the Harvard-Smithsonian Center for Astrophysics. Join Todd Mostak and Tom Graham, key members of the map-D team, as they demonstrate the speed and agility of map-D and describe the live processing, search and mapping of over 1 billion tweets.

I have been haunting the GTC On-Demand page waiting for this to be posted.

I had to download the MP4. (Approximately 124 MB) Suspect they are creating a lot of traffic at the GTC On-Demand page.

As a bonus, see also:

Map-D: GPU-Powered Databases and Interactive Social Science Research in Real Time by Tom Graham (Map_D) and Todd Mostak (Map_D) (streaming) or PDF.

From the description:

Map-D (Massively Parallel Database) uses multiple NVIDIA GPUs to interactively query and visualize big data in real-time. Map-D is an SQL-enabled column store that generates 70-400X speedups over other in-memory databases. This talk discusses the basic architecture of the system, the advantages and challenges of running queries on the GPU, and the implications of interactive and real-time big data analysis in the social sciences and beyond.

Suggestions of more links/papers on Map-D greatly appreciated!

Enjoy!

PS: Just so you aren’t too shocked, the Twitter demo involves scanning a billion row database in 5 mili-seconds.

February 3, 2014

Parallel implementation of 3D protein structure similarity searches…

Filed under: Bioinformatics,GPU,Similarity,Similarity Retrieval — Patrick Durusau @ 8:41 pm

Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA by Dariusz Mrozek, Milosz Brozek, and Bozena Malysiak-Mrozek.

Abstract:

Searching for similar 3D protein structures is one of the primary processes employed in the field of structural bioinformatics. However, the computational complexity of this process means that it is constantly necessary to search for new methods that can perform such a process faster and more efficiently. Finding molecular substructures that complex protein structures have in common is still a challenging task, especially when entire databases containing tens or even hundreds of thousands of protein structures must be scanned. Graphics processing units (GPUs) and general purpose graphics processing units (GPGPUs) can perform many time-consuming and computationally demanding processes much more quickly than a classical CPU can. In this paper, we describe the GPU-based implementation of the CASSERT algorithm for 3D protein structure similarity searching. This algorithm is based on the two-phase alignment of protein structures when matching fragments of the compared proteins. The GPU (GeForce GTX 560Ti: 384 cores, 2GB RAM) implementation of CASSERT (“GPU-CASSERT”) parallelizes both alignment phases and yields an average 180-fold increase in speed over its CPU-based, single-core implementation on an Intel Xeon E5620 (2.40GHz, 4 cores). In this paper, we show that massive parallelization of the 3D structure similarity search process on many-core GPU devices can reduce the execution time of the process, allowing it to be performed in real time. GPU-CASSERT is available at: http://zti.polsl.pl/dmrozek/science/gpucassert/cassert.htm.

Seventeen pages of heavy sledding but an average of 180-fold increase in speed? That’s worth the effort.

Sorry, I got distracted. How difficult did you say your subject similarity/identity problem was? 😉

February 1, 2014

Buffer k-d Trees: Processing Massive Nearest Neighbor Queries on GPUs

Filed under: GPU,K-Nearest-Neighbors — Patrick Durusau @ 3:44 pm

Buffer k-d Trees: Processing Massive Nearest Neighbor Queries on GPUs by Fabian Gieseke, Justin Heinermann, Cosmin Oancea, Christian Igel. (Journal of Machine Learning Research W&CP 32 (1): 172-180, 2014)

Abstract:

We present a new approach for combining k-d trees and graphics processing units for nearest neighbor search. It is well known that a direct combination of these tools leads to a non-satisfying performance due to conditional computations and suboptimal memory accesses. To alleviate these problems, we propose a variant of the classical k-d tree data structure, called buffer k-d tree, which can be used to reorganize the search. Our experiments show that we can take advantage of both the hierarchical subdivision induced by k-d trees and the huge computational resources provided by today’s many-core devices. We demonstrate the potential of our approach in astronomy, where hundreds of million nearest neighbor queries have to be processed.

The complexity of your data may or may not be a barrier to this technique. The authors report that for feature spaces < 4, specialized solutions exist and benefits exist for feature spaces up to 27.

Pure speculation on my part but it could be the case that some level of size or complexity of data, a general solution that is equally applicable to every data set, isn’t possible.

I first saw this in a tweet by Stefano Bertolo

January 29, 2014

Map-D (the details)

Filed under: GPU,MapD — Patrick Durusau @ 9:15 pm

MIT Spinout Exploits GPU Memory for Vast Visualization by Alex Woodie.

From the post:

An MIT research project turned open source project dubbed the Massively Parallel Database (Map-D) is turning heads for its capability to generate visualizations on the fly from billions of data points. The software—an SQL-based, column-oriented database that runs in the memory of GPUs—can deliver interactive analysis of 10TB datasets with millisecond latencies. For this reason, its creator feels comfortable is calling it “the fastest database in the world.”

Map-D is the brainchild of Todd Mostak, who created the software while taking a class in database development at MIT. By optimizing the database to run in the memory of off-the-shelf graphics processing units (GPUs), Mostak found that he could create a mini supercomputer cluster that offered an order of magnitude better performance than a database running on regular CPUs.

“Map-D is an in-memory column store coded into the onboard memory of GPUs and CPUs,” Mostak said today during Webinar on Map-D. “It’s really designed from the ground up to maximize whatever hardware it’s using, whether it’s running on Intel CPU or Nvidia GPU. It’s optimized to maximize the throughput, meaning if a GPU has this much memory bandwidth, what we really try to do is make sure we’re hitting that memory bandwidth.”

During the webinar, Mostak and Tom Graham, his fellow co-founder of the startup Map-D, demonstrated the technology’s capability to interactively analyze datasets composed of a billion individual records, constituting more than 1TB of data. The demo included a heat map of Twitter posts made from 2010 to the present. Map-D’s “TweetMap” (which the company also demonstrated at the recent SC 2013 conference) runs on eight K40 Tesla GPUs, each with 12 GB of memory, in a single node configuration.

You really need to try the TweetMap example. This rocks!

The details on TweetMap:

You can search tweet text, heatmap results, identify and animate trends, share maps and regress results against census data.

For each click Map-D scans the entire database and visualizes results in real-time. Unlike many other tweetmap demos, nothing is canned or pre-rendered. Recent tweets also stream live onto the system and are available for view within seconds of broadcast.

TweetMap is powered by 8 NVIDIA Tesla K40 GPUs with a total of 96GB of GPU memory in a single node. While we sometimes switch between interesting datasets of various size, for the most part TweetMap houses over 1 billion tweets from 2010 to the present.

Imagine interactive “merging” of subjects based on their properties.

Come to think of it, don’t GPUs handle edges between nodes? As in graphs? 😉

A couple of links for more information, although I suspect the list of resources on Map-D is going to grow by leaps and bounds:

Resources page (included videos of demonstrations).

An Overview of MapD (Massively Parallel Database) by Todd Mostak. (whitepaper)

January 15, 2014

MPGraph: [GPU = 3 Billion Traversed Edges Per Second]

Filed under: GPU,Graphs,Parallel Programming — Patrick Durusau @ 3:32 pm

mpgraph Beta: Massively Parallel Graph processing on GPUs

From the webpage:

MPGraph is Massively Parallel Graph processing on GPUs.

The MPGraph API makes it easy to develop high performance graph analytics on GPUs. The API is based on the Gather-Apply-Scatter (GAS) model as used in GraphLab. To deliver high performance computation and efficiently utilize the high memory bandwidth of GPUs, MPGraph’s CUDA kernels use multiple sophisticated strategies, such as vertex-degree-dependent dynamic parallelism granularity and frontier compaction.

MPGraph is up to two order of magnitude faster than parallel CPU implementations on up 24 CPU cores and has performance comparable to a state-of-the-art manually optimized GPU implementation.

New algorithms can be implemented in a few hours that fully exploit the data-level parallelism of the GPU and offer throughput of up to 3 billion traversed edges per second on a single GPU.

Before some wag blows off the “3 billion traversed edges per second on a single GPU” by calling MPGraph a “graph compute engine,” consider this performance graphic:

MPGraph performance

Screenshot 1 / 1 MPGraph showing BFS speedup over graphlab. Comparison is a single NVIDIA K20 verus up to 24 CPU cores using an 3.33 GHz X5680 CPU chipset.

Don’t let name calling keep you from seeking the graph performance you need.

Flying an F-16 requires more user skill than a VW. But when you need an F-16, don’t settle for a VW because its easier.

GTC On-Demand

Filed under: Conferences,GPU,HPC — Patrick Durusau @ 3:05 pm

GTC On-Demand

While running down presentations at prior GPU Technology Conferences, I found this gold mine of presentations and slides on GPU computing.

Counting “presentationTitle” in the page source says 385 presentations!

Enjoy!

Hardware for Big Data, Graphs and Large-scale Computation

Filed under: BigData,GPU,Graphs,NVIDIA — Patrick Durusau @ 2:51 pm

Hardware for Big Data, Graphs and Large-scale Computation by Rob Farber.

From the post:

Recent announcements by Intel and NVIDIA indicate that massively parallel computing with GPUs and Intel Xeon Phi will no longer require passing data via the PCIe bus. The bad news is that these standalone devices are still in the design phase and are not yet available for purchase. Instead of residing on the PCIe bus as a second-class system component like a disk or network controller, the new Knights Landing processor announced by Intel at ISC’13 will be able to run as a standalone processor just like a Sandy Bridge or any other multi-core CPU. Meanwhile, NVIDIA’s release of native ARM compilation in CUDA 5.5 provides a necessary next step toward Project Denver, which is NVIDIAs integration of a 64-bit ARM processor and a GPU. This combination, termed a CP-GP (or ceepee-geepee) in the media, can leverage the energy savings and performance of both architectures.

Of course, the NVIDIA strategy also opens the door to the GPU acceleration of mobile phone and other devices in the ARM dominated low-power, consumer and real-time markets. In the near 12- to 24-month timeframe, customers should start seeing big-memory standalone systems based on Intel and NVIDIA technology that only require power and a network connection. The need for a separate x86 computer to host one or more GPU or Intel Xeon Phi coprocessors will no longer be a requirement.

The introduction of standalone GPU and Intel Xeon Phi devices will affect the design decisions made when planning the next generation of leadership class supercomputers, enterprise data center procurements, and teraflop/s workstations. It also will affect the software view in programming these devices, because the performance limitations of the PCIe bus and the need to work with multiple memory spaces will no longer be compulsory.

Ray provides a great peek at hardware that is coming and current high performance computing, in particular, processing graphs.

Resources mentioned in Rob’s post without links:

Rob’s Intel Xeon Phi tutorial at Dr. Dobbs:

Programming Intel’s Xeon Phi: A Jumpstart Introduction

CUDA vs. Phi: Phi Programming for CUDA Developers

Getting to 1 Teraflop on the Intel Phi Coprocessor

Numerical and Computational Optimization on the Intel Phi

Rob’s GPU Technology Conference presentations:

Simplifying Portable Killer Apps with OpenACC and CUDA-5 Concisely and Efficiently.

Clicking GPUs into a Portable, Persistent and Scalable Massive Data Framework.

(The links are correct but put you one presentation below Rob’s. Scroll up one. Sorry. It was that or use an incorrect link to put you at the right location.)

mpgraph (part of XDATA)

Other resources you may find of interest:

Ray Farber – Dr. Dobbs – Current article listing.

Hot-Rodding Windows and Linux App Performance with CUDA-Based Plugins by Rob Farber (with source code for Windows and Linux).

Ray Farber’s wiki: http://gpucomputing.net/ (Warning: The site seems to be flaky. If it doesn’t load, try again.)

OpenCL (Khronos)

Ray Farber’s Code Project tutorials:

(Part 9 was published in February of 2012. Some updating may be necessary.)

December 31, 2013

Augur:…

Filed under: Bayesian Models,GPU,Machine Learning,Probabilistic Programming,Scala — Patrick Durusau @ 2:40 pm

Augur: a Modeling Language for Data-Parallel Probabilistic Inference by Jean-Baptiste Tristan, et.al.

Abstract:

It is time-consuming and error-prone to implement inference procedures for each new probabilistic model. Probabilistic programming addresses this problem by allowing a user to specify the model and having a compiler automatically generate an inference procedure for it. For this approach to be practical, it is important to generate inference code that has reasonable performance. In this paper, we present a probabilistic programming language and compiler for Bayesian networks designed to make effective use of data-parallel architectures such as GPUs. Our language is fully integrated within the Scala programming language and benefits from tools such as IDE support, type-checking, and code completion. We show that the compiler can generate data-parallel inference code scalable to thousands of GPU cores by making use of the conditional independence relationships in the Bayesian network.

A very good paper but the authors should highlight the caveat in the introduction:

We claim that many MCMC inference algorithms are highly data-parallel (Hillis & Steele, 1986; Blelloch, 1996) if we take advantage of the conditional independence relationships of the input model (e.g. the assumption of i.i.d. data makes the likelihood independent across data points).

(Where i.i.d. = Independent and identically distributed random variables.)

That assumption does allow for parallel processing, but users should be cautious about accepting assumptions about data.

The algorithms will still work, even if your assumptions about the data are incorrect.

But the answer you get may not be as useful as you would like.

I first saw this in a tweet by Stefano Bertolo.

Efficient Large-Scale Graph Processing…

Filed under: GPU,Graphs — Patrick Durusau @ 2:05 pm

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems by Abdullah Gharaibeh, Elizeu Santos-Neto, Lauro Beltrao Costa, and Matei Ripeanu.

Abstract:

The increasing scale and wealth of inter-connected data, such as those accrued by social network applications, demand the design of new techniques and platforms to efficiently derive actionable knowledge from large-scale graphs. However, real-world graphs are famously difficult to process efficiently. Not only they have a large memory footprint, but also most graph algorithms entail memory access patterns with poor locality, data-dependent parallelism and a low compute-to-memory access ratio. Moreover, most real-world graphs have a highly heterogeneous node degree distribution, hence partitioning these graphs for parallel processing and simultaneously achieving access locality and load-balancing is difficult.

This work starts from the hypothesis that hybrid platforms (e.g., GPU-accelerated systems) have both the potential to cope with the heterogeneous structure of real graphs and to offer a cost-effective platform for high-performance graph processing. This work assesses this hypothesis and presents an extensive exploration of the opportunity to harness hybrid systems to process large-scale graphs efficiently. In particular, (i) we present a performance model that estimates the achievable performance on hybrid platforms; (ii) informed by the performance model, we design and develop TOTEM – a processing engine that provides a convenient environment to implement graph algorithms on hybrid platforms; (iii) we show that further performance gains can be extracted using partitioning strategies that aim to produce partitions that each matches the strengths of the processing element it is allocated to, finally, (iv) we demonstrate the performance advantages of the hybrid system through a comprehensive evaluation that uses real and synthetic workloads (as large as 16 billion edges), multiple graph algorithms that stress the system in various ways, and a variety of hardware configurations.

Graph processing that avoids the problems with clusters by using a single node.

Yes, a single node. Best to avoid this solution if you are a DoD contractor. 😉

If you are not a DoD (or NSA) contractor, the Totem project (subject of this paper), describes itself this way:

The goal of this project is to understand the challenges in supporting graph algorithms on commodity, hybrid platforms; platforms that consist of processors optimized for sequential processing and accelerators optimized for massively-parallel processing.

This will fill the gap between current graph processing platforms that are either expensive (e.g., supercomputers) or inefficient (e.g., commodity clusters). Our hypothesis is that hybrid platforms (e.g., GPU-supported large-memory nodes and GPU supported clusters) can bridge the performance-cost chasm, and offer an attractive graph-processing solution for many graph-based applications such as social networks and web analysis.

If you are facing performance-cost issues with graph processing, this is definitely research you need to be watching.

Totem software is available for downloading.

I first saw this in a tweet by Stefano Bertolo.

November 8, 2013

ParLearning 2014

ParLearning 2014 The 3rd International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics.

Dates:

Workshop Paper Due: December 30, 2013
Author Notification: February 14, 2014
Camera-ready Paper Due: March 14, 2014
Workshop: May 23, 2014 Phoenix, AZ, USA

From the webpage:

Data-driven computing needs no introduction today. The case for using data for strategic advantages is exemplified by web search engines, online translation tools and many more examples. The past decade has seen 1) the emergence of multicore architectures and accelerators as GPGPUs, 2) widespread adoption of distributed computing via the map-reduce/hadoop eco-system and 3) democratization of the infrastructure for processing massive datasets ranging into petabytes by cloud computing. The complexity of the technological stack has grown to an extent where it is imperative to provide frameworks to abstract away the system architecture and orchestration of components for massive-scale processing. However, the growth in volume and heterogeneity in data seems to outpace the growth in computing power. A “collect everything” culture stimulated by cheap storage and ubiquitous sensing capabilities contribute to increasing the noise-to-signal ratio in all collected data. Thus, as soon as the data hits the processing infrastructure, determining the value of information, finding its rightful place in a knowledge representation and determining subsequent actions are of paramount importance. To use this data deluge to our advantage, a convergence between the field of Parallel and Distributed Computing and the interdisciplinary science of Artificial Intelligence seems critical. From application domains of national importance as cyber-security, health-care or smart-grid to providing real-time situational awareness via natural interface based smartphones, the fundamental AI tasks of Learning and Inference need to be enabled for large-scale computing across this broad spectrum of application domains.

Many of the prominent algorithms for learning and inference are notorious for their complexity. Adopting parallel and distributed computing appears as an obvious path forward, but the mileage varies depending on how amenable the algorithms are to parallel processing and secondly, the availability of rapid prototyping capabilities with low cost of entry. The first issue represents a wider gap as we continue to think in a sequential paradigm. The second issue is increasingly recognized at the level of programming models, and building robust libraries for various machine-learning and inferencing tasks will be a natural progression. As an example, scalable versions of many prominent graph algorithms written for distributed shared memory architectures or clusters look distinctly different from the textbook versions that generations of programmers have grown with. This reformulation is difficult to accomplish for an interdisciplinary field like Artificial Intelligence for the sheer breadth of the knowledge spectrum involved. The primary motivation of the proposed workshop is to invite leading minds from AI and Parallel & Distributed Computing communities for identifying research areas that require most convergence and assess their impact on the broader technical landscape.

Taking full advantage of parallel processing remains a distant goal. This workshop looks like a good concrete step towards that goal.

November 3, 2013

A multi-Teraflop Constituency Parser using GPUs

Filed under: GPU,Grammar,Language,Parsers,Parsing — Patrick Durusau @ 4:45 pm

A multi-Teraflop Constituency Parser using GPUs by John Canny, David Hall and Dan Klein.

Abstract:

Constituency parsing with rich grammars remains a computational challenge. Graphics Processing Units (GPUs) have previously been used to accelerate CKY chart evaluation, but gains over CPU parsers were modest. In this paper, we describe a collection of new techniques that enable chart evaluation at close to the GPU’s practical maximum speed (a Teraflop), or around a half-trillion rule evaluations per second. Net parser performance on a 4-GPU system is over 1 thousand length- 30 sentences/second (1 trillion rules/sec), and 400 general sentences/second for the Berkeley Parser Grammar. The techniques we introduce include grammar compilation, recursive symbol blocking, and cache-sharing.

Just in case you are interested in parsing “unstructured” data, mostly what they also call “texts.”

I first saw the link: BIDParse: GPU-accelerated natural language parser at hgup.org. Then I started looking for the paper. 😉

Towards GPU-Accelerated Large-Scale Graph Processing in the Cloud

Filed under: Cloud Computing,GPU,Graphs — Patrick Durusau @ 4:28 pm

Towards GPU-Accelerated Large-Scale Graph Processing in the Cloud by Jianlong Zhong and Bingsheng He.

Abstract:

Recently, we have witnessed that cloud providers start to offer heterogeneous computing environments. There have been wide interests in both cluster and cloud of adopting graphics processors (GPUs) as accelerators for various applications. On the other hand, large-scale processing is important for many data-intensive applications in the cloud. In this paper, we propose to leverage GPUs to accelerate large-scale graph processing in the cloud. Specifically, we develop an in-memory graph processing engine G2 with three non-trivial GPU-specific optimizations. Firstly, we adopt fine-grained APIs to take advantage of the massive thread parallelism of the GPU. Secondly, G2 embraces a graph partition based approach for load balancing on heterogeneous CPU/GPU architectures. Thirdly, a runtime system is developed to perform transparent memory management on the GPU, and to perform scheduling for an improved throughput of concurrent kernel executions from graph tasks. We have conducted experiments on a local cluster of three nodes and an Amazon EC2 virtual cluster of eight nodes. Our preliminary results demonstrate that 1) GPU is a viable accelerator for cloud-based graph processing, and 2) the proposed optimizations further improve the performance of GPU-based graph processing engine.

GPUs in the cloud anyone?

The future of graph computing isn’t clear but it certainly promises to be interesting!

I first saw this in a tweet by Stefano Bertolo

October 17, 2013

cudaMap:…

Filed under: Bioinformatics,CUDA,Genomics,GPU,NVIDIA,Topic Map Software,Topic Maps — Patrick Durusau @ 3:16 pm

cudaMap: a GPU accelerated program for gene expression connectivity mapping by Darragh G McArt, Peter Bankhead, Philip D Dunne, Manuel Salto-Tellez, Peter Hamilton, Shu-Dong Zhang.

Abstract:

BACKGROUND: Modern cancer research often involves large datasets and the use of sophisticated statistical techniques. Together these add a heavy computational load to the analysis, which is often coupled with issues surrounding data accessibility. Connectivity mapping is an advanced bioinformatic and computational technique dedicated to therapeutics discovery and drug re-purposing around differential gene expression analysis. On a normal desktop PC, it is common for the connectivity mapping task with a single gene signature to take > 2h to complete using sscMap, a popular Java application that runs on standard CPUs (Central Processing Units). Here, we describe new software, cudaMap, which has been implemented using CUDA C/C++ to harness the computational power of NVIDIA GPUs (Graphics Processing Units) to greatly reduce processing times for connectivity mapping.

RESULTS: cudaMap can identify candidate therapeutics from the same signature in just over thirty seconds when using an NVIDIA Tesla C2050 GPU. Results from the analysis of multiple gene signatures, which would previously have taken several days, can now be obtained in as little as 10 minutes, greatly facilitating candidate therapeutics discovery with high throughput. We are able to demonstrate dramatic speed differentials between GPU assisted performance and CPU executions as the computational load increases for high accuracy evaluation of statistical significance.

CONCLUSION: Emerging ‘omics’ technologies are constantly increasing the volume of data and information to be processed in all areas of biomedical research. Embracing the multicore functionality of GPUs represents a major avenue of local accelerated computing. cudaMap will make a strong contribution in the discovery of candidate therapeutics by enabling speedy execution of heavy duty connectivity mapping tasks, which are increasingly required in modern cancer research. cudaMap is open source and can be freely downloaded from http://purl.oclc.org/NET/cudaMap.

Or to put that in lay terms, the goal is to establish the connections between human diseases, genes that underlie them and drugs that treat them.

Going from several days to ten (10) minutes is quite a gain in performance.

This is processing of experimental data but is it a window into techniques for scaling topic maps?

I first saw this in a tweet by Stefano Bertolo.

September 27, 2013

Large Graphs on multi-GPUs

Filed under: GPU,Graphs — Patrick Durusau @ 4:05 pm

Large Graphs on multi-GPUs by Enrico Mastrostefano.

Abstract:

We studied the problem of developing an efficient BFS algorithm to explore large graphs having billions of nodes and edges. The size of the problem requires a parallel computing architecture. We proposed a new algorithm that performs a distributed BFS and the corresponding implementation on multiGPUs clusters. As far as we know, this is the first attempt to implement a distributed graph algorithm on that platform. Our study shows how most straightforward BFS implementations present significant computation and communication overheads. The main reason is that, at each iteration, the number of processed edges is greater than the number actually needed to determine the parent or the distance array (the standard output of the BFS): there is always redundant information at each step.

Reducing as much as possible this redundancy is essential in order to improve performances by minimizing the communication overhead. To this purpose, our algorithm performs, at each BFS level, a pruning procedure on the set of nodes that will be visited (NLFS). This step reduces both the amount of work required to enqueue new vertices and the size of messages exchanged among different tasks. To implement this pruning procedure efficiently is not trivial: none of the earlier works on GPU tackled that problem directly. The main issue being how to employ a sufficient large number of threads and balance their workload, to fully exploit the GPU computing power.

To that purpose, we developed a new mapping of data elements to CUDA threads that uses a binary search function at its core. This mapping permits to process the entire Next Level Frontier Set by mapping each element of the set to one CUDA thread (perfect load-balancing) so the available parallelism is exploited at its best. This mapping allows for an efficient filling of a global array that, for each BFS level, contains all the neighbors of the vertices in the queue as required by the pruning procedure (based on sort and unique operations) of the array.

This mapping is a substantial contribution of our work: it is quite simple and general and can be used in different contexts. We wish to highlight that it is this operation (and not the sorting) that makes possible to exploit at its best the computing power of the GPU. To speed up the sort and unique operations we rely on very efficient implementations (like the radix sort) available in the CUDA Thrust library. We have shown that our algorithm has good scaling properties and, with 128 GPUs, it can traverse 3 billion edges per second (3 GTEPS for an input graph with 228 vertices). By comparing our results with those obtained on different architectures we have shown that our implementation is better or comparable to state-of-the-art implementations.

Among the operations that are performed during the BFS, the pruning of the NLFS is the most expensive in terms of execution time. Moreover, the overall computational time is greater then the time spent in communications. Our experiments show that the ratio between the time spent in computation and the time spent in communication reduces by increasing the number of tasks. For instance, with 4 GPUs the ratio is 2:125 whereas by using 64 GPUs the value is 1:12. The result can be explained as follows. In order to process the largest possible graph, the memory of each GPU is fully used and thus the subgraph assigned to each processor has a maximum (fixed) size. When the graph size increases we use more GPUs and the number of messages exchanged among nodes increases accordingly.

To maintain a good scalability using thousands GPUs we need to further improve the communication mechanism that is, in the present implementation, quite simple. To this purpose, many studies employed a 2D partitioning of the graph to reduce the number of processors involved in communication. Such partitioning could be, in principle, implemented in our code and it will be the subject of a future work. (paragraphing was inserted into the abstract for readability)

Without any paragraph breaks the abstract was very difficult to read. Apologies if I have incorrectly inserted paragraph breaks.

If you have access to multiple GPUs, this should be very interesting work.

May 11, 2013

Medusa: Simplified Graph Processing on GPUs

Filed under: GPU,Graphs,Networks,Parallel Programming — Patrick Durusau @ 12:30 pm

Medusa: Simplified Graph Processing on GPUs by Jianlong Zhong, Bingsheng He.

Abstract:

Graphs are the de facto data structures for many applications, and efficient graph processing is a must for the application performance. Recently, the graphics processing unit (GPU) has been adopted to accelerate various graph processing algorithms such as BFS and shortest path. However, it is difficult to write correct and efficient GPU programs and even more difficult for graph processing due to the irregularities of graph structures. To simplify graph processing on GPUs, we propose a programming framework called Medusa which enables developers to leverage the capabilities of GPUs by writing sequential C/C++ code. Medusa offers a small set of user-defined APIs, and embraces a runtime system to automatically execute those APIs in parallel on the GPU. We develop a series of graph-centric optimizations based on the architecture features of GPU for efficiency. Additionally, Medusa is extended to execute on multiple GPUs within a machine. Our experiments show that (1) Medusa greatly simplifies implementation of GPGPU programs for graph processing, with much fewer lines of source code written by developers; (2) The optimization techniques significantly improve the performance of the runtime system, making its performance comparable with or better than the manually tuned GPU graph operations.

Just in case you are interested in high performance graph processing. 😉

GPU Scripting and Code Generation with PyCUDA

Filed under: CUDA,GPU,Python — Patrick Durusau @ 10:47 am

GPU Scripting and Code Generation with PyCUDA by Andreas Klockner, Nicolas Pinto, Bryan Catanzaro, Yunsup Lee, Paul Ivanov, Ahmed Fasih.

Abstract:

High-level scripting languages are in many ways polar opposites to GPUs. GPUs are highly parallel, subject to hardware subtleties, and designed for maximum throughput, and they offer a tremendous advance in the performance achievable for a significant number of computational problems. On the other hand, scripting languages such as Python favor ease of use over computational speed and do not generally emphasize parallelism. PyCUDA is a package that attempts to join the two together. This chapter argues that in doing so, a programming environment is created that is greater than just the sum of its two parts. We would like to note that nearly all of this chapter applies in unmodified form to PyOpenCL, a sister project of PyCUDA, whose goal it is to realize the same concepts as PyCUDA for OpenCL.

The author’s argue that while measurement of the productivity gains from PyCUDA are missing, spread use of PyCUDA is an indication of its usefulness.

Point taken.

More importantly, in my view, is PyCUDA’s potential to make use of GPUs more widespread.

Widespread use will uncover better algorithms, data structures, appropriate problems for GPUs, etc., potentially more quickly than occasional use.

Massively Parallel Suffix Array Queries…

Filed under: GPU,Parallel Programming,Suffix Array,Translation — Patrick Durusau @ 10:27 am

Massively Parallel Suffix Array Queries and On-Demand Phrase Extraction for Statistical Machine Translation Using GPUs by Hua He, Jimmy Lin, Adam Lopez.

Abstract:

Translation models can be scaled to large corpora and arbitrarily-long phrases by looking up translations of source phrases on the fly in an indexed parallel text. However, this is impractical because on-demand extraction of phrase tables is a major computational bottleneck. We solve this problem by developing novel algorithms for general purpose graphics processing units (GPUs), which enable suffix array queries for phrase lookup and phrase extractions to be massively parallelized. Our open-source implementation improves the speed of a highly-optimized, state-of-the-art serial CPU-based implementation by at least an order of magnitude. In a Chinese-English translation task, our GPU implementation extracts translation tables from approximately 100 million words of parallel text in less than 30 milliseconds.

If you think about topic maps as mapping the identification of a subject in multiple languages to a single representative, then the value of translation software becomes obvious.

You may or may not, depending upon project requirements, want to rely solely on automated mappings of phrases.

Whether you use automated mapping of phrases as an “assist” to or as a sanity check on human curation, this work looks very interesting.

Data-rich astronomy: mining synoptic sky surveys [Data Bombing]

Filed under: Astroinformatics,BigData,Data Mining,GPU — Patrick Durusau @ 10:11 am

Data-rich astronomy: mining synoptic sky surveys by Stefano Cavuoti.

Abstract:

In the last decade a new generation of telescopes and sensors has allowed the production of a very large amount of data and astronomy has become, a data-rich science; this transition is often labeled as: “data revolution” and “data tsunami”. The first locution puts emphasis on the expectations of the astronomers while the second stresses, instead, the dramatic problem arising from this large amount of data: which is no longer computable with traditional approaches to data storage, data reduction and data analysis. In a new, age new instruments are necessary, as it happened in the Bronze age when mankind left the old instruments made out of stone to adopt the new, better ones made with bronze. Everything changed, even the social structure. In a similar way, this new age of Astronomy calls for a new generation of tools and, for a new methodological approach to many problems, and for the acquisition of new skills. The attempts to find a solution to this problems falls under the umbrella of a new discipline which originated by the intersection of astronomy, statistics and computer science: Astroinformatics, (Borne, 2009; Djorgovski et al., 2006).

Dissertation by the same Stefano Cavuoti of: Astrophysical data mining with GPU….

Along with every new discipline comes semantics that are transparent to insiders and opaque to others.

Not out of malice but economy. Why explain a term if all those attending the discussion understand what it means?

But that lack of explanation, like our current ignorance about the means used to construct the pyramids, can come back to bite you.

In some cases far more quickly than intellectual curiosity about ancient monuments by the tin hat crowd.

Take the continuing failure of data integration by the U.S. intelligence services for example.

Rather than the current mule-like resistance to sharing, I would data bomb the other intelligence services with incompatible data exports every week.

Full sharing, for all they would be able to do with it.

Unless they had a topic map.

May 6, 2013

Virtual School summer courses…

Filed under: BigData,CS Lectures,GPU,Multi-Core — Patrick Durusau @ 7:02 pm

Virtual School summer courses on data-intensive and many-core computing

From the webpage:

Graduate students, post-docs and professionals from academia, government, and industry are invited to sign up now for two summer school courses offered by the Virtual School of Computational Science and Engineering.

These Virtual School courses will be delivered to sites nationwide using high-definition videoconferencing technologies, allowing students to participate at a number of convenient locations where they will be able to work with a cohort of fellow computational scientists, have access to local experts, and interact in real time with course instructors.

The Data Intensive Summer School focuses on the skills needed to manage, process, and gain insight from large amounts of data. It targets researchers from the physical, biological, economic, and social sciences who need to deal with large collections of data. The course will cover the nuts and bolts of data-intensive computing, common tools and software, predictive analytics algorithms, data management, and non-relational database models.

(…)

For more information about the Data-Intensive Summer School, including pre-requisites and course topics, visit http://www.vscse.org/summerschool/2013/bigdata.html.

The Proven Algorithmic Techniques for Many-core Processors summer school will present students with the seven most common and crucial algorithm and data optimization techniques to support successful use of GPUs for scientific computing.

Studying many current GPU computing applications, the course instructors have learned that the limits of an application’s scalability are often related to some combination of memory bandwidth saturation, memory contention, imbalanced data distribution, or data structure/algorithm interactions. Successful GPU application developers often adjust their data structures and problem formulation specifically for massive threading and executed their threads leveraging shared on-chip memory resources for bigger impact. The techniques presented in the course can improve performance of applicable kernels by 2-10X in current processors while improving future scalability.

(…)

For more information about the Proven Algorithmic Techniques for Many-core Processors course, including pre-requisites and course topics, visit http://www.vscse.org/summerschool/2013/manycore.html.

Think of it as summer camp. For $100 (waived at some locations), it would be hard to do better.

May 3, 2013

Introduction to Parallel Programming

Filed under: GPU,NVIDIA,Parallel Programming — Patrick Durusau @ 1:44 pm

Introduction to Parallel Programming by John Owens, David Luebke, Cheng-Han Lee and Mike Roberts. (UDACITY)

Class Summary:

Learn the fundamentals of parallel computing with the GPU and the CUDA programming environment! In this class, you’ll learn about parallel programming by coding a series of image processing algorithms, such as you might find in Photoshop or Instagram. You’ll be able to program and run your assignments on high-end GPUs, even if you don’t own one yourself.

What Should I Know?

We expect students to have a solid experience with the C programming language and basic knowledge of data structures and algorithms.

What Will I Learn?

You’ll master the fundamentals of massively parallel computing by using CUDA C/C++ to program modern GPUs. You’ll learn the GPU programming model and architecture, key algorithms and parallel programming patterns, and optimization techniques. Your assignments will illustrate these concepts through image processing applications, but this is a parallel computing course and what you learn will translate to any application domain. Most of all we hope you’ll learn how to think in parallel.

In Fast Database Emerges from MIT Class… [Think TweetMap] you read about a new SQL database based on GPUs.

What new approach is going to emerge from your knowing more about GPUs and parallel programming?

April 24, 2013

History of the Modern GPU Series

Filed under: GPU,Programming — Patrick Durusau @ 6:06 pm

History of the Modern GPU Series

From the post:

Graham Singer over at Techspot posted a series of articles a few weeks ago covering the history of the modern GPU. It is well-written and in-depth.

For GPU affectionados, this is a nice read. There are 4 parts to the series:

  1. Part 1: (1976 – 1995) The Early Days of 3D Consumer Graphics
  2. Part 2: (1995 – 1999) 3Dfx Voodoo: The Game-changer
  3. Part 3: (2000 – 2006) The Nvidia vs. ATI Era Begins
  4. Part 4: (2006 – 2013) The Modern GPU: Stream processing units a.k.a. GPGPU

Just in case you are excited about the GPU news reported below, a bit of history might not hurt.

😉

Fast Database Emerges from MIT Class… [Think TweetMap]

Filed under: GPU,MapD,SQL — Patrick Durusau @ 4:39 pm

Fast Database Emerges from MIT Class, GPUs and Student’s Invention by Ian B. Murphy.

Details the invention of MapD by Todd Mostak.

From the post:

MapD, At A Glance:

MapD is a new database in development at MIT, created by Todd Mostak.

  • MapD stands for “massively parallel database.”
  • The system uses graphics processing units (GPUs) to parallelize computations. Some statistical algorithms run 70 times faster compared to CPU-based systems like MapReduce.
  • A MapD server costs around $5,000 and runs on the same power as five light bulbs.
  • MapD runs at between 1.4 and 1.5 teraflops, roughly equal to the fastest supercomputer in 2000.
  • MapD uses SQL to query data.
  • Mostak intends to take the system open source sometime in the next year.

Sam Madden (MIT) describes MapD this way:

Madden said there are three elements that make Mostak’s database a disruptive technology. The first is the millisecond response time for SQL queries across “huge” datasets. Madden, who was a co-creator of the Vertica columnar database, said MapD can do in milliseconds what Vertica can do in minutes. That difference in speed is everything when doing iterative research, he said.

The second is the very tight coupling between data processing and visually rendering the data; this is a byproduct of building the system from GPUs from the beginning. That adds the ability to visualize the results of the data processing in under a second. Third is the cost to build the system. MapD runs in a server that costs around $5,000.

“He can do what a 1000 node MapReduce cluster would do on a single processor for some of these applications,” Madden said.

Not a lot of technical detail but you could start learning CUDA while waiting for the open source release.

At 1.4 to 1.5 teraflops on $5,000 worth of hardware, how will clusters will retain their customer base?

Welcome to TweetMap ALPHA

Filed under: GPU,Maps,SQL,Tweets — Patrick Durusau @ 3:57 pm

Welcome to TweetMap ALPHA

From the introduction popup:

TweetMap is an instance of MapD, a massively parallel database platform being developed through a collaboration between Todd Mostak, (currently a researcher at MIT), and the Harvard Center for Geographic Analysis (CGA).

The tweet database presented here starts on 12/10/2012 and ends 12/31/2012. Currently 95 million tweets are available to be queried by time, space, and keyword. This could increase to billions and we are working on real time streaming from tweet-tweeted to tweet-on-the-map in under a second.

MapD is a general purpose SQL database that can be used to provide real-time visualization and analysis of just about any very large data set. MapD makes use of commodity Graphic Processing Units (GPUs) to parallelize hard compute jobs such as that of querying and rendering very large data sets on-the-fly.

This is a real treat!

Try something popular, like “gaga,” without the quotes.

Remember this is running against 95 million tweets.

Impressive! Yes?

April 9, 2013

Astrophysical data mining with GPU…

Filed under: Astroinformatics,BigData,Data Mining,Genetic Algorithms,GPU — Patrick Durusau @ 10:02 am

Astrophysical data mining with GPU. A case study: genetic classification of globular clusters by Stefano Cavuoti, Mauro Garofalo, Massimo Brescia, Maurizio Paolillo, Antonio Pescape’, Giuseppe Longo, Giorgio Ventre.

Abstract:

We present a multi-purpose genetic algorithm, designed and implemented with GPGPU / CUDA parallel computing technology. The model was derived from our CPU serial implementation, named GAME (Genetic Algorithm Model Experiment). It was successfully tested and validated on the detection of candidate Globular Clusters in deep, wide-field, single band HST images. The GPU version of GAME will be made available to the community by integrating it into the web application DAMEWARE (DAta Mining Web Application REsource), a public data mining service specialized on massive astrophysical data. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm leads to a speedup of a factor of 200x in the training phase with respect to the CPU based version.

BTW, DAMEWARE (DAta Mining Web Application REsource, http://dame.dsf.unina.it/beta_info.html.

In case you are curious about the application of genetic algorithms in a low signal/noise situation with really “big” data, this is a good starting point.

Makes me curious about the “noise” in other communications.

The “signal” is fairly easy to identify in astronomy, but what about in text or speech?

I suppose “background noise, music, automobiles” would count as “noise” on a tape recording of a conversation, but is there “noise” in a written text?

Or noise in a conversation that is clearly audible?

If we have 100% signal, how do we explain failing to understand a message in speech or writing?

If it is not “noise,” then what is the problem?

April 7, 2013

Deploying Graph Algorithms on GPUs: an Adaptive Solution

Filed under: Algorithms,GPU,Graphs,Networks — Patrick Durusau @ 5:46 am

Deploying Graph Algorithms on GPUs: an Adaptive Solution by Da Li and Michela Becchi. (27th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2013)

From the post:

Thanks to their massive computational power and their SIMT computational model, Graphics Processing Units (GPUs) have been successfully used to accelerate a wide variety of regular applications (linear algebra, stencil computations, image processing and bioinformatics algorithms, among others). However, many established and emerging problems are based on irregular data structures, such as graphs. Examples can be drawn from different application domains: networking, social networking, machine learning, electrical circuit modeling, discrete event simulation, compilers, and computational sciences. It has been shown that irregular applications based on large graphs do exhibit runtime parallelism; moreover, the amount of available parallelism tends to increase with the size of the datasets. In this work, we explore an implementation space for deploying a variety of graph algorithms on GPUs. We show that the dynamic nature of the parallelism that can be extracted from graph algorithms makes it impossible to find an optimal solution. We propose a runtime system able to dynamically transition between different implementations with minimal overhead, and investigate heuristic decisions applicable across algorithms and datasets. Our evaluation is performed on two graph algorithms: breadth-first search and single-source shortest paths. We believe that our proposed mechanisms can be extended and applied to other graph algorithms that exhibit similar computational patterns.

A development that may surprise some graph software vendors, there are “no optimal solution[s] across graph problems and datasets” for graph algorithms on GPU.

This paper points towards an adaptive technique that may prove to be “resilient to the irregularity and heterogeneity of real world graphs.”

I first saw this in a tweet by Stefano Bertolo.

April 2, 2013

STScI’s Engineering and Technology Colloquia

Filed under: Astroinformatics,GPU,Image Processing,Knowledge Management,Visualization — Patrick Durusau @ 5:49 am

STScI’s Engineering and Technology Colloquia Series Webcasts by Bruce Berriman.

From the post:

Last week, I wrote a post about Michelle Borkin’s presentation on Astronomical Medicine and Beyond, part of the Space Telescope Science Institute’s (STScI) Engineering and Technology Colloquia Series. STScI archives and posts on-line all the presentations in this series. The talks go back to 2008 (with one earlier one dating to 2001), are generally given monthly or quarterly, and represent a rich source of information on many aspects of engineering and technology. The archive includes, where available, abstracts, Power Point Slides, videos for download, and for the more recent presentations, webcasts.

Definitely a astronomy/space flavor but also includes:

Scientific Data Visualization by Adam Bly (Visualizing.org, Seed Media Group).

Knowledge Retention & Transfer: What You Need to Know by Jay Liebowitz (UMUC).

Fast Parallel Processing Using GPUs for Accelerating Image Processing by Tom Reed (Nvidia Corporation).

Every field is struggling with the same data/knowledge issues, often using different terminologies or examples.

We can all struggle separately or we can learn from others.

Which approach do you use?

March 23, 2013

Duplicate Detection on GPUs

Filed under: Duplicates,GPU,Record Linkage — Patrick Durusau @ 7:01 pm

Duplicate Detection on GPUs by Benedikt Forchhammer, Thorsten Papenbrock, Thomas Stening, Sven Viehmeier, Uwe Draisbach, Felix Naumann.

Abstract:

With the ever increasing volume of data and the ability to integrate different data sources, data quality problems abound. Duplicate detection, as an integral part of data cleansing, is essential in modern information systems. We present a complete duplicate detection workflow that utilizes the capabilities of modern graphics processing units (GPUs) to increase the efficiency of finding duplicates in very large datasets. Our solution covers several well-known algorithms for pair selection, attribute-wise similarity comparison, record-wise similarity aggregation, and clustering. We redesigned these algorithms to run memory-efficiently and in parallel on the GPU. Our experiments demonstrate that the GPU-based workflow is able to outperform a CPU-based implementation on large, real-world datasets. For instance, the GPU-based algorithm deduplicates a dataset with 1.8m entities 10 times faster than a common CPU-based algorithm using comparably priced hardware.

Synonyms: Duplicate detection = entity matching = record linkage (and all the other alternatives for those terms).

This looks wicked cool!

I first saw this in a tweet by Stefano Bertolo.

February 20, 2013

Graph Databases, GPUs, and Graph Analytics

Filed under: GPU,Graph Analytics,Graphs — Patrick Durusau @ 9:25 pm

Graph Databases, GPUs, and Graph Analytics by Bryan Thompson.

From the post:

For people who want to track what we’ve been up to on the XDATA project, there are three surveys articles that we’ve produced:

Literature Review of Graph Databases (Bryan Thompson, SYSTAP)
Large Scale Graph Algorithms on the GPU (Yangzihao Wang and John Owens, UC Davis)
Graph Pattern Matching, Search, and OLAP (Dr. Xifeng Yan, UCSB)

Simply awesome reading.

It may be too early to take off for a weekend of reading but I wish….

January 10, 2013

Getting Started with ArrayFire – a 30-minute Jump Start

Filed under: GPU,HPC — Patrick Durusau @ 1:46 pm

Getting Started with ArrayFire – a 30-minute Jump Start

From the post:

In case you missed it, we recently held a webinar on the ArrayFire GPU Computing Library. This webinar was part of an ongoing series of webinars that will help you learn more about the many applications of ArrayFire, while interacting with AccelerEyes GPU computing experts.

ArrayFire is the world’s most comprehensive GPU software library. In this webinar, James Malcolm, who has built many of ArrayFire’s core components, walked us through the basic principles and syntax for ArrayFire. He also provided an overview of existing efforts in GPU software, and compared them to the extensive capabilities of ArrayFire.

If you need to push the limits of current performance, GPUs are one way to go.

Maybe 2013 will be your GPU year!

« Newer PostsOlder Posts »

Powered by WordPress