Archive for the ‘HPC’ Category

Why Would #1 Spy on #2?

Wednesday, May 29th, 2013

Confirmation: China has a 50+ Petaflop system.

That confirmation casts even more doubt on the constant drum roll of “China spying on the U.S.” allegations.

Who wants to spy on second place technology?

The further U.S.-based technology falls behind, due to the lack of investment in R&D by government and industry, expect the the hysterical accusations against China and others to ramp up.

Can’t possibly be that three month profit goals and lowering government spending led to a self-inflicted lack of R&D.

Must be someone stealing the technology we didn’t invest to invent. Has to be. ;-)

The new Chinese system is a prick to the delusional American Exceptionalism balloon.

There will be others.

High-Performance and Parallel Computing with R

Tuesday, April 9th, 2013

High-Performance and Parallel Computing with R by Dirk Eddelbuettel.

From the webpage:

This CRAN task view contains a list of packages, grouped by topic, that are useful for high-performance computing (HPC) with R. In this context, we are defining ‘high-performance computing’ rather loosely as just about anything related to pushing R a little further: using compiled code, parallel computing (in both explicit and implicit modes), working with large objects as well as profiling.

Here you will find R packages for:

  • Explicit parallelism
  • Implicit parallelism
  • Grid computing
  • Hadoop
  • Random numbers
  • Resource managers and batch schedulers
  • Applications
  • GPUs
  • Large memory and out-of-memory data
  • Easier interfaces for Compiled code
  • Profiling tools

Despite HPC advances over the last decade, semantics remain an unsolved problem.

Perhaps raw computational capacity isn’t the key to semantics.

If not, some different approach awaits to be discovered.

I first saw this in a tweet by One R Tip a Day.

FLOPS Fall Flat for Intelligence Agency

Friday, March 29th, 2013

FLOPS Fall Flat for Intelligence Agency by Nicole Hemsoth.

From the post:

The Intelligence Advanced Research Projects Activity (IARPA) is putting out some RFI feelers in hopes of pushing new boundaries with an HPC program. However, at the core of their evaluation process is an overt dismissal of current popular benchmarks, including floating operations per second (FLOPS).

To uncover some missing pieces for their growing computational needs, IARPA is soliciting for “responses that illuminate the breadth of technologies” under the HPC umbrella, particularly the tech that “isn’t already well-represented in today’s HPC benchmarks.”

The RFI points to the general value of benchmarks (Linpack, for instance) as necessary metrics to push research and development, but argues that HPC benchmarks have “constrained the technology and architecture options for HPC system designers.” More specifically, in this case, floating point benchmarks are not quite as valuable to the agency as data-intensive system measurements, particularly as they relate to some of the graph and other so-called big data problems the agency is hoping to tackle using HPC systems.

Responses are due by Apr 05, 2013 4:00 pm Eastern.

Not that I expect most of you to respond to this RFI but I mention it as a step in the right direction for the processing of semantics.

Semantics are not native to vector fields and so every encoding of semantics in a vector field is a mapping.

As is every extraction of semantic from a vector field is the reverse of that mapping process.

The impact of this mapping/unmapping of semantics to and from a vector field on interpretation are unclear.

As mapping and unmapping decisions are interpretative, it seems reasonable to conclude there is some impact. How much isn’t known.

Vector fields are easy for high FLOPS systems to process but do you want a fast inaccurate answer or one that bears some resemblance to reality as experienced by others?

Graph databases, to name one alternative, are the current rage, at least according to graph database vendors.

But saying “graph database,” isn’t the same as usefully capturing semantics with a graph database.

Or processing semantics once captured.

What we need is an alternative to FLOPS that represents effective processing of semantics.

Suggestions?

Getting Started with ArrayFire – a 30-minute Jump Start

Thursday, January 10th, 2013

Getting Started with ArrayFire – a 30-minute Jump Start

From the post:

In case you missed it, we recently held a webinar on the ArrayFire GPU Computing Library. This webinar was part of an ongoing series of webinars that will help you learn more about the many applications of ArrayFire, while interacting with AccelerEyes GPU computing experts.

ArrayFire is the world’s most comprehensive GPU software library. In this webinar, James Malcolm, who has built many of ArrayFire’s core components, walked us through the basic principles and syntax for ArrayFire. He also provided an overview of existing efforts in GPU software, and compared them to the extensive capabilities of ArrayFire.

If you need to push the limits of current performance, GPUs are one way to go.

Maybe 2013 will be your GPU year!

Parallel Computing – Prof. Alan Edelman

Saturday, December 29th, 2012

Parallel Computing – Prof. Alan Edelman MIT Course Number 18.337J / 6.338J.

From the webpage:

This is an advanced interdisciplinary introduction to applied parallel computing on modern supercomputers. It has a hands-on emphasis on understanding the realities and myths of what is possible on the world’s fastest machines. We will make prominent use of the Julia Language software project.

A “modern supercomputer” may be in your near term future. Would not hurt to start preparing now.

Similar courses that you would recommend?

The Cooperative Computing Lab

Monday, December 17th, 2012

The Cooperative Computing Lab

I encountered this site while tracking down resources for the DASPOS post.

From the homepage:

The Cooperative Computing Lab at the University of Notre Dame seeks to give ordinary users the power to harness large systems of hundreds or thousands of machines, often called clusters, clouds, or grids. We create real software that helps people to attack extraordinary problems in fields such as physics, chemistry, bioinformatics, biometrics, and data mining. We welcome others at the University to make use of our computing systems for research and education.

As the computing requirements of your data mining or topic maps increase, so will your need for clusters, clouds, or grids.

The CCL offers several software packages for free download that you may find useful.

2013 International Supercomputing Conference

Monday, December 3rd, 2012

2013 International Supercomputing Conference

Important Dates

Abstract Submission Deadline Sunday, January 27, 2013
23:59 pm, AoE
Full Paper Submission Deadline Sunday, February 10, 2013
23:59 pm, AoE
Author Notification Sunday, March 10, 2013
Rebuttal Phase Starts Sunday, March 10, 2013
Rebuttal Phase Ends Sunday, March 17, 2013
Notification of Acceptance Friday, March 22, 2013
Camera-Ready Submission Sunday, April 7, 2013

From the call for papers:

  • Architectures (multicore/manycore systems, heterogeneous systems, network technology and programming models) 
  • Algorithms and Analysis (scalability on future architectures, performance evaluation and tuning) 
  • Large-Scale Simulations (workflow management, data analysis and visualization, coupled simulations and industrial simulations) 
  • Future Trends (Exascale HPC, HPC in the Cloud) 
  • Storage and Data (file systems and tape libraries, data intensive applications and databases) 
  • Software Engineering in HPC (application of methods, surveys) 
  • Supercomputing Facility (batch job management, job mix and system utilization and monitoring and administration tools) 
  • Scalable Applications: 50k+ (ISC Research thrust). The Research Paper committee encourages scientists to submit parallelization approaches that lead to scalable applications on more than 50,000 (CPU or GPU) cores
  • Submissions on other innovative aspects of high-performance computing are also welcome. 

Did I mention it will be in Leipzig, Germany? ;-)

SC12 Salt Lake City, Utah (Proceedings)

Thursday, November 22nd, 2012

SC12 Salt Lake City, Utah

Proceeding from SC12 are online!

ACM Digital Library: SC12 Conference Proceedings

IEEE Xplore: SC12 Conference Proceedings

Everything from graphs to search and lots in between.

Enjoy!

hgpu.org

Thursday, November 8th, 2012

hgpu.org – high performance computing on graphics processing units

Wealth of GPU computing resources. Will take days to explore fully (if then).

Highest level view:

  • Applications – Where it’s used
  • Hardware – Specs and reviews
  • Programming – Algorithms and techniques
  • Resources – Source Code, tutorials, books, etc.
  • Tools – GPU Sources

Homepage is rather “busy” but packed with information (as opposed to gadgets). Lists the most recent entries, most viewed papers, most recent source code and events.

One special item to note:

Free GPU computing node at hgpu.org

Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.

Oh, did I mention that registration is free?

If you don’t get a multi-GPU unit under the Christmas tree, you can still hum along.

Efficient implementation of data flow graphs on multi-gpu clusters

Thursday, November 8th, 2012

Efficient implementation of data flow graphs on multi-gpu clusters by Vincent Boulos, Sylvain Huet, Vincent Fristot, Luc Salvo and Dominique Houzet.

Abstract:

Nowadays, it is possible to build a multi-GPU supercomputer, well suited for implementation of digital signal processing algorithms, for a few thousand dollars. However, to achieve the highest performance with this kind of architecture, the programmer has to focus on inter-processor communications, tasks synchronization. In this paper, we propose a high level programming model based on a data flow graph (DFG) allowing an efficient implementation of digital signal processing applications on a multi-GPU computer cluster. This DFG-based design flow abstracts the underlying architecture. We focus particularly on the efficient implementation of communications by automating computation-communication overlap, which can lead to significant speedups as shown in the presented benchmark. The approach is validated on three experiments: a multi-host multi-gpu benchmark, a 3D granulometry application developed for research on materials and an application for computing visual saliency maps.

Analysis of the statistics of sizes in images (granulometry) and focusing on a particular place of interest in an image (visual saliency) were interesting use cases.

May or may not be helpful in particular cases, depending on your tests for subject identity.

A Strong ARM for Big Data [Semantics Not Included]

Monday, October 22nd, 2012

A Strong ARM for Big Data (Datanami – Sponsored Content by Calxeda)

From the post:

Burgeoning data growth is one of the foremost challenges facing IT and businesses today. Multiple analyst groups, including Gartner, have reported that information volume is growing at a minimum rate of 59 percent annually. At the same time, companies increasingly are mining this data for invaluable business insight that can give them a competitive advantage.

The challenge the industry struggles with is figuring out how to build cost-effective infrastructures so data scientists can derive these insights for their organizations to make timely, more intelligent decisions. As data volumes continue their explosive growth and algorithms to analyze and visualize that data become more optimized, something must give.

Past approaches that primarily relied on using faster, larger systems just are not able to keep pace. There is a need to scale-out, instead of scaling-up, to help in managing and understanding Big Data. As a result, this has focused new attention on different technologies such as in-memory databases, I/O virtualization, high-speed interconnects, and software frameworks such as Hadoop.

To take full advantage of these network and software innovations requires re-examining strategies for compute hardware. For maximum performance, a well-balanced infrastructure based on densely packed, power-efficient processors coupled with fast network interconnects is needed. This approach will help unlock applications and open new opportunities in business and high performance computing (HPC). (emphasis added)

I like powerful hardware as much as the next person. Either humming within earshot or making the local grid blink when it comes online.

Still, hardware/software tools for big data need to come with the warning label: “Semantics not included.

To soften the disappointment when big data appliances and/or software arrive and the bottom line stays the same, or gets worse.

Using big data, or rather effective use of big data, that is improving your bottom line, requires semantics, your semantics.

Ohio State University Researcher Compares Parallel Systems

Tuesday, April 3rd, 2012

Ohio State University Researcher Compares Parallel Systems

From the post:

Surveying the wide range of parallel system architectures offered in the supercomputer market, an Ohio State University researcher recently sought to establish some side-by-side performance comparisons.

The journal, Concurrency and Computation: Practice and Experience, in February published, “Parallel solution of the subset-sum problem: an empirical study.” The paper is based upon a master’s thesis written last year by former computer science and engineering graduate student Saniyah Bokhari.

“We explore the parallelization of the subset-sum problem on three contemporary but very different architectures, a 128-processor Cray massively multithreaded machine, a 16-processor IBM shared memory machine, and a 240-core NVIDIA graphics processing unit,” said Bokhari. “These experiments highlighted the strengths and weaknesses of these architectures in the context of a well-defined combinatorial problem.”

Bokhari evaluated the conventional central processing unit architecture of the IBM 1350 Glenn Cluster at the Ohio Supercomputer Center (OSC) and the less-traditional general-purpose graphic processing unit (GPGPU) architecture, available on the same cluster. She also evaluated the multithreaded architecture of a Cray Extreme Multithreading (XMT) supercomputer at the Pacific Northwest National Laboratory’s (PNNL) Center for Adaptive Supercomputing Software.

What I found fascinating about this approach was the comparison of:

the strengths and weaknesses of these architectures in the context of a well-defined combinatorial problem.

True enough, there is a place for general methods and solutions, but one pays the price for using general methods and solutions.

Thinking that for subject identity and “merging” in a “big data” context, that we will need a deeper understanding of specific identity and merging requirements. So that the result of that study is one or more well-defined combinatorial problems.

That is to say that understanding one or more combinatorial problems precedes proposing a solution.

You can view/download the thesis by Saniyah Bokhari, Parallel Solution of the Subset-sum Problem: An Empirical Study

Or view the article (assuming you have access):

Parallel solution of the subset-sum problem: an empirical study

Abstract (of the article):

The subset-sum problem is a well-known NP-complete combinatorial problem that is solvable in pseudo-polynomial time, that is, time proportional to the number of input objects multiplied by the sum of their sizes. This product defines the size of the dynamic programming table used to solve the problem. We show how this problem can be parallelized on three contemporary architectures, that is, a 128-processor Cray Extreme Multithreading (XMT) massively multithreaded machine, a 16-processor IBM x3755 shared memory machine, and a 240-core NVIDIA FX 5800 graphics processing unit (GPU). We show that it is straightforward to parallelize this algorithm on the Cray XMT primarily because of the word-level locking that is available on this architecture. For the other two machines, we present an alternating word algorithm that can implement an efficient solution. Our results show that the GPU performs well for problems whose tables fit within the device memory. Because GPUs typically have memories in the order of 10GB, such architectures are best for small problem sizes that have tables of size approximately 1010. The IBM x3755 performs very well on medium-sized problems that fit within its 64-GB memory but has poor scalability as the number of processors increases and is unable to sustain performance as the problem size increases. This machine tends to saturate for problem sizes of 1011 bits. The Cray XMT shows very good scaling for large problems and demonstrates sustained performance as the problem size increases. However, this machine has poor scaling for small problem sizes; it performs best for problem sizes of 1012 bits or more. The results in this paper illustrate that the subset-sum problem can be parallelized well on all three architectures, albeit for different ranges of problem sizes. The performance of these three machines under varying problem sizes show the strengths and weaknesses of the three architectures. Copyright © 2012 John Wiley & Sons, Ltd.

The Heterogeneous Programming Jungle

Saturday, March 24th, 2012

The Heterogeneous Programming Jungle by Michael Wolfe.

Michael starts off with one definition of “heterogeneous:”

The heterogeneous systems of interest to HPC use an attached coprocessor or accelerator that is optimized for certain types of computation.These devices typically exhibit internal parallelism, and execute asynchronously and concurrently with the host processor. Programming a heterogeneous system is then even more complex than “traditional” parallel programming (if any parallel programming can be called traditional), because in addition to the complexity of parallel programming on the attached device, the program must manage the concurrent activities between the host and device, and manage data locality between the host and device.

And while he returns to that definition in the end, another form of heterogeneity is lurking not far behind:

Given the similarities among system designs, one might think it should be obvious how to come up with a programming strategy that would preserve portability and performance across all these devices. What we want is a method that allows the application writer to write a program once, and let the compiler or runtime optimize for each target. Is that too much to ask?

Let me reflect momentarily on the two gold standards in this arena. The first is high level programming languages in general. After 50 years of programming using Algol, Pascal, Fortran, C, C++, Java, and many, many other languages, we tend to forget how wonderful and important it is that we can write a single program, compile it, run it, and get the same results on any number of different processors and operating systems.

So there is the heterogeneity of attached coprocessor and, just as importantly, of the processors with coprocessors.

His post concludes with:

Grab your Machete and Pith Helmet

If parallel programming is hard, heterogeneous programming is that hard, squared. Defining and building a productive, performance-portable heterogeneous programming system is hard. There are several current programming strategies that attempt to solve this problem, including OpenCL, Microsoft C++AMP, Google Renderscript, Intel’s proposed offload directives (see slide 24), and the recent OpenACC specification. We might also learn something from embedded system programming, which has had to deal with heterogeneous systems for many years. My next article will whack through the underbrush to expose each of these programming strategies in turn, presenting advantages and disadvantages relative to the goal.

These are languages that share common subjects (think of their target architectures) and so are ripe for a topic map that co-locates their approaches to a particular architecture. Being able to incorporate official and non-official documentation, tests, sample code, etc., might enable faster progress in this area.

The future of HPC processors is almost upon us. It will not do to be tardy.

An Application Driven Analysis of the ParalleX Execution Model (here be graph’s mention)

Tuesday, January 10th, 2012

An Application Driven Analysis of the ParalleX Execution Model by Matthew Anderson, Maciej Brodowicz, Hartmut Kaiser and Thomas Sterling.

Just in case you feel the need for more information about ParalleX after that post about the LSU software release. ;-)

Abstract:

Exascale systems, expected to emerge by the end of the next decade, will require the exploitation of billion-way parallelism at multiple hierarchical levels in order to achieve the desired sustained performance. The task of assessing future machine performance is approached by identifying the factors which currently challenge the scalability of parallel applications. It is suggested that the root cause of these challenges is the incoherent coupling between the current enabling technologies, such as Non-Uniform Memory Access of present multicore nodes equipped with optional hardware accelerators and the decades older execution model, i.e., the Communicating Sequential Processes (CSP) model best exemplified by the message passing interface (MPI) application programming interface. A new execution model, ParalleX, is introduced as an alternative to the CSP model. In this paper, an overview of the ParalleX execution model is presented along with details about a ParalleX-compliant runtime system implementation called High Performance ParalleX (HPX). Scaling and performance results for an adaptive mesh refinement numerical relativity application developed using HPX are discussed. The performance results of this HPX-based application are compared with a counterpart MPI-based mesh refinement code. The overheads associated with HPX are explored and hardware solutions are introduced for accelerating the runtime system.

Graphaholics should also note:

Today’s conventional parallel programming methods such as MPI [1] and systems such as distributed memory massively parallelvprocessors (MPPs) and Linux clusters exhibit poor efficiency and constrained scalability for this class of applications. This severely hinders scientifi c advancement. Many other classes of applications exhibit similar properties, especially graph/tree data structures that have non uniform data access patterns. (emphasis added)

I like that, “non uniform data access patterns.”

My “gut” feeling is that this will prove very useful for processing semantics. Since semantics originate from us and have “non uniform data access patterns.”

Granted a lot of work between here and there, especially since the semantics side of the house is fond of declaring victory in favor of the latest solution.

You would think after years, decades, centuries, no, millenia of one “ultimate” solution after another, we would be a little more wary of such pronouncements. I suspect the problem is that programmers come by their proverbial laziness honestly. They get it from us. It is easier to just fall into line with whatever seems like a passable solution and to not worry about all the passable solutions that went before.

That is no doubt easier but imagine where medicine, chemistry, physics, or even computers would be if they had adopted such a model. True, we have to use models that work now, but at the same time we should encourage new, different, even challenging models that may (or may not) be better at capturing human semantics. Models that change even as we do.

LSU Releases First Open Source ParalleX Runtime Software System

Tuesday, January 10th, 2012

LSU Releases First Open Source ParalleX Runtime Software System

From the press release:

Louisiana State University’s Center for Computation & Technology (CCT) has delivered the first freely available open-source runtime system implementation of the ParalleX execution model. The HPX, or High Performance ParalleX, runtime software package is a modular, feature-complete, and performance oriented representation of the ParalleX execution model targeted at conventional parallel computing architectures such as SMP nodes and commodity clusters.

HPX is being provided to the open community for experimentation and application to achieve high efficiency and scalability for dynamic adaptive and irregular computational problems. HPX is a library of C++ functions that supports a set of critical mechanisms for dynamic adaptive resource management and lightweight task scheduling within the context of a global address space. It is solidly based on many years of experience in writing highly parallel applications for HPC systems.

The two-decade success of the communicating sequential processes (CSP) execution model and its message passing interface (MPI) programming model has been seriously eroded by challenges of power, processor core complexity, multi-core sockets, and heterogeneous structures of GPUs. Both efficiency and scalability for some current (strong scaled) applications and future Exascale applications demand new techniques to expose new sources of algorithm parallelism and exploit unused resources through adaptive use of runtime information.

The ParalleX execution model replaces CSP to provide a new computing paradigm embodying the governing principles for organizing and conducting highly efficient scalable computations greatly exceeding the capabilities of today’s problems. HPX is the first practical, reliable, and performance-oriented runtime system incorporating the principal concepts of ParalleX model publicly provided in open source release form.