Archive for the ‘HPC’ Category

GPU + Russian Algorithm Bests Supercomputer

Thursday, June 30th, 2016

No need for supercomputers

From the post:


Senior researchers Vladimir Pomerantcev and Olga Rubtsova, working under the guidance of Professor Vladimir Kukulin (SINP MSU), were able to use on an ordinary desktop PC with GPU to solve complicated integral equations of quantum mechanics — previously solved only with the powerful, expensive supercomputers. According to Vladimir Kukulin, the personal computer does the job much faster: in 15 minutes it is doing the work requiring normally 2-3 days of the supercomputer time.

The main problem in solving the scattering equations of multiple quantum particles was the calculation of the integral kernel — a huge two-dimensional table, consisting of tens or hundreds of thousands of rows and columns, with each element of such a huge matrix being the result of extremely complex calculations. But this table appeared to look like a monitor screen with tens of billions of pixels, and with a good GPU it was quite possible to calculate all of these. Using the software developed in Nvidia and having written their own programs, the researchers split their calculations on the many thousands of streams and were able to solve the problem brilliantly.

“We reached the speed we couldn’t even dream of,” Vladimir Kukulin said. “The program computes 260 million of complex double integrals on a desktop computer within three seconds only. No comparison with supercomputers! My colleague from the University of Bochum in Germany (recently deceased, mournfully), whose lab did the same, carried out the calculations by one of the largest supercomputers in Germany with the famous blue gene architecture that is actually very expensive. And what his group is seeking for two or three days, we do in 15 minutes without spending a dime.”

The most amazing thing is that the desired quality of graphics processors and a huge amount of software to them exist for ten years already, but no one used them for such calculations, preferring supercomputers. Anyway, our physicists surprised their Western counterparts pretty much.

One of the principal beneficiaries of the US restricting the export of the latest generation of computer technology to the former USSR, was of course Russia.

Deprived of the latest hardware, Russian mathematicians and computer scientists were forced to be more efficient with equipment that was one or two generations off the latest mark for computing.

Parity between the USSR and the USA in nuclear weapons is testimony to their success and the failure of US export restriction policies.

For the technical details: V.N. Pomerantsev, V.I. Kukulin, O.A. Rubtsova, S.K. Sakhiev. Fast GPU-based calculations in few-body quantum scattering. Computer Physics Communications, 2016; 204: 121 DOI: 10.1016/j.cpc.2016.03.018.

Will a GPU help you startle your colleagues in the near future?

The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox

Wednesday, October 1st, 2014

The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox by Daniel Crankshaw, et al.

Abstract:

To support complex data-intensive applications such as personalized recommendations, targeted advertising, and intelligent services, the data management community has focused heavily on the design of systems to support training complex models on large datasets. Unfortunately, the design of these systems largely ignores a critical component of the overall analytics process: the deployment and serving of models at scale. In this work, we present Velox, a new component of the Berkeley Data Analytics Stack. Velox is a data management system for facilitating the next steps in real-world, large-scale analytics pipelines: online model management, maintenance, and serving. Velox provides end-user applications and services with a low-latency, intuitive interface to models, transforming the raw statistical models currently trained using existing offline large-scale compute frameworks into full-blown, end-to-end data products capable of recommending products, targeting advertisements, and personalizing web content. To provide up-to-date results for these complex models, Velox also facilitates lightweight online model maintenance and selection (i.e., dynamic weighting). In this paper, we describe the challenges and architectural considerations required to achieve this functionality, including the abilities to span online and offline systems, to adaptively adjust model materialization strategies, and to exploit inherent statistical properties such as model error tolerance, all while operating at “Big Data” scale.

Early Warning: Alpha code drop expected December 2014.

If you want to get ahead of the curve I suggest you start reading this paper soon. Very soon.

Written from the perspective of end-user facing applications but applicable to author-facing applications for real time interaction with subject identification.

Supercomputing frontiers and innovations

Saturday, August 9th, 2014

Supercomputing frontiers and innovations (New Journal)

From the homepage:

Parallel scientific computing has entered a new era. Multicore processors on desktop computers make parallel computing a fundamental skill required by all computer scientists. High-end systems have surpassed the Petaflop barrier, and significant efforts are devoted to the development of the next generation of hardware and software technologies towards Exascale systems. This is an exciting time for computing as we begin the journey on the road to exascale computing. ‘Going to the exascale’ will mean radical changes in computing architecture, software, and algorithms – basically, vastly increasing the levels of parallelism to the point of billions of threads working in tandem – which will force radical changes in how hardware is designed and how we go about solving problems. There are many computational and technical challenges ahead that must be overcome. The challenges are great, different than the current set of challenges, and exciting research problems await us.

This journal, Supercomputing Frontiers and Innovations, gives an introduction to the area of innovative supercomputing technologies, prospective architectures, scalable and highly parallel algorithms, languages, data analytics, issues related to computational co-design, and cross-cutting HPC issues as well as papers on supercomputing education and massively parallel computing applications in science and industry.

This journal provides immediate open access to its content on the principle that making research freely available to the public supports a greater global exchange of knowledge. We hope you find this journal timely, interesting, and informative. We welcome your contributions, suggestions, and improvements to this new journal. Please join us in making this exciting new venture a success. We hope you will find Supercomputing Frontiers and Innovations an ideal venue for the publication of your team’s next exciting results.

Becoming “massively parallel” isn’t going to free “computing applications in science and industry” from semantics. If anything, the more complex applications become, the easier it will be to mislay semantics, to the user’s peril.

Semantic efforts that did not scale for applications in the last decade face even dimmer prospects in the face of “big data” and massively parallel applications.

I suggest we move the declaration of semantics closer to or at the authors of content/data. At least as a starting point for discussion/research.

Current issue.

Jetson TK1:… [$192.00]

Friday, April 4th, 2014

Jetson TK1: Mobile Embedded Supercomputer Takes CUDA Everywhere by Mark Harris.

From the post:

Jetson TK1 is a tiny but full-featured computer designed for development of embedded and mobile applications. Jetson TK1 is exciting because it incorporates Tegra K1, the first mobile processor to feature a CUDA-capable GPU. Jetson TK1 brings the capabilities of Tegra K1 to developers in a compact, low-power platform that makes development as simple as developing on a PC.

Tegra K1 is NVIDIA’s latest mobile processor. It features a Kepler GPU with 192 cores, an NVIDIA 4-plus-1 quad-core ARM Cortex-A15 CPU, integrated video encoding and decoding support, image/signal processing, and many other system-level features. The Kepler GPU in Tegra K1 is built on the same high-performance, energy-efficient Kepler GPU architecture that is found in our high-end GeForce, Quadro, and Tesla GPUs for graphics and computing. That makes it the only mobile processor today that supports CUDA 6 for computing and full desktop OpenGL 4.4 and DirectX 11 for graphics.

Tegra K1 is a parallel processor capable of over 300 GFLOP/s of 32-bit floating point computation. Not only is that a huge achievement in a processor with such a low power footprint (Tegra K1 power consumption is in the range of 5 Watts for real workloads), but K1′s support for CUDA and desktop graphics APIs means that much of your existing compute and graphics software will compile and run largely as-is on this platform.

Are you old enough to remember looking at the mini-computers on the back of most computer zines?

And then sighing at the price tag?

Times have changed!

Order Jetson TK1 Now, just $192

Jetson TK1 is available to pre-order today for $192. In the United States, it is available from the NVIDIA website, as well as newegg.com and Micro Center. See the Jetson TK1 page for details on international orders.

Some people, General Clapper comes to mind, use supercomputers to mine dots that are already connected together (phone data).

Other people, create algorithms to assist users in connecting dots between diverse and disparate data sources.

You know who my money is riding on.

You?

Orbital Computing – Electron Orbits That Is.

Monday, March 10th, 2014

Physicist proposes a new type of computing at SXSW. Check out orbital computing by Stacey Higginbotham.

From the post:

The demand for computing power is constantly rising, but we’re heading to the edge of the cliff in terms of increasing performance — both in terms of the physics of cramming more transistors on a chip and in terms of the power consumption. We’ve covered plenty of different ways that researchers are trying to continue advancing Moore’s Law — this idea that the number of transistors (and thus the performance) on a chip doubles every 18 months — especially the far out there efforts that take traditional computer science and electronics and dump them in favor of using magnetic spin, quantum states or probabilistic logic.

We’re going to add a new impossible that might become possible to that list thanks to Joshua Turner, a physicist at the SLAC National Accelerator Laboratory, who has proposed using the orbits of electrons around the nucleus of an atom as a new means to generate the binary states (the charge or lack of a charge that transistors use today to generate zeros and ones) we use in computing. He calls this idea orbital computing and the big takeaway for engineers is that one can switch the state of an electron’s orbit 10,000 times faster than you can switch the state of a transistor used in computing today.

That means you can still have the features of computing in that you use binary programming, but you just can compute more in less time. To get us to his grand theory, Turner had to take the SXSW audience through how computing works, how transistors work, the structure of atoms, the behavior of subatomic particles and a bunch of background on X-rays.

This would have been a presentation to see: Bits, Bittier Bits & Qubits: Physics of Computing

Try this SLAC Search for some publications by Joshua Turner.

It’s always fun to read about how computers will be able to process data more quickly. A techie sort of thing.

On the other hand, going 10,000 times faster with semantically heterogeneous data, will get you to the wrong answer 10,000 times faster.

If you realize the answer is wrong, you may have time to try again.

What if you don’t realize the answer is wrong?

Do you really want to be the customs agent who stops a five year old because their name is similar to that of a known terrorist? Because the machine said they could not fly?

Excited about going faster, worried about data going by too fast for anyone to question its semantics.

The Data Avalanche in Astrophysics

Tuesday, February 4th, 2014

The Data Avalanche in Astrophysics (podcast)

From the post:

On today’s edition of Soundbite, we’ll be talking with Dr. Kirk Borne, Professor of Astrophysics and Computational Science at George Mason University about managing the data avalanche in astronomy.

Borne has been involved in a number of data-intensive astrophysics projects, including data mining on the Galaxy Zoo database of galaxy classifications. We’ll talk about some of his experiences and what challenges lie ahead for astronomy as well what some established and emerging tools, including Graph databases, languages like Python and R, and approaches will lend to his field and big data research in general.

During the podcast, Dr. Borne talks about the rising use of graphs over the last several years on supercomputers to analyze astronomical data.

You’ll get the impression that graphs are not a recent item in high-performance computing. Which just happens to be a correct impression.

GTC On-Demand

Wednesday, January 15th, 2014

GTC On-Demand

While running down presentations at prior GPU Technology Conferences, I found this gold mine of presentations and slides on GPU computing.

Counting “presentationTitle” in the page source says 385 presentations!

Enjoy!

Exploiting Parallelism and Scalability (XPS)

Monday, January 13th, 2014

Exploiting Parallelism and Scalability (XPS) NSF

Full Proposal Window: February 10, 2014 – February 24, 2014

Synopsis:

Computing systems have undergone a fundamental transformation from the single-processor devices of the turn of the century to today’s ubiquitous and networked devices and warehouse-scale computing via the cloud. Parallelism is abundant at many levels. At the same time, semiconductor technology is facing fundamental physical limits and single processor performance has plateaued. This means that the ability to achieve predictable performance improvements through improved processor technologies alone has ended. Thus, parallelism has become critically important.

The Exploiting Parallelism and Scalability (XPS) program aims to support groundbreaking research leading to a new era of parallel computing. Achieving the needed breakthroughs will require a collaborative effort among researchers representing all areas– from services and applications down to the micro-architecture– and will be built on new concepts, theories, and foundational principles. New approaches to achieve scalable performance and usability need new abstract models and algorithms, new programming models and languages, new hardware architectures, compilers, operating systems and run-time systems, and must exploit domain and application-specific knowledge. Research is also needed on energy efficiency, communication efficiency, and on enabling the division of effort between edge devices and clouds.

The January 10th webinar for this activity hasn’t been posted yet.

Without semantics, XPS will establish a new metric:

GFS: Garbage per Femtosecond.

Supercomputing on the cheap with Parallella

Tuesday, December 10th, 2013

Supercomputing on the cheap with Parallella by Federico Lucifredi.

From the post:

Packing impressive supercomputing power inside a small credit card-sized board running Ubuntu, Adapteva‘s $99 ARM-based Parallella system includes the unique Ephiphany numerical accelerator that promises to unleash industrial strength parallel processing on the desktop at a rock-bottom price. The Massachusetts-based startup recently ran a successfully funded Kickstarter campaign and gained widespread attention only to run into a few roadblocks along the way. Now, with their setbacks behind them, Adapteva is slated to deliver its first units mid-December 2013, with volume shipping in the following months.

What makes the Parallella board so exciting is that it breaks new ground: imagine an Open Source Hardware board, powered by just a few Watts of juice, delivering 90 GFLOPS of number crunching. Combine this with the possibility of clustering multiple boards, and suddenly the picture of an exceedingly affordable desktop supercomputer emerges.

This review looks in-depth at a pre-release prototype board (so-called Generation Zero, a development run of 50 units), giving you a pretty complete overview of what the finished board will look like.

Whether you participate in this aspect of the computing revolution or not, you will be impacted by it.

The more successful Parallela and similar efforts become in bringing desktop supercomputing, the more pressure there will be on cloud computing providers to match those capabilities at lower prices.

Another point of impact will be non-production experimentation with parallel processing. Which may, like Thomas Edison, discover (or re-discover) 10,000 ways that don’t work but discover 1 that far exceeds anyone’s expectations.

That is to say that supercomputing will become cheap enough to tolerate frequent failure while experimenting with it.

What would you like to invent for supercomputing?

Programming model for supercomputers of the future

Wednesday, June 26th, 2013

Programming model for supercomputers of the future

From the post:

The demand for even faster, more effective, and also energy-saving computer clusters is growing in every sector. The new asynchronous programming model GPI might become a key building block towards realizing the next generation of supercomputers.

The demand for even faster, more effective, and also energy-saving computer clusters is growing in every sector. The new asynchronous programming model GPI from Fraunhofer ITWM might become a key building block towards realizing the next generation of supercomputers.

High-performance computing is one of the key technologies for numerous applications that we have come to take for granted – everything from Google searches to weather forecasting and climate simulation to bioinformatics requires an ever increasing amount of computing ressources. Big data analysis additionally is driving the demand for even faster, more effective, and also energy-saving computer clusters. The number of processors per system has now reached the millions and looks set to grow even faster in the future. Yet something has remained largely unchanged over the past 20 years and that is the programming model for these supercomputers. The Message Passing Interface (MPI) ensures that the microprocessors in the distributed systems can communicate. For some time now, however, it has been reaching the limits of its capability.

“I was trying to solve a calculation and simulation problem related to seismic data,” says Dr. Carsten Lojewski from the Fraunhofer Institute for Industrial Mathematics ITWM. “But existing methods weren’t working. The problems were a lack of scalability, the restriction to bulk-synchronous, two-sided communication, and the lack of fault tolerance. So out of my own curiosity I began to develop a new programming model.” This development work ultimately resulted in the Global Address Space Programming Interface – or GPI – which uses the parallel architecture of high-performance computers with maximum efficiency.

GPI is based on a completely new approach: an asynchronous communication model, which is based on remote completion. With this approach, each processor can directly access all data – regardless of which memory it is on and without affecting other parallel processes. Together with Rui Machado, also from Fraunhofer ITWM, and Dr. Christian Simmendinger from T-Systems Solutions for Research, Dr. Carsten Lojewski is receiving a Joseph von Fraunhofer prize this year.

The post concludes with the observation that “…GPI is a tool for specialists….”

Rather surprising since it hasn’t been that many years ago that Hadoop was a tool for specialists. Or that “data mining” was a tool for specialists.

In the last year both Hadoop and “data mining” have come within reach of nearly average users.

GPI if successful for a broad range of problems, a few years will find it under the hood of any nearby cluster.

Perhaps sooner if you take an interest in it.

Why Would #1 Spy on #2?

Wednesday, May 29th, 2013

Confirmation: China has a 50+ Petaflop system.

That confirmation casts even more doubt on the constant drum roll of “China spying on the U.S.” allegations.

Who wants to spy on second place technology?

The further U.S.-based technology falls behind, due to the lack of investment in R&D by government and industry, expect the the hysterical accusations against China and others to ramp up.

Can’t possibly be that three month profit goals and lowering government spending led to a self-inflicted lack of R&D.

Must be someone stealing the technology we didn’t invest to invent. Has to be. 😉

The new Chinese system is a prick to the delusional American Exceptionalism balloon.

There will be others.

High-Performance and Parallel Computing with R

Tuesday, April 9th, 2013

High-Performance and Parallel Computing with R by Dirk Eddelbuettel.

From the webpage:

This CRAN task view contains a list of packages, grouped by topic, that are useful for high-performance computing (HPC) with R. In this context, we are defining ‘high-performance computing’ rather loosely as just about anything related to pushing R a little further: using compiled code, parallel computing (in both explicit and implicit modes), working with large objects as well as profiling.

Here you will find R packages for:

  • Explicit parallelism
  • Implicit parallelism
  • Grid computing
  • Hadoop
  • Random numbers
  • Resource managers and batch schedulers
  • Applications
  • GPUs
  • Large memory and out-of-memory data
  • Easier interfaces for Compiled code
  • Profiling tools

Despite HPC advances over the last decade, semantics remain an unsolved problem.

Perhaps raw computational capacity isn’t the key to semantics.

If not, some different approach awaits to be discovered.

I first saw this in a tweet by One R Tip a Day.

FLOPS Fall Flat for Intelligence Agency

Friday, March 29th, 2013

FLOPS Fall Flat for Intelligence Agency by Nicole Hemsoth.

From the post:

The Intelligence Advanced Research Projects Activity (IARPA) is putting out some RFI feelers in hopes of pushing new boundaries with an HPC program. However, at the core of their evaluation process is an overt dismissal of current popular benchmarks, including floating operations per second (FLOPS).

To uncover some missing pieces for their growing computational needs, IARPA is soliciting for “responses that illuminate the breadth of technologies” under the HPC umbrella, particularly the tech that “isn’t already well-represented in today’s HPC benchmarks.”

The RFI points to the general value of benchmarks (Linpack, for instance) as necessary metrics to push research and development, but argues that HPC benchmarks have “constrained the technology and architecture options for HPC system designers.” More specifically, in this case, floating point benchmarks are not quite as valuable to the agency as data-intensive system measurements, particularly as they relate to some of the graph and other so-called big data problems the agency is hoping to tackle using HPC systems.

Responses are due by Apr 05, 2013 4:00 pm Eastern.

Not that I expect most of you to respond to this RFI but I mention it as a step in the right direction for the processing of semantics.

Semantics are not native to vector fields and so every encoding of semantics in a vector field is a mapping.

As is every extraction of semantic from a vector field is the reverse of that mapping process.

The impact of this mapping/unmapping of semantics to and from a vector field on interpretation are unclear.

As mapping and unmapping decisions are interpretative, it seems reasonable to conclude there is some impact. How much isn’t known.

Vector fields are easy for high FLOPS systems to process but do you want a fast inaccurate answer or one that bears some resemblance to reality as experienced by others?

Graph databases, to name one alternative, are the current rage, at least according to graph database vendors.

But saying “graph database,” isn’t the same as usefully capturing semantics with a graph database.

Or processing semantics once captured.

What we need is an alternative to FLOPS that represents effective processing of semantics.

Suggestions?

Getting Started with ArrayFire – a 30-minute Jump Start

Thursday, January 10th, 2013

Getting Started with ArrayFire – a 30-minute Jump Start

From the post:

In case you missed it, we recently held a webinar on the ArrayFire GPU Computing Library. This webinar was part of an ongoing series of webinars that will help you learn more about the many applications of ArrayFire, while interacting with AccelerEyes GPU computing experts.

ArrayFire is the world’s most comprehensive GPU software library. In this webinar, James Malcolm, who has built many of ArrayFire’s core components, walked us through the basic principles and syntax for ArrayFire. He also provided an overview of existing efforts in GPU software, and compared them to the extensive capabilities of ArrayFire.

If you need to push the limits of current performance, GPUs are one way to go.

Maybe 2013 will be your GPU year!

Parallel Computing – Prof. Alan Edelman

Saturday, December 29th, 2012

Parallel Computing – Prof. Alan Edelman MIT Course Number 18.337J / 6.338J.

From the webpage:

This is an advanced interdisciplinary introduction to applied parallel computing on modern supercomputers. It has a hands-on emphasis on understanding the realities and myths of what is possible on the world’s fastest machines. We will make prominent use of the Julia Language software project.

A “modern supercomputer” may be in your near term future. Would not hurt to start preparing now.

Similar courses that you would recommend?

The Cooperative Computing Lab

Monday, December 17th, 2012

The Cooperative Computing Lab

I encountered this site while tracking down resources for the DASPOS post.

From the homepage:

The Cooperative Computing Lab at the University of Notre Dame seeks to give ordinary users the power to harness large systems of hundreds or thousands of machines, often called clusters, clouds, or grids. We create real software that helps people to attack extraordinary problems in fields such as physics, chemistry, bioinformatics, biometrics, and data mining. We welcome others at the University to make use of our computing systems for research and education.

As the computing requirements of your data mining or topic maps increase, so will your need for clusters, clouds, or grids.

The CCL offers several software packages for free download that you may find useful.

2013 International Supercomputing Conference

Monday, December 3rd, 2012

2013 International Supercomputing Conference

Important Dates

Abstract Submission Deadline Sunday, January 27, 2013
23:59 pm, AoE
Full Paper Submission Deadline Sunday, February 10, 2013
23:59 pm, AoE
Author Notification Sunday, March 10, 2013
Rebuttal Phase Starts Sunday, March 10, 2013
Rebuttal Phase Ends Sunday, March 17, 2013
Notification of Acceptance Friday, March 22, 2013
Camera-Ready Submission Sunday, April 7, 2013

From the call for papers:

  • Architectures (multicore/manycore systems, heterogeneous systems, network technology and programming models) 
  • Algorithms and Analysis (scalability on future architectures, performance evaluation and tuning) 
  • Large-Scale Simulations (workflow management, data analysis and visualization, coupled simulations and industrial simulations) 
  • Future Trends (Exascale HPC, HPC in the Cloud) 
  • Storage and Data (file systems and tape libraries, data intensive applications and databases) 
  • Software Engineering in HPC (application of methods, surveys) 
  • Supercomputing Facility (batch job management, job mix and system utilization and monitoring and administration tools) 
  • Scalable Applications: 50k+ (ISC Research thrust). The Research Paper committee encourages scientists to submit parallelization approaches that lead to scalable applications on more than 50,000 (CPU or GPU) cores
  • Submissions on other innovative aspects of high-performance computing are also welcome. 

Did I mention it will be in Leipzig, Germany? 😉

SC12 Salt Lake City, Utah (Proceedings)

Thursday, November 22nd, 2012

SC12 Salt Lake City, Utah

Proceeding from SC12 are online!

ACM Digital Library: SC12 Conference Proceedings

IEEE Xplore: SC12 Conference Proceedings

Everything from graphs to search and lots in between.

Enjoy!

hgpu.org

Thursday, November 8th, 2012

hgpu.org – high performance computing on graphics processing units

Wealth of GPU computing resources. Will take days to explore fully (if then).

Highest level view:

  • Applications – Where it’s used
  • Hardware – Specs and reviews
  • Programming – Algorithms and techniques
  • Resources – Source Code, tutorials, books, etc.
  • Tools – GPU Sources

Homepage is rather “busy” but packed with information (as opposed to gadgets). Lists the most recent entries, most viewed papers, most recent source code and events.

One special item to note:

Free GPU computing node at hgpu.org

Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.

Oh, did I mention that registration is free?

If you don’t get a multi-GPU unit under the Christmas tree, you can still hum along.

Efficient implementation of data flow graphs on multi-gpu clusters

Thursday, November 8th, 2012

Efficient implementation of data flow graphs on multi-gpu clusters by Vincent Boulos, Sylvain Huet, Vincent Fristot, Luc Salvo and Dominique Houzet.

Abstract:

Nowadays, it is possible to build a multi-GPU supercomputer, well suited for implementation of digital signal processing algorithms, for a few thousand dollars. However, to achieve the highest performance with this kind of architecture, the programmer has to focus on inter-processor communications, tasks synchronization. In this paper, we propose a high level programming model based on a data flow graph (DFG) allowing an efficient implementation of digital signal processing applications on a multi-GPU computer cluster. This DFG-based design flow abstracts the underlying architecture. We focus particularly on the efficient implementation of communications by automating computation-communication overlap, which can lead to significant speedups as shown in the presented benchmark. The approach is validated on three experiments: a multi-host multi-gpu benchmark, a 3D granulometry application developed for research on materials and an application for computing visual saliency maps.

Analysis of the statistics of sizes in images (granulometry) and focusing on a particular place of interest in an image (visual saliency) were interesting use cases.

May or may not be helpful in particular cases, depending on your tests for subject identity.

A Strong ARM for Big Data [Semantics Not Included]

Monday, October 22nd, 2012

A Strong ARM for Big Data (Datanami – Sponsored Content by Calxeda)

From the post:

Burgeoning data growth is one of the foremost challenges facing IT and businesses today. Multiple analyst groups, including Gartner, have reported that information volume is growing at a minimum rate of 59 percent annually. At the same time, companies increasingly are mining this data for invaluable business insight that can give them a competitive advantage.

The challenge the industry struggles with is figuring out how to build cost-effective infrastructures so data scientists can derive these insights for their organizations to make timely, more intelligent decisions. As data volumes continue their explosive growth and algorithms to analyze and visualize that data become more optimized, something must give.

Past approaches that primarily relied on using faster, larger systems just are not able to keep pace. There is a need to scale-out, instead of scaling-up, to help in managing and understanding Big Data. As a result, this has focused new attention on different technologies such as in-memory databases, I/O virtualization, high-speed interconnects, and software frameworks such as Hadoop.

To take full advantage of these network and software innovations requires re-examining strategies for compute hardware. For maximum performance, a well-balanced infrastructure based on densely packed, power-efficient processors coupled with fast network interconnects is needed. This approach will help unlock applications and open new opportunities in business and high performance computing (HPC). (emphasis added)

I like powerful hardware as much as the next person. Either humming within earshot or making the local grid blink when it comes online.

Still, hardware/software tools for big data need to come with the warning label: “Semantics not included.

To soften the disappointment when big data appliances and/or software arrive and the bottom line stays the same, or gets worse.

Using big data, or rather effective use of big data, that is improving your bottom line, requires semantics, your semantics.

Ohio State University Researcher Compares Parallel Systems

Tuesday, April 3rd, 2012

Ohio State University Researcher Compares Parallel Systems

From the post:

Surveying the wide range of parallel system architectures offered in the supercomputer market, an Ohio State University researcher recently sought to establish some side-by-side performance comparisons.

The journal, Concurrency and Computation: Practice and Experience, in February published, “Parallel solution of the subset-sum problem: an empirical study.” The paper is based upon a master’s thesis written last year by former computer science and engineering graduate student Saniyah Bokhari.

“We explore the parallelization of the subset-sum problem on three contemporary but very different architectures, a 128-processor Cray massively multithreaded machine, a 16-processor IBM shared memory machine, and a 240-core NVIDIA graphics processing unit,” said Bokhari. “These experiments highlighted the strengths and weaknesses of these architectures in the context of a well-defined combinatorial problem.”

Bokhari evaluated the conventional central processing unit architecture of the IBM 1350 Glenn Cluster at the Ohio Supercomputer Center (OSC) and the less-traditional general-purpose graphic processing unit (GPGPU) architecture, available on the same cluster. She also evaluated the multithreaded architecture of a Cray Extreme Multithreading (XMT) supercomputer at the Pacific Northwest National Laboratory’s (PNNL) Center for Adaptive Supercomputing Software.

What I found fascinating about this approach was the comparison of:

the strengths and weaknesses of these architectures in the context of a well-defined combinatorial problem.

True enough, there is a place for general methods and solutions, but one pays the price for using general methods and solutions.

Thinking that for subject identity and “merging” in a “big data” context, that we will need a deeper understanding of specific identity and merging requirements. So that the result of that study is one or more well-defined combinatorial problems.

That is to say that understanding one or more combinatorial problems precedes proposing a solution.

You can view/download the thesis by Saniyah Bokhari, Parallel Solution of the Subset-sum Problem: An Empirical Study

Or view the article (assuming you have access):

Parallel solution of the subset-sum problem: an empirical study

Abstract (of the article):

The subset-sum problem is a well-known NP-complete combinatorial problem that is solvable in pseudo-polynomial time, that is, time proportional to the number of input objects multiplied by the sum of their sizes. This product defines the size of the dynamic programming table used to solve the problem. We show how this problem can be parallelized on three contemporary architectures, that is, a 128-processor Cray Extreme Multithreading (XMT) massively multithreaded machine, a 16-processor IBM x3755 shared memory machine, and a 240-core NVIDIA FX 5800 graphics processing unit (GPU). We show that it is straightforward to parallelize this algorithm on the Cray XMT primarily because of the word-level locking that is available on this architecture. For the other two machines, we present an alternating word algorithm that can implement an efficient solution. Our results show that the GPU performs well for problems whose tables fit within the device memory. Because GPUs typically have memories in the order of 10GB, such architectures are best for small problem sizes that have tables of size approximately 1010. The IBM x3755 performs very well on medium-sized problems that fit within its 64-GB memory but has poor scalability as the number of processors increases and is unable to sustain performance as the problem size increases. This machine tends to saturate for problem sizes of 1011 bits. The Cray XMT shows very good scaling for large problems and demonstrates sustained performance as the problem size increases. However, this machine has poor scaling for small problem sizes; it performs best for problem sizes of 1012 bits or more. The results in this paper illustrate that the subset-sum problem can be parallelized well on all three architectures, albeit for different ranges of problem sizes. The performance of these three machines under varying problem sizes show the strengths and weaknesses of the three architectures. Copyright © 2012 John Wiley & Sons, Ltd.

The Heterogeneous Programming Jungle

Saturday, March 24th, 2012

The Heterogeneous Programming Jungle by Michael Wolfe.

Michael starts off with one definition of “heterogeneous:”

The heterogeneous systems of interest to HPC use an attached coprocessor or accelerator that is optimized for certain types of computation.These devices typically exhibit internal parallelism, and execute asynchronously and concurrently with the host processor. Programming a heterogeneous system is then even more complex than “traditional” parallel programming (if any parallel programming can be called traditional), because in addition to the complexity of parallel programming on the attached device, the program must manage the concurrent activities between the host and device, and manage data locality between the host and device.

And while he returns to that definition in the end, another form of heterogeneity is lurking not far behind:

Given the similarities among system designs, one might think it should be obvious how to come up with a programming strategy that would preserve portability and performance across all these devices. What we want is a method that allows the application writer to write a program once, and let the compiler or runtime optimize for each target. Is that too much to ask?

Let me reflect momentarily on the two gold standards in this arena. The first is high level programming languages in general. After 50 years of programming using Algol, Pascal, Fortran, C, C++, Java, and many, many other languages, we tend to forget how wonderful and important it is that we can write a single program, compile it, run it, and get the same results on any number of different processors and operating systems.

So there is the heterogeneity of attached coprocessor and, just as importantly, of the processors with coprocessors.

His post concludes with:

Grab your Machete and Pith Helmet

If parallel programming is hard, heterogeneous programming is that hard, squared. Defining and building a productive, performance-portable heterogeneous programming system is hard. There are several current programming strategies that attempt to solve this problem, including OpenCL, Microsoft C++AMP, Google Renderscript, Intel’s proposed offload directives (see slide 24), and the recent OpenACC specification. We might also learn something from embedded system programming, which has had to deal with heterogeneous systems for many years. My next article will whack through the underbrush to expose each of these programming strategies in turn, presenting advantages and disadvantages relative to the goal.

These are languages that share common subjects (think of their target architectures) and so are ripe for a topic map that co-locates their approaches to a particular architecture. Being able to incorporate official and non-official documentation, tests, sample code, etc., might enable faster progress in this area.

The future of HPC processors is almost upon us. It will not do to be tardy.

An Application Driven Analysis of the ParalleX Execution Model (here be graph’s mention)

Tuesday, January 10th, 2012

An Application Driven Analysis of the ParalleX Execution Model by Matthew Anderson, Maciej Brodowicz, Hartmut Kaiser and Thomas Sterling.

Just in case you feel the need for more information about ParalleX after that post about the LSU software release. 😉

Abstract:

Exascale systems, expected to emerge by the end of the next decade, will require the exploitation of billion-way parallelism at multiple hierarchical levels in order to achieve the desired sustained performance. The task of assessing future machine performance is approached by identifying the factors which currently challenge the scalability of parallel applications. It is suggested that the root cause of these challenges is the incoherent coupling between the current enabling technologies, such as Non-Uniform Memory Access of present multicore nodes equipped with optional hardware accelerators and the decades older execution model, i.e., the Communicating Sequential Processes (CSP) model best exemplified by the message passing interface (MPI) application programming interface. A new execution model, ParalleX, is introduced as an alternative to the CSP model. In this paper, an overview of the ParalleX execution model is presented along with details about a ParalleX-compliant runtime system implementation called High Performance ParalleX (HPX). Scaling and performance results for an adaptive mesh refinement numerical relativity application developed using HPX are discussed. The performance results of this HPX-based application are compared with a counterpart MPI-based mesh refinement code. The overheads associated with HPX are explored and hardware solutions are introduced for accelerating the runtime system.

Graphaholics should also note:

Today’s conventional parallel programming methods such as MPI [1] and systems such as distributed memory massively parallelvprocessors (MPPs) and Linux clusters exhibit poor efficiency and constrained scalability for this class of applications. This severely hinders scientifi c advancement. Many other classes of applications exhibit similar properties, especially graph/tree data structures that have non uniform data access patterns. (emphasis added)

I like that, “non uniform data access patterns.”

My “gut” feeling is that this will prove very useful for processing semantics. Since semantics originate from us and have “non uniform data access patterns.”

Granted a lot of work between here and there, especially since the semantics side of the house is fond of declaring victory in favor of the latest solution.

You would think after years, decades, centuries, no, millenia of one “ultimate” solution after another, we would be a little more wary of such pronouncements. I suspect the problem is that programmers come by their proverbial laziness honestly. They get it from us. It is easier to just fall into line with whatever seems like a passable solution and to not worry about all the passable solutions that went before.

That is no doubt easier but imagine where medicine, chemistry, physics, or even computers would be if they had adopted such a model. True, we have to use models that work now, but at the same time we should encourage new, different, even challenging models that may (or may not) be better at capturing human semantics. Models that change even as we do.

LSU Releases First Open Source ParalleX Runtime Software System

Tuesday, January 10th, 2012

LSU Releases First Open Source ParalleX Runtime Software System

From the press release:

Louisiana State University’s Center for Computation & Technology (CCT) has delivered the first freely available open-source runtime system implementation of the ParalleX execution model. The HPX, or High Performance ParalleX, runtime software package is a modular, feature-complete, and performance oriented representation of the ParalleX execution model targeted at conventional parallel computing architectures such as SMP nodes and commodity clusters.

HPX is being provided to the open community for experimentation and application to achieve high efficiency and scalability for dynamic adaptive and irregular computational problems. HPX is a library of C++ functions that supports a set of critical mechanisms for dynamic adaptive resource management and lightweight task scheduling within the context of a global address space. It is solidly based on many years of experience in writing highly parallel applications for HPC systems.

The two-decade success of the communicating sequential processes (CSP) execution model and its message passing interface (MPI) programming model has been seriously eroded by challenges of power, processor core complexity, multi-core sockets, and heterogeneous structures of GPUs. Both efficiency and scalability for some current (strong scaled) applications and future Exascale applications demand new techniques to expose new sources of algorithm parallelism and exploit unused resources through adaptive use of runtime information.

The ParalleX execution model replaces CSP to provide a new computing paradigm embodying the governing principles for organizing and conducting highly efficient scalable computations greatly exceeding the capabilities of today’s problems. HPX is the first practical, reliable, and performance-oriented runtime system incorporating the principal concepts of ParalleX model publicly provided in open source release form.