Archive for the ‘Cray’ Category

Entry-Level HPC: Proven at a Petaflop, Affordably Priced!

Monday, June 4th, 2012

Entry-Level HPC: Proven at a Petaflop, Affordably Priced!

AMD sponsored this content at

As a long time admirer of Cray I had to repost:

Computing needs at many commercial enterprises, research universities, and government labs continue to grow as more complex problems are explored using ever-more sophisticated modeling and analysis programs.

A new class of Cray XE6 and Cray XK6 high performance computing (HPC) systems, based on AMD Opteron™ processors, now offer teraFLOPS of processing power, reliability, utilization rates, and other advantages of high-end supercomputers, but with a great low purchase price. Entry-level supercomputing systems in this model line target midrange HPC applications, have an expected performance in the 6.5 teraflop to 200 teraFLOPS range, and scale in price from $200,000 to $3 million.

These systems can give organizations an alternative to high-end HPC clusters. One potential advantage of these entry-level systems is that they are designed to deliver supercomputing reliability and sustained performance. Users can be confident their jobs will run to completion. And the systems also offer predictability. “There is reduced OS noise, so you get similar run times every time,” said Margaret Williams, senior vice president of HPC Systems at Cray Inc.

Not enough to get you into “web scale” data but certainly enough for many semantic integration problems.

Ohio State University Researcher Compares Parallel Systems

Tuesday, April 3rd, 2012

Ohio State University Researcher Compares Parallel Systems

From the post:

Surveying the wide range of parallel system architectures offered in the supercomputer market, an Ohio State University researcher recently sought to establish some side-by-side performance comparisons.

The journal, Concurrency and Computation: Practice and Experience, in February published, “Parallel solution of the subset-sum problem: an empirical study.” The paper is based upon a master’s thesis written last year by former computer science and engineering graduate student Saniyah Bokhari.

“We explore the parallelization of the subset-sum problem on three contemporary but very different architectures, a 128-processor Cray massively multithreaded machine, a 16-processor IBM shared memory machine, and a 240-core NVIDIA graphics processing unit,” said Bokhari. “These experiments highlighted the strengths and weaknesses of these architectures in the context of a well-defined combinatorial problem.”

Bokhari evaluated the conventional central processing unit architecture of the IBM 1350 Glenn Cluster at the Ohio Supercomputer Center (OSC) and the less-traditional general-purpose graphic processing unit (GPGPU) architecture, available on the same cluster. She also evaluated the multithreaded architecture of a Cray Extreme Multithreading (XMT) supercomputer at the Pacific Northwest National Laboratory’s (PNNL) Center for Adaptive Supercomputing Software.

What I found fascinating about this approach was the comparison of:

the strengths and weaknesses of these architectures in the context of a well-defined combinatorial problem.

True enough, there is a place for general methods and solutions, but one pays the price for using general methods and solutions.

Thinking that for subject identity and “merging” in a “big data” context, that we will need a deeper understanding of specific identity and merging requirements. So that the result of that study is one or more well-defined combinatorial problems.

That is to say that understanding one or more combinatorial problems precedes proposing a solution.

You can view/download the thesis by Saniyah Bokhari, Parallel Solution of the Subset-sum Problem: An Empirical Study

Or view the article (assuming you have access):

Parallel solution of the subset-sum problem: an empirical study

Abstract (of the article):

The subset-sum problem is a well-known NP-complete combinatorial problem that is solvable in pseudo-polynomial time, that is, time proportional to the number of input objects multiplied by the sum of their sizes. This product defines the size of the dynamic programming table used to solve the problem. We show how this problem can be parallelized on three contemporary architectures, that is, a 128-processor Cray Extreme Multithreading (XMT) massively multithreaded machine, a 16-processor IBM x3755 shared memory machine, and a 240-core NVIDIA FX 5800 graphics processing unit (GPU). We show that it is straightforward to parallelize this algorithm on the Cray XMT primarily because of the word-level locking that is available on this architecture. For the other two machines, we present an alternating word algorithm that can implement an efficient solution. Our results show that the GPU performs well for problems whose tables fit within the device memory. Because GPUs typically have memories in the order of 10GB, such architectures are best for small problem sizes that have tables of size approximately 1010. The IBM x3755 performs very well on medium-sized problems that fit within its 64-GB memory but has poor scalability as the number of processors increases and is unable to sustain performance as the problem size increases. This machine tends to saturate for problem sizes of 1011 bits. The Cray XMT shows very good scaling for large problems and demonstrates sustained performance as the problem size increases. However, this machine has poor scaling for small problem sizes; it performs best for problem sizes of 1012 bits or more. The results in this paper illustrate that the subset-sum problem can be parallelized well on all three architectures, albeit for different ranges of problem sizes. The performance of these three machines under varying problem sizes show the strengths and weaknesses of the three architectures. Copyright © 2012 John Wiley & Sons, Ltd.

Is That A Graph In Your Cray?

Saturday, March 3rd, 2012

If you want more information about graph processing in Cray’s uRIKA (I did), try: High-performance Computing Applied to Semantic Databases by Eric L. Goodman, Edward Jimenez, David Mizell, Sinan al-Saffar, Bob Adolf, and David Haglin.


To-date, the application of high-performance computing resources to Semantic Web data has largely focused on commodity hardware and distributed memory platforms. In this paper we make the case that more specialized hardware can offer superior scaling and close to an order of magnitude improvement in performance. In particular we examine the Cray XMT. Its key characteristics, a large, global shared memory, and processors with a memory-latency tolerant design, offer an environment conducive to programming for the Semantic Web and have engendered results that far surpass current state of the art. We examine three fundamental pieces requisite for a fully functioning semantic database: dictionary encoding, RDFS inference, and query processing. We show scaling up to 512 processors (the largest configuration we had available), and the ability to process 20 billion triples completely in memory.

Unusual to see someone apologize for only having “…512 processors (the largest configuration we had available)….,” but that isn’t why I am citing the paper. 😉

The “dictionary encoding” (think indexing) techniques may prove instructive, even if you don’t have time on a Cray XMT. The techniques presented achieve a compression of the raw data between 3.2. and 4.4.

Take special note of the statement: “To simplify the discussion, we consider only semantic web data represented in N-Triples.” Actually the system presented processes only subject, edge, object triples. Unlike Neo4j, for instance, it isn’t a generalized graph engine.

Specialized hardware/software is great but let’s be clear about that upfront. You may need more than RDF graphs can offer. Like edges with properties.

Other specializations include, a process of “closure” has several simplifications to enable a single pass through the RDFS rule set and querying doesn’t allow a variable in the predicate position.

Granting that this results in a hardware/software combination that can claim “interactivity” on large data sets, but what is the cost of making that a requirement?

Take the best known “connect the dots” problem of this century, 9/11. Analysts did not need “interactivity” with large data sets measured in nano-seconds. Batch processing that lasted for a week or more would have been more than sufficient. Most of the information that was known was “known” by various parties for months.

More than that, the amount of relevant was quite small when compared to the “Semantic Web.” There were known suspects (as there are now), with known associates, with known travel patterns, so eliminating all the business/frequent flyers from travel data is a one time filter, plus any > 40 females traveling on US passports (grandmothers). Similar criteria can reduce information clutter, allowing analysts to focus on important data, as opposing to paging through “hits” in a simulation of useful activity.

I would put batch processing of graphs of relevant information against interactive churning of big data in a restricted graph model any day. How about you?

Cray Parlays Supercomputing Technology Into Big Data Appliance

Friday, March 2nd, 2012

Cray Parlays Supercomputing Technology Into Big Data Appliance by Michael Feldman.

From the post:

For the first time in its history, Cray has built something other than a supercomputer. On Wednesday, the company’s newly hatched YarcData division launched “uRiKA,” a hardware-software solution aimed at real-time knowledge discovery with terascale-sized data sets. The system is designed to serve businesses and government agencies that need to do high-end analytics in areas as diverse as social networking, financial management, healthcare, supply chain management, and national security.

As befits Cray’s MO, their target market for uRiKA, (pronounced Eureka) is slanted toward the cutting edge. It uses a graph-based data approach to do interactive analytics with large, complex, and often dynamic data sets. “We are not trying to be everything for everybody,” says YarcData general manager Arvind Parthasarathi. (emphasis added) (YarcData is a new division at Cray. Just a little more name confusion for everyone.

Read the article for the hardware/performance stats but consider the following on graphs:

More to the point, uRiKA is designed to analyze graphs rather than simple tabular databases. A graph, one of the fundamental data abstractions in computer science, is basically a structure whose objects are linked together by some relationship. It is especially suited to structures like website links, social networks, and genetic maps — essentially any data set where the relationships between the objects are as important as the objects themselves.

This type of application exists further up the analytics food change than most business intelligence or data mining applications. In general, a lot of these more traditional applications involve searching for particular items or deriving simple relationships. The YarcData technology is focused on relationship discovery. And since it’s uses graph structures, the system can support graph-based reasoning and deductions to uncover new relationships.

A typical example is pattern-based queries — does x resemble y? This might not lead to a definitive answer, but will provide a range of possibilities, which can then be further refined. So, for example, one of the YarcData’s early customers is a government agency that is interested in finding “persons of interest.” They maintain profiles of terrorists, criminals or other ne’er-do-wells, and are using uRiKA to search for patterns of specific behaviors and activities. A credit card company could use the same basic algorithms to search for fraudulent transactions.

YarcData uses the term “relationship analytics” to describe this approach. While that might sound a bit Oprah-ish, it certainly emphasizes the importance of extracting knowledge from how the objects are connected rather than just their content. This is not to be confused with relational databases, which are organized in tabular form and use simpler forms of querying.


After data is ingested, it needs to be converted to an internal format called RDF, or Resource Description Framework (in case you were wondering, uRiKA stands for Universal RDF Integration Knowledge Appliance), an industry standard graph format for representing information in the Web. According to Mufti, they are providing tools for RDF data conversion and are also laying the groundwork for a standards-based software that allows for third-party conversion tools.

Industry standard is a common theme here. uRiKA’s software internals include SUSE Linux, Java, Apache, WS02, Google Gadgets, and Relfinder. That stack of interfaces allows users to write or port analytics applications to the platform without having to come up with a uRiKA-specific implementation. So Java, J2EE, SPARQL, and Gadget apps are all fair game. YarcData thinks this will be key to encouraging third-party developers to build applications on top of the system, since it doesn’t require them to use a whole new programming language or API.

At least as of today, CrayDoc has no documentation on conversion to the “…industry standard graph format…” RDF or the details of its graph operations.

Parthasarathi talks about shoehorning data into relational databases. I wonder why uRiKA shoehorns data into RDF?

Perhaps the documentation, when it appears, will explain the choice of RDF as a “data format.” (I know RDF isn’t a format, I am just repeating what the article says.)

I am curious because efficient graph structures are going to be necessary for a variety of problems. Has Cray/YarcData compared graph structures, RDF and others for performance on particular problems? If so, are the tests and test data available?

Before laying out sums in the “low hundreds of thousands of dollars,” I would want to know I wasn’t brute forcing solutions, when less costly and elegant solutions existed.