Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 6, 2014

Spark officially sets a new record in large-scale sorting

Filed under: Hadoop,Sorting,Spark — Patrick Durusau @ 7:16 pm

Spark officially sets a new record in large-scale sorting by Reynold Xin.

From the post:

A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system can sort 100 TB of data (1 trillion records). Today, we are happy to announce that our entry has been reviewed by the benchmark committee and we have officially won the Daytona GraySort contest!

In case you missed our earlier blog post, using Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. In comparison, the previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. This entry tied with a UCSD research team building high performance systems and we jointly set a new world record.

Winning this benchmark as a general, fault-tolerant system marks an important milestone for the Spark project. It demonstrates that Spark is fulfilling its promise to serve as a faster and more scalable engine for data processing of all sizes, from GBs to TBs to PBs. In addition, it validates the work that we and others have been contributing to Spark over the past few years.

If you are not already familiar with Spark, see the project homepage and/or the extensive documentation page. (Be careful, you can easily lose yourself in the Spark documentation.)

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress