BigDataBench:… « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 8, 2014

BigDataBench:…

Filed under: Benchmarks,BigData — Patrick Durusau @ 11:07 am

BigDataBench: a Big Data Benchmark Suite from Internet Services by Lei Wang, et.al.

Abstract:

As architecture, systems, and data management communities pay greater attention to innovative big data systems and architecture, the pressure of benchmarking and evaluating these systems rises. However, the complexity, diversity, frequently changed workloads, and rapid evolution of big data systems raise great challenges in big data benchmarking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target evaluating specific types of applications or system software stacks, and hence they are not qualified for serving the purposes mentioned above.

This paper presents our joint research efforts on this issue with several industrial partners. Our big data benchmark suite—BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets. Currently, we choose 19 big data benchmarks from dimensions of application scenarios, operations/ algorithms, data types, data sources, software stacks, and application types, and they are comprehensive for fairly measuring and evaluating big data systems and architecture. BigDataBench is publicly available from the project home page http://prof.ict.ac.cn/BigDataBench.

Also, we comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs. On a typical state-of-practice processor, Intel Xeon E5645, we have the following observations: First, in comparison with the traditional benchmarks: including PARSEC, HPCC, and SPECCPU, big data applications have very low operation intensity, which measures the ratio of the total number of instructions divided by the total byte number of memory accesses; Second, the volume of data input has non-negligible impact on micro-architecture characteristics, which may impose challenges for simulation-based big data architecture research; Last but not least, corroborating the observations in CloudSuite and DCBench (which use smaller data inputs), we find that the numbers of L1 instruction cache (L1I) misses per 1000 instructions (in short, MPKI) of the big data applications are higher than in the traditional benchmarks; also, we find that L3 caches are effective for the big data applications, corroborating the observation in DCBench.

An excellent summary of current big data benchmarks along with datasets and diverse benchmarks for varying big data inputs.

I emphasize diverse because we have all known “big data” covers a wide variety of data. Unfortunately, that hasn’t always been a point of emphasis. This paper corrects that oversight.

The User_Manual for Big Data Bench 2.1.

Summaries of the data sets and benchmarks:

No. data sets data size

1

Wikipedia Entries

4,300,000 English articles

2

Amazon Movie Reviews

7,911,684 reviews

3

Google Web Graph

875713 nodes, 5105039 edges

4

Facebook Social Network

4039 nodes, 88234 edges

5

E-commerce Transaction Data

table1: 4 columns, 38658 rows.

table2: 6 columns, 242735 rows

6

ProfSearch Person Resumes

278956 resumes

Table 2: The Summary of BigDataBench

Application Scenarios

Operations & Algorithm

Data Type

Data Source

Software stack

Application type

Micro Benchmarks

Sort

Unstructured

Text

MapReduce, Spark, MPI

Offline Analytics

Grep

Unstructured

Text

MapReduce, Spark, MPI

Offline Analytics

WordCount

Unstructured

Text

MapReduce, Spark, MPI

Offline Analytics

BFS

Unstructured

Graph

MapReduce, Spark, MPI

Offline Analytics

Basic Datastore Operations (“Cloud OLTP”)

Read

Semi-structured

Table

Hbase, Cassandra, MongoDB, MySQL

Online Service

Write

Semi-structured

Table

Hbase, Cassandra, MongoDB, MySQL

Online Services

Scan

Semi-structured

Table

Hbase, Cassandra, MongoDB, MySQL

Online Services

Relational Query

Select Query

Structured

Table

Impala, Shark, MySQL, Hive

Realtime Analytics

Aggregate Query

Structured

Table

Impala, Shark, MySQL, Hive

Realtime Analytics

Join Query

Structured

Table

Impala, Shark, MySQL, Hive

Realtime Analytics

Search Engine

Nutch Server

Structured

Table

Hadoop

Online Services

PageRank

Unstructured

Graph

Hadoop, MPI, Spark

Offline Analytics

Index

Unstructured

Text

Hadoop, MPI, Spark

Offline Analytics

Social Network

Olio Server

Structured

Table

MySQL

Online Service

K-means

Unstructured

Graph

Hadoop, MPI, Spark

Offline Analytics

Connected Com-ponents

Unstructured

Graph

Hadoop, MPI, Spark

Offline Analytics

E-commerce

Rubis Server

Structured

Table

MySQL

Online Service

Collaborative Filtering

Unstructured

Text

Hadoop, MPI, Spark

Offline Analytics

Naive Bayes

Unstructrued

Text

Hadoop, MPI, Spark

Offline Analytics

I first saw this in a tweet by Stefano Bertolo.

No.	data sets	data size
1	Wikipedia Entries	4,300,000 English articles
2	Amazon Movie Reviews	7,911,684 reviews
3	Google Web Graph	875713 nodes, 5105039 edges
4	Facebook Social Network	4039 nodes, 88234 edges
5	E-commerce Transaction Data	table1: 4 columns, 38658 rows. table2: 6 columns, 242735 rows
6	ProfSearch Person Resumes	278956 resumes

Application Scenarios	Operations & Algorithm	Data Type	Data Source	Software stack	Application type
Micro Benchmarks	Sort	Unstructured	Text	MapReduce, Spark, MPI	Offline Analytics
	Grep	Unstructured	Text	MapReduce, Spark, MPI	Offline Analytics
	WordCount	Unstructured	Text	MapReduce, Spark, MPI	Offline Analytics
	BFS	Unstructured	Graph	MapReduce, Spark, MPI	Offline Analytics
Basic Datastore Operations (“Cloud OLTP”)	Read	Semi-structured	Table	Hbase, Cassandra, MongoDB, MySQL	Online Service
	Write	Semi-structured	Table	Hbase, Cassandra, MongoDB, MySQL	Online Services
	Scan	Semi-structured	Table	Hbase, Cassandra, MongoDB, MySQL	Online Services
Relational Query	Select Query	Structured	Table	Impala, Shark, MySQL, Hive	Realtime Analytics
	Aggregate Query	Structured	Table	Impala, Shark, MySQL, Hive	Realtime Analytics
	Join Query	Structured	Table	Impala, Shark, MySQL, Hive	Realtime Analytics
Search Engine	Nutch Server	Structured	Table	Hadoop	Online Services
	PageRank	Unstructured	Graph	Hadoop, MPI, Spark	Offline Analytics
	Index	Unstructured	Text	Hadoop, MPI, Spark	Offline Analytics
Social Network	Olio Server	Structured	Table	MySQL	Online Service
	K-means	Unstructured	Graph	Hadoop, MPI, Spark	Offline Analytics
	Connected Com-ponents	Unstructured	Graph	Hadoop, MPI, Spark	Offline Analytics
E-commerce	Rubis Server	Structured	Table	MySQL	Online Service
	Collaborative Filtering	Unstructured	Text	Hadoop, MPI, Spark	Offline Analytics
	Naive Bayes	Unstructrued	Text	Hadoop, MPI, Spark	Offline Analytics

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 8, 2014

BigDataBench:…

No Comments