The Week in Big Data Research from Datanami.
A new feature from Datanami that highlights academic research on big data.
In last Friday’s post you will find:
MapReduce-Based Data Stream Processing over Large History Data
Abstract:
With the development of Internet of Things applications based on sensor data, how to process high speed data stream over large scale history data brings a new challenge. This paper proposes a new programming model RTMR, which improves the real-time capability of traditional batch processing based MapReduce by preprocessing and caching, along with pipelining and localizing. Furthermore, to adapt the topologies to application characteristics and cluster environments, a model analysis based RTMR cluster constructing method is proposed. The benchmark built on the urban vehicle monitoring system shows RTMR can provide the real-time capability and scalability for data stream processing over large scale data.
Mastiff: A MapReduce-based System for Time-Based Big Data Analytics
Abstract:
Existing MapReduce-based warehousing systems are not specially optimized for time-based big data analysis applications. Such applications have two characteristics: 1) data are continuously generated and are required to be stored persistently for a long period of time, 2) applications usually process data in some time period so that typical queries use time-related predicates. Time-based big data analytics requires both high data loading speed and high query execution performance. However, existing systems including current MapReduce-based solutions do not solve this problem well because the two requirements are contradictory. We have implemented a MapReduce-based system, called Mastiff, which provides a solution to achieve both high data loading speed and high query performance. Mastiff exploits a systematic combination of a column group store structure and a lightweight helper structure. Furthermore, Mastiff uses an optimized table scan method and a column-based query execution engine to boost query performance. Based on extensive experiments results with diverse workloads, we will show that Mastiff can significantly outperform existing systems including Hive, HadoopDB, and GridSQL.
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets
Abstract:
Scientific datasets, such as HDF5 and PnetCDF, have been used widely in many scientific applications. These data formats and libraries provide essential support for data analysis in scientific discovery and innovations. In this research, we present an approach to boost data analysis, namely Fast Analysis with Statistical Metadata (FASM), via data sub setting and integrating a small amount of statistics into datasets. We discuss how the FASM can improve data analysis performance. It is currently evaluated with the PnetCDF on synthetic and real data, but can also be implemented in other libraries. The FASM can potentially lead to a new dataset design and can have an impact on data analysis.
MapReduce Performance Evaluation on a Private HPC Cloud
Abstract:
The convergence of accessible cloud computing resources and big data trends have introduced unprecedented opportunities for scientific computing and discovery. However, HPC cloud users face many challenges when selecting valid HPC configurations. In this paper, we report a set of performance evaluations of data intensive benchmarks on a private HPC cloud to help with the selection of such configurations. More precisely, we study the effect of virtual machines core-count on the performance of 3 benchmarks widely used by the MapReduce community. We notice that depending on the computation to communication ratios of the studied applications, using higher core-counts virtual machines do not always lead to higher performance for data-intensive applications.
I manage to visit Datanami once or twice a day. Usually not as long as I should. 😉 Visit, I think you will be pleasantly surprised.
PS: You will be seeing some of these articles in separate posts. Thought the cutting/bleeding edge types would like notice sooner rather than latter.