Big Data Processing with Apache Spark – Part 1: Introduction by Srini Penchikala.
From the post:
What is Spark
Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project.
Spark has several advantages compared to other big data and MapReduce technologies like Hadoop and Storm.
First of all, Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data).
Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk.
Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators. And you can use it interactively to query data within the shell.
In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case.
In this first installment of Apache Spark article series, we’ll look at what Spark is, how it compares with a typical MapReduce solution and how it provides a complete suite of tools for big data processing.
If the rest of this series of posts is as comprehensive as this one, this will be a great overview of Apache Spark! Looking forward to additional posts in this series.