Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 11, 2014

Spark Breaks Previous Large-Scale Sort Record

Filed under: BigData,Hadoop,Spark — Patrick Durusau @ 10:28 am

Spark Breaks Previous Large-Scale Sort Record by Reynold Xin.

From the post:

Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes. While it was clear to everybody that Spark is more efficient than MapReduce for data that fits in memory, we heard that some organizations were having trouble pushing it to large scale datasets that could not fit in memory. Therefore, since the inception of Databricks, we have devoted much effort, together with the Spark community, to improve the stability, scalability, and performance of Spark. Spark works well for gigabytes or terabytes of data, and it should also work well for petabytes.

To evaluate these improvements, we decided to participate in the Sort Benchmark. With help from Amazon Web Services, we participated in the Daytona Gray category, an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by Yahoo using a Hadoop MapReduce cluster of 2100 nodes. Using Spark on 206 EC2 nodes, we completed the benchmark in 23 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache.

Additionally, while no official petabyte (PB) sort competition exists, we pushed Spark further to also sort 1 PB of data (10 trillion records) on 190 machines in under 4 hours. This PB time beats previously reported results based on Hadoop MapReduce (16 hours on 3800 machines). To the best of our knowledge, this is the first petabyte-scale sort ever done in a public cloud.

Bottom line: Sorted 100 TB of data in 23 minutes, beat old record of 72 minutes, on fewer machines.

Read Reynold’s post and then get thee to Apache Spark!

I first saw this in a tweet by paco nathan.

October 1, 2014

Integrating Kafka and Spark Streaming: Code Examples and State of the Game

Filed under: Avro,Kafka,Spark — Patrick Durusau @ 7:55 pm

Integrating Kafka and Spark Streaming: Code Examples and State of the Game by Michael G. Noll.

From the post:

Spark Streaming has been getting some attention lately as real-time data processing tool, often mentioned alongside Apache Storm. If you ask me, no real-time data processing tool is complete without Kafka integration (smile), hence I added an example Spark Streaming application to kafka-storm-starter that demonstrates how to read from Kafka and write to Kafka, using Avro as the data format and Twitter Bijection for handling the data serialization.

In this post I will explain this Spark Streaming example in further detail and also shed some light on the current state of Kafka integration in Spark Streaming. All this with the disclaimer that this happens to be my first experiment with Spark Streaming.

If mid-week is when you like to brush up on emerging technologies, Michael’s post is a good place to start.

The post is well organized and has enough notes, asides and references to enable you to duplicate the example and to expand your understanding of Kafka and Spark Streaming.

September 30, 2014

Open Sourcing ml-ease

Filed under: Hadoop,Machine Learning,Spark — Patrick Durusau @ 6:25 pm

Open Sourcing ml-ease by Deepak Agarwal.

From the post:

LinkedIn data science and engineering is happy to release the first version of ml-ease, an open-source large scale machine learning library. ml-ease supports model fitting/training on a single machine, a Hadoop cluster and a Spark cluster with emphasis on scalability, speed, and ease-of-use. ml-ease is a useful tool for developers working on big data machine learning applications, and we’re looking forward to feedback from the open-source community. ml-ease currently supports ADMM logistic regression for binary response prediction with L1 and L2 regularization on Hadoop clusters.

See Deepak’s post for more details and news of future machine learning algorithms to be released!

September 14, 2014

Pig is Flying: Apache Pig on Apache Spark (aka “Spork”)

Filed under: Pig,Spark — Patrick Durusau @ 4:19 pm

Pig is Flying: Apache Pig on Apache Spark by Mayur Rustagi.

From the post:

Analysts can talk about data insights all day (and night), but the reality is that 70% of all data analyst time goes into data processing and not analysis. At Sigmoid Analytics, we want to streamline this data processing pipeline so that analysts can truly focus on value generation and not data preparation.

We focus our efforts on three simple initiatives:

  • Make data processing more powerful
  • Make data processing more simple
  • Make data processing 100x faster than before

As a data mashing platform, the first key initiative is to combine the power and simplicity of Apache Pig on Apache Spark, making existing ETL pipelines 100x faster than before. We do that via a unique mix of our operator toolkit, called DataDoctor, and Spark.

DataDoctor is a high-level operator DSL on top of Spark. It has frameworks for no-symmetrical joins, sorting, grouping, and embedding native Spark functions. It hides a lot of complexity and makes it simple to implement data operators used in applications like Pig and Apache Hive on Spark.

For the uninitiated, Spark is open source Big Data infrastructure that enables distributed fault-tolerant in-memory computation. As the kernel for the distributed computation, it empowers developers to write testable, readable, and powerful Big Data applications in a number of languages including Python, Java, and Scala.

Introduction to and how to get started using Spork (Pig-on-Spark).

I know, more proof that Phil Karton was correct in saying:

There are only two hard things in Computer Science: cache invalidation and naming things.

😉

August 13, 2014

TF-IDF using flambo

Filed under: Clojure,DSL,Spark,TF-IDF — Patrick Durusau @ 6:48 pm

TF-IDF using flambo by Muslim Baig.

From the post:

flambo is a Clojure DSL for Spark created by the data team at Yieldbot. It allows you to create and manipulate Spark data structures using idiomatic Clojure. The following tutorial demonstrates typical flambo API usage and facilities by implementing the classic tf-idf algorithm.

The complete runnable file of the code presented in this tutorial is located under the flambo.example.tfidf namespace, under the flambo /test/flambo/example directory. We recommend you download flambo and follow along in your REPL.

Working through the Clojure code you will get a better understanding of the TF-IDF algorithm.

I don’t know if it was intentional, but the division of the data into “documents” illustrates one of the fundamental questions for most indexing techniques:

What do you mean by document?

It is a non-trivial question and one that has a major impact on the results of the algorithm.

If I get to choose what is considered a “document,” then I can weight the results while using the same algorithm as everyone else.

Think about it. My “documents” may have the term “example” in each one, as opposed to “example” appearing three times in a single document. See the last section in the Wikipedia article tf-idf for the impact of such splitting.

Other algorithms are subject to similar manipulation. It isn’t ever enough to know the algorithms applied to data, you need to see the data itself.

July 28, 2014

Oryx 2:…

Filed under: Machine Learning,Spark — Patrick Durusau @ 6:56 pm

Oryx 2: Lambda architecture on Spark for real-time large scale machine learning

From the overview:

This is a redesign of the Oryx project as “Oryx 2.0”. The primary design goals are:

1. A more reusable platform for lambda-architecture-style designs, with batch, speed and serving layers

2. Make each layer usable independently

3.Fuller support for common machine learning needs

  • Test/train set split and evaluation
  • Parallel model build
  • Hyper-parameter selection

4. Use newer technologies like Spark and Streaming in order to simplify:

  • Remove separate in-core implementations for scale-down
  • Remove custom data transport implementation in favor of message queues like Apache Kafka
  • Use a ‘real’ streaming framework instead of reimplementing a simple one
  • Remove complex MapReduce-based implementations in favor of Apache Spark-based implementations

5. Support more input (i.e. not just CSV)

Initial import was three days ago if you are interested in being in on the beginning!

July 1, 2014

Flambo

Filed under: Clojure,Spark — Patrick Durusau @ 7:19 pm

Flambo

From the webpage:

Flambo is a Clojure DSL for Spark. It allows you to create and manipulate Spark data structures using idiomatic Clojure.

The README is the recommended place to get started.

Cool!

May 30, 2014

Apache™ Spark™ v1.0

Filed under: Hadoop,Spark,SQL — Patrick Durusau @ 1:30 pm

Apache™ Spark™ v1.0

From the post:

The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 170 Open Source projects and initiatives, announced today the availability of Apache Spark v1.0, the super-fast, Open Source large-scale data processing and advanced analytics engine.

Apache Spark has been dubbed a “Hadoop Swiss Army knife” for its remarkable speed and ease of use, allowing developers to quickly write applications in Java, Scala, or Python, using its built-in set of over 80 high-level operators. With Spark, programs can run up to 100x faster than Apache Hadoop MapReduce in memory.

“1.0 is a huge milestone for the fast-growing Spark community. Every contributor and user who’s helped bring Spark to this point should feel proud of this release,” said Matei Zaharia, Vice President of Apache Spark.

Apache Spark is well-suited for machine learning, interactive queries, and stream processing. It is 100% compatible with Hadoop’s Distributed File System (HDFS), HBase, Cassandra, as well as any Hadoop storage system, making existing data immediately usable in Spark. In addition, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box.

New in v1.0, Apache Spark offers strong API stability guarantees (backward-compatibility throughout the 1.X series), a new Spark SQL component for accessing structured data, as well as richer integration with other Apache projects (Hadoop YARN, Hive, and Mesos).

Spark Homepage.

A bit more technical note of the release from the project:

Spark 1.0.0 is a major release marking the start of the 1.X line. This release brings both a variety of new features and strong API compatibility guarantees throughout the 1.X line. Spark 1.0 adds a new major component, Spark SQL, for loading and manipulating structured data in Spark. It includes major extensions to all of Spark’s existing standard libraries (ML, Streaming, and GraphX) while also enhancing language support in Java and Python. Finally, Spark 1.0 brings operational improvements including full support for the Hadoop/YARN security model and a unified submission process for all supported cluster managers.

You can download Spark 1.0.0 as either a source package (5 MB tgz) or a prebuilt package for Hadoop 1 / CDH3, CDH4, or Hadoop 2 / CDH5 / HDP2 (160 MB tgz). Release signatures and checksums are available at the official Apache download site.

What a nice way to start the weekend!

I first saw this in a tweet by Sean Owen.

May 15, 2014

Distributed LIBLINEAR:

Filed under: Machine Learning,MPI,Spark,Virtual Machines — Patrick Durusau @ 10:23 am

Distributed LIBLINEAR: Libraries for Large-scale Linear Classification on Distributed Environments

From the webpage:

MPI LIBLINEAR is an extension of LIBLINEAR on distributed environments. The usage and the data format are the same as LIBLINEAR. Currently only two solvers are supported:

  • L2-regularized logistic regression (LR)
  • L2-regularized L2-loss linear SVM

NOTICE: This extension can only run on Unix-like systems. (We test it on Ubuntu 13.10.) Python and Matlab interfaces are not supported.

Spark LIBLINEAR is a Spark implementation based on LIBLINEAR and integrated with Hadoop distributed file system. This package is developed using Scala. Currently it supports the same two solvers as MPI LIBLINEAR.

If you are unfamiliar with LIBLINEAR:

LIBLINEAR is a linear classifier for data with millions of instances and features. It supports

  • L2-regularized classifiers
    L2-loss linear SVM, L1-loss linear SVM, and logistic regression (LR)
  • L1-regularized classifiers (after version 1.4)
    L2-loss linear SVM and logistic regression (LR)
  • L2-regularized support vector regression (after version 1.9)
    L2-loss linear SVR and L1-loss linear SVR.

Main features of LIBLINEAR include

  • Same data format as LIBSVM, our general-purpose SVM solver, and also similar usage
  • Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
  • Cross validation for model selection
  • Probability estimates (logistic regression only)
  • Weights for unbalanced data
  • MATLAB/Octave, Java, Python, Ruby interfaces

You will also find instructions for creating distributed environments using VirtualBox for both MPI LIBLINEAR and Spark LIBLINEAR. I am going to post on that separately to draw attention to it.

The phrase “standalone computer” is rapidly becoming a misnomer. Forward looking algorithm designers and power users will begin gaining experience with the new distributed “normal,” at every opportunity.

I first saw this in a tweet by Reynold Xin.

May 11, 2014

…Technology-Assisted Review in Electronic Discovery…

Filed under: Machine Learning,Spark — Patrick Durusau @ 7:16 pm

Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery by Gordon V. Cormack & Maura R. Grossman.

Abstract:

Using a novel evaluation toolkit that simulates a human reviewer in the loop, we compare the effectiveness of three machine-learning protocols for technology-assisted review as used in document review for discovery in legal proceedings. Our comparison addresses a central question in the deployment of technology-assisted review: Should training documents be selected at random, or should they be selected using one or more non-random methods, such as keyword search or active learning? On eight review tasks — four derived from the TREC 2009 Legal Track and four derived from actual legal matters — recall was measured as a function of human review effort. The results show that entirely non-random training methods, in which the initial training documents are selected using a simple keyword search, and subsequent training documents are selected by active learning, require substantially and significantly less human review effort (P<0.01) to achieve any given level of recall, than passive learning, in which the machine-learning algorithm plays no role in the selection of training documents. Among passive-learning methods, significantly less human review effort (P<0.01) is required when keywords are used instead of random sampling to select the initial training documents. Among active-learning methods, continuous active learning with relevance feedback yields generally superior results to simple active learning with uncertainty sampling, while avoiding the vexing issue of "stabilization" -- determining when training is adequate, and therefore may stop.

New acronym for me: TAR (technology-assisted review).

If you are interested in legal discovery, take special note that the authors have released a TAR evaluation toolkit.

This article and its references will repay a close reading several times over.

April 17, 2014

Cloudera Live (beta)

Filed under: Cloudera,Hadoop,Hive,Impala,Oozie,Solr,Spark — Patrick Durusau @ 4:57 pm

Cloudera Live (beta)

From the webpage:

Try a live demo of Hadoop, right now.

Cloudera Live is a new way to get started with Apache Hadoop, online. No downloads, no installations, no waiting. Watch tutorial videos and work with real-world examples of the complete Hadoop stack included with CDH, Cloudera’s completely open source Hadoop platform, to:

  • Learn Hue, the Hadoop User Interface developed by Cloudera
  • Query data using popular projects like Apache Hive, Apache Pig, Impala, Apache Solr, and Apache Spark (new!)
  • Develop workflows using Apache Oozie

Great news for people interested in Hadoop!

Question: Will this become the default delivery model for test driving software and training?

Enjoy!

March 27, 2014

Apache Mahout, “…Ya Gotta Hit The Road”

Filed under: H20,Machine Learning,Mahout,MapReduce,Spark — Patrick Durusau @ 3:24 pm

The news in Derrick Harris’ “Apache Mahout, Hadoop’s original machine learning project, is moving on from MapReduce” reminded of a line from Tommy, “Just as the gypsy queen must do, ya gotta hit the road.”

From the post:

Apache Mahout, a machine learning library for Hadoop since 2009, is joining the exodus away from MapReduce. The project’s community has decided to rework Mahout to support the increasingly popular Apache Spark in-memory data-processing framework, as well as the H2O engine for running machine learning and mathematical workloads at scale.

While data processing in Hadoop has traditionally been done using MapReduce, the batch-oriented framework has fallen out of vogue as users began demanding lower-latency processing for certain types of workloads — such as machine learning. However, nobody really wants to abandon Hadoop entirely because it’s still great for storing lots of data and many still use MapReduce for most of their workloads. Spark, which was developed at the University of California, Berkeley, has stepped in to fill that void in a growing number of cases where speed and ease of programming really matter.

H2O was developed separately by a startup called 0xadata (pronounced hexadata), although it’s also available as open source software. It’s an in-memory data engine specifically designed for running various types of types of statisical computations — including deep learning models — on data stored in the Hadoop Distributed File System.

Support for multiple data frameworks is yet another reason to learn Mahout.

February 28, 2014

New Hue Demos:…

Filed under: Hadoop YARN,Hue,Oozie,Spark — Patrick Durusau @ 7:35 pm

New Hue Demos: Spark UI, Job Browser, Oozie Scheduling, and YARN Support by Justin Kestelyn.

From the post:

Hue, the open source Web UI that makes Apache Hadoop easier to use, is now a standard across the ecosystem — shipping within multiple software distributions and sandboxes. One of the reasons for its success is an agile developer community behind it that is constantly rolling out new features to its users.

Just as important, the Hue team is diligent in its documentation and demonstration of those new features via video demos. In this post, for your convenience, I bring you the most recent examples (released since December):

  • The new Spark Igniter App
  • Using YARN and Job Browser
  • Job Browser with YARN Security
  • Apache Oozie crontab scheduling

All short but all worthwhile. Nice way to start off your Saturday morning. The kids have cartoons and you have Hue. 😉

February 23, 2014

How Companies are Using Spark

Filed under: BigData,Hadoop,Spark — Patrick Durusau @ 7:50 pm

How Companies are Using Spark, and Where the Edge in Big Data Will Be by Matei Zaharia.

Description:

While the first big data systems made a new class of applications possible, organizations must now compete on the speed and sophistication with which they can draw value from data. Future data processing platforms will need to not just scale cost-effectively; but to allow ever more real-time analysis, and to support both simple queries and today’s most sophisticated analytics algorithms. Through the Spark project at Apache and Berkeley, we’ve brought six years research to enable real-time and complex analytics within the Hadoop stack.

At time mark 1:53, Matei says when size of storage is no longer an advantage, you can gain an advantage by:

Speed: how quickly can you go from data to decisions?

Sophistication: can you run the best algorithms on the data?

As you might suspect, I strongly disagree that those are the only two points where you can gain an advantage with Big Data.

How about including:

Data Quality: How do you make data semantics explicit?

Data Management: Can you re-use data by knowing its semantics?

You can run sophisticated algorithms on data and make quick decisions, but if your data is GIGO (garbage in, garbage out), I don’t see the competitive edge.

Nothing against Spark, managing video streams with only 1 second of buffering was quite impressive.

To be fair, Matei does include ClearStoryData as one of his examples and ClearStory says that they merge data based in its semantics. Unfortunately, the website doesn’t mention any details other than there is a “patent pending.”

But in any event, I do think data quality and data management should be explicit items in any big data strategy.

At least so long as you want big data and not big garbage.

February 19, 2014

Spark Graduates Apache Incubator

Filed under: Graphs,GraphX,Hadoop,Spark — Patrick Durusau @ 12:07 pm

Spark Graduates Apache Incubator by Tiffany Trader.

From the post:

As we’ve touched on before, Hadoop was designed as a batch-oriented system, and its real-time capabilities are still emerging. Those eagerly awaiting this next evolution will be pleased to hear about the graduation of Apache Spark from the Apache Incubator. On Sunday, the Apache Spark Project committee unanimously voted to promote the fast data-processing tool out of the Apache Incubator.

Databricks refers to Apache Spark as “a powerful open source processing engine for Hadoop data built around speed, ease of use, and sophisticated analytics.” The computing framework supports Java, Scala, and Python and comes with a set of more than 80 high-level operators baked-in.

Spark runs on top of existing Hadoop clusters and is being pitched as a “more general and powerful alternative to Hadoop’s MapReduce.” Spark promises performance gains up to 100 times faster than Hadoop MapReduce for in-memory datasets, and 10 times faster when running on disk.

BTW, the most recent release, 0.90, includes GraphX.

Spark homepage.

February 13, 2014

Forbes on Graphs

Filed under: Cloudera,Dendrite,Graphs,Spark,Titan — Patrick Durusau @ 8:23 pm

Big Data Solutions Through The Combination Of Tools by Ben Lorica.

From the post:

As a user who tends to mix-and-match many different tools, not having to deal with configuring and assembling a suite of tools is a big win. So I’m really liking the recent trend towards more integrated and packaged solutions. A recent example is the relaunch of Cloudera’s Enterprise Data hub, to include Spark(1) and Spark Streaming. Users benefit by gaining automatic access to analytic engines that come with Spark(2). Besides simplifying things for data scientists and data engineers, easy access to analytic engines is critical for streamlining the creation of big data applications.

Another recent example is Dendrite(3) – an interesting new graph analysis solution from Lab41. It combines Titan (a distributed graph database), GraphLab (for graph analytics), and a front-end that leverages AngularJS, into a Graph exploration and analysis tool for business analysts:

Another contender in the graph space!

Interesting that Spark comes up a second time for today.

Having Forbes notice a technology gives it credence don’t you think?

I first saw this in a tweet by aurelius.

January 28, 2014

SparkR

Filed under: Parallel Programming,R,Spark,Statistics — Patrick Durusau @ 5:36 pm

Large scale data analysis made easier with SparkR by Shivaram Venkataraman.

From the post:

R is a widely used statistical programming language and supports a variety of data analysis tasks through extension packages. In fact, a recent survey of data scientists showed that R is the most frequently used tool other than SQL databases. However, data analysis in R is limited as the runtime is single threaded and can only process data sets that fit in a single machine.

In an effort to enable large scale data analysis from R, we have recently released SparkR. SparkR is an R package that provides a light-weight frontend to use Spark from R. SparkR allows users to create and transform RDDs in R and interactively run jobs from the R shell on a Spark cluster. You can can try out SparkR today by installing it from our github repo.

Be mindful of the closing caveat:

Right now, SparkR works well for algorithms like gradient descent that are parallelizable but requires users to decide which parts of the algorithm can be run in parallel. In the future, we hope to provide direct access to large scale machine learning algorithms by integrating with Spark’s MLLib. More examples and details about SparkR can be found at http://amplab-extras.github.io/SparkR-pkg.

Early days for SparkR but it has a lot of promise.

I first saw this in a tweet by Jason Trost.

January 10, 2014

SIMR – Spark on top of Hadoop

Filed under: Hadoop,Spark — Patrick Durusau @ 3:58 pm

SIMR – Spark on top of Hadoop by Danny Bickson.

From the post:

Just learned from my collaborator Aapo Kyrola that the Spark team have now released a plugin which allows running Spark on top of Hadoop, without installation anything and without administrator privileges. This will probably encourage many more companies to try out Spark, which significantly improves on Hadoop performance.

The tools for data are getting easier to use every day.

Which moves the semantic wall a little closer with each improvement.

Efficiently processing TB of data only to confess it isn’t clear what the data may or may not mean, isn’t going to win IT any friends.

November 22, 2013

Spark and Elasticsearch

Filed under: ElasticSearch,Spark — Patrick Durusau @ 4:41 pm

Spark and Elasticsearch by Barnaby Gray.

From the post:

If you work in the Hadoop world and have not yet heard of Spark, drop everything and go check it out. It’s a really powerful, intuitive and fast map/reduce system (and some).

Where it beats Hadoop/Pig/Hive hands down is it’s not a massive stack of quirky DSLs built on top of layers of clunky Java abstractions – it’s a simple, pure Scala functional DSL with all the flexibility and succinctness of Scala. And it’s fast, and properly interactive – query, bam response snappiness – not query, twiddle fingers, wait a bit.. response.

And if you’re into search, you’ll no doubt have heard of Elasticsearch – a distributed restful search engine built upon Lucene.

They’re perfect bedfellows – crunch your raw data and spit it out into a search index ready for serving to your frontend. At the company I work for we’ve built the google-analytics-esque part of our product around this combination.

It so fast, it flies – we can process raw event logs at 250,000 events/s without breaking a sweat on a meagre EC2 m1.large instance. (bold emphasis added)

Don’t you just hate it when bloggers hold back? 😉

I’m not endorsing this solution but I do appreciate a post with attitude and useful information.

Enjoy!

November 21, 2013

Putting Spark to Use:…

Filed under: Hadoop,MapReduce,Spark — Patrick Durusau @ 5:43 pm

Putting Spark to Use: Fast In-Memory Computing for Your Big Data Applications by Justin Kestelyn.

From the post:

Apache Hadoop has revolutionized big data processing, enabling users to store and process huge amounts of data at very low costs. MapReduce has proven to be an ideal platform to implement complex batch applications as diverse as sifting through system logs, running ETL, computing web indexes, and powering personal recommendation systems. However, its reliance on persistent storage to provide fault tolerance and its one-pass computation model make MapReduce a poor fit for low-latency applications and iterative computations, such as machine learning and graph algorithms.

Apache Spark addresses these limitations by generalizing the MapReduce computation model, while dramatically improving performance and ease of use.

Fast and Easy Big Data Processing with Spark

At its core, Spark provides a general programming model that enables developers to write application by composing arbitrary operators, such as mappers, reducers, joins, group-bys, and filters. This composition makes it easy to express a wide array of computations, including iterative machine learning, streaming, complex queries, and batch.

In addition, Spark keeps track of the data that each of the operators produces, and enables applications to reliably store this data in memory. This is the key to Spark’s performance, as it allows applications to avoid costly disk accesses. As illustrated in the figure below, this feature enables:

I would not use the following example to promote Spark:

One of Spark’s most useful features is the interactive shell, bringing Spark’s capabilities to the user immediately – no IDE and code compilation required. The shell can be used as the primary tool for exploring data interactively, or as means to test portions of an application you’re developing.

The screenshot below shows a Spark Python shell in which the user loads a file and then counts the number of lines that contain “Holiday”.

Spark Example

Isn’t that just:

grep holiday WarAndPeace.txt | wc -l
15

?

Grep doesn’t require an IDE or compilation either. Of course, grep isn’t reading from an HDFS file.

The “file.filter(lamda line: “Holiday” in.line).count()” works but some of us prefer the terseness of Unix.

Unix text tools for HDFS?

August 27, 2013

Analytics and Machine Learning at Scale [Aug. 29-30]

Filed under: BlinkDB,GraphX,Mesos,MLBase,Shark,Spark,Tachyon — Patrick Durusau @ 6:40 pm

AMP Camp Three – Analytics and Machine Learning at Scale

From the webpage:

AMP Camp Three – Analytics and Machine Learning at Scale will be held in Berkeley California, August 29-30, 2013. AMP Camp 3 attendees and online viewers will learn to solve big data problems using components of the Berkeley Data Analytics Stack (BDAS) and cutting edge machine learning algorithms.

Live streaming!

Sessions will cover (among other things): Mesos, Spark, Shark, Spark Streaming, BlinkDB, MLbase, Tachyon and GraphX.

Talk about a jolt before the weekend!

August 18, 2013

Distributed Machine Learning with Spark using MLbase

Filed under: Machine Learning,MLBase,Spark — Patrick Durusau @ 1:01 pm

Apache Spark: Distributed Machine Learning with Spark using MLbase by Ameet Talwaker and Evan Sparks.

From the description:

In this talk we describe our efforts, as part of the MLbase project, to develop a distributed Machine Learning platform on top of Spark. In particular, we present the details of two core components of MLbase, namely MLlib and MLI, which are scheduled for open-source release this summer. MLlib provides a standard Spark library of scalable algorithms for common learning settings such as classification, regression, collaborative filtering and clustering. MLI is a machine learning API that facilitates the development of new ML algorithms and feature extraction methods. As part of our release, we include a library written against the MLI containing standard and experimental ML algorithms, optimization primitives and feature extraction methods.

Useful links:

http://mlbase.org

http://spark-project.org/

http://incubator.apache.org/projects/spark.html

Suggestion: When you make a video of a presentation, don’t include members of the audience eating (pizza in this case). It’s distracting.

From: http://mlbase.org

  • MLlib: A distributed low-level ML library written directly against the Spark runtime that can be called from Scala and Java. The current library includes common algorithms for classification, regression, clustering and collaboritive filtering, and will be included as part of the Spark v0.8 release.
  • MLI: An API / platform for feature extraction and algorithm development that introduces high-level ML programming abstractions. MLI is currently implemented against Spark, leveraging the kernels in MLlib when possible, though code written against MLI can be executed on any runtime engine supporting these abstractions. MLI includes more extensive functionality and has a faster development cycle than MLlib. It will be released in conjunction with MLlib as a separate project.
  • ML Optimizer: This layer aims to simplify ML problems for End Users by automating the task of model selection. The optimizer solves a search problem over feature extractors and ML algorithms included in MLI. This component is under active development.

The goal of this project, to make machine learning easier for developers and end users is a laudable one.

And it is the natural progression of a technology from being experimental to common use.

On the other hand, I am uneasy about the weight users will put on results, while not understanding biases or uncertainties that are cooked into the data or algorithms.

I don’t think there is a solution to the bias/uncertainty problem other than to become more knowledgeable about machine learning.

Not that you will win an argument with an end users who keeps pointing to a result as though it were untouched by human biases.

But you may be able to better avoid such traps for yourself and your clients.

May 20, 2013

GraphX: A Resilient Distributed Graph System on Spark

Filed under: Graphs,GraphX,Spark — Patrick Durusau @ 10:23 am

GraphX: A Resilient Distributed Graph System on Spark by Reynold Xin, Joseph Gonzalez, Michael Franklin, Ion Stoica.

Abstract:

From social networks to targeted advertising, big graphs capture the structure in data and are central to recent advances in machine learning and data mining. Unfortunately, directly applying existing data-parallel tools to graph computation tasks can be cumbersome and inefficient. The need for intuitive, scalable tools for graph computation has lead to the development of new graph-parallel systems (e.g. Pregel, PowerGraph) which are designed to efficiently execute graph algorithms. Unfortunately, these new graph-parallel systems do not address the challenges of graph construction and transformation which are often just as problematic as the subsequent computation. Furthermore, existing graph-parallel systems provide limited fault-tolerance and support for interactive data mining.

We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in distributed graph representation to efficiently distribute graphs as tabular data-structures. Similarly, we leverage advances in data-flow systems to exploit in-memory computation and fault-tolerance. We provide powerful new operations to simplify graph construction and transformation. Using these primitives we implement the PowerGraph and Pregel abstractions in less than 20 lines of code. Finally, by exploiting the Scala foundation of Spark, we enable users to interactively load, transform, and compute on massive graphs.

Of particular note is the use of an immutable graph as the core data structure for GraphX.

The authors report that GraphX performs less well than PowerGraph (GraphLab 2.1) but promise performance gains and offsetting gains in productivity.

I didn’t find any additional resources at AMPLab on GraphX but did find:

Spark project homepage, and,

Screencasts on Spark

Both will benefit you when more information emerges on GraphX.

Graph Landscape Survey

Filed under: GraphBuilder,GraphLab,Graphs,Neo4j,Pregel,Spark — Patrick Durusau @ 9:41 am

Improving options for unlocking your graph data by Ben Lorica.

From the post:

The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open source GraphLab to “push the limits of graph computation and develop new ideas”, but having a commercial company will accelerate development, and allow the hiring of resources dedicated to improving usability and documentation.

While social media placed graph data on the radar of many companies, similar data sets can be found in many domains including the life and health sciences, security, and financial services. Graph data is different enough that it necessitates special tools and techniques. Because tools were a bit too complex for casual users, in the past this meant graph data analytics was the province of specialists. Fortunately graph data is an area that has attracted many enthusiastic entrepreneurs and developers. The tools have improved and I expect things to get much easier for users in the future. A great place to learn more about tools for graph data, is at the upcoming GraphLab Workshop (on July 1st in SF).
(…)

Ben summarizes graph resources for:

  • Data wrangling: creating graphs
  • Data management and search
  • Graph-parallel frameworks
  • Machine-learning and analytics
  • Visualization

It would be hard to find a better starting place for investigating the buzz about graphs.

I first saw this in An Overview of Graph Processing Frameworks by Danny Bickson.

February 7, 2013

A Quick Guide to Hadoop Map-Reduce Frameworks

Filed under: Hadoop,Hive,MapReduce,Pig,Python,Scalding,Scoobi,Scrunch,Spark — Patrick Durusau @ 10:45 am

A Quick Guide to Hadoop Map-Reduce Frameworks by Alex Popescu.

Alex has assembled links to guides to MapReduce frameworks:

Thanks Alex!

December 8, 2012

GraphLab vs. Piccolo vs. Spark

Filed under: GraphLab,Graphs,Networks,Piccolo,Spark — Patrick Durusau @ 7:26 pm

GraphLab vs. Piccolo vs. Spark by Danny Bickson.

From the post:

I got an interesting case study from Cui Henggang, a first year graduate student at CMU Parallel Data Lab. Cui implemented GMM on GraphLab, for comparing its performance to Piccolo and Spark. His collaborators on this projects where Jinliang Wei and Wei Dai. The algorithm is described on Chris Bishop, Pattern Recognition and Machine Learning, Chapter 9.2, page 438.

Danny reports Chu will be releasing his report and posting his GMM code to the graphic models toolkit (GraphLab).

I will post a pointer when the report appears, here and probably in a new post as well.

November 26, 2012

Shark (Hive on Spark)

Filed under: Shark,Spark — Patrick Durusau @ 4:57 pm

Shark (Hive on Spark)

From the webpage:

Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can answer Hive QL queries up to 70 times faster than Hive without modification to the existing data nor queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions.

We released Shark 0.2 on Oct 15, 2012. The new version is much more stable and also features significant performance improvements.

Getting Started

See our documentation on Github to get started. It takes around 5 mins to set up Shark on a single node for a quick spin, and about 20 mins on an Amazon EC2 cluster.

Fast Execution Engine

Shark is built on top of Spark, a data-parallel execution engine that is fast and fault-tolerant. Even if data are on disk, Shark can be noticeably faster than Hive because of the fast execution engine. It avoids the high task launching overhead of Hadoop MapReduce and does not require materializing intermediate data between stages on disk. Thanks to this fast engine, Shark can answer queries in sub-second latency.

They say that imitation is the sincerest form of flattery.

In software, do claims of compatibility with your software mean the same thing?

It isn’t possible to know which database solutions will be around in five years but the rapid emergence of alternative solutions certainly is exciting!

November 14, 2012

Spark: Making Big Data Analytics Interactive and Real-­Time (Video Lecture)

Filed under: CS Lectures,Spark,Tweets — Patrick Durusau @ 5:41 am

Spark: Making Big Data Analytics Interactive and Real-­Time by Matei Zaharia. Post from Marti Hearst.

From the post:

Spark is the hot next thing for Hadoop / MapReduce, and yesterday Matei Zaharia, a PhD student in UC Berkeley’s AMP Lab, gave us a terrific lecture about how it works and what’s coming next. The key idea is to make analysis of big data interactive and able to respond in real time. Matei also gave a live demo.

(slides here)

Spark: Lightning-Fast Cluster Computing (website).

Another great lecture from Marti’s class on Twitter and Big Data.

October 23, 2012

Deploying a GraphLab/Spark/Mesos cluster on EC2

Filed under: Amazon Web Services AWS,Clustering (servers),GraphLab,Spark — Patrick Durusau @ 10:10 am

Deploying a GraphLab/Spark/Mesos cluster on EC2 by Danny Bickson.

From the post:

I got the following instructions from my collaborator Jay (Haijie Gu) who spent some time learning Spark cluster deployment and adapted those useful scripts to be used in GraphLab.

This tutorial will help you spawn a GraphLab distributed cluster, run alternating least squares task, collect the results and shutdown the cluster.

This tutorial is very new beta release. Please contact me if you are brave enough to try it out..

I haven’t seen any responses to Danny’s post. Is yours going to be the first one?

September 9, 2012

Seven reasons why I like Spark

Filed under: BigData,Spark — Patrick Durusau @ 12:57 pm

Seven reasons why I like Spark by Ben Lorica.

From the post:

A large portion of this week’s Amp Camp at UC Berkeley, is devoted to an introduction to Spark – an open source, in-memory, cluster computing framework. After playing with Spark over the last month, I’ve come to consider it a key part of my big data toolkit. Here’s why:

Hadoop integration: Spark can work with files stored in HDFS, an important feature given the amount of investment in the Hadoop Ecosystem. Getting Spark to work with MapR is straightforward.

The Spark interactive Shell: Spark is written in Scala, and has it’s own version of the Scala interpreter. I find this extremely convenient for testing short snippets of code.

The Spark Analytic Suite:


(Figure courtesy of Matei Zaharia)

Spark comes with tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming). Rather than having to mix and match a set of tools (e.g., Hive, Hadoop, Mahout, S4/Storm), you only have to learn one programming paradigm. For SQL enthusiasts, the added bonus is that Shark tends to run faster than Hive. If you want to run Spark in the cloud, there are a set of EC2 scripts available.

Resilient Distributed Data sets (RDD’s):
RDD’s are distributed objects that can be cached in-memory, across a cluster of compute nodes. They are the fundamental data objects used in Spark. The crucial thing is that fault-tolerance is built-in: RDD’s are automatically rebuilt if something goes wrong. If you need to test something out, RDD’s can even be used interactively from the Spark interactive shell.

Be sure to follow the link to the AMP workshop (August 21-22, 2012) for videos on the Spark framework.

« Newer PostsOlder Posts »

Powered by WordPress