Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 11, 2015

termsql

Filed under: Spark,SQL,SQLite,Text Analytics — Patrick Durusau @ 2:06 pm

termsql

From the webpage:

Convert text from a file or from stdin into SQL table and query it instantly. Uses sqlite as backend. The idea is to make SQL into a tool on the command line or in scripts.

Online manual: http://tobimensch.github.io/termsql

So what can it do?

  • convert text/CSV files into sqlite database/table
  • work on stdin data on-the-fly
  • it can be used as swiss army knife kind of tool for extracting information from other processes that send their information to termsql via a pipe on the command line or in scripts
  • termsql can also pipe into another termsql of course
  • you can quickly sort and extract data
  • creates string/integer/float column types automatically
  • gives you the syntax and power of SQL on the command line

Sometimes you need the esoteric and sometimes not!

Enjoy!

I first saw this in a tweet by Christophe Lalanne.

March 3, 2015

KillrWeather

Filed under: Akka,Cassandra,Spark,Time Series — Patrick Durusau @ 2:31 pm

KillrWeather

From the post:

KillrWeather is a reference application (which we are constantly improving) showing how to easily leverage and integrate Apache Spark, Apache Cassandra, and Apache Kafka for fast, streaming computations in asynchronous Akka event-driven environments. This application focuses on the use case of time series data.

The site doesn’t give enough emphasis to the importance of time series data. Yes, weather is an easy example of time series data, but consider another incomplete listing of the uses of time series data:

A time series is a sequence of data points, typically consisting of successive measurements made over a time interval. Examples of time series are ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average. Time series are very frequently plotted via line charts. Time series are used in statistics, signal processing, pattern recognition, econometrics, mathematical finance, weather forecasting, earthquake prediction, electroencephalography, control engineering, astronomy, communications engineering, and largely in any domain of applied science and engineering which involves temporal measurements.

(Time Series)

Mastering KillrWeather will put you on the road to many other uses of time series data.

Enjoy!

I first saw this in a tweet by Chandra Gundlapalli.

February 25, 2015

Download the Hive-on-Spark Beta

Filed under: Hive,Hive-on-Spark,Spark — Patrick Durusau @ 4:00 pm

Download the Hive-on-Spark Beta by Xuefu Zhang.

From the post:

The Hive-on-Spark project (HIVE-7292) is one of the most watched projects in Apache Hive history. It has attracted developers from across the ecosystem, including from organizations such as Intel, MapR, IBM, and Cloudera, and gained critical help from the Spark community.

Many anxious users have inquired about its availability in the last few months. Some users even built Hive-on-Spark from the branch code and tried it in their testing environments, and then provided us valuable feedback. The team is thrilled to see this level of excitement and early adoption, and has been working around the clock to deliver the product at an accelerated pace.

Thanks to this hard work, significant progress has been made in the last six months. (The project is currently incubating in Cloudera Labs.) All major functionality is now in place, including different flavors of joins and integration with Spark, HiveServer2, and YARN, and the team has made initial but important investments in performance optimization, including split generation and grouping, supporting vectorization and cost-based optimization, and more. We are currently focused on running benchmarks, identifying and prototyping optimization areas such as dynamic partition pruning and table caching, and creating a roadmap for further performance enhancements for the near future.

Two month ago, we announced the availability of an Amazon Machine Image (AMI) for a hands-on experience. Today, we even more proudly present you a Hive-on-Spark beta via CDH parcel. You can download that parcel here. (Please note that in this beta release only HDFS, YARN, Apache ZooKeeper, and Hive are supported. Other components, such as Apache Pig, Apache Oozie, and Impala, might not work as expected.) The “Getting Started” guide will help you get your Hive queries up and running on the Spark engine without much trouble.

We welcome your feedback. For assistance, please use user@hive.apache.org or the Cloudera Labs discussion board.

We will update you again when GA is available. Stay tuned!

If you are snowbound this week, this may be what you have been looking for!

I have listed this under both Hive and Spark separately but am confident enough of its success that I created Hive-on-Spark as well.

Enjoy!

February 19, 2015

Introducing DataFrames in Spark for Large Scale Data Science

Filed under: Data Frames,Spark — Patrick Durusau @ 10:12 am

Introducing DataFrames in Spark for Large Scale Data Science by Reynold Xin, Michael Armbrust and Davies Liu.

From the post:

Today, we are excited to announce a new DataFrame API designed to make big data processing even easier for a wider audience.

When we first open sourced Spark, we aimed to provide a simple API for distributed data processing in general-purpose programming languages (Java, Python, Scala). Spark enabled distributed data processing through functional transformations on distributed collections of data (RDDs). This was an incredibly powerful API: tasks that used to take thousands of lines of code to express could be reduced to dozens.

As Spark continues to grow, we want to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing. The new DataFrames API was created with this goal in mind. This API is inspired by data frames in R and Python (Pandas), but designed from the ground-up to support modern big data and data science applications. As an extension to the existing RDD API, DataFrames feature:

  • Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster
  • Support for a wide array of data formats and storage systems
  • State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer
  • Seamless integration with all big data tooling and infrastructure via Spark
  • APIs for Python, Java, Scala, and R (in development via SparkR)

For new users familiar with data frames in other programming languages, this API should make them feel at home. For existing Spark users, this extended API will make Spark easier to program, and at the same time improve performance through intelligent optimizations and code-generation.

The dataframe API will be released for Spark 1.3 in early March.

BTW, the act of using a dataframe creates a new subject, yes? How are you going to document the semantics of such subjects? I didn’t notice a place to write down that information.

That’s a good question of ask of many of the emerging big/large/ginormous data tools. I have trouble remembering what I meant from notes yesterday and that’s not an uncommon experience. Imagine six months from now. Or when you are at your third client this month and the first one calls for help.

Remember: To many eyes undocumented subjects are opaque.

I first saw this in a tweet by Sebastian Rascha

February 6, 2015

Big Data Processing with Apache Spark – Part 1: Introduction

Filed under: BigData,Spark — Patrick Durusau @ 2:45 pm

Big Data Processing with Apache Spark – Part 1: Introduction by Srini Penchikala.

From the post:

What is Spark

Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project.

Spark has several advantages compared to other big data and MapReduce technologies like Hadoop and Storm.

First of all, Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data).

Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk.

Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators. And you can use it interactively to query data within the shell.

In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case.

In this first installment of Apache Spark article series, we’ll look at what Spark is, how it compares with a typical MapReduce solution and how it provides a complete suite of tools for big data processing.

If the rest of this series of posts is as comprehensive as this one, this will be a great overview of Apache Spark! Looking forward to additional posts in this series.

February 2, 2015

Configuring IPython Notebook Support for PySpark

Filed under: Python,Spark — Patrick Durusau @ 4:50 pm

Configuring IPython Notebook Support for PySpark by John Ramey.

From the post:

Apache Spark is a great way for performing large-scale data processing. Lately, I have begun working with PySpark, a way of interfacing with Spark through Python. After a discussion with a coworker, we were curious whether PySpark could run from within an IPython Notebook. It turns out that this is fairly straightforward by setting up an IPython profile.

A quick setup note for a useful configuration of PySpark, IPython Notebook.

Good example of it being unnecessary to solve every problem to make a useful contribution.

Enjoy!

January 24, 2015

A first look at Spark

Filed under: R,Spark — Patrick Durusau @ 10:57 am

A first look at Spark by Joseph Rickert.

From the post:

Apache Spark, the open-source, cluster computing framework originally developed in the AMPLab at UC Berkeley and now championed by Databricks is rapidly moving from the bleeding edge of data science to the mainstream. Interest in Spark, demand for training and overall hype is on a trajectory to match the frenzy surrounding Hadoop in recent years. Next month's Strata + Hadoop World conference, for example, will offer three serious Spark training sessions: Apache Spark Advanced Training, SparkCamp and Spark developer certification with additional spark related talks on the schedule. It is only a matter of time before Spark becomes a big deal in the R world as well.

If you don't know much about Spark but want to learn more, a good place to start is the video of Reza Zadeh's keynote talk at the ACM Data Science Camp held last October at eBay in San Jose that has been recently posted.

After reviewing the high points of Reza Zadeh's presentation, Joseph points out another 4 hours+ of videos on using Spark and R together.

A nice collection for getting started with Spark and seeing how to use a standard tool (R) with an emerging one (Spark).

I first saw this in a tweet by Christophe Lalanne.

January 22, 2015

Streaming Big Data with Spark, Spark Streaming, Kafka, Cassandra and Akka

Filed under: Akka,Cassandra,Kafka,Spark,Streams — Patrick Durusau @ 3:47 pm

Webinar: Streaming Big Data with Spark, Spark Streaming, Kafka, Cassandra and Akka by Helena Edelson.

From the post:

On Tuesday, January 13 I gave a webinar on Apache Spark, Spark Streaming and Cassandra. Over 1700 registrants from around the world signed up. This is a follow-up post to that webinar, answering everyone’s questions. In the talk I introduced Spark, Spark Streaming and Cassandra with Kafka and Akka and discussed wh​​​​y these particular technologies are a great fit for lambda architecture due to some key features and strategies they all have in common, and their elegant integration together. We walked through an introduction to implementing each, then showed how to integrate them into one clean streaming data platform for real-time delivery of meaning at high velocity. All this in a highly distributed, asynchronous, parallel, fault-tolerant system.

Video | Slides | Code | Diagram

About The Presenter: Helena Edelson is a committer on several open source projects including the Spark Cassandra Connector, Akka and previously Spring Integration and Spring AMQP. She is a Senior Software Engineer on the Analytics team at DataStax, a Scala and Big Data conference speaker, and has presented at various Scala, Spark and Machine Learning Meetups.

I have long contended that it is possible to have a webinar that has little if any marketing fluff and maximum technical content. Helena’s presentation is an example of that type of webinar.

Very much worth the time to watch.

BTW, being so content full, questions were answered as part of this blog post. Technical webinars just don’t get any better organized than this one.

Perhaps technical webinars should be marked with TW and others with CW (for c-suite webinars). To prevent disorientation in the first case and disappointment in the second one.

January 20, 2015

Spark Summit East Agenda (New York, March 18-19 2015)

Filed under: Conferences,Spark — Patrick Durusau @ 3:05 pm

Spark Summit East Agenda (New York, March 18-19 2015)

Registration

The plenary and track sessions are on day one. Databricks is offering three training courses on day two.

The track sessions were divided into developer, applications and data science tracks. To assist you in finding your favorite speakers, I have collapsed that listing and sorted it by the first listed speaker’s last name. I certainly hope all of these presentations will be video recorded!

Take good notes and blog about your favorite sessions! Ping me with a pointer to your post. Thanks!

I first saw this in a tweet by Helena Edelson.

Improved Fault-tolerance and Zero Data Loss in Spark Streaming

Filed under: Spark,Streams — Patrick Durusau @ 11:38 am

Improved Fault-tolerance and Zero Data Loss in Spark Streaming by Tathagata Das.

From the post:

Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system. Since its beginning, Spark Streaming has included support for recovering from failures of both driver and worker machines. However, for some data sources, input data could get lost while recovering from the failures. In Spark 1.2, we have added preliminary support for write ahead logs (also known as journaling) to Spark Streaming to improve this recovery mechanism and give stronger guarantees of zero data loss for more data sources. In this blog, we are going to elaborate on how this feature works and how developers can enable it to get those guarantees in Spark Streaming applications.

Background

Spark and its RDD abstraction is designed to seamlessly handle failures of any worker nodes in the cluster. Since Spark Streaming is built on Spark, it enjoys the same fault-tolerance for worker nodes. However, the demand of high uptimes of a Spark Streaming application require that the application also has to recover from failures of the driver process, which is the main application process that coordinates all the workers. Making the Spark driver fault-tolerant is tricky because it is an arbitrary user program with arbitrary computation patterns. However, Spark Streaming applications have an inherent structure in the computation — it runs the same Spark computation periodically on every micro-batch of data. This structure allows us to save (aka, checkpoint) the application state periodically to reliable storage and recover the state on driver restarts.

For sources like files, this driver recovery mechanism was sufficient to ensure zero data loss as all the data was reliably stored in a fault-tolerant file system like HDFS or S3. However, for other sources like Kafka and Flume, some of the received data that was buffered in memory but not yet processed could get lost. This is because of how Spark applications operate in a distributed manner. When the driver process fails, all the executors running in a standalone/yarn/mesos cluster are killed as well, along with any data in their memory. In case of Spark Streaming, all the data received from sources like Kafka and Flume are buffered in the memory of the executors until their processing has completed. This buffered data cannot be recovered even if the driver is restarted. To avoid this data loss, we have introduced write ahead logs in Spark Streaming in the Spark 1.2 release.

Solid piece on the principles and technical details you will need for zero data loss in Spark Streaming. With suggestions for changes that may be necessary to support zero data loss at no loss in throughput. The latter being a non-trivial consideration.

Curious, I understand that many systems require zero data loss but do you have examples of systems were some data loss is acceptable? To what extend is data loss acceptable? (Given lost baggage rates, is airline baggage one of those?)

January 8, 2015

New Advanced Analytics and Data Wrangling Tutorials on Cloudera Live

Filed under: Cloudera,Spark — Patrick Durusau @ 2:06 pm

New Advanced Analytics and Data Wrangling Tutorials on Cloudera Live by Alex Gutow.

From the post:

When it comes to learning Apache Hadoop and CDH (Cloudera’s open source platform including Hadoop), there is no better place to start than Cloudera Live. With a quick, one-button deployment option, Cloudera Live launches a four-node Cloudera cluster that you can learn and experiment in free for two-weeks. To help plan and extend the capabilities of your cluster, we also offer various partner deployments. Building on the addition of interactive tutorials and Tableau and Zoomdata integration, we have added a new tutorial on Apache Spark and a new Trifacta partner deployment.

One of the most popular tools in the Hadoop ecosystem is Apache Spark. This easy-to-use, general-purpose framework is extensible across multiple use cases – including batch processing, iterative advanced analytics, and real-time stream processing. With support and development from multiple industry vendors and partner tools, Spark has quickly become a standard within Hadoop.

With the new tutorial, “Relationship Strength Analytics Using Spark,” it will walk you through the basics of Spark and how you can utilize the same, unified enterprise data hub to launch into advanced analytics. Using the example of product relationships, it will walk you through how to discover what products are commonly viewed together, how to optimize product campaigns together for better sales, and discover other insights about product relationships to help build advanced recommendations.

There is enough high grade educational material for data science that I think with some slicing and dicing, an entire curriculum could be fashioned out of online resources alone.

A number of Cloudera tutorials would find their way into such a listing.

Enjoy!

December 25, 2014

Announcing Spark Packages

Filed under: Spark — Patrick Durusau @ 3:06 pm

Announcing Spark Packages by Xiangrui Meng and Patrick Wendell.

From the post:

Today, we are happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages.

Spark Packages will feature integrations with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Thanks to the package authors, the initial listing of packages includes scientific computing libraries, a job execution server, a connector for importing Avro data, tools for launching Spark on Google Compute Engine, and many others. We expect this list to grow substantially in 2015, and to help fuel this growth we’re continuing to invest in extension points to Spark such as the Spark SQL data sources API, the Spark streaming Receiver API, and the Spark ML pipeline API. Package authors who submit a listing retain full rights to your code, including your choice of open-source license.

Please give Spark Packages a try and let us know if you have any questions when working with the site! We expect to extend the site in the coming months while also building mechanisms in Spark to make using packages even easier. We hope Spark Packages lets you find even more great ways to work with Spark.

I hope this site catches on across the growning Spark community. If used, it has the potential to be a real time saver for anyone interested in Spark.

Looking forward to seeing this site grown over 2015.

I first saw this in a tweet by Christophe Lallanne.

December 22, 2014

Spark 1.2.0 released

Filed under: GraphX,Hadoop,Spark — Patrick Durusau @ 7:26 pm

Spark 1.2.0 released

From the post:

We are happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is the third release on the API-compatible 1.X line. It is Spark’s largest release ever, with contributions from 172 developers and more than 1,000 commits!

This release brings operational and performance improvements in Spark core including a new network transport subsytem designed for very large shuffles. Spark SQL introduces an API for external data sources along with Hive 13 support, dynamic partitioning, and the fixed-precision decimal type. MLlib adds a new pipeline-oriented package (spark.ml) for composing multiple algorithms. Spark Streaming adds a Python API and a write ahead log for fault tolerance. Finally, GraphX has graduated from alpha and introduces a stable API.

Visit the release notes to read about the new features, or download the release today.

It looks like Christmas came a bit early this year. 😉

Lots of goodies to try out!

December 19, 2014

New in Cloudera Labs: SparkOnHBase

Filed under: Cloudera,HBase,Spark — Patrick Durusau @ 2:59 pm

New in Cloudera Labs: SparkOnHBase by Ted Malaska.

From the post:

Apache Spark is making a huge impact across our industry, changing the way we think about batch processing and stream processing. However, as we progressively migrate from MapReduce toward Spark, we shouldn’t have to “give up” anything. One of those capabilities we need to retain is the ability to interact with Apache HBase.

In this post, we will share the work being done in Cloudera Labs to make integrating Spark and HBase super-easy in the form of the SparkOnHBase project. (As with everything else in Cloudera Labs, SparkOnHBase is not supported and there is no timetable for possible support in the future; it’s for experimentation only.) You’ll learn common patterns of HBase integration with Spark and see Scala and Java examples for each. (It may be helpful to have the SparkOnHBase repository open as you read along.)

Is it too late to amend my wish list to include an eighty-hour week with Spark? 😉

This is an excellent opportunity to follow along with lab quality research on an important technology.

The Cloudera Labs discussion group strikes me as dreadfully under used.

Enjoy!

December 17, 2014

Leveraging UIMA in Spark

Filed under: Spark,Text Mining,UIMA — Patrick Durusau @ 5:01 pm

Leveraging UIMA in Spark by Philip Ogren.

Description:

Much of the Big Data that Spark welders tackle is unstructured text that requires text processing techniques. For example, performing named entity extraction on tweets or sentiment analysis on customer reviews are common activities. The Unstructured Information Management Architecture (UIMA) framework is an Apache project that provides APIs and infrastructure for building complex and robust text analytics systems. A typical system built on UIMA defines a collection of analysis engines (such as e.g. a tokenizer, part-of-speech tagger, named entity recognizer, etc.) which are executed according to arbitrarily complex flow control definitions. The framework makes it possible to have interoperable components in which best-of-breed solutions can be mixed and matched and chained together to create sophisticated text processing pipelines. However, UIMA can seem like a heavy weight solution that has a sprawling API, is cumbersome to configure, and is difficult to execute. Furthermore, UIMA provides its own distributed computing infrastructure and run time processing engines that overlap, in their own way, with Spark functionality. In order for Spark to benefit from UIMA, the latter must be light-weight and nimble and not impose its architecture and tooling onto Spark.

In this talk, I will introduce a project that I started called uimaFIT which is now part of the UIMA project (http://uima.apache.org/uimafit.html). With uimaFIT it is possible to adopt UIMA in a very light-weight way and leverage it for what it does best: text processing. An entire UIMA pipeline can be encapsulated inside a single function call that takes, for example, a string input parameter and returns named entities found in the input string. This allows one to call a Spark RDD transform (e.g. map) that performs named entity recognition (or whatever text processing tasks your UIMA components accomplish) on string values in your RDD. This approach requires little UIMA tooling or configuration and effectively reduces UIMA to a text processing library that can be called rather than requiring full-scale adoption of another platform. I will prepare a companion resource for this talk that will provide a complete, self-contained, working example of how to leverage UIMA using uimaFIT from within Spark.

The necessity of creating light-weight ways to bridge the gaps between applications and frameworks is a signal that every solution is trying to be the complete solution. Since we have different views of what any “complete” solution would look like, wheels are re-invented time and time again. Along with all the parts necessary to use those wheels. Resulting in a tremendous duplication of effort.

A component based approach attempts to do one thing. Doing any one thing well, is challenging enough. (Self-test: How many applications do more than one thing well? Assuming they do one thing well. BTW, for programmers, the test isn’t that other programs fail to do it any better.)

Until more demand results in easy to pipeline components, Philip’s uimaFIT is a great way to incorporate text processing from UIMA into Spark.

Enjoy!

December 16, 2014

Apache Spark I & II [Pacific Northwest Scala 2014]

Filed under: BigData,Spark — Patrick Durusau @ 5:49 pm

Apache Spark I: From Scala Collections to Fast Interactive Big Data with Spark by Evan Chan.

Description:

This session introduces you to Spark by starting with something basic: Scala collections and functional data transforms. We then look at how Spark expands the functional collection concept to enable massively distributed, fast computations. The second half of the talk is for those of you who want to know the secrets to make Spark really fly for querying tabular datasets. We will dive into row vs columnar datastores and the facilities that Spark has for enabling interactive data analysis, including Spark SQL and the in-memory columnar cache. Learn why Scala’s functional collections are the best foundation for working with data!

Apache Spark II: Streaming Big Data Analytics with Team Apache, Scala & Akka by Helena Edelson.

Description:

In this talk we will step into Spark over Cassandra with Spark Streaming and Kafka. Then put it in the context of an event-driven Akka application for real-time delivery of meaning at high velocity. We will do this by showing how to easily integrate Apache Spark and Spark Streaming with Apache Cassandra and Apache Kafka using the Spark Cassandra Connector. All within a common use case: working with time-series data, which Cassandra excells at for data locality and speed.

Back to back excellent presentations on Spark!

I need to replace my second monitor (died last week) so I can run the video at full screen with a REPL open!

Enjoy!

December 14, 2014

GearPump

Filed under: Actor-Based,Akka,Hadoop YARN,Samza,Spark,Storm,Tez — Patrick Durusau @ 7:30 pm

GearPump (GitHub)

From the wiki homepage:

GearPump is a lightweight, real-time, big data streaming engine. It is inspired by recent advances in the Akka framework and a desire to improve on existing streaming frameworks. GearPump draws from a number of existing frameworks including MillWheel, Apache Storm, Spark Streaming, Apache Samza, Apache Tez, and Hadoop YARN while leveraging Akka actors throughout its architecture.

What originally caught my attention was this passage on the GitHub page:

Per initial benchmarks we are able to process 11 million messages/second (100 bytes per message) with a 17ms latency on a 4-node cluster.

Think about that for a second.

Per initial benchmarks we are able to process 11 million messages/second (100 bytes per message) with a 17ms latency on a 4-node cluster.

The GitHub page features a word count example and pointers to the wiki with more examples.

What if every topic “knew” the index value of every topic that should merge with it on display to a user?

When added to a topic map it broadcasts its merging property values and any topic with those values responds by transmitting its index value.

When you retrieve a topic, it has all the IDs necessary to create a merged view of the topic on the fly and on the client side.

There would be redundancy in the map but de-duplication for storage space went out with preferences for 7-bit character values to save memory space. So long as every topic returns the same result, who cares?

Well, it might make a difference when the CIA want to give every contractor full access to its datastores 24×7 via their cellphones. But, until that is an actual requirement, I would not worry about the storage space overmuch.

I first saw this in a tweet from Suneel Marthi.

December 9, 2014

Data Science with Hadoop: Predicting Airline Delays – Part 2

Filed under: Hadoop,Hortonworks,Machine Learning,Python,R,Spark — Patrick Durusau @ 5:25 pm

Using machine learning algorithms, Spark and Scala – Part 2 by Ofer Mendelevitch and Beau Plath.

From the post:

In this 2nd part of the blog post and its accompanying IPython Notebook in our series on Data Science and Apache Hadoop, we continue to demonstrate how to build a predictive model with Apache Hadoop, using existing modeling tools. And this time we’ll use Apache Spark and ML-Lib.

Apache Spark is a relatively new entrant to the Hadoop ecosystem. Now running natively on Apache Hadoop YARN, the architectural center of Hadoop, Apache Spark is an in-memory data processing API and execution engine that is effective for machine learning and data science use cases. And with Spark on YARN, data workers can simultaneously use Spark for data science workloads alongside other data access engines–all accessing the same shared dataset on the same cluster.

ds_2_1

The next installment in this series continues the analysis with the same dataset but then with R!

The bar for user introductions to technology is getting higher even as we speak!

Data Science with Apache Hadoop: Predicting Airline Delays (Part 1)

Filed under: Hadoop,Hortonworks,Machine Learning,Python,R,Spark — Patrick Durusau @ 5:06 pm

Using machine learning algorithms, Pig and Python – Part 1 by Ofer Mendelevitch.

From the post:

With the rapid adoption of Apache Hadoop, enterprises use machine learning as a key technology to extract tangible business value from their massive data assets. This derivation of business value is possible because Apache Hadoop YARN as the architectural center of Modern Data Architecture (MDA) allows purpose-built data engines such as Apache Tez and Apache Spark to process and iterate over multiple datasets for data science techniques within the same cluster.

ds_1

It is a common misconception that the way data scientists apply predictive learning algorithms like Linear Regression, Random Forest or Neural Networks to large datasets requires a dramatic change in approach, in tooling, or in usage of siloed clusters. Not so: no dramatic change; no dedicated clusters; using existing modeling tools will suffice.

In fact, the big change is in what is known as “feature engineering”—the process by which very large raw data is transformed into a “feature matrix.” Enabled by Apache Hadoop with YARN as an ideal platform, this transformation of large raw datasets (terabytes or petabytes) into a feature matrix is now scalable and not limited by RAM or compute power of a single node.

Since the output of the feature engineering step (the “feature matrix”) tends to be relatively small in size (typically in the MB or GB scale), a common choice is to run the learning algorithm on a single machine (often with multiple cores and high amount of RAM), allowing us to utilize a plethora of existing robust tools and algorithms from R packages, Python’s Scikit-learn, or SAS.

In this multi-part blog post and its accompanying IPython Notebook, we will demonstrate an example step-by-step solution to a supervised learning problem. We will show how to solve this problem with various tools and libraries and how they integrate with Hadoop. In part I we focus on Apache PIG, Python, and Scikit-learn, while in subsequent parts, we will explore and examine other alternatives such as R or Spark/ML-Lib

With the IPython notebook, this becomes a great example of how to provide potential users hands-on experience with a technology.

An example that Solr, for example, might well want to imitate.

PS: When I was traveling, a simpler way to predict flight delays was to just ping me for my travels plans. 😉 You?

December 5, 2014

Databricks to run two massive online courses on Apache Spark

Filed under: BigData,Spark — Patrick Durusau @ 8:06 pm

Databricks to run two massive online courses on Apache Spark by Ameet Talwalkar and Anthony Joseph.

From the post:

In the age of ‘Big Data,’ with datasets rapidly growing in size and complexity and cloud computing becoming more pervasive, data science techniques are fast becoming core components of large-scale data processing pipelines.

Apache Spark offers analysts and engineers a powerful tool for building these pipelines, and learning to build such pipelines will soon be a lot easier. Databricks is excited to be working with professors from University of California Berkeley and University of California Los Angeles to produce two new upcoming Massive Open Online Courses (MOOCs). Both courses will be freely available on the edX MOOC platform in spring 2015. edX Verified Certificates are also available for a fee.

The first course, called Introduction to Big Data with Apache Spark, will teach students about Apache Spark and performing data analysis. Students will learn how to apply data science techniques using parallel programming in Spark to explore big (and small) data. The course will include hands-on programming exercises including Log Mining, Textual Entity Recognition, Collaborative Filtering that teach students how to manipulate data sets using parallel processing with PySpark (part of Apache Spark). The course is also designed to help prepare students for taking the Spark Certified Developer exam. The course is being taught by Anthony Joseph, a professor at UC Berkeley and technical advisor at Databricks, and will start on February 23rd, 2015.

The second course, called Scalable Machine Learning, introduces the underlying statistical and algorithmic principles required to develop scalable machine learning pipelines, and provides hands-on experience using PySpark. It presents an integrated view of data processing by highlighting the various components of these pipelines, including exploratory data analysis, feature extraction, supervised learning, and model evaluation. Students will use Spark to implement scalable algorithms for fundamental statistical models while tackling real-world problems from various domains. The course is being taught by Ameet Talwalkar, an assistant professor at UCLA and technical advisor at Databricks, and will start on April 14th, 2015.

2015 is going to be here before you know it! Time to start practicing with Spark in a sandbox or a local machine is now.

Looking forward to 2015!

November 27, 2014

A Docker Image for Graph Analytics on Neo4j with Apache Spark GraphX

Filed under: Graphs,GraphX,Neo4j,Spark — Patrick Durusau @ 8:20 pm

A Docker Image for Graph Analytics on Neo4j with Apache Spark GraphX by Kenny Bastani.

From the post:

I’ve just released a useful new Docker image for graph analytics on a Neo4j graph database with Apache Spark GraphX. This image deploys a container with Apache Spark and uses GraphX to perform ETL graph analysis on subgraphs exported from Neo4j. This docker image is a great addition to Neo4j if you’re looking to do easy PageRank or community detection on your graph data. Additionally, the results of the graph analysis are applied back to Neo4j.

This gives you the ability to optimize your recommendation-based Cypher queries by filtering and sorting on the results of the analysis.

This rocks!

If you were looking for an excuse to investigate Docker or Spark or GraphX or Neo4j, it has arrived!

Enjoy!

November 26, 2014

Spark, D3, data visualization and Super Cow Powers

Filed under: D3,Spark — Patrick Durusau @ 7:56 pm

Spark, D3, data visualization and Super Cow Powers by Mateusz Fedoryszak.

From the post:

Did you know that the amount of milk given by a cow depends on the number of days since its last calving? A plot of this correlation is called a lactation curve. Read on to find out how do we use Apache Spark and D3 to find out how much milk we can expect on a particular day.

lactation chart

There are things that except for a client’s request, I have never been curious about. 😉

How are you using Spark?

I first saw this in a tweet by Anna Pawlicka

November 21, 2014

Building an AirPair Dashboard Using Apache Spark and Clojure

Filed under: Clojure,Functional Programming,Spark — Patrick Durusau @ 8:19 pm

Building an AirPair Dashboard Using Apache Spark and Clojure by Sébastien Arnaud.

From the post:

Have you ever wondered how much you should charge per hour for your skills as an expert on AirPair.com? How many opportunities are available on this platform? Which skills are in high demand? My name is Sébastien Arnaud and I am going to introduce you to the basics of how to gather, refine and analyze publicly available data using Apache Spark with Clojure, while attempting to generate basic insights about the AirPair platform.

1.1 Objectives

Here are some of the basic questions we will be trying to answer through this tutorial:

  • What is the lowest and the highest hourly rate?
  • What is the average hourly rate?
  • What is the most sought out skill?
  • What is the skill that pays the most on average?

1.2 Chosen tools

In order to make this journey a lot more interesting, we are going to step out of the usual comfort zone of using a standard iPython notebook using pandas with matplotlib. Instead we are going to explore one of the hottest data technologies of the year: Apache Spark, along with one of the up and coming functional languages of the moment: Clojure.

As it turns out, these two technologies can work relatively well together thanks to Flambo, a library developed and maintained by YieldBot's engineers who have improved upon clj-spark (an earlier attempt from the Climate Corporation). Finally, in order to share the results, we will attempt to build a small dashboard using DucksBoard, which makes pushing data to a public dashboard easy.

In addition to illustrating the use of Apache Spark and Clojure, Sébastien also covers harvesting data from Twitter and processing it into a useful format.

Definitely worth some time over a weekend!

Apache Hive on Apache Spark: The First Demo

Filed under: Cloudera,Hive,Spark — Patrick Durusau @ 3:39 pm

Apache Hive on Apache Spark: The First Demo by Brock Noland.

From the post:

Apache Spark is quickly becoming the programmatic successor to MapReduce for data processing on Apache Hadoop. Over the course of its short history, it has become one of the most popular projects in the Hadoop ecosystem, and is now supported by multiple industry vendors—ensuring its status as an emerging standard.

Two months ago Cloudera, Databricks, IBM, Intel, MapR, and others came together to port Apache Hive and the other batch processing engines to Spark. In October at Strata + Hadoop World New York, the Hive on Spark project lead Xuefu Zhang shared the project status and a provided a demo of our work. The same week at the Bay Area Hadoop User Group, Szehon Ho discussed the project and demo’ed the work completed. Additionally, Xuefu and Suhas Satish will be speaking about Hive on Spark at the Bay Area Hive User Group on Dec. 3.

The community has committed more than 140 changes to the Spark branch as part of HIVE-7292 – Hive on Spark. We are proud to say that queries are now functionally able to run, as you can see in the demo below of a multi-node Hive-on-Spark query (query 28 from TPC-DS with a scale factor of 20 on a TPC-DS derived dataset).

After seeing the demo, you will want to move Spark up on your technology to master list!

November 17, 2014

Apache Spark RefCardz

Filed under: Spark — Patrick Durusau @ 3:43 pm

Apache Spark RefCardz by Ashwini Kuntamukkala.

Counting the cover, four (4) out of the eight (8) pages don’t qualify for inclusion in a cheat sheet or refcard. Depending on how hard you want to push that, the count could easily be six (6) out of the eight (8) pages should not be in a cheat sheet or refcard.

Reasoning that cheat sheets or refcards are meant for practitioners or experts who have forgotten a switch on date or bc.

The “extra” information present on this RefCard is useful but you will rapidly outgrow it. Unless you routinely need help installing Apache Spark or working a basic word count problem.

A two (2) page (front/back) cheatsheet for Spark would be more useful.

November 16, 2014

Spark: Parse CSV file and group by column value

Filed under: Linux OS,Spark — Patrick Durusau @ 7:24 pm

Spark: Parse CSV file and group by column value by Mark Needham.

Mark parses a 1GB file that details 4 million crimes from the City of Chicago.

And he does it two ways: Using Unix and Spark.

Results? One way took more than 2 minutes, the other way, less than 10 seconds.

Place your bets with office staff and then visit Mark’s post for the results.

November 13, 2014

Spark for Data Science: A Case Study

Filed under: Linux OS,Spark — Patrick Durusau @ 7:28 pm

Spark for Data Science: A Case Study by Casey Stella.

From the post:

I’m a pretty heavy Unix user and I tend to prefer doing things the Unix Way™, which is to say, composing many small command line oriented utilities. With composability comes power and with specialization comes simplicity. Although, sometimes if two utilities are used all the time, sometimes it makes sense for either:

  • A utility that specializes in a very common use-case
  • One utility to provide basic functionality from another utility

For example, one thing that I find myself doing a lot of is searching a directory recursively for files that contain an expression:

find /path/to/root -exec grep -l "search phrase" {} \;

Despite the fact that you can do this, specialized utilities, such as ack have come up to simplify this style of querying. Turns out, there’s also power in not having to consult the man pages all the time. Another example, is the interaction between uniq and sort. uniq presumes sorted data. Of course, you need not sort your data using the Unix utility sort, but often you find yourself with a flow such as this:

sort filename.dat | uniq > uniq.dat

This is so common that a -u flag was added to sort to support this flow, like so:

sort -u filename.dat > uniq.dat

Now, obviously, uniq has utilities beyond simply providing distinct output from a stream, such as providing counts for each distinct occurrence. Even so, it’s nice for the situation where you don’t need the full power of uniq for the minimal functionality of uniq to be a part of sort. These simple motivating examples got me thinking:

  • Are there opportunities for folding another command’s basic functionality into another command as a feature (or flag) as in sort and uniq?
  • Can we answer the above question in a principled, data-driven way?

This sounds like a great challenge and an even greater opportunity to try out a new (to me) analytics platform, Apache Spark. So, I’m going to take you through a little journey doing some simple analysis and illustrate the general steps. We’re going to cover

  1. Data Gathering
  2. Data Engineering
  3. Data Analysis
  4. Presentation of Results and Conclusions

We’ll close with my impressions of using Spark as an analytics platform. Hope you enjoy!

All of that is just the setup for a very cool walk through a data analysis example with Spark.

Enjoy!

November 6, 2014

Spark officially sets a new record in large-scale sorting

Filed under: Hadoop,Sorting,Spark — Patrick Durusau @ 7:16 pm

Spark officially sets a new record in large-scale sorting by Reynold Xin.

From the post:

A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system can sort 100 TB of data (1 trillion records). Today, we are happy to announce that our entry has been reviewed by the benchmark committee and we have officially won the Daytona GraySort contest!

In case you missed our earlier blog post, using Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. In comparison, the previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. This entry tied with a UCSD research team building high performance systems and we jointly set a new world record.

Winning this benchmark as a general, fault-tolerant system marks an important milestone for the Spark project. It demonstrates that Spark is fulfilling its promise to serve as a faster and more scalable engine for data processing of all sizes, from GBs to TBs to PBs. In addition, it validates the work that we and others have been contributing to Spark over the past few years.

If you are not already familiar with Spark, see the project homepage and/or the extensive documentation page. (Be careful, you can easily lose yourself in the Spark documentation.)

November 3, 2014

Using Apache Spark and Neo4j for Big Data Graph Analytics

Filed under: BigData,Graphs,Hadoop,HDFS,Spark — Patrick Durusau @ 8:29 pm

Using Apache Spark and Neo4j for Big Data Graph Analytics by Kenny Bastani.

From the post:


Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.

Still, where does all that data come from? Where does it go when the analysis is done?

Graph databases

I’ve been working with graph database technologies for the last few years and I have yet to become jaded by its powerful ability to combine both the transformation of data with analysis. Graph databases like Neo4j are solving problems that relational databases cannot.

Graph processing at scale from a graph database like Neo4j is a tremendously valuable power.

But if you wanted to run PageRank on a dump of Wikipedia articles in less than 2 hours on a laptop, you’d be hard pressed to be successful. More so, what if you wanted the power of a high-performance transactional database that seamlessly handled graph analysis at this scale?

Mazerunner for Neo4j

Mazerunner is a Neo4j unmanaged extension and distributed graph processing platform that extends Neo4j to do big data graph processing jobs while persisting the results back to Neo4j.

Mazerunner uses a message broker to distribute graph processing jobs to Apache Spark’s GraphX module. When an agent job is dispatched, a subgraph is exported from Neo4j and written to Apache Hadoop HDFS.

Mazerunner is an alpha release with page rank as its only algorithm.

It has a great deal of potential so worth your time to investigate further.

October 27, 2014

Apache Flink (formerly Stratosphere) Competitor to Spark

Filed under: Apache Flink,Spark — Patrick Durusau @ 3:28 pm

From the Apache Flink 0.6 release page:

What is Flink?

Apache Flink is a general-purpose data processing engine for clusters. It runs on YARN clusters on top of data stored in Hadoop, as well as stand-alone. Flink currently has programming APIs in Java and Scala. Jobs are executed via Flink's own runtime engine. Flink features:

Robust in-memory and out-of-core processing: once read, data stays in memory as much as possible, and is gracefully de-staged to disk in the presence of memory pressure from limited memory or other applications. The runtime is designed to perform very well both in setups with abundant memory and in setups where memory is scarce.

POJO-based APIs: when programming, you do not have to pack your data into key-value pairs or some other framework-specific data model. Rather, you can use arbitrary Java and Scala types to model your data.

Efficient iterative processing: Flink contains explicit "iterate" operators that enable very efficient loops over data sets, e.g., for machine learning and graph applications.

A modular system stack: Flink is not a direct implementation of its APIs but a layered system. All programming APIs are translated to an intermediate program representation that is compiled and optimized via a cost-based optimizer. Lower-level layers of Flink also expose programming APIs for extending the system.

Data pipelining/streaming: Flink's runtime is designed as a pipelined data processing engine rather than a batch processing engine. Operators do not wait for their predecessors to finish in order to start processing data. This results to very efficient handling of large data sets.

The latest version is Apache Flink 0.6.1

See more information at the incubator homepage. Or consult the Apache Flink mailing lists.

The Quickstart is…, wait for it: word count on Hamlet. Nothing against the Bard, but you do know that everyone dies at the end. Yes? Seems like a depressing example.

What you suggest as an example application(s) for this type of software?

I first saw this on Danny Bickson’s blog as Apache flink.

« Newer PostsOlder Posts »

Powered by WordPress