Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 9, 2014

Data Science with Apache Hadoop: Predicting Airline Delays (Part 1)

Filed under: Hadoop,Hortonworks,Machine Learning,Python,R,Spark — Patrick Durusau @ 5:06 pm

Using machine learning algorithms, Pig and Python – Part 1 by Ofer Mendelevitch.

From the post:

With the rapid adoption of Apache Hadoop, enterprises use machine learning as a key technology to extract tangible business value from their massive data assets. This derivation of business value is possible because Apache Hadoop YARN as the architectural center of Modern Data Architecture (MDA) allows purpose-built data engines such as Apache Tez and Apache Spark to process and iterate over multiple datasets for data science techniques within the same cluster.

ds_1

It is a common misconception that the way data scientists apply predictive learning algorithms like Linear Regression, Random Forest or Neural Networks to large datasets requires a dramatic change in approach, in tooling, or in usage of siloed clusters. Not so: no dramatic change; no dedicated clusters; using existing modeling tools will suffice.

In fact, the big change is in what is known as “feature engineering”—the process by which very large raw data is transformed into a “feature matrix.” Enabled by Apache Hadoop with YARN as an ideal platform, this transformation of large raw datasets (terabytes or petabytes) into a feature matrix is now scalable and not limited by RAM or compute power of a single node.

Since the output of the feature engineering step (the “feature matrix”) tends to be relatively small in size (typically in the MB or GB scale), a common choice is to run the learning algorithm on a single machine (often with multiple cores and high amount of RAM), allowing us to utilize a plethora of existing robust tools and algorithms from R packages, Python’s Scikit-learn, or SAS.

In this multi-part blog post and its accompanying IPython Notebook, we will demonstrate an example step-by-step solution to a supervised learning problem. We will show how to solve this problem with various tools and libraries and how they integrate with Hadoop. In part I we focus on Apache PIG, Python, and Scikit-learn, while in subsequent parts, we will explore and examine other alternatives such as R or Spark/ML-Lib

With the IPython notebook, this becomes a great example of how to provide potential users hands-on experience with a technology.

An example that Solr, for example, might well want to imitate.

PS: When I was traveling, a simpler way to predict flight delays was to just ping me for my travels plans. 😉 You?

December 7, 2014

Lab Report: The Final Grade [Normalizing Corporate Small Data]

Filed under: Cloudera,Data Conversion,Hadoop — Patrick Durusau @ 8:34 pm

Lab Report: The Final Grade by Dr. Geoffrey Malafsky.

From the post:

We have completed our TechLab series with Cloudera. Its objective was to explore the ability of Hadoop in general, and Cloudera’s distribution in particular, to meet the growing need for rapid, secure, adaptive merging and correction of core corporate data. I call this Corporate Small Data which is:

“Structured data that is the fuel of an organization’s main activities, and whose problems with accuracy and trustworthiness are past the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision making, applications, reports, and Business Intelligence. This is Small Data relative to the much ballyhooed Big Data of the Terabyte range.”1

Corporate Small Data does not include the predominant Big Data examples which are almost all stochastic use cases. These can succeed even if there is error in the source data and uncertainty in the results since the business objective is getting trends or making general associations. In stark contrast are deterministic use cases, where the ramifications for wrong results are severely negative, such as for executive decision making, accounting, risk management, regulatory compliance, and security.

Dr. Malafsky gives Cloudera high marks (A-) for use in enterprises and what he describes as “data normalization.” Not in the relational database sense but more in the data cleaning sense.

While testing a Cloudera distribution at your next data cleaning exercise, ask yourself this question: OK, the processing worked great, but how to I avoid collecting all the information I needed for this project, again in the future?

December 3, 2014

Available Now: HDP 2.2

Filed under: Hadoop,Hortonworks — Patrick Durusau @ 8:11 pm

Available Now: HDP 2.2 by Jim Walker.

From the post:

We are very pleased to announce that the Hortonworks Data Platform Version 2.2 (HDP) is now generally available for download. With thousands of enhancements across all elements of the platform spanning data access to security to governance, rolling upgrades and more, HDP 2.2 makes it even easier for our customers to incorporate HDP as a core component of Modern Data Architecture (MDA).

HDP 2.2 represents the very latest innovation from across the Hadoop ecosystem, where literally hundreds of developers have been collaborating with us to evolve each of the individual Apache Software Foundation (ASF) projects from the broader Apache Hadoop ecosystem. These projects have now been brought together into the complete and open Hortonworks Data Platform (HDP) delivering more than 100 new features and closing out thousands of issues across Apache Hadoop and its related projects.

These distinct ASF projects from across the Hadoop ecosystem span every aspect of the data platform and are easily categorized into:

  • Data management: this is the core of the platform, including Apache Hadoop and its subcomponents of HDFS and YARN, which is the architectural center of HDP.
  • Data access: this represents the broad range of options for developers to access and process data, stored in HDFS and depending on their application requirements.
  • The supporting enterprise services of governance, operations and security that are fundamental to any enterprise data platform.

How many of the 100 new features will you try by the end of December, 2014? 😉

A sandbox edition is promised by December 9, 2014.

Tis the season to be jolly!

December 2, 2014

Announcing Apache Hadoop 2.6.0

Filed under: Hadoop,MapReduce — Patrick Durusau @ 10:32 am

Announcing Apache Hadoop 2.6.0 by Arun Murthy.

From the post:

It gives me great pleasure to announce that the Apache Hadoop community has released Apache Hadoop 2.6.0 !

In particular, we are excited about three major pieces in this release: heterogeneous storage in HDFS with SSD & Memory tiers, support for long-running services in YARN and rolling upgrades—the ability to upgrade your cluster software and restart upgraded nodes without taking the cluster down or losing work in progress. With YARN as its architectural center, Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it simultaneously in different ways.

Many thanks to all of the contributors and committers who collaborated on this version and resolved a total of nearly 900 JIRA issues across four areas:

  • Hadoop Common: 231 JIRAs resolved
  • Hadoop HDFS: 305 JIRAs resolved
  • Hadoop YARN: 290 JIRAs resolved
  • Hadoop MapReduce: 70 JIRAs resolved

Highlights for Apache Hadoop 2.6.0

Here are some details about the most important features. For the complete list of features, improvements and bug fixes, see the sidebar and the release notes.

The post includes a nifty PNG file that lists the major issues with what I think were intended to be links to JIRA issues. Unfortunately the links point to the PNG file. I suspect a missing map directive. I posted a comment on same and hopefully it will be fixed soon.

In the meantime, enjoy the new release of Hadoop! (And thank the many contributors to the project this holiday season!)

November 30, 2014

Introducing Cloudera Labs: An Open Look into Cloudera Engineering R&D

Filed under: Cloudera,Hadoop — Patrick Durusau @ 1:40 pm

Introducing Cloudera Labs: An Open Look into Cloudera Engineering R&D by Justin Kestelyn.

From the announcement of Cloudera Labs, a list of existing projects and a call for your suggestions of others:

Apache Kafka is among the “charter members” of this program. Since its origin as proprietary LinkedIn infrastructure just a couple years ago for highly scalable and resilient real-time data transport, it’s now one of the hottest projects associated with Hadoop. To stimulate feedback about Kafka’s role in enterprise data hubs, today we are making a Kafka-Cloudera Labs parcel (unsupported) available for installation.

Other initial Labs projects include:

  • Exhibit
    Exhibit is a library of Apache Hive UDFs that usefully let you treat array fields within a Hive row as if they were “mini-tables” and then execute SQL statements against them for deeper analysis.
  • Hive-on-Spark Integration
    A broad community effort is underway to bring Apache Spark-based data processing to Apache Hive, reducing query latency considerably and allowing IT to further standardize on Spark for data processing.
  • Impyla
    Impyla is a Python (2.6 and 2.7) client for Impala, the open source MPP query engine for Hadoop. It communicates with Impala using the same standard protocol as ODBC/JDBC drivers.
  • Oryx
    Oryx, a project jointly spearheaded by Cloudera Engineering and Intel, provides simple, real-time infrastructure for large-scale machine learning/predictive analytics applications.
  • RecordBreaker
    RecordBreaker, a project jointly developed by Hadoop co-founder Mike Cafarella and Cloudera, automatically turns your text-formatted data into structured Avro data–dramatically reducing data prep time.

As time goes on, and some of the projects potentially graduate into CDH components (or otherwise remain as Labs projects), more names will join the list. And of course, we’re always interested in hearing your suggestions for new Labs projects.

Do you take the rapid development of the Hadoop ecosystem as a lesson about investment in R&D by companies both large and small?

Is one of your first questions to a startup: What are your plans for investing in open source R&D?

Other R&D labs that I should call out for special mention?

November 25, 2014

Announcing Apache Pig 0.14.0

Filed under: Hadoop,Pig — Patrick Durusau @ 8:23 pm

Announcing Apache Pig 0.14.0 by Daniel Dai.

From the post:

With YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it simultaneously in different ways. Apache Tez supports YARN-based, high performance batch and interactive data processing applications in Hadoop that need to handle datasets scaling to terabytes or petabytes.

The Apache community just released Apache Pig 0.14.0,and the main feature is Pig on Tez. In this release, we closed 334 Jira tickets from 35 Pig contributors. Specific credit goes to the virtual team consisting of Cheolsoo Park, Rohini Palaniswamy, Olga Natkovich, Mark Wagner and Alex Bain who were instrumental in getting Pig on Tez working!

Screen Shot 2014-11-24 at 10.40.43 AMThis blog gives a brief overview of Pig on Tez and other new features included in the release.

Pig on Tez

Apache Tez is an alternative execution engine focusing on performance. It offers a more flexible interface so Pig can compile into a better execution plan than is possible with MapReduce. The result is consistent performance improvements in both large and small queries.

Since it is the Thanksgiving holiday this week in the United States, this release reminds me to ask why is turkey the traditional Thanksgiving meal? Everyone likes bacon better. 😉

November 24, 2014

Announcing Apache Hive 0.14

Filed under: Hadoop,Hive — Patrick Durusau @ 3:51 pm

Announcing Apache Hive 0.14 by Gunther Hagleitner.

From the post:

While YARN has allowed new engines to emerge for Hadoop, the most popular integration point with Hadoop continues to be SQL and Apache Hive is still the defacto standard. Although many SQL engines for Hadoop have emerged, their differentiation is being rendered obsolete as the open source community surrounds and advances this key engine at an accelerated rate.

Last week, the Apache Hive community released Apache Hive 0.14, which includes the results of the first phase in the Stinger.next initiative and takes Hive beyond its read-only roots and extends it with ACID transactions. Thirty developers collaborated on this version and resolved more than 1,015 JIRA issues.

Although there are many new features in Hive 0.14, there are a few highlights we’d like to highlight. For the complete list of features, improvements, and bug fixes, see the release notes.

If you have been watching the work on Spark + Hive: Apache Hive on Apache Spark: The First Demo, then you know how important Hive is to the Hadoop ecosystem.

The highlights:

Transactions with ACID semantics (HIVE-5317)

Allows users to modify data using insert, update and delete SQL statements. This provides snapshot isolation and uses locking for writes. Now users can make corrections to fact tables and changes to dimension tables.

Cost Base Optimizer (CBO) (HIVE-5775)

Now the query compiler uses a more sophisticated cost based optimizer that generates query plans based on statistics on data distribution. This works really well with complex joins and joins with multiple large fact tables. The CBO generates busy plans that execute much faster.

SQL Temporary Tables (HIVE-7090)

Temporary tables exist in scratch space that goes away when the user session disconnects. This allows users and BI tools to store temporary results and further process that data with multiple queries.

Coming Next in Stinger.next: Sub-Second Queries

After Hive 0.14, we’re planning on working with the community to deliver sub-second queries and SQL:2011 Analytics coverage in Hive. We also plan to work on Hive-Spark integration for machine learning and operational reporting with Hive streaming ingest and transactions.

Hive is an example of how an open source project should be supported.

November 22, 2014

Open-sourcing tools for Hadoop

Filed under: Hadoop,Impala,Machine Learning,Parquet,Scalding — Patrick Durusau @ 4:48 pm

Open-sourcing tools for Hadoop by Colin Marc.

From the post:

Stripe’s batch data infrastructure is built largely on top of Apache Hadoop. We use these systems for everything from fraud modeling to business analytics, and we’re open-sourcing a few pieces today:

Timberlake

Timberlake is a dashboard that gives you insight into the Hadoop jobs running on your cluster. Jeff built it as a replacement for YARN’s ResourceManager and MRv2’s JobHistory server, and it has some features we’ve found useful:

  • Map and reduce task waterfalls and timing plots
  • Scalding and Cascading awareness
  • Error tracebacks for failed jobs

Brushfire

Avi wrote a Scala framework for distributed learning of ensemble decision tree models called Brushfire. It’s inspired by Google’s PLANET, but built on Hadoop and Scalding. Designed to be highly generic, Brushfire can build and validate random forests and similar models from very large amounts of training data.

Sequins

Sequins is a static database for serving data in Hadoop’s SequenceFile format. I wrote it to provide low-latency access to key/value aggregates generated by Hadoop. For example, we use it to give our API access to historical fraud modeling features, without adding an online dependency on HDFS.

Herringbone

At Stripe, we use Parquet extensively, especially in tandem with Cloudera Impala. Danielle, Jeff, and Avi wrote Herringbone (a collection of small command-line utilities) to make working with Parquet and Impala easier.

More open source tools for your Hadoop installation!

I am considering creating a list of closed source tools for Hadoop. It would be shorter and easier to maintain than a list of open source tools for Hadoop. 😉

November 11, 2014

Discovering Patterns for Cyber Defense Using Linked Data Analysis [12th Nov., 10am PDT]

Filed under: Cybersecurity,Hadoop,Hortonworks,Linked Data — Patrick Durusau @ 5:22 pm

Discovering Patterns for Cyber Defense Using Linked Data Analysis

Wednesday, Nov. 12th | 10am PDT

I am always suspicious of one-day announcements of webinars. This post appeared on November 11th for a webinar on November 12th.

Only one way to find out so I registered. Join me to find out: substantive presentation or click-bait.

If enough people attend and then comment here, one way or the other, who knows? It might make a difference.

From the post:

Almost every week, news of a proprietary or customer data breach hits the news wave. While attackers have increased the level of sophistication in their tactics, so too have organizations advanced in their ability to build a robust, data-driven defense.

Apache Hadoop has emerged as the de facto big data platform, which makes it the perfect fit to accumulate cybersecurity data and diagnose the latest attacks.  As Enterprises roll out and grow their Hadoop implementations, they require effective ways for pinpointing and reasoning about correlated events within their data, and assessing their network security posture.

Join Hortonworks and Sqrrl to learn:

  • How Linked Data Analysis enables intuitive exploration, discovery, and pattern recognition over your big cybersecurity data
  • Effective ways to correlated events within your data, and assessing your network security posture
  • New techniques for discovering hidden patterns and detecting anomalies within your data
  • How Hadoop fits into your current data structure forming a secure, Modern Data Architecture

Register now to learn how combining the power of Hadoop and the Hortonworks Data Platform with massive, secure, entity-centric data models in Sqrrl Enterprise allows you to create a data-driven defense.

Bring your red pen. November 12, 2014 at 10am PDT. (That should be 1pm East Coast time.) See you then!

Massively Parallel Clustering: Overview

Filed under: Clustering,Hadoop,MapReduce — Patrick Durusau @ 3:35 pm

Massively Parallel Clustering: Overview by Grigory Yaroslavtsev.

From the post:

Clustering is one of the main vechicles of machine learning and data analysis.
In this post I will describe how to make three very popular sequential clustering algorithms (k-means, single-linkage clustering and correlation clustering) work for big data. The first two algorithms can be used for clustering a collection of feature vectors in \(d\)-dimensional Euclidean space (like the two-dimensional set of points on the picture below, while they also work for high-dimensional data). The last one can be used for arbitrary objects as long as for any pair of them one can define some measure of similarity.

mapreduce clustering

Besides optimizing different objective functions these algorithms also give qualitatively different types of clusterings.
K-means produces a set of exactly k clusters. Single-linkage clustering gives a hierarchical partitioning of the data, which one can zoom into at different levels and get any desired number of clusters.
Finally, in correlation clustering the number of clusters is not known in advance and is chosen by the algorithm itself in order to optimize a certain objective function.

All algorithms described in this post use the model for massively parallel computation that I described before.

I thought you might be interested in parallel clustering algorithms after the post on OSM-France. Don’t skip model for massively parallel computation. It and the discussion that follows is rich in resources on parallel clustering. Lots of links.

I take heart from the line:

The last one [Correlation Clustering] can be used for arbitrary objects as long as for any pair of them one can define some measure of similarity.

The words “some measure of similarity” should be taken as a warning the any particular “measure of similarity” should be examined closely and tested against the data so processed. It could be that the “measure of similarity” produces a desired result on a particular data set. You won’t know until you look.

November 6, 2014

Spark officially sets a new record in large-scale sorting

Filed under: Hadoop,Sorting,Spark — Patrick Durusau @ 7:16 pm

Spark officially sets a new record in large-scale sorting by Reynold Xin.

From the post:

A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system can sort 100 TB of data (1 trillion records). Today, we are happy to announce that our entry has been reviewed by the benchmark committee and we have officially won the Daytona GraySort contest!

In case you missed our earlier blog post, using Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. In comparison, the previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. This entry tied with a UCSD research team building high performance systems and we jointly set a new world record.

Winning this benchmark as a general, fault-tolerant system marks an important milestone for the Spark project. It demonstrates that Spark is fulfilling its promise to serve as a faster and more scalable engine for data processing of all sizes, from GBs to TBs to PBs. In addition, it validates the work that we and others have been contributing to Spark over the past few years.

If you are not already familiar with Spark, see the project homepage and/or the extensive documentation page. (Be careful, you can easily lose yourself in the Spark documentation.)

November 4, 2014

Tessera

Filed under: BigData,Hadoop,R,RHIPE,Tessera — Patrick Durusau @ 7:20 pm

Tessera

From the webpage:

The Tessera computational environment is powered by a statistical approach, Divide and Recombine. At the front end, the analyst programs in R. At the back end is a distributed parallel computational environment such as Hadoop. In between are three Tessera packages: datadr, Trelliscope, and RHIPE. These packages enable the data scientist to communicate with the back end with simple R commands.

Divide and Recombine (D&R)

Tessera is powered by Divide and Recombine. In D&R, we seek meaningful ways to divide the data into subsets, apply statistical methods to each subset independently, and recombine the results of those computations in a statistically valid way. This enables us to use the existing vast library of methods available in R – no need to write scalable versions

DATADR

The datadr R package provides a simple interface to D&R operations. The interface is back end agnostic, so that as new distributed computing technology comes along, datadr will be able to harness it. Datadr currently supports in-memory, local disk / multicore, and Hadoop back ends, with experimental support for Apache Spark. Regardless of the back end, coding is done entirely in R and data is represented as R objects.

TRELLISCOPE

Trelliscope is a D&R visualization tool based on Trellis Display that enables scalable, flexible, detailed visualization of data. Trellis Display has repeatedly proven itself as an effective approach to visualizing complex data. Trelliscope, backed by datadr, scales Trellis Display, allowing the analyst to break potentially very large data sets into many subsets, apply a visualization method to each subset, and then interactively sample, sort, and filter the panels of the display on various quantities of interest.
trelliscope

RHIPE

RHIPE is the R and Hadoop Integrated Programming Environment. RHIPE allows an analyst to run Hadoop MapReduce jobs wholly from within R. RHIPE is used by datadr when the back end for datadr is Hadoop. You can also perform D&R operations directly through RHIPE , although in this case you are programming at a lower level.

Quite an impressive package for R and “big data.”

I first saw this in a tweet by Christophe Lalanne.

November 3, 2014

Using Apache Spark and Neo4j for Big Data Graph Analytics

Filed under: BigData,Graphs,Hadoop,HDFS,Spark — Patrick Durusau @ 8:29 pm

Using Apache Spark and Neo4j for Big Data Graph Analytics by Kenny Bastani.

From the post:


Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.

Still, where does all that data come from? Where does it go when the analysis is done?

Graph databases

I’ve been working with graph database technologies for the last few years and I have yet to become jaded by its powerful ability to combine both the transformation of data with analysis. Graph databases like Neo4j are solving problems that relational databases cannot.

Graph processing at scale from a graph database like Neo4j is a tremendously valuable power.

But if you wanted to run PageRank on a dump of Wikipedia articles in less than 2 hours on a laptop, you’d be hard pressed to be successful. More so, what if you wanted the power of a high-performance transactional database that seamlessly handled graph analysis at this scale?

Mazerunner for Neo4j

Mazerunner is a Neo4j unmanaged extension and distributed graph processing platform that extends Neo4j to do big data graph processing jobs while persisting the results back to Neo4j.

Mazerunner uses a message broker to distribute graph processing jobs to Apache Spark’s GraphX module. When an agent job is dispatched, a subgraph is exported from Neo4j and written to Apache Hadoop HDFS.

Mazerunner is an alpha release with page rank as its only algorithm.

It has a great deal of potential so worth your time to investigate further.

October 29, 2014

AsterixDB: Better than Hadoop? Interview with Mike Carey

Filed under: AsterixDB,BigData,Hadoop — Patrick Durusau @ 3:24 pm

AsterixDB: Better than Hadoop? Interview with Mike Carey by Roberto V. Zicari.

The first two questions should be enough incentive to read the full interview and get your blood pumping in the middle of the week:

Q1. Why build a new Big Data Management System?

Mike Carey: When we started this project in 2009, we were looking at a “split universe” – there were your traditional parallel data warehouses, based on expensive proprietary relational DBMSs, and then there was the emerging Hadoop platform, which was free but low-function in comparison and wasn’t based on the many lessons known to the database community about how to build platforms to efficiently query large volumes of data. We wanted to bridge those worlds, and handle “modern data” while we were at it, by taking into account the key lessons from both sides.

To distinguish AsterixDB from current Big Data analytics platforms – which query but don’t store or manage Big Data – we like to classify AsterixDB as being a “Big Data Management System” (BDMS, with an emphasis on the “M”).
We felt that the Big Data world, once the initial Hadoop furor started to fade a little, would benefit from having a platform that could offer things like:

  • a flexible data model that could handle data scenarios ranging from “schema first” to “schema never”;
  • a full query language with at least the expressive power of SQL;
  • support for data storage, data management, and automatic indexing;
  • support for a wide range of query sizes, with query processing cost being proportional to the given query;
  • support for continuous data ingestion, hence the accumulation of Big Data;
  • the ability to scale up gracefully to manage and query very large volumes of data using commodity clusters; and,
  • built-in support for today’s common “Big Data data types”, such as textual, temporal, and simple spatial data.

So that’s what we set out to do.

Q2. What was wrong with the current Open Source Big Data Stack?

Mike Carey: First, we should mention that some reviewers back in 2009 thought we were crazy or stupid (or both) to not just be jumping on the Hadoop bandwagon – but we felt it was important, as academic researchers, to look beyond Hadoop and be asking the question “okay, but after Hadoop, then what?”

We recognized that MapReduce was great for enabling developers to write massively parallel jobs against large volumes of data without having to “think parallel” – just focusing on one piece of data (map) or one key-sharing group of data (reduce) at a time. As a platform for “parallel programming for dummies”, it was (and still is) very enabling! It also made sense, for expedience, that people were starting to offer declarative languages like Pig and Hive, compiling them down into Hadoop MapReduce jobs to improve programmer productivity – raising the level much like what the database community did in moving to the relational model and query languages like SQL in the 70’s and 80’s.

One thing that we felt was wrong for sure in 2009 was that higher-level languages were being compiled into an assembly language with just two instructions, map and reduce. We knew from Tedd Codd and relational history that more instructions – like the relational algebra’s operators – were important – and recognized that the data sorting that Hadoop always does between map and reduce wasn’t always needed.

Trying to simulate everything with just map and reduce on Hadoop made “get something better working fast” sense, but not longer-term technical sense. As for HDFS, what seemed “wrong” about it under Pig and Hive was its being based on giant byte stream files and not on “data objects”, which basically meant file scans for all queries and lack of indexing. We decided to ask “okay, suppose we’d known that Big Data analysts were going to mostly want higher-level languages – what would a Big Data platform look like if it were built ‘on purpose’ for such use, instead of having incrementally evolved from HDFS and Hadoop?”

Again, our idea was to try and bring together the best ideas from both the database world and the distributed systems world. (I guess you could say that we wanted to build a Big Data Reese’s Cup… J)

I knew words would fail me if I tried to describe the AsterixDB logo so I simply reproduce the logo:

asterickdb logo

Read the interview in full and then grab a copy of AsterixDB.

The latest beta release is 0.8.6. The software appears under the Apache Software 2.0 license.

October 27, 2014

On the Computational Complexity of MapReduce

Filed under: Algorithms,Complexity,Computer Science,Hadoop,Mathematics — Patrick Durusau @ 6:54 pm

On the Computational Complexity of MapReduce by Jeremy Kun.

From the post:

I recently wrapped up a fun paper with my coauthors Ben Fish, Adam Lelkes, Lev Reyzin, and Gyorgy Turan in which we analyzed the computational complexity of a model of the popular MapReduce framework. Check out the preprint on the arXiv.

As usual I’ll give a less formal discussion of the research here, and because the paper is a bit more technically involved than my previous work I’ll be omitting some of the more pedantic details. Our project started after Ben Moseley gave an excellent talk at UI Chicago. He presented a theoretical model of MapReduce introduced by Howard Karloff et al. in 2010, and discussed his own results on solving graph problems in this model, such as graph connectivity. You can read Karloff’s original paper here, but we’ll outline his model below.

Basically, the vast majority of the work on MapReduce has been algorithmic. What I mean by that is researchers have been finding more and cleverer algorithms to solve problems in MapReduce. They have covered a huge amount of work, implementing machine learning algorithms, algorithms for graph problems, and many others. In Moseley’s talk, he posed a question that caught our eye:

Is there a constant-round MapReduce algorithm which determines whether a graph is connected?

After we describe the model below it’ll be clear what we mean by “solve” and what we mean by “constant-round,” but the conjecture is that this is impossible, particularly for the case of sparse graphs. We know we can solve it in a logarithmic number of rounds, but anything better is open.

In any case, we started thinking about this problem and didn’t make much progress. To the best of my knowledge it’s still wide open. But along the way we got into a whole nest of more general questions about the power of MapReduce. Specifically, Karloff proved a theorem relating MapReduce to a very particular class of circuits. What I mean is he proved a theorem that says “anything that can be solved in MapReduce with so many rounds and so much space can be solved by circuits that are yae big and yae complicated, and vice versa.

But this question is so specific! We wanted to know: is MapReduce as powerful as polynomial time, our classical notion of efficiency (does it equal P)? Can it capture all computations requiring logarithmic space (does it contain L)? MapReduce seems to be somewhere in between, but it’s exact relationship to these classes is unknown. And as we’ll see in a moment the theoretical model uses a novel communication model, and processors that never get to see the entire input. So this led us to a host of natural complexity questions:

  1. What computations are possible in a model of parallel computation where no processor has enough space to store even one thousandth of the input?
  2. What computations are possible in a model of parallel computation where processor’s can’t request or send specific information from/to other processors?
  3. How the hell do you prove that something can’t be done under constraints of this kind?
  4. How do you measure the increase of power provided by giving MapReduce additional rounds or additional time?

These questions are in the domain of complexity theory, and so it makes sense to try to apply the standard tools of complexity theory to answer them. Our paper does this, laying some brick for future efforts to study MapReduce from a complexity perspective.

Given the prevalence of MapReduce, progress on understanding what is or is not possible is an important topic.

The first two complexity questions strike me as the ones most relevant to topic map processing with map reduce. Depending upon the nature of your merging algorithm.

Enjoy!

October 11, 2014

Spark Breaks Previous Large-Scale Sort Record

Filed under: BigData,Hadoop,Spark — Patrick Durusau @ 10:28 am

Spark Breaks Previous Large-Scale Sort Record by Reynold Xin.

From the post:

Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes. While it was clear to everybody that Spark is more efficient than MapReduce for data that fits in memory, we heard that some organizations were having trouble pushing it to large scale datasets that could not fit in memory. Therefore, since the inception of Databricks, we have devoted much effort, together with the Spark community, to improve the stability, scalability, and performance of Spark. Spark works well for gigabytes or terabytes of data, and it should also work well for petabytes.

To evaluate these improvements, we decided to participate in the Sort Benchmark. With help from Amazon Web Services, we participated in the Daytona Gray category, an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by Yahoo using a Hadoop MapReduce cluster of 2100 nodes. Using Spark on 206 EC2 nodes, we completed the benchmark in 23 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache.

Additionally, while no official petabyte (PB) sort competition exists, we pushed Spark further to also sort 1 PB of data (10 trillion records) on 190 machines in under 4 hours. This PB time beats previously reported results based on Hadoop MapReduce (16 hours on 3800 machines). To the best of our knowledge, this is the first petabyte-scale sort ever done in a public cloud.

Bottom line: Sorted 100 TB of data in 23 minutes, beat old record of 72 minutes, on fewer machines.

Read Reynold’s post and then get thee to Apache Spark!

I first saw this in a tweet by paco nathan.

October 7, 2014

The Definitive “Getting Started” Tutorial for Apache Hadoop + Your Own Demo Cluster

Filed under: Cloudera,Hadoop — Patrick Durusau @ 7:11 pm

The Definitive “Getting Started” Tutorial for Apache Hadoop + Your Own Demo Cluster by Justin Kestelyn.

From the post:

Most Hadoop tutorials take a piecemeal approach: they either focus on one or two components, or at best a segment of the end-to-end process (just data ingestion, just batch processing, or just analytics). Furthermore, few if any provide a business context that makes the exercise pragmatic.

This new tutorial closes both gaps. It takes the reader through the complete Hadoop data lifecycle—from data ingestion through interactive data discovery—and does so while emphasizing the business questions concerned: What products do customers view on the Web, what do they like to buy, and is there a relationship between the two?

Getting those answers is a task that organizations with traditional infrastructure have been doing for years. However, the ones that bought into Hadoop do the same thing at greater scale, at lower cost, and on the same storage substrate (with no ETL, that is) upon which many other types of analysis can be done.

To learn how to do that, in this tutorial (and assuming you are using our sample dataset) you will:

  • Load relational and clickstream data into HDFS (via Apache Sqoop and Apache Flume respectively)
  • Use Apache Avro to serialize/prepare that data for analysis
  • Create Apache Hive tables
  • Query those tables using Hive or Impala (via the Hue GUI)
  • Index the clickstream data using Flume, Cloudera Search, and Morphlines, and expose a search GUI for business users/analysts

I can’t imagine what “other” tutorials that Justin has in mind. 😉

To be fair, I haven’t taken this particular tutorial. Hadoop tutorials you suggest as comparisons to this one? Your comparisons of Hadoop tutorials?

September 30, 2014

Open Sourcing ml-ease

Filed under: Hadoop,Machine Learning,Spark — Patrick Durusau @ 6:25 pm

Open Sourcing ml-ease by Deepak Agarwal.

From the post:

LinkedIn data science and engineering is happy to release the first version of ml-ease, an open-source large scale machine learning library. ml-ease supports model fitting/training on a single machine, a Hadoop cluster and a Spark cluster with emphasis on scalability, speed, and ease-of-use. ml-ease is a useful tool for developers working on big data machine learning applications, and we’re looking forward to feedback from the open-source community. ml-ease currently supports ADMM logistic regression for binary response prediction with L1 and L2 regularization on Hadoop clusters.

See Deepak’s post for more details and news of future machine learning algorithms to be released!

September 29, 2014

The Apache Software Foundation Announces Apache™ Storm™ as a Top-Level Project

Filed under: Hadoop,Storm — Patrick Durusau @ 6:52 pm

The Apache Software Foundation Announces Apache™ Storm™ as a Top-Level Project

From the post:

The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 200 Open Source projects and initiatives, announced today that Apache™ Storm™ has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the project’s community and products have been well-governed under the ASF’s meritocratic process and principles.

“Apache Storm’s graduation is not only an indication of its maturity as a technology, but also of the robust, active community that develops and supports it,” said P. Taylor Goetz, Vice President of Apache Storm. “Storm’s vibrant community ensures that Storm will continue to evolve to meet the demands of real-time stream processing and computation use cases.”

Apache Storm is a high-performance, easy-to-implement distributed real-time computation framework for processing fast, large streams of data, adding reliable data processing capabilities to Apache Hadoop. Using Storm, a Hadoop cluster can efficiently process a full range of workloads, from real-time to interactive to batch.

As with all Apache products, Apache Storm software is released under the Apache License v2.0, and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project’s day-to-day operations, including community development and product releases. For documentation and ways to become involved with Apache Storm, visit http://storm.apache.org/ and @Apache_Storm on Twitter.

You will see many notices of Apache™ Storm™’s graduation to a Top-Level Project. Odds are you have already seen one. But, like the weather channel reporting rain at your location, someone may have missed the news. 😉

September 3, 2014

Best Map/Reduce Explanation

Filed under: Hadoop,MapReduce — Patrick Durusau @ 10:45 am

Michael Klishin tweeted today: “The best map/reduce explanation ever: https://pbs.twimg.com/media/Bwj9KO5IcAAdl4H.png:large

For your viewing pleasure:

map/reduce

It does have “side effects” though.

Is it lunch time yet?

August 26, 2014

Cloudera Navigator Demo

Filed under: Cloudera,Hadoop — Patrick Durusau @ 4:22 pm

Cloudera Navigator Demo

Not long (9:50) but useful demo of Cloudera Navigator.

There was a surprise or two.

The first one was the suggestion that if there are multiple columns with different names for zip code (the equivalent of postal codes), that you should normalize all the columns to one name.

Understandable but what if the column has a non-intuitive (to the user) name for the column? Such as CEP?

It appears that “searching” is on surface tokens and we all know the perils of that type of searching. More robust searching would allow for searching for any variant name of postal code, for example, and return the columns that shared the property of being a postal code, without regard to the column name.

The second surprise was that “normalization” as described sets the stage for repeating normatization with each data import. That sounds subject to human error as more and more data sets are imported.

The interface itself appears easy to use, assuming you are satisfied with opaque tokens for which you have to guess the semantics. You could be right but then on the other hand, you could be wrong.

August 19, 2014

High Performance With Apache Tez (webinar)

Filed under: Hadoop,Tez — Patrick Durusau @ 7:28 pm

Build High Performance Data Processing Application Using Apache Tez by Ajay Singh.

From the post:

This week we continue our YARN webinar series with detailed introduction and a developer overview of Apache Tez. Designed to express fit-to-purpose data processing logic, Tez enables batch and interactive data processing applications spanning TB to PB scale datasets. Tez offers a customizable execution architecture that allows developers to express complex computations as dataflow graphs and allows for dynamic performance optimizations based on real information about the data and the resources required to process it.

Tez graduated to Apache top-level project in July 2014 and is now the workhorse of Apache Hive. With Tez, Hive 0.13 is of a magnitude faster than its previous generation. To learn more on Tez, join us on Thursday August 21st at 9 AM Pacific Time. We will review

  • Tez Architecture
  • Developer APIs
  • Sample code

Discover and Learn

Something to get you in shape for the Fall!

August 13, 2014

Hadoop Ecosystem Guide Chart

Filed under: Hadoop,Hadoop YARN — Patrick Durusau @ 1:45 pm

As they say, you can’t tell the players without a program!

hadoop chart

From Greg Hill’s New To Hadoop? Here’s A Handy Guide To Get You Started (Part 1)

Greg’s post has a brief summary of each category.

Additional pieces that you will find handy are promised in a future post.

The Hadoop ecosystem is evolving rapidly so take this chart as a rough guide. More players are likely to appear in a matter of months if not weeks.

I first saw this in Joe Crobak’s Hadoop Weekly – July 28, 2014.

HDP 2.1 Tutorials

Filed under: Falcon,Hadoop,Hive,Hortonworks,Knox Gateway,Storm,Tez — Patrick Durusau @ 11:17 am

HDP 2.1 tutorials from Hortonworks:

  1. Securing your Data Lake Resource & Auditing User Access with HDP Security
  2. Searching Data with Apache Solr
  3. Define and Process Data Pipelines in Hadoop with Apache Falcon
  4. Interactive Query for Hadoop with Apache Hive on Apache Tez
  5. Processing streaming data in Hadoop with Apache Storm
  6. Securing your Hadoop Infrastructure with Apache Knox

The quality you have come to expect from Hortonwork tutorials but the data sets are a bit dull.

What data sets would you suggest to spice up this tutorials?

August 4, 2014

Summingbird:… [VLDB 2014]

Filed under: Hadoop,Scala,Storm,Summingbird,Tweets — Patrick Durusau @ 4:07 pm

Summingbird: A Framework for Integrating Batch and Online MapReduce Computations by Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin.

Abstract:

Summingbird is an open-source domain-specifi c language implemented in Scala and designed to integrate online and batch MapReduce computations in a single framework. Summingbird programs are written using data flow abstractions such as sources, sinks, and stores, and can run on diff erent execution platforms: Hadoop for batch processing (via Scalding/Cascading) and Storm for online processing. Different execution modes require di fferent bindings for the data flow abstractions (e.g., HDFS files or message queues for the source) but do not require any changes to the program logic. Furthermore, Summingbird can operate in a hybrid processing mode that transparently integrates batch and online results to efficiently generate up-to-date aggregations over long time spans. The language was designed to improve developer productivity and address pain points in building analytics solutions at Twitter where often, the same code needs to be written twice (once for batch processing and again for online processing) and indefi nitely maintained in parallel. Our key insight is that certain algebraic structures provide the theoretical foundation for integrating batch and online processing in a seamless fashion. This means that Summingbird imposes constraints on the types of aggregations that can be performed, although in practice we have not found these constraints to be overly restrictive for a broad range of analytics tasks at Twitter.

Heavy sledding but deeply interesting work. Particularly about “…integrating batch and online processing in a seamless fashion.”

I first saw this in a tweet by Jimmy Lin.

July 29, 2014

Hello World! – Hadoop, Hive, Pig

Filed under: Hadoop,Hive,Hortonworks,Pig — Patrick Durusau @ 7:10 pm

Hello World! – An introduction to Hadoop with Hive and Pig

A set of tutorials to be run on Sandbox v2.0.

From the post:

This Hadoop tutorial is from the Hortonworks Sandbox – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series. The tutorials presented here are for Sandbox v2.0

The tutorials are presented in sections as listed below.

Maybe I have seen too many “Hello World!” examples but I was expecting the tutorials to go through the use of Hadoop, HCatalog, Hive and Pig to say “Hello World!”

You can imagine my disappointment when that wasn’t the case. 😉

A lot of work to say “Hello World!” but on the other hand, tradition is tradition.

July 24, 2014

Hadoop Summit Content Curation

Filed under: Hadoop,Hadoop YARN,Hortonworks — Patrick Durusau @ 10:04 am

Hadoop Summit Content Curation by Jules S. Damji.

From the post:

Although the Hadoop Summit San Jose 2014 has come and gone, the invaluable content—keynotes, sessions, and tracks—is available here. We ’ve selected a few sessions for Hadoop developers, practitioners, and architects, curating them under Apache Hadoop YARN, the architectural center and the data operating system.

In most of the keynotes and tracks three themes resonated:

  1. Enterprises are transitioning from traditional Hadoop to modern Hadoop 2.
  2. YARN is an enabler, the central orchestrator that facilitates multiple workloads, runs multiple data engines, and supports multiple access patterns—batch, interactive, streaming, and real-time—in Apache Hadoop 2.
  3. Apache Hadoop 2, as part of Modern Data Architecture (MDA), is enterprise ready.

It doesn’t matter if I have cable or DirectTV, there is never a shortage of material to watch. 😉

Enjoy!

July 21, 2014

Hadoop Doesn’t Cure HIV

Filed under: Data Integration,Hadoop — Patrick Durusau @ 9:53 am

If I were Gartner, I could get IBM to support my stating the obvious. I would have to dress it up by repeating a lot of other obvious things but that seems to be the role for some “analysts.”

If you need proof of that claim, consider this report: Hadoop Is Not a Data Integration Solution. Really? Did any sane person familiar with Hadoop think otherwise?

The “key” findings from the report:

  • Many Hadoop projects perform extract, transform and load workstreams. Although these serve a purpose, the technology lacks the necessary key features and functions of commercially-supported data integration tools.
  • Data integration requires a method for rationalizing inconsistent semantics, which helps developers rationalize various sources of data (depending on some of the metadata and policy capabilities that are entirely absent from the Hadoop stack).
  • Data quality is a key component of any appropriately governed data integration project. The Hadoop stack offers no support for this, other than the individual programmer’s code, one data element at a time, or one program at a time.
  • Because Hadoop workstreams are independent — and separately programmed for specific use cases — there is no method for relating one to another, nor for identifying or reconciling underlying semantic differences.

All true, all obvious and all a function of Hadoop’s design. It never had data integration as a requirement so finding that it doesn’t do data integration isn’t a surprise.

If you switch “commercially-supported data integration tools,” you will be working “…one data element at a time,” because common data integration tools don’t capture their own semantics. Which means you can’t re-use your prior data integration with one tool when you transition to another. Does that sound like vendor lock-in?

Odd that Gartner didn’t mention that.

Perhaps that’s stating the obvious as well.

A topic mapping of your present data integration solution will enable you to capture and re-use your investment in its semantics, with any data integration solution.

Did I hear someone say “increased ROI?”

June 15, 2014

Analyzing 1.2 Million Network Packets…

Filed under: ElasticSearch,Hadoop,HBase,Hive,Hortonworks,Kafka,Storm — Patrick Durusau @ 4:19 pm

Analyzing 1.2 Million Network Packets per Second in Real Time by James Sirota and Sheetal Dolas.

Slides giving an overview of OpenSOC (Open Security Operations Center).

I mention this in case you are not the NSA and simply streaming the backbone of the Internet to storage for later analysis. Some business cases require real time results.

The project is also a good demonstration of building a high throughput system using only open source software.

Not to mention a useful collaboration between Cisco and Hortonworks.

BTW, take a look at slide 18. I would say they are adding information to the representative of a subject, wouldn’t you? While on the surface this looks easy, merging that data with other data, say held by local law enforcement, might not be so easy.

For example, depending on where you are intercepting traffic, you will be told I am about thirty (30) miles from my present physical location or some other answer. 😉 Now, if someone had annotated an earlier packet with that information and it was accessible to you, well, your targeting of my location could be a good deal more precise.

And there is the question of using data annotated by different sources who may have been attacked by the same person or group.

Even at 1.2 million packets per second there is still a role for subject identity and merging.

May 31, 2014

Introducing Hadoop FlipBooks

Filed under: Documentation,Hadoop — Patrick Durusau @ 5:31 pm

Introducing Hadoop FlipBooks

From the post:

In line with the learning theme that HadoopSphere has been evangelizing, we are pleased to introduce a new feature named FlipBooks. A Hadoop flipbook is a quick reference guide for any topic giving a short summary of key concepts in form of Q&A. Typically with a set of 4 questions, it tries to test your knowledge on the concept.

Curious what you think of this concept?

I looked at a couple of them but four (4) questions seems a bit short.

With the caution that it was probably twenty (20) years ago, I remember the drill software for the Novell Netware CNE program. Organized by subject/class as I recall and certainly a lot more than four (4) questions.

What software would you suggest for authoring similar drill material now?

« Newer PostsOlder Posts »

Powered by WordPress