Archive for the ‘Hortonworks’ Category

Apache Spark on HDP: Learn, Try and Do

Thursday, June 4th, 2015

Apache Spark on HDP: Learn, Try and Do by Jules S. Damji.

I wanted to leave you with something fun to enjoy this evening. I am off to read a forty-eight (48) page bill that would make your ninth (9th) grade English teacher hurl. It’s really grim stuff that boils down to a lot of nothing but you have to parse through it to make that evident. More on that tomorrow.

From the post:

Not a day passes without someone tweeting or re-tweeting a blog on the virtues of Apache Spark.

At a Memorial Day BBQ, an old friend proclaimed: “Spark is the new rub, just as Java was two decades ago. It’s a developers’ delight.”

Spark as a distributed data processing and computing platform offers much of what developers’ desire and delight—and much more. To the ETL application developer Spark offers expressive APIs for transforming data; to the data scientists it offers machine libraries, MLlib component; and to data analysts it offers SQL capabilities for inquiry.

In this blog, I summarize how you can get started, enjoy Spark’s delight, and commence on a quick journey to Learn, Try, and Do Spark on HDP, with a set of tutorials.

I don’t know which is more disturbing. That Spark was being discussed at a Memorial Day BBQ or that anyone was sober enough to remember it. Life seems to change when you are older than the average cardiologist.

Sorry! Where were we, oh, yes, Saptak Sen has collected a set of tutorials to introduce you to Spark on the HDP Sandbox.

Near the bottom of the page, Apache Zeppelin (incubating) is mentioned along with Spark. Could use it to enable exploration of a data set. Could also use it so that users “discover” on their own that your analysis of the data is indeed correct. 😉

Due diligence means not only seeing the data as processed but the data from where that data was drawn, what pre-processing was done on that data, the circumstances under which the “original” data came into being, the algorithms applied at all stages, to name only a few considerations.

The demonstration of a result merits, “that’s interesting” until you have had time to verify it. “Trust” comes after verification.

MapR on Open Data Platform: Why we declined

Wednesday, April 29th, 2015

MapR on Open Data Platform: Why we declined by John Schroeder.

From the post:


Open Data Platform is “solving” problems that don’t need solving

Companies implementing Hadoop applications do not need to be concerned about vendor lock-in or interoperability issues. Gartner analysts Merv Adrian and Nick Heudecker disclosed in a recent blog that less than 1% of companies surveyed thought that vendor lock-in or interoperability was an issue—dead last on the list of customer concerns. Project and sub-project interoperability are very good and guaranteed by both free and paid-for distributions. Applications built on one distribution can be migrated with virtually zero switching costs to the other distributions.

Open Data Platform participation lacks participation by the Hadoop leaders

~75% of Hadoop implementations run on MapR and Cloudera. MapR and Cloudera have both chosen not to participate. The Open Data Platform without MapR and Cloudera is a bit like one of the Big Three automakers pushing for a standards initiative without the involvement of the other two.

I mention this post because it touches on two issues that should concern all users of Hadoop applications.

On “vendor lock-in” you will find the question that was asked was “…how many attendees considered vendor lock-in a barrier to investment in Hadoop. It came in dead last. With around 1% selecting it.” Who Asked for an Open Data Platform?. Considering that it was in the context of a Gartner webinar, it could have been only one person selected it. Not what I would call a representative sample.

Still, I think John in right in saying that vendor lock-in isn’t a real issue with Hadoop. Hadoop applications aren’t off the shelf items and are custom constructs for your needs and data. Not much opportunity for vendor lock-in. You’re in greater danger of IT lock-in due to poor or non-existent documentation for your Hadoop application. If anyone tells you a Hadoop application doesn’t need documentation because you can “…read the code…,” they are building up job security, quite possibly at your future expense.

John is spot on about the Open Data Platform not including all of the Hadoop market leaders. As John says, Open Data Platform does not include those responsible for 75% of the existing Hadoop implementations.

I have seen that situation before in standards work and it never leads to a happy conclusion, for the participants, non-participants and especially the consumers, who are supposed to benefit from the creation of standards. Non-standards for a minority of the market only serve to confuse not overly clever consumers. To say nothing of the popular IT press.

The Open Data Platform also raises questions about how one goes about creating a standard. One approach is to create a standard based on your projection of market needs and to campaign for its adoption. Another is to create a definition of an “ODP Core” and see if it is used by customers in development contracts and purchase orders. If consumers find it useful, they will no doubt adopt it as a de facto standard. Formalization can follow in due course.

So long as we are talking about possible future standards, a practice of documentation more advanced than C style comments for Hadoop ecosystems would be a useful Hadoop standard in the future.

Apache Spark, Now GA on Hortonworks Data Platform

Tuesday, April 14th, 2015

Apache Spark, Now GA on Hortonworks Data Platform by Vinay Shukla.

From the post:

Hortonworks is pleased to announce the general availability of Apache Spark in Hortonworks Data Platform (HDP)— now available on our downloads page. With HDP 2.2.4 Hortonworks now offers support for your developers and data scientists using Apache Spark 1.2.1.

HDP’s YARN-based architecture enables multiple applications to share a common cluster and dataset while ensuring consistent levels of service and response. Now Spark is one of the many data access engines that works with YARN and that is supported in an HDP enterprise data lake. Spark provides HDP subscribers yet another way to derive value from any data, any application, anywhere.

What more need I say?

Get thee to the downloads page!

Jump-Start Big Data with Hortonworks Sandbox on Azure

Thursday, March 19th, 2015

Jump-Start Big Data with Hortonworks Sandbox on Azure by Saptak Sen.

From the post:

We’re excited to announce the general availability of Hortonworks Sandbox for Hortonworks Data Platform 2.2 on Azure.

Hortonworks Sandbox is already a very popular environment in which developers, data scientists, and administrators can learn and experiment with the latest innovations in the Hortonworks Data Platform.

The hundreds of innovations span Hadoop, Kafka, Storm, Hive, Pig, YARN, Ambari, Falcon, Ranger, and other components of which HDP is composed. Now you can deploy this environment for your learning and experimentation in a few clicks on Microsoft Azure.

Follow the guide to Getting Started with Hortonworks Sandbox with HDP 2.2 on Azure to set up your own dev-ops environment on the cloud in a few clicks.

We also provide step by step tutorials to help you get a jump-start on how to use HDP to implement a Modern Data Architecture at your organization.

The Hadoop Sandbox is an excellent way to explore the Hadoop ecosystem. If you trash the setup, just open another sandbox.

Add Hortonworks tutorials to the sandbox and you are less likely to do something really dumb. Or at least you will understand what happened and how to avoid it before you go into production. Always nice to keep the dumb mistakes on your desktop.

Now the Hortonworks Sandbox is on Azure. Same safe learning environment but the power to scale when you are really to go live!

Hortonworks Establishes Data Governance Initiative

Monday, February 2nd, 2015

Hortonworks Establishes Data Governance Initiative

From the post:

Hortonworks® (NASDAQ:HDP), the leading contributor to and provider of enterprise Apache™ Hadoop®, today announced the creation of the Data Governance Initiative (DGI). DGI will develop an extensible foundation that addresses enterprise requirements for comprehensive data governance. In addition to Hortonworks, the founding members of DGI are Aetna, Merck, and Target and Hortonworks’ technology partner SAS.

Enterprises adopting a modern data architecture must address certain realities when legacy and new data from disparate platforms are brought under management. DGI members will work with the open source community to deliver a comprehensive solution; offering fast, flexible and powerful metadata services, deep audit store and an advanced policy rules engine. It will also feature deep integration with Apache Falcon for data lifecycle management and Apache Ranger for global security policies. Additionally, the DGI solution will interoperate with and extend existing third-party data governance and management tools by shedding light on the access of data within Hadoop. Further DGI investment roadmap phases will be released in the coming weeks.

Supporting quotes

“This joint engineering initiative is another pillar in our unique open source development model,” said Tim Hall, vice president, product management at Hortonworks. “We are excited to partner with the other DGI members to build a completely open data governance foundation that meets enterprise requirements.”

“As customers are moving Hadoop into corporate data and processing environments, metadata and data governance are much needed capabilities. SAS participation in this initiative strengthens the integration of SAS data management, analytics and visualization into the HDP environment and more broadly it helps advance the Apache Hadoop project. This additional integration will give customers better ability to manage big data governance within the Hadoop framework,” said SAS Vice President of Product Management Randy Guard.

Further reading

Enterprise Hadoop: www.hortonworks.com/hadoop

Apache Falcon: http://hortonworks.com/hadoop/falcon/

Hadoop and a Modern Data Architecture: www.hortonworks.com/mda

For more information:

Mike Haro
408-438-8628
comms@hortonworks.com

Quite possibly an opportunity to push for topic map like capabilities in an enterprise setting.

That will require affirmative action on the part of members of the TM community as it is unlikely Hortonworks and others will educate themselves on topic maps.

Suggestions?

Apache Ranger Audit Framework

Wednesday, December 24th, 2014

Apache Ranger Audit Framework by Madhan Neethiraj.

From the post:

Apache Ranger provides centralized security for the Enterprise Hadoop ecosystem, including fine-grained access control and centralized audit mechanism, all essential for Enterprise Hadoop. This blog covers various details of Apache Ranger’s audit framework options available with Apache Ranger Release 0.4.0 in HDP 2.2 and how they can be configured.

From the Ranger homepage:

Apache Ranger offers a centralized security framework to manage fine-grained access control over Hadoop data access components like Apache Hive and Apache HBase. Using the Apache Ranger console, security administrators can easily manage policies for access to files, folders, databases, tables, or column. These policies can be set for individual users or groups and then enforced within Hadoop.

Security administrators can also use Apache Ranger to manage audit tracking and policy analytics for deeper control of the environment. The solution also provides an option to delegate administration of certain data to other group owners, with the aim of securely decentralizing data ownership.

Apache Ranger currently supports authorization, auditing and security administration of following HDP components:

And you are going to document the semantics of the settings, events and other log information….where?

Oh, aha, you know what those settings, events and other log information mean and…, not planning on getting hit by a bus are we? Or planning to stay in your present position forever?

No joke. I know someone training their replacements in ten year old markup technologies. Systems built on top of other systems. And they kept records. Lots of records.

Test your logs on a visiting Hadoop systems administrator. If they don’t get 100% correct on your logging, using whatever documentation you have, you had better start writing.

I hadn’t thought about the visiting Hadoop systems administrator idea before but that would be a great way to test the documentation for Hadoop ecosystems. Better to test it that way instead of after a natural or unnatural disaster.

Call it the Hadoop Ecosystem Documentation Audit. Give a tester tasks to perform, which must be accomplished with existing documentation. No verbal assistance. I suspect a standard set of tasks could be useful in defining such a process.

Announcing Apache Storm 0.9.3

Thursday, December 18th, 2014

Announcing Apache Storm 0.9.3 by Taylor Goetz

From the post:

With Apache Hadoop YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it for batch, interactive and real-time streaming use cases. Apache Storm brings real-time data processing capabilities to help capture new business opportunities by powering low-latency dashboards, security alerts, and operational enhancements integrated with other applications running in the Hadoop cluster.

spark-0.9.3

Now there’s an early holiday surprise!

Enjoy!

Data Science with Hadoop: Predicting Airline Delays – Part 2

Tuesday, December 9th, 2014

Using machine learning algorithms, Spark and Scala – Part 2 by Ofer Mendelevitch and Beau Plath.

From the post:

In this 2nd part of the blog post and its accompanying IPython Notebook in our series on Data Science and Apache Hadoop, we continue to demonstrate how to build a predictive model with Apache Hadoop, using existing modeling tools. And this time we’ll use Apache Spark and ML-Lib.

Apache Spark is a relatively new entrant to the Hadoop ecosystem. Now running natively on Apache Hadoop YARN, the architectural center of Hadoop, Apache Spark is an in-memory data processing API and execution engine that is effective for machine learning and data science use cases. And with Spark on YARN, data workers can simultaneously use Spark for data science workloads alongside other data access engines–all accessing the same shared dataset on the same cluster.

ds_2_1

The next installment in this series continues the analysis with the same dataset but then with R!

The bar for user introductions to technology is getting higher even as we speak!

Data Science with Apache Hadoop: Predicting Airline Delays (Part 1)

Tuesday, December 9th, 2014

Using machine learning algorithms, Pig and Python – Part 1 by Ofer Mendelevitch.

From the post:

With the rapid adoption of Apache Hadoop, enterprises use machine learning as a key technology to extract tangible business value from their massive data assets. This derivation of business value is possible because Apache Hadoop YARN as the architectural center of Modern Data Architecture (MDA) allows purpose-built data engines such as Apache Tez and Apache Spark to process and iterate over multiple datasets for data science techniques within the same cluster.

ds_1

It is a common misconception that the way data scientists apply predictive learning algorithms like Linear Regression, Random Forest or Neural Networks to large datasets requires a dramatic change in approach, in tooling, or in usage of siloed clusters. Not so: no dramatic change; no dedicated clusters; using existing modeling tools will suffice.

In fact, the big change is in what is known as “feature engineering”—the process by which very large raw data is transformed into a “feature matrix.” Enabled by Apache Hadoop with YARN as an ideal platform, this transformation of large raw datasets (terabytes or petabytes) into a feature matrix is now scalable and not limited by RAM or compute power of a single node.

Since the output of the feature engineering step (the “feature matrix”) tends to be relatively small in size (typically in the MB or GB scale), a common choice is to run the learning algorithm on a single machine (often with multiple cores and high amount of RAM), allowing us to utilize a plethora of existing robust tools and algorithms from R packages, Python’s Scikit-learn, or SAS.

In this multi-part blog post and its accompanying IPython Notebook, we will demonstrate an example step-by-step solution to a supervised learning problem. We will show how to solve this problem with various tools and libraries and how they integrate with Hadoop. In part I we focus on Apache PIG, Python, and Scikit-learn, while in subsequent parts, we will explore and examine other alternatives such as R or Spark/ML-Lib

With the IPython notebook, this becomes a great example of how to provide potential users hands-on experience with a technology.

An example that Solr, for example, might well want to imitate.

PS: When I was traveling, a simpler way to predict flight delays was to just ping me for my travels plans. 😉 You?

Available Now: HDP 2.2

Wednesday, December 3rd, 2014

Available Now: HDP 2.2 by Jim Walker.

From the post:

We are very pleased to announce that the Hortonworks Data Platform Version 2.2 (HDP) is now generally available for download. With thousands of enhancements across all elements of the platform spanning data access to security to governance, rolling upgrades and more, HDP 2.2 makes it even easier for our customers to incorporate HDP as a core component of Modern Data Architecture (MDA).

HDP 2.2 represents the very latest innovation from across the Hadoop ecosystem, where literally hundreds of developers have been collaborating with us to evolve each of the individual Apache Software Foundation (ASF) projects from the broader Apache Hadoop ecosystem. These projects have now been brought together into the complete and open Hortonworks Data Platform (HDP) delivering more than 100 new features and closing out thousands of issues across Apache Hadoop and its related projects.

These distinct ASF projects from across the Hadoop ecosystem span every aspect of the data platform and are easily categorized into:

  • Data management: this is the core of the platform, including Apache Hadoop and its subcomponents of HDFS and YARN, which is the architectural center of HDP.
  • Data access: this represents the broad range of options for developers to access and process data, stored in HDFS and depending on their application requirements.
  • The supporting enterprise services of governance, operations and security that are fundamental to any enterprise data platform.

How many of the 100 new features will you try by the end of December, 2014? 😉

A sandbox edition is promised by December 9, 2014.

Tis the season to be jolly!

Discovering Patterns for Cyber Defense Using Linked Data Analysis [12th Nov., 10am PDT]

Tuesday, November 11th, 2014

Discovering Patterns for Cyber Defense Using Linked Data Analysis

Wednesday, Nov. 12th | 10am PDT

I am always suspicious of one-day announcements of webinars. This post appeared on November 11th for a webinar on November 12th.

Only one way to find out so I registered. Join me to find out: substantive presentation or click-bait.

If enough people attend and then comment here, one way or the other, who knows? It might make a difference.

From the post:

Almost every week, news of a proprietary or customer data breach hits the news wave. While attackers have increased the level of sophistication in their tactics, so too have organizations advanced in their ability to build a robust, data-driven defense.

Apache Hadoop has emerged as the de facto big data platform, which makes it the perfect fit to accumulate cybersecurity data and diagnose the latest attacks.  As Enterprises roll out and grow their Hadoop implementations, they require effective ways for pinpointing and reasoning about correlated events within their data, and assessing their network security posture.

Join Hortonworks and Sqrrl to learn:

  • How Linked Data Analysis enables intuitive exploration, discovery, and pattern recognition over your big cybersecurity data
  • Effective ways to correlated events within your data, and assessing your network security posture
  • New techniques for discovering hidden patterns and detecting anomalies within your data
  • How Hadoop fits into your current data structure forming a secure, Modern Data Architecture

Register now to learn how combining the power of Hadoop and the Hortonworks Data Platform with massive, secure, entity-centric data models in Sqrrl Enterprise allows you to create a data-driven defense.

Bring your red pen. November 12, 2014 at 10am PDT. (That should be 1pm East Coast time.) See you then!

HDP 2.1 Tutorials

Wednesday, August 13th, 2014

HDP 2.1 tutorials from Hortonworks:

  1. Securing your Data Lake Resource & Auditing User Access with HDP Security
  2. Searching Data with Apache Solr
  3. Define and Process Data Pipelines in Hadoop with Apache Falcon
  4. Interactive Query for Hadoop with Apache Hive on Apache Tez
  5. Processing streaming data in Hadoop with Apache Storm
  6. Securing your Hadoop Infrastructure with Apache Knox

The quality you have come to expect from Hortonwork tutorials but the data sets are a bit dull.

What data sets would you suggest to spice up this tutorials?

Hello World! – Hadoop, Hive, Pig

Tuesday, July 29th, 2014

Hello World! – An introduction to Hadoop with Hive and Pig

A set of tutorials to be run on Sandbox v2.0.

From the post:

This Hadoop tutorial is from the Hortonworks Sandbox – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series. The tutorials presented here are for Sandbox v2.0

The tutorials are presented in sections as listed below.

Maybe I have seen too many “Hello World!” examples but I was expecting the tutorials to go through the use of Hadoop, HCatalog, Hive and Pig to say “Hello World!”

You can imagine my disappointment when that wasn’t the case. 😉

A lot of work to say “Hello World!” but on the other hand, tradition is tradition.

Hadoop Summit Content Curation

Thursday, July 24th, 2014

Hadoop Summit Content Curation by Jules S. Damji.

From the post:

Although the Hadoop Summit San Jose 2014 has come and gone, the invaluable content—keynotes, sessions, and tracks—is available here. We ’ve selected a few sessions for Hadoop developers, practitioners, and architects, curating them under Apache Hadoop YARN, the architectural center and the data operating system.

In most of the keynotes and tracks three themes resonated:

  1. Enterprises are transitioning from traditional Hadoop to modern Hadoop 2.
  2. YARN is an enabler, the central orchestrator that facilitates multiple workloads, runs multiple data engines, and supports multiple access patterns—batch, interactive, streaming, and real-time—in Apache Hadoop 2.
  3. Apache Hadoop 2, as part of Modern Data Architecture (MDA), is enterprise ready.

It doesn’t matter if I have cable or DirectTV, there is never a shortage of material to watch. 😉

Enjoy!

Analyzing 1.2 Million Network Packets…

Sunday, June 15th, 2014

Analyzing 1.2 Million Network Packets per Second in Real Time by James Sirota and Sheetal Dolas.

Slides giving an overview of OpenSOC (Open Security Operations Center).

I mention this in case you are not the NSA and simply streaming the backbone of the Internet to storage for later analysis. Some business cases require real time results.

The project is also a good demonstration of building a high throughput system using only open source software.

Not to mention a useful collaboration between Cisco and Hortonworks.

BTW, take a look at slide 18. I would say they are adding information to the representative of a subject, wouldn’t you? While on the surface this looks easy, merging that data with other data, say held by local law enforcement, might not be so easy.

For example, depending on where you are intercepting traffic, you will be told I am about thirty (30) miles from my present physical location or some other answer. 😉 Now, if someone had annotated an earlier packet with that information and it was accessible to you, well, your targeting of my location could be a good deal more precise.

And there is the question of using data annotated by different sources who may have been attacked by the same person or group.

Even at 1.2 million packets per second there is still a role for subject identity and merging.

Apache Hadoop 2.4.0 Released!

Thursday, April 10th, 2014

Apache Hadoop 2.4.0 Released! by Arun Murthy.

From the post:

It gives me great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 2.4.0! Thank you to every single one of the contributors, reviewers and testers!

Hadoop 2.4.0 continues that momentum, with additional enhancements to both HDFS & YARN:

  • Support for Access Control Lists in HDFS (HDFS-4685)
  • Native support for Rolling Upgrades in HDFS (HDFS-5535)
  • Smooth operational upgrades with protocol buffers for HDFS FSImage (HDFS-5698)
  • Full HTTPS support for HDFS (HDFS-5305)
  • Support for Automatic Failover of the YARN ResourceManager (YARN-149) (a.k.a Phase 1 of YARN ResourceManager High Availability)
  • Enhanced support for new applications on YARN with Application History Server (YARN-321) and Application Timeline Server (YARN-1530)
  • Support for strong SLAs in YARN CapacityScheduler via Preemption (YARN-185)

And of course:

Links

See Arun’s post for more details or just jump to the downloads links.

Hortonworks Data Platform 2.1

Wednesday, April 2nd, 2014

Hortonworks Data Platform 2.1 by Jim Walker.

From the post:

The pace of innovation within the Apache Hadoop community is truly remarkable, enabling us to announce the availability of Hortonworks Data Platform 2.1, incorporating the very latest innovations from the Hadoop community in an integrated, tested, and completely open enterprise data platform.

A VM available now, full releases to follow later in April.

Just grabbing the headings from Jim’s post:

The Stinger Initiative: Apache Hive, Tez and YARN for Interactive Query

Data Governance with Apache Falcon

Security with Apache Knox

Stream Processing with Apache Storm

Searching Hadoop Data with Apache Solr

Advanced Operations with Apache Ambari

See Jim’s post for some of the details and the VM for others.

Apache Hadoop 2.3.0 Released!

Tuesday, February 25th, 2014

Apache Hadoop 2.3.0 Released! by Arun Murthy.

From the post:

It gives me great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 2.3.0!

hadoop-2.3.0 is the first release for the year 2014, and brings a number of enhancements to the core platform, in particular to HDFS.

With this release, there are two significant enhancements to HDFS:

  • Support for Heterogeneous Storage Hierarchy in HDFS (HDFS-2832)
  • In-memory Cache for data resident in HDFS via Datanodes (HDFS-4949)

With support for heterogeneous storage classes in HDFS, we now can take advantage of different storage types on the same Hadoop clusters. Hence, we can now make better cost/benefit tradeoffs with different storage media such as commodity disks, enterprise-grade disks, SSDs, Memory etc. More details on this major enhancement are available here.

Along similar lines, it is now possible to use memory available in the Hadoop cluster to centrally cache and administer data-sets in-memory in the Datanode’s address space. Applications such as MapReduce, Hive, Pig etc. can now request for memory to be cached (for the curios, we use a combination of mmap, mlock to achieve this) and then read it directly off the Datanode’s address space for extremely efficient scans by avoiding disk altogether. As an example, Hive is taking advantage of this feature by implementing an extremely efficient zero-copy read path for ORC files – see HIVE-6347 for details.

See Arun’s post for more details.

I guess there really is a downside to open source development.

It’s so much faster than commercial product development cycles. 😉 (Hard to keep up.)

Tutorial 1: Hello World… [Hadoop/Hive/Pig]

Monday, January 27th, 2014

Tutorial 1: Hello World – An Overview of Hadoop with Hive and Pig

Don’t be frightened!

The tutorial really doesn’t use big data tools to quickly say “Hello World” or to even say it quickly, many times. 😉

One of the clearer tutorials on big data tools.

You won’t quite be dangerous by the time you finish this tutorial but you should have a strong enough taste of the tools to want more.

Enjoy!

Empowering Half a Billion Users For Free –
Would You?

Wednesday, January 22nd, 2014

How To Use Microsoft Excel to Visualize Hadoop Data by Saptak Sen.

From the post:

Microsoft and Hortonworks have been working together for over two years now with the goal of bringing the power of Big Data to a billion people. As a result of that work, today we announced the General Availability of HDP 2.0 for Windows with the full power of YARN.

There are already over half a billion Excel users on this planet.

So, we have put together a short tutorial on the Hortonworks Sandbox where we walk through the end-to-end data pipeline using HDP and Microsoft Excel in the shoes of a data analyst at a financial services firm where she:

  • Cleans and aggregates 10 years of raw stock tick data from NYSE
  • Enriches the data model by looking up additional attributes from Wikipedia
  • Creates an interactive visualization on the model

You can find the tutorial here.

As part of this process you will experience how simple it is to integrate HDP with the Microsoft Power BI platform.

This integration is made possible by the community work to design and implement WebHDFS, an open REST API in Apache Hadoop. Microsoft used the API from Power Query for Excel to make the integration to Microsoft Business Intelligence platform seamless.

Happy Hadooping!!!

Opening up Hadoop to a half of billion users can’t do anything but drive the development of the Hadoop ecosystem.

Which will in turn return more benefits to the Excel user community, which will drive usage of Excel.

That’s what I call a smart business strategy.

You?

PS: Where are there similar strategies possible for subject identity?

HDP 2.0 for Windows is GA

Tuesday, January 21st, 2014

HDP 2.0 for Windows is GA by John Kreisa.

From the post:

We are excited to announce that the Hortonworks Data Platform 2.0 for Windows is publicly available for download. HDP 2 for Windows is the only Apache Hadoop 2.0 based platform that is certified for production usage on Windows Server 2008 R2 and Windows Server 2012 R2.

With this release, the latest in community innovation on Apache Hadoop is now available across all major Operating Systems. HDP 2.0 provides Hadoop coverage for more than 99% of the enterprises in the world, offering the most flexible deployment options from On-Premise to a variety of cloud solutions.

Unleashing YARN and Hadoop 2 on Windows

HDP 2.0 for Windows is a leap forward as it brings the power of Apache Hadoop YARN to Windows. YARN enables a user to interact with all data in multiple ways simultaneously – for instance making use of both realtime and batch processing – making Hadoop a true multi-use data platform and allowing it to take its place in a modern data architecture.

Excellent!

BTW, Microsoft is working with Hortonworks to make sure Apache Hadoop works seamlessly with Microsoft Windows and Azure.

I think they call that interoperability. Or something like that. 😉

MS SQL Server -> Hadoop

Thursday, January 16th, 2014

Community Tutorial 04: Import from Microsoft SQL Server into the Hortonworks Sandbox using Sqoop

From the webpage:

For a simple proof of concept I wanted to get data from MS SQL Server into the Hortonworks Sandbox in an automated fasion using Sqoop. Apache Sqoop provides a way of efficiently transferring bulk data between Apache Hadoop and relational databases. This tutorial will show you how to use Sqoop to import data into the Hortonworks Sandbox from a Microsoft SQL Server data source.

You’ll have to test this one without me.

I have thought about setting up a MS SQL Server but never got around to it. 😉

Ready to learn Hadoop?

Monday, December 30th, 2013

Ready to learn Hadoop?

From the webpage:

Sign up for the challenge of learning the basics of Hadoop in two weeks! You will get one email every day for the next 14 days.

  • Hello World: Overview of Hadoop
  • Data Processing Using Apache Hadoop
  • Setting up ODBC Connections
  • Connecting to Enterprise Applications
  • Data Integration and ETL
  • Data Analytics
  • Data Visualization
  • Hadoop Use Cases: Web
  • Hadoop Use Cases: Business
  • Recap

You could do this entirely on your own but the daily email may help.

If nothing else, it will be a reminder that something fun is waiting for you after work.

Enjoy!

…Stinger Phase 3 Technical Preview

Saturday, December 21st, 2013

Announcing Stinger Phase 3 Technical Preview by Carter Shanklin.

From the post:

As an early Christmas present, we’ve made a technical preview of Stinger Phase 3 available. While just a preview by moniker, the release marks a significant milestone in the transformation of Hadoop from a batch-oriented system to a data platform capable of interactive data processing at scale and delivering on the aims of the Stinger Initiative.

Apache Tez and SQL: Interactive Query-IN-Hadoop

stinger-phase-3Tez is a low-level runtime engine not aimed directly at data analysts or data scientists. Frameworks need to be built on top of Tez to expose it to a broad audience… enter SQL and interactive query in Hadoop.

Stinger Phase 3 Preview combines the Tez execution engine with Apache Hive, Hadoop’s native SQL engine. Now, anyone who uses SQL tools in Hadoop can enjoy truly interactive data query and analysis.

We have already seen Apache Pig move to adopt Tez, and we will soon see others like Cascading do the same, unlocking many forms of interactive data processing natively in Hadoop. Tez is the technology that takes Hadoop beyond batch and into interactive, and we’re excited to see it available in a way that is easy to use and accessible to any SQL user.

….

Further on in the blog Carter mentions that for real fun you need four (4) physical nodes and a fairly large dataset.

I have yet to figure out the price break point between a local cluster and using a cloud service. Suggestions on that score?

Storm Technical Preview Available Now!

Friday, December 13th, 2013

Storm Technical Preview Available Now! by Himanshu Bari.

From the post:

In October, we announced our intent to include and support Storm as part of Hortonworks Data Platform. With this commitment, we also outlined and proposed an open roadmap to improve the enterprise readiness of this key project. We are committed to doing this with a 100% open source approach and your feedback is immensely valuable in this process.

Today, we invite you to take a look at our Storm technical preview. This preview includes the latest release of Storm with instructions on how to install Storm on Hortonworks Sandbox and run a sample topology to familiarize yourself with the technology. This is the final pre-Apache release of Storm.

You know this but I wanted to emphasize how your participation in alpha/beta/candidate/preview releases benefits not only the community but yourself as well.

Bugs that are found and squashed now won’t bother you (or anyone else) later in production.

Not to mention you get to exercise your skills before using the software become routine and so does your use of it.

Enjoy the weekend!

Getting Started Writing YARN Applications [Webinar – December 18th]

Friday, December 13th, 2013

Getting Started Writing YARN Applications by Lisa Sensmeier.

From the post:

There is a lot of information available on the benefits of Apache YARN but how do you get started building applications? On December 18 at 9am Pacific Time, Hortonworks will host a webinar and go over just that: what independent software vendors (ISVs) and developers need to do to take the first steps towards developing applications or integrating existing applications on YARN.

Register for the webinar here.

My experience with webinars has been uneven to say the least.

Every Mike McCandless webinar (live or recorded) has been a real treat. Great presentation skills, high value content and well organized.

I have seen other webinars with poor presentation skills, low value or mostly ad content that were poorly organized.

No promises on what you will see on the 18th of December but let’s hope for the former and not the latter. (No pressure, no pressure. 😉 )

Modern Healthcare Architectures Built with Hadoop

Monday, December 2nd, 2013

Modern Healthcare Architectures Built with Hadoop by Justin Sears.

From the post:

We have heard plenty in the news lately about healthcare challenges and the difficult choices faced by hospital administrators, technology and pharmaceutical providers, researchers, and clinicians. At the same time, consumers are experiencing increased costs without a corresponding increase in health security or in the reliability of clinical outcomes.

One key obstacle in the healthcare market is data liquidity (for patients, practitioners and payers) and some are using Apache Hadoop to overcome this challenge, as part of a modern data architecture. This post describes some healthcare use cases, a healthcare reference architecture and how Hadoop can ease the pain caused by poor data liquidity.

As you would guess, I like the phrase data liquidity. 😉

And Justin lays out the areas where we are going to find “poor data liquidity.”

Source data comes from:

  • Legacy Electronic Medical Records (EMRs)
  • Transcriptions
  • PACS
  • Medication Administration
  • Financial
  • Laboratory (e.g. SunQuest, Cerner)
  • RTLS (for locating medical equipment & patient throughput)
  • Bio Repository
  • Device Integration (e.g. iSirona)
  • Home Devices (e.g. scales and heart monitors)
  • Clinical Trials
  • Genomics (e.g. 23andMe, Cancer Genomics Hub)
  • Radiology (e.g. RadNet)
  • Quantified Self Sensors (e.g. Fitbit, SmartSleep)
  • Social Media Streams (e.g. FourSquare, Twitter)

But then I don’t see what part of the Hadoop architecture addresses the problem of “poor data liquidity.”

Do you?

I thought I had found it when Charles Boicey (in the UCIH case study) says:

“Hadoop is the only technology that allows healthcare to store data in its native form. If Hadoop didn’t exist we would still have to make decisions about what can come into our data warehouse or the electronic medical record (and what cannot). Now we can bring everything into Hadoop, regardless of data format or speed of ingest. If I find a new data source, I can start storing it the day that I learn about it. We leave no data behind.”

But that’s not “data liquidity,” not in any meaningful sense of the word. Dumping your data to paper would be just as effective and probably less costly.

To be useful, “data liquidity” must has a sense of being integrated with data from diverse sources. To present the clinician, researcher, health care facility, etc. with all the data about a patient, not just some of it.

I also checked the McKinsey & Company report “The ‘Big Data’ Revolution in Healthcare.” I didn’t expect them to miss the data integration question and they didn’t.

The second exhibit in the McKinsey and Company report (the full report):

big data integration

The part in red reads:

Integration of data pools required for major opportunities.

I take that to mean that in order to have meaningful healthcare reform, integration of health care data pools is the first step.

Do you disagree?

And if that’s true, that we need integration of health care data pools first, do you think Hadoop can accomplish that auto-magically?

I don’t either.

How to use R … in MapReduce and Hive

Friday, November 8th, 2013

How to use R and other non-Java languages in MapReduce and Hive by Tom Hanlon.

From the post:

I teach for Hortonworks and in class just this week I was asked to provide an example of using the R statistics language with Hadoop and Hive. The good news was that it can easily be done. The even better news is that it is actually possible to use a variety of tools: Python, Ruby, shell scripts and R to perform distributed fault tolerant processing of your data on a Hadoop cluster.

In this blog post I will provide an example of using R, http://www.r-project.org with Hive. I will also provide an introduction to other non-Java MapReduce tools.

If you wanted to follow along and run these examples in the Hortonworks Sandbox you would need to install R.

The Hortonworks Sandbox just keeps getting better!

Hortonworks Sandbox Version 2.0

Wednesday, October 30th, 2013

Hortonworks Sandbox Version 2.0

From the web page:

Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials. Sandbox includes many of the most exciting developments from the latest HDP distribution, packaged up in a virtual environment that you can get up and running in 15 minutes!

Sandbox comes with:

Component Version
Apache Hadoop 2.2.0
Apache Hive 0.12.0
Apache HCatalog 0.12.0
Apache HBase 0.96.0
Apache ZooKeeper 3.4.5
Apache Pig 0.12.0
Apache Sqoop 1.4.4
Apache Flume 1.4.0
Apache Oozie 4.0.0
Apache Ambari 1.4.1
Apache Mahout 0.8.0
Hue 2.3.0

If you check the same listing at the Hortonworks page, you will see the Hue lacks a hyperlink. I had forgotten why until I ran the link down. 😉

Enjoy!

HDP 2.0 and its YARN-based architecture…delivered!

Wednesday, October 23rd, 2013

HDP 2.0 and its YARN-based architecture…delivered! By Shaun Connolly.

From the post:

Typical delivery of enterprise software involves a very controlled date with a secret roadmap designed to wow prospects, customers, press and analysts…or at least that is the way it usually works. Open source, however, changes this equation.

As described here, the vision for extending Hadoop beyond its batch-only roots in support of interactive and real-time workloads was set by Arun Murthy back in 2008. The initiation of YARN, the key technology for enabling this vision, started in earnest in 2011, was declared GA by the community in the recent Apache Hadoop 2.2 release, and is now delivered for mainstream enterprises and the broader commercial ecosystem with the release of Hortonworks Data Platform 2.0.

HDP 2.0, and its YARN foundation, is a huge milestone for the Hadoop market since it unlocks that vision of gathering all data in Hadoop and interacting with that data in many ways and with predictable performance levels… but you know this because Apache Hadoop 2.2 went GA last week.

If you know Sean, do you think he knows that only 6.4% of projects costing => $10 million succeed? (Healthcare.gov website ‘didn’t have a chance in hell’)

HDP 2.0 did take longer than Healthcare.gov.

But putting 18 children in a room isn’t going to produce an 18 year old in a year.

Sean does a great job in this post and points to other HDP 2.0 resources you will want to see.

BTW, maybe you should not mention the 6.4% success rate to Sean. It might jinx how open source software development works and succeeds. 😉