Archive for the ‘Hortonworks’ Category

Hadoop, The Perfect App for OpenStack

Tuesday, April 16th, 2013

Hadoop, The Perfect App for OpenStack by Shaun Connolly.

From the post:

The convergence of big data and cloud is a disruptive market force that we at Hortonworks not only want to encourage but also accelerate. Our partnerships with Microsoft and Rackspace have been perfect examples of bringing Hadoop to the cloud in a way that enables choice and delivers meaningful value to enterprise customers. In January, Hortonworks joined the OpenStack Foundation in support of our efforts with Rackspace (i.e. OpenStack-based Hadoop solution for the public and private cloud).

Today, we announced our plans to work with engineers from Red Hat and Mirantis within the OpenStack community on open source Project Savanna to automate the deployment of Hadoop on enterprise-class OpenStack-powered clouds.

Why is this news important?

Because big data and cloud computing are two of the top priorities in enterprise IT today, and it’s our intention to work diligently within the Hadoop and OpenStack open source communities to deliver solutions in support of these market needs. By bringing our Hadoop expertise to the OpenStack community in concert with Red Hat (the leading contributor to OpenStack), Mirantis (the leading system integrator for OpenStack), and Rackspace (a founding member of OpenStack), we feel we can speed the delivery of operational agility and efficient sharing of infrastructure that deploying elastic Hadoop on OpenStack can provide.

Why is this news important for topic maps?

Have you noticed that none, read none of the big data or cloud efforts say anything about data semantics?

As if when big data and the cloud arrives, all your data integration problems will magically melt away.

I don’t think so.

What I think is going to happen is discordant data sets are going to start rubbing and binding on each other. Perhaps not a lot at first but as data explorers get bolder, the squeaks are going to get louder.

So loud in fact the squeaks (now tearing metal sounds) are going to attract the attention of… (drum roll)… the CEO.

What’s your answer for discordant data?

  • Ear plugs?
  • Job with another company?
  • Job in another country?
  • Job under an assumed name?

I would say none of the above.

Apache Hadoop Patterns of Use: Refine, Enrich and Explore

Tuesday, April 9th, 2013

Apache Hadoop Patterns of Use: Refine, Enrich and Explore by Jim Walter.

From the post:

“OK, Hadoop is pretty cool, but exactly where does it fit and how are other people using it?” Here at Hortonworks, this has got to be the most common question we get from the community… well that and “what is the airspeed velocity of an unladen swallow?”

We think about this (where Hadoop fits) a lot and have gathered a fair amount of expertise on the topic. The core team at Hortonworks includes the original architects, developers and operators of Apache Hadoop and its use at Yahoo, and through this experience and working within the larger community they have been privileged to see Hadoop emerge as the technological underpinning for so many big data projects. That has allowed us to observe certain patterns that we’ve found greatly simplify the concepts associated with Hadoop, and our aim is to share some of those patterns here.

As an organization laser focused on developing, distributing and supporting Apache Hadoop for enterprise customers, we have been fortunate to have a unique vantage point.

With that, we’re delighted to share with you our new whitepaper ‘Apache Hadoop Patterns of Use’. The patterns discussed in the whitepaper are:

Refine: Collect data and apply a known algorithm to it in a trusted operational process.
Enrich: Collect data, analyze and present salient results for online apps.
Explore: Collect data and perform iterative investigation for value.

You can download it here, and we hope you enjoy it.

If you are looking for detailed patterns of use, you will be disappointed.

Runs about nine (9) pages in very high level summary mode.

What remains to be written (to my knowledge) is a collection of use patterns with a realistic amount of detail from a cross-section of Hadoop users.

That would truly be a compelling resource for the community.

Microsoft and Hadoop, Sitting in a Tree…*

Wednesday, February 27th, 2013

Putting the Elephant in the Window by John Kreisa.

From the post:

For several years now Apache Hadoop has been fueling the fast growing big data market and has become the defacto platform for Big Data deployments and the technology foundation for an explosion of new analytic applications. Many organizations turn to Hadoop to help tame the vast amounts of new data they are collecting but in order to do so with Hadoop they have had to use servers running the Linux operating system. That left a large number of organizations who standardize on Windows (According to IDC, Windows Server owned 73 percent of the market in 2012 – IDC, Worldwide and Regional Server 2012–2016 Forecast, Doc # 234339, May 2012) without the ability to run Hadoop natively, until today.

We are very pleased to announce the availability of Hortonworks Data Platform for Windows providing organizations with an enterprise-grade, production-tested platform for big data deployments on Windows. HDP is the first and only Hadoop-based platform available on both Windows and Linux and provides interoperability across Windows, Linux and Windows Azure. With this release we are enabling a massive expansion of the Hadoop ecosystem. New participants in the community of developers, data scientist, data management professionals and Hadoop fans to build and run applications for Apache Hadoop natively on Windows. This is great news for Windows focused enterprises, service provides, software vendors and developers and in particular they can get going today with Hadoop simply by visiting our download page.

This release would not be possible without a strong partnership and close collaboration with Microsoft. Through the process of creating this release, we have remained true to our approach of community-driven enterprise Apache Hadoop by collecting enterprise requirements, developing them in open source and applying enterprise rigor to produce a 100-precent open source enterprise-grade Hadoop platform.

Now there is a very smart marketing move!

A smaller share of a larger market is always better than a large share of a small market.

(You need to be writing down these quips.) ;-)

Seriously, take note of how Hortonworks used the open source model.

They did not build Hadoop in their image and try to sell it to the world.

Hortonworks gathered requirements from others and built Hadoop to meet their needs.

Open source model in both cases, very different outcomes.

* I didn’t remember the rhyme beyond the opening line. Consulting the oracle (Wikipedia), I discovered Playground song. ;-)

The Stinger Initiative: Making Apache Hive 100 Times Faster

Wednesday, February 20th, 2013

The Stinger Initiative: Making Apache Hive 100 Times Faster by Alan Gates.

From the post:

Introduced by Facebook in 2007, Apache Hive and its HiveQL interface has become the de facto SQL interface for Hadoop. Today, companies of all types and sizes use Hive to access Hadoop data in a familiar way and to extend value to their organization or customers either directly or though a broad ecosystem of existing BI tools that rely on this key proven interface. The who’s who of business analytics have already adopted Hive.

Hive was originally built for large-scale operational batch processing and it is very effective with reporting, data mining and data preparation use cases. These usage patterns remain very important but with widespread adoption of Hadoop, the enterprise requirement for Hadoop to become more real time or interactive has increased in importance as well. At Hortonworks, we believe in the power of the open source community to innovate faster than any proprietary offering and the Stinger initiative is proof of this once again as we collaborate with others to improve Hive performance.

So, What is Stinger?

Enabling Hive to answer human-time use cases (i.e. queries in the 5-30 second range) such as big data exploration, visualization, and parameterized reports without needing to resort to yet another tool to install, maintain and learn can deliver a lot of value to the large community of users with existing Hive skills and investments.

To this end, we have launched the Stinger Initiative, with input and participation from the broader community, to enhance Hive with more SQL and better performance for these human-time use cases. All the while, HiveQL remains the same before and after these advancements so it just gets better. And in keeping with the ecosystem of existing tools, it is complementary to best-of-breed data warehouses and analytic platforms.

Leveraging on existing skills and infrastructure.

Who knows? Hortonworks maybe about to start a trend!

Doing More with the Hortonworks Sandbox

Tuesday, February 5th, 2013

Doing More with the Hortonworks Sandbox by Cheryle Custer.

From the post:

The Hortonworks Sandbox was recently introduced garnering incredibly positive response and feedback. We are as excited as you, and gratified that our goal providing the fastest onramp to Apache Hadoop has come to fruition. By providing a free, integrated learning environment along with a personal Hadoop environment, we are helping you gain those big data skills faster. Because of your feedback and demand for new tutorials, we are accelerating the release schedule for upcoming tutorials. We will continue to announce new tutorials via the Hortonworks blog, opt-in email and Twitter (@hortonworks).

While you wait for more tutorials, Cheryle points to some data sets to keep you busy:

For advice, see the Sandbox Forums.

BTW, while you are munging across different data sets, be sure to notice any semantic impedance if you try to merge some data sets.

If you don’t want everyone in your office doing that merging one-off, you might want to consider topic maps.

Design and document a merge between data sets once, run many times.

Even if your merging requirements change. Just change that part of the map, don’t re-create the entire map.

What if mapping companies recreated their maps for every new street?

Or would it be better to add the new street to an existing map?

If that looks obvious, try the extra-bonus question:

Which model, new map or add new street, do you use for schema migration?

Hadoop – “State of the Union” – Notes

Friday, January 25th, 2013

I took notes on Shaun Connolly’s “Hortonworks State of the Union and Vision for Apache Hadoop in 2013.”

Unless you are an open source/Hadoop innocent, you aren’t going to gain much new information from the webinar.

But there is another reason for watching the webinar.

That is the strategy Hortonworks is pursuing in developing the Hadoop ecosystem.

Shaun refers to it in various ways (warning, some paraphrasing): “Investing in making Hadoop work with existing infrastructure,” add Hadoop to (not replace) traditional data architectures, “customers don’t want more data silos.”

Rather than a rip-and-replace technology, Hortonworks is building a Hadoop ecosystem that interacts with and compliments existing data architectures.

Think about that for a moment.

Works with existing data architectures.

Which means everyone from ordinary users to power users and sysadmins, can work from what they know and gain the benefits of a Hadoop ecosystem.

True enough, over time some (all?) of their traditional data architectures may become more Hadoop based but that will be a gradual process.

In the meantime, the benefits of Hadoop will be made manifest in the context of familiar tooling.

When a familiar tool runs faster, acquires new capabilities, user’s notice the change along with the lack of a learning curve.

Watch the webinar for the strategy and think about how to apply it to your favorite semantic technology.

Hortonworks Sandbox — the Fastest On Ramp to Apache Hadoop

Tuesday, January 22nd, 2013

Hortonworks Sandbox — the Fastest On Ramp to Apache Hadoop by Cheryle Custer.

From the post:

Today Hortonworks announced the availability of the Hortonworks Sandbox, an easy-to-use, flexible and comprehensive learning environment that will provide you with fastest on-ramp to learning and exploring enterprise Apache Hadoop.

The Hortonworks Sandbox is:

  • A free download
  • A complete, self contained virtual machine with Apache Hadoop pre-configured
  • A personal, portable and standalone Hadoop environment
  • A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop on your own

The Hortonworks Sandbox is designed to help close the gap between people wanting to learn and evaluate Hadoop, and the complexities of spinning up an evaluation cluster of Hadoop. The Hortonworks Sandbox provides a powerful combination of hands-on, step-by-step tutorials paired with an easy to use Web interface designed to lower the learning curve for people who just want to explore and evaluate Hadoop, as quickly as possible.

BTW, the tutorials can be refreshed to load new tutorials as they are released.

A marketing/teaching strategy that merits imitation by others.

Mahout on Windows Azure…

Tuesday, January 22nd, 2013

Mahout on Windows Azure – Machine Learning Using Microsoft HDInsight by Istvan Szegedi.

From the post:

Our last post was about Microsoft and Hortonworks joint effort to deliver Hadoop on Microsoft Windows Azure dubbed HDInsight. One of the key Microsoft HDInsight components is Mahout, a scalable machine learning library that provides a number of algorithms relying on the Hadoop platform. Machine learning supports a wide range of use cases from email spam filtering to fraud detection to recommending books or movies, similar to Amazon.com features.These algorithms can be divided into three main categories: recommenders/collaborative filtering, categorization and clustering. More details about these algorithms can be read on Apache Mahout wiki.

Are you hearing Hadoop, Mahout, HBase, Hive, etc., as often as I am?

Does it make you wonder about Apache becoming the locus of transferable IT skills?

Something to think about as you are developing topic map ecosystems.

You can hand roll your own solutions.

Or build upon solutions that have widespread vendor support.

PS: Another great post from Istvan.

Hadoop Bingo [The More Things Change..., First Game 22nd Jan. 2013]

Sunday, January 20th, 2013

Don’t be Tardy for This Hadoop BINGO Party! by Kim Truong.

It had to happen. Virtual “door prizes” for attending webinars, now bingo games.

I’m considering a contest to guess what the next 1950′s/60′s marketing tool will appear next. ;-)

From the post:

I’m excited to kick-off our first webinar series for 2013: The True Value of Apache Hadoop.

Get all your friends, co-workers together and be prepared to geek out to Hadoop!

This 4-part series will have a mixture of amazing guest speakers covering topics such as Hortonworks 2013 vision and roadmaps for Apache Hadoop and Big Data, What’s new with Hortonworks Data Platform v1.2, How Luminar (an Entravision company) adopted Apache Hadoop, and use case on Hadoop, R and GoogleVis. This series will provide organizations an opportunity to gain a better understanding of Apache Hadoop and Big Data landscape and practical guidance on how to leverage Hadoop as part of your Big Data strategy.

How is that a party?

Don’t be confused. The True Value of Apache Hadoop is the series name and Hortonworks State of the Union and Vision for Apache Hadoop in 2013 is the first webinar title. My note on the “State of the Union.”

Don’t get me wrong. Entirely appropriate to recycle 1950′s/60′s techniques (or older).

We are people and people haven’t changed in terms of motivations, virtues or vices in recorded history.

If the past works, use it.

Hadoop “State of the Union” [Webinar]

Saturday, January 19th, 2013

Hortonworks State of the Union and Vision for Apache Hadoop in 2013 by Kim Rose.

From the post:

Who: Shaun Connolly, Vice President of Corporate Strategy, Hortonworks

When: Tuesday, January 22, 2013 at 1:00 p.m. ET/10:00am PT

Where: http://info.hortonworks.com/Winterwebinarseries_TheTrueValueofHadoop.html

Click to Tweet: #Hortonworks hosting “State of the Union” webinar to discuss 2013 vision for #Hadoop, 1/22 at 1 pm ET. Register here: http://bit.ly/VYJxKX

The “State of the Union” webinar is the first in a four-part Hortonworks webinar series titled, “The True Value of Apache Hadoop,” designed to inform attendees of key trends, future roadmaps, best practices and the tools necessary for the successful enterprise adoption of Apache Hadoop.

During the “State of the Union,” Connolly will look at key company highlights from 2012, including the release of the Hortonworks Data Platform (HDP)—the industry’s online 100-percent open source platform powered by Apache Hadoop—and the further development of the Hadoop ecosystem through partnerships with leading software vendors, such as Microsoft and Teradata. Connolly will also provide insight into upcoming initiatives and projects that the Company plans to focus on this year as well as topical advances in the Apache Hadoop community.

Attendees will learn:

  • How Hortonworks’ focus contributes to innovation within the Apache open source community while addressing enterprise requirements and ecosystem interoperability;
  • About the latest releases in the Hortonworks product offering; and
  • About Hortonworks’ roadmap and major areas of investment across core platform, data and operational services for productive operations and management.

For more information, or to register for the “State of the Union” webinar, please visit: http://info.hortonworks.com/Winterwebinarseries_TheTrueValueofHadoop.html.

You will learn more from this “State of the Union” address than any similarly titled presentations with Congressional responses and the sycophantic choruses that accompany them.

Hortonworks Data Platform 1.2 Available Now!

Friday, January 18th, 2013

Hortonworks Data Platform 1.2 Available Now! by Kim Rose.

From the post:

Hortonworks Data Platform (HDP) 1.2, the industry’s only complete 100-percent open source platform powered by Apache Hadoop is available today. The enterprise-grade Hortonworks Data Platform includes the latest version of Apache Ambari for comprehensive management, monitoring and provisioning of Apache Hadoop clusters. By also introducing additional new capabilities for improving security and ease of use, HDP delivers an enterprise-class distribution of Apache Hadoop that is endorsed and adopted by some of the largest vendors in the IT ecosystem.

Hortonworks continues to drive innovation through a range of Hadoop-related projects, packaging the most enterprise-ready components, such as Ambari, into the Hortonworks Data Platform. Powered by an Apache open source community, Ambari represents the forefront of innovation in Apache Hadoop management. Built on Apache Hadoop 1.0, the most stable and reliable code available today, HDP 1.2 improves the ease of enterprise adoption for Apache Hadoop with comprehensive management and monitoring, enhanced connectivity to high-performance drivers, and increased enterprise-readiness of Apache HBase, Apache Hive and Apache HCatalog projects.

The Hortonworks Data Platform 1.2 features a number of new enhancements designed to improve the enterprise viability of Apache Hadoop, including:

  • Simplified Hadoop Operations—Using the latest release of Apache Ambari, HDP 1.2 now provides both cluster management and the ability to zoom into cluster usage and performance metrics for jobs and tasks to identify the root cause of performance bottlenecks or operations issues. This enables Hadoop users to identify issues and optimize future job processing.
  • Improved Security and Multi-threaded Query—HDP 1.2 provides an enhanced security architecture and pluggable authentication model that controls access to Hive tables and the metastore. In addition, HDP 1.2 improves scalability by supporting multiple concurrent query connections to Hive from business intelligence tools and Hive clients.
  • Integration with High-performance Drivers Built for Big Data—HDP 1.2 empowers organizations with a trusted and reliable ODBC connector that enables the integration of current systems with high-performance drivers built for big data. The ODBC driver enables integration with reporting or visualization components through a SQL engine built into the driver. Hortonworks has partnered with Simba to deliver a trusted, reliable high-performance ODBC connector that is enterprise ready and completely free.
  • HBase Enhancements—By including and testing HBase 0.94.2, HDP 1.2 delivers important performance and operational improvements for customers building and deploying highly scalable interactive applications using HBase.

There goes the weekend!

Big Graph Data on Hortonworks Data Platform

Thursday, December 13th, 2012

Big Graph Data on Hortonworks Data Platform by Marko Rodriguez.

The Hortonworks Data Platform (HDP) conveniently integrates numerous Big Data tools in the Hadoop ecosystem. As such, it provides cluster-oriented storage, processing, monitoring, and data integration services. HDP simplifies the deployment and management of a production Hadoop-based system.

In Hadoop, data is represented as key/value pairs. In HBase, data is represented as a collection of wide rows. These atomic structures makes global data processing (via MapReduce) and row-specific reading/writing (via HBase) simple. However, writing queries is nontrivial if the data has a complex, interconnected structure that needs to be analyzed (see Hadoop joins and HBase joins). Without an appropriate abstraction layer, processing highly structured data is cumbersome. Indeed, choosing the right data representation and associated tools opens up otherwise unimaginable possibilities. One such data representation that naturally captures complex relationships is a graph (or network). This post presents Aurelius‘ Big Graph Data technology suite in concert with Hortonworks Data Platform. Moreover, for a real-world grounding, a GitHub clone is described in this context to help the reader understand how to use these technologies for building scalable, distributed, graph-based systems.

If you like graphs at all or have been looking at graph solutions, you are going to like this post.

Why not RAID-0? It’s about Time and Snowflakes

Friday, November 9th, 2012

Why not RAID-0? It’s about Time and Snowflakes by Steve Loughran.

From the post:

A recurrent question on the various Hadoop mailing lists is “why does Hadoop prefer a set of separate disks to the same set managed as a RAID-0 disks array?”

Steve uses empirical data on disk storage to explain why to avoid RAID-0 when using Hadoop.

As nice a summary as you are likely to find.

DINOSAURS ARE REAL: Microsoft WOWs audience with HDInsight…(Hortonworks Inside)

Thursday, October 25th, 2012

DINOSAURS ARE REAL: Microsoft WOWs audience with HDInsight at Strata NYC (Hortonworks Inside) by Russell Jurney.

From the post:

You don’t see many demos like the one given by Shawn Bice (Microsoft) today in the Regent Parlor of the New York Hilton, at Strata NYC. “Drive Smarter Decisions with Microsoft Big Data,” was different.

For starters – everything worked like clockwork. Live demos of new products are notorious for failing on-stage, even if they work in production. And although Microsoft was presenting about a Java-based platform at a largely open-source event… it was standing room only, with the crowd overflowing out the doors.

Shawn demonstrated working with Apache Hadoop from Excel, through Power Pivot, to Hive (with sampling-driven early results!?) and out to import third party data-sets. To get the full effect of what he did, you’re going to have to view a screencast or try it out but to give you the idea of what the first proper interface on Hadoop feels like…

My thoughts on reading Russell’s post:

  • A live product demo that did not fail? Really?
  • Is that tatoo copyrighted?
  • Oh, yes, +1!, big data has become real for millions of users.

How’s that for a big data book, tutorial, consulting, semantic market explosion?

Why Microsoft is committed to Hadoop and Hortonworks

Thursday, October 25th, 2012

Why Microsoft is committed to Hadoop and Hortonworks (a buest post at Hortonworks by Microsoft’s Dave Campbell).

From the post:

Last February at Strata Conference in Santa Clara we shared Microsoft’s progress on Big Data, specifically working to broaden the adoption of Hadoop with the simplicity and manageability of Windows and enabling customers to easily derive insights from their structured and unstructured data through familiar tools like Excel.

Hortonworks is a recognized pioneer in the Hadoop Community and a leading contributor to the Apache Hadoop project, and that’s why we’re excited to announce our expanded partnership with Hortonworks to give customers access to an enterprise-ready distribution of Hadoop that is 100 percent compatible with Windows Server and Windows Azure. To provide customers with access to this Hadoop compatibility, yesterday we also released new previews of Microsoft HDInsight Server for Windows and Windows Azure HDInsight Service, our Hadoop-based solutions for Windows Server and Windows Azure.

With this expanded partnership, the Hadoop community will reap the following benefits of Hadoop on Windows:

  • Insights to all users from all data:….
  • Enterprise-ready Hadoop with HDInsight:….
  • Simplicity of Windows for Hadoop:….
  • Extend your data warehouse with Hadoop:….
  • Seamless Scale and Elasticity of the Cloud:….

This is a very exciting milestone, and we hope you’ll join us for the ride as we continue partnering with Hortonworks to democratize big data. Download HDInsight today at Microsoft.com/BigData.

See Dave’s post for the details on “benefits of Hadoop on Windows” and then like the man says:

Download HDInsight today at Microsoft.com/BigData.

Enabling Big Data Insight for Millions of Windows Developers [Your Target Audience?]

Thursday, October 25th, 2012

Enabling Big Data Insight for Millions of Windows Developers by Shaun Connolly.

From the post:

At Hortonworks, we fundamentally believe that, in the not-so-distant future, Apache Hadoop will process over half the world’s data flowing through businesses. We realize this is a BOLD vision that will take a lot of hard work by not only Hortonworks and the open source community, but also software, hardware, and solution vendors focused on the Hadoop ecosystem, as well as end users deploying platforms powered by Hadoop.

If the vision is to be achieved, we need to accelerate the process of enabling the masses to benefit from the power and value of Apache Hadoop in ways where they are virtually oblivious to the fact that Hadoop is under the hood. Doing so will help ensure time and energy is spent on enabling insights to be derived from big data, rather than on the IT infrastructure details required to capture, process, exchange, and manage this multi-structured data.

So how can we accelerate the path to this vision? Simply put, we focus on enabling the largest communities of users interested in deriving value from big data.

You don’t have to wonder long what Shaun is reacting to:

Today Microsoft unveiled previews of Microsoft HDInsight Server and Windows Azure HDInsight Service, big data solutions that are built on Hortonworks Data Platform (HDP) for Windows Server and Windows Azure respectively. These new offerings aim to provide a simplified and consistent experience across on-premise and cloud deployment that is fully compatible with Apache Hadoop.

Enabling big data insight isn’t the same as capturing those insights for later use or re-use.

May just be me, but that sounds like a great opportunity for topic maps.

Bringing semantics to millions of Windows developers that is.

HBase Futures

Monday, October 22nd, 2012

HBase Futures by Devaraj Das.

From the post:

As we have said here, Hortonworks has been steadily increasing our investment in HBase. HBase’s adoption has been increasing in the enterprise. To continue this trend, we feel HBase needs investments in the areas of:

  1. Reliability and High Availability (all data always available, and recovery from failures is quick)
  2. Autonomous operation (minimum operator intervention)
  3. Wire compatibility (to support rolling upgrades across a couple of versions at least)
  4. Cross data-center replication (for disaster recovery)
  5. Snapshots and backups (be able to take periodic snapshots of certain/all tables and be able to restore them at a later point if required)
  6. Monitoring and Diagnostics (which regionserver is hot or what caused an outage)

Probably just a personal prejudice but I would have mentioned semantics in that list.

You?

HBase at Hortonworks: An Update [Features, Consumer Side?]

Monday, October 22nd, 2012

HBase at Hortonworks: An Update by Devaraj Das.

From the post:

HBase is a critical component of the Apache Hadoop ecosystem and a core component of the Hortonworks Data Platform. HBase enables a host of low latency Hadoop use-cases; As a publishing platform, HBase exposes data refined in Hadoop to outside systems; As an online column store, HBase supports the blending of random access data read/write with application workloads whose data is directly accessible to Hadoop MapReduce.

The HBase community is moving forward aggressively, improving HBase in many ways. We are in the process of integrating HBase 0.94 into our upcoming HDP 1.1 refresh. This “minor upgrade” will include a lot of bug fixes (nearly 200 in number) and quite a few performance improvements and will be wire compatible with HBase 0.92 (in HDP 1.0).

The post concludes:

All of the above is just what we’ve been doing recently and Hortonworkers are only a small fraction of the HBase contributor base. When one factors in all the great contributions coming from across the Apache HBase community, we predict 2013 is going to be a great year for HBase. HBase is maturing fast, becoming both more operationally reliable and more feature rich.

When a technical infrastructure becomes “feature rich,” can “features” for consumer services/interfaces be far behind?

Delivering location-based coupons for latte’s on a cellphone may seem like a “feature.” But we can do that with a man wearing a sandwich board.

A “feature” for the consumer needs to be more than digital imitation of an analog capability.

What consumer “feature(s)” would you offer based on new features in HBase?

Bigger Than A Bread Box

Thursday, October 18th, 2012

Hortonworks & Teradata: More Than Just an Elephant in a Box by Jim Walker.

I’m not going to wake up Christmas morning to find:

Teredata

But in case you are in the market for a big analytics hardware/software appliance, Jim writes:

Today our partner, Teradata, announced availability of the Teradata Aster Big Analytics Appliance, which packages our Hortonworks Data Platform (HDP) with Teradata Aster on machine that is ready to plug-in and bring big data value in hours.

There is more to this appliance than meets the eye… it is not just a simple packaging of software on hardware. Teradata and Hortonworks engineers have been working together for months tying our solutions together and optimizing them for an appliance. This solution gives an analyst the ability to leverage big data (social media, Web clickstream, call center, and other types of customer interaction data) in their analysis and all the while use the tools they are already familiar with. It is analytics and data discovery/exploration with big data (or HDP) inside… all on an appliance that can be operational in hours.

Not just anyone can do this

This is an engineered solution. Many analytics tools are building their solutions on top of Hadoop using Hive and HiveQL. This is a great approach but it lacks integration of metadata and metadata exchange. With the appliance we have extended a new approach using HCatalog and the Teradata SQL-H product. SQL-H is a conduit that allows new analysis to be created and schema changes to be adopted within Hadoop from Teradata. Analysts are abstracted completely from the Hadoop environment so they can focus on what they do best… analyze. All of this is enabled by an innovation provided by HCatalog, which enables this metadata exchange.

Shortcut to Big Data Exploration

In the appliance, Aster provides over 50 pre-built functions that allow analysts to perform segmentation, transformations and even pre-packaged marketing analytics. With this package, these valuable functions can now be applied to big data in Hadoop. This shortens the time it takes for an analyst to explore and discover value in big data. And if the pre-packaged functions aren’t explicit enough, Teradata Aster also provides an environment to create MapReduce functions that can be executed in HDP.

Just as well.

Red doesn’t really go with my office decor. Runs more towards the hulking black server tower, except for the artificial pink tree in the corner. ;-)

Apache Hadoop YARN Meetup at Hortonworks

Thursday, October 18th, 2012

Apache Hadoop YARN Meetup at Hortonworks – Recap by Vinod Kumar Vavilapalli.

Just in case you missed the Apache Hadoop YARN meetup, summaries and slides are available for:

  • Chris Riccomini’s on “Building Applications on YARN”
  • YARN API Discussion
  • Efforts Underway

Enjoy!

Ready to Contribute to Apache Hadoop 2.0?

Tuesday, October 16th, 2012

User feedback is a contribution to a software project.

Software can only mature with feedback, your feedback.

Otherwise the final deliverable has a “works on my machine” outcome.

Don’t let Apache Hadoop 2.0 have a “works on my machine” outcome.

Download the preview and contribute your experiences back to the community.

We will all be glad you did!

Details:

Hortonworks Data Platform 2.0 Alpha is Now Available for Preview! by Jeff Sposetti.

From the post:

We are very excited to announce the Alpha release of the Hortonworks Data Platform 2.0 (HDP 2.0 Alpha).

HDP 2.0 Alpha is built around Apache Hadoop 2.0, which improves availability of HDFS with High Availability for the NameNode along with several performance and reliability enhancements. Apache Hadoop 2.0 also significantly advances data processing in the Hadoop ecosystem with the introduction of YARN, a generic resource-management and application framework to support MapReduce and other paradigms such as real-time processing and graph processing.

In addition to Apache Hadoop 2.0, this release includes the essential Hadoop ecosystem projects such as Apache HBase, Apache Pig, Apache Hive, Apache HCatalog, Apache ZooKeeper and Apache Oozie to provide a fully integrated and verified Apache Hadoop 2.0 stack

Apache Hadoop 2.0 is well on the path to General Availability, and is already deployed at scale in several organizations; but it won’t get to the current maturity levels of the Hadoop 1.0 stack (available in Hortonworks Data Platform 1.x) without feedback and contributions from the community.

Hortonworks strongly believes that for open source technologies to mature and become widely adopted in the enterprise, you must balance innovation with stability. With HDP 2.0 Alpha, Hortonworks provides organizations an easy way to evaluate and gain experience with the Apache Hadoop 2.0 technology stack, and it presents the perfect opportunity to help bring stability to the platform and influence the future of the technology.

YARN Meetup at Hortonworks on Friday, Oct 12

Thursday, October 4th, 2012

YARN Meetup at Hortonworks on Friday, Oct 12 by Russell Jurney.

From the post:

Hortonworks is hosting an Apache YARN Meetup on Friday, Oct 12, to solicit feedback on the YARN APIs. We’ve talked about YARN before in a four-part series on YARN, parts one, two, three and four.

YARN, or “Apache Hadoop NextGen MapReduce,” has come a long way this year. It is now a full-fledged sub-project of Apache Hadoop and has already been deployed on a massive 2,000 node cluster at Yahoo. Many projects, both open-src and otherwise, are porting to work in YARN such as Storm, S4 and many of them are in fairly advanced stages. We also have several individuals implementing one-off or ad-hoc application on YARN.

This meetup is a good time for YARN developers to catch up and talk more about YARN, it’s current status and medium-term and long-term roadmap.

OK, it’s probably too late to get cheap tickets but if you are in New York on the 12th of October, take advantage of the opportunity!

And please blog about the meeting, with a note to yours truly! I will post a link to your posting.

Pig Out to Hadoop (Replay) [Restore Your Faith in Webinars]

Sunday, September 23rd, 2012

Pig Out to Hadoop with Alan Gates (Link to the webinar page at Hortonworks. Scroll down for this webinar. You have to register/login to view.)

From the description:

Pig has added some exciting new features in 0.10, including a boolean type, UDFs in JRuby, load and store functions for JSON, bloom filters, and performance improvements. Join Alan Gates, Hortonworks co-founder and long-time contributor to the Apache Pig and HCatalog projects, to discuss these new features, as well as talk about work the project is planning to do in the near future. In particular, we will cover how Pig can take advantage of changes in Hadoop 0.23.

I should have been watching more closely for this webinar recording to get posted.

Not only is it a great webinar on Pig, but it will restore your faith in webinars as a means of content delivery.

I have suffered through several lately where introductions took more time than actual technical content of the webinar.

Hard to know until you have already registered and spent time expecting substantive content.

Is there a public tally board for webinars on search, semantics, big data, etc.?

Welcome Hortonworks Data Platform 1.1

Wednesday, September 12th, 2012

Welcome Hortonworks Data Platform 1.1 by Jim Walker.

From the post:

Hortonworks Data Platform 1.1 Brings Expanded High Availability and Streaming Data Capture, Easier Integration with Existing Tools to Improve Enterprise Reliability and Performance of Apache Hadoop

It is exactly three months to the day that Hortonworks Data Platform version 1.0 was announced. A lot has happened since that day…

  • Our distribution has been downloaded by thousands and is delivering big value to organizations throughout the world,
  • Hadoop Summit gathered over 2200 Hadoop enthusiasts into the San Jose Convention Center,
  • And, our Hortonworks team grew by leaps and bounds!

In these same three months our growing team of committers, engineers, testers and writers have been busy knocking out our next release, Hortonworks Data Platform 1.1. We are delighted to announce availability of HDP 1.1 today! With this release, we expand our high availability options with the addition of Red Hat based HA, add streaming capability with Flume, expand monitoring API enhancements and have made significant performance improvements to the core platform.

New features include high availability, capturing data streams (Flume), improved operations management and performance increases.

For the details, see the post, documentation or even download Hortonworks Data Platform 1.1 for a spin.

Unlike Odo’s Klingon days, a day with several items from Hortonworks is a good day. Enjoy!

How To Take Big Data to the Cloud [Webinar - 13 Sept 2012 - 10 AM PDT]

Wednesday, September 12th, 2012

How To Take Big Data to the Cloud by Lisa Sensmeier.

From the post:

Hortonworks boasts a rich and vibrant ecosystem of partners representing a huge array of solutions that leverage Hadoop, and specifically Hortonworks Data Platform, to provide big data insights for customers. The goal of our Partner Webinar Series is to help communicate the value and benefit of our partners’ solutions and how they connect and use Hortonworks Data Platform.

Look to the CloudsBig-Data-and-the-cloud

Setting up a big data cluster can be difficult, especially considering the assembly of all the all the equipment, power, and space to make it happen. One option to consider is using the cloud for a practical and economical way to go. The cloud is also used to provide extra capacity for an existing cluster or for test your Hadoop applications.

Join our webinar and we will show how you can build a flexible and reliable Hadoop cluster in the cloud using Amazon EC2 cloud infrastructure, StackIQ Apache Hadoop Amazon Machine Image (AMI) and Hortonworks Data Platform. The panel of speakers includes Matt Tavis, Solutions Architect for Amazon Web Services, Mason Katz, CTO and co-founder of StackIQ, and Rohit Bakhshi, Product Manager at Hortonworks.

OK, it is a vendor/partner presentation but most of us work for vendors and use vendor created tools.

Yes?

The real question is whether tool X does what is necessary at a cost project Y can afford?

Whether vendor sponsored tool, service, home grown or otherwise.

Yes?

Looking forward to it!

Apache Hadoop YARN – NodeManager

Wednesday, September 12th, 2012

Apache Hadoop YARN – NodeManager by Vinod Kumar Vavilapalli

From the post:

In the previous post, we briefly covered the internals of Apache Hadoop YARN’s ResourceManager. In this post, which is the fourth in the multi-part YARN blog series, we are going to dig deeper into the NodeManager internals and some of the key-features that NodeManager exposes. Part one, two and three are available.

Introduction

The NodeManager (NM) is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the ResourceManager (RM), overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.

Administration isn’t high on the “exciting” list, although without good administration, things can get very “exciting.”

NodeManager gives you the monitoring tools to help avoid the latter form of excitement.

Meet the Committer, Part One: Alan Gates

Thursday, September 6th, 2012

Meet the Committer, Part One: Alan Gates by Kim Truong.

From the post:

Series Introduction

Hortonworks is on a mission to accelerate the development and adoption of Apache Hadoop. Through engineering open source Hadoop, our efforts with our distribution, Hortonworks Data Platform (HDP), a 100% open source data management platform, and partnerships with the likes of Microsoft, Teradata, Talend and others, we will accomplish this, one installation at a time.

What makes this mission possible is our all-star team of Hadoop committers. In this series, we’re going to profile those committers, to show you the face of Hadoop.

Alan Gates, Apache Pig and HCatalog Committer

Education is a key component of this mission. Helping companies gain a better understanding of the value of Hadoop through transparent communications of the work we’re doing is paramount. In addition to explaining core Hadoop projects (MapReduce and HDFS) we also highlight significant contributions to other ecosystem projects including Apache Ambari, Apache HCatalog, Apache Pig and Apache Zookeeper.

Alan Gates is a leader in our Hadoop education programs. That is why I’m incredibly excited to kick off the next phase of our “Future of Apache Hadoop” webinar series. We’re starting off this segment with 4-webinar series on September 12 with “Pig out to Hadoop” with Alan Gates (twitter:@alanfgates). Alan is an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan is also a member of the Apache Software Foundation and a co-founder of Hortonworks.

My only complaint is that the interview is too short!

Looking forward to the Pig webinar!

New ‘The Future of Apache Hadoop’ Season!

Wednesday, September 5th, 2012

OK, the real title is: Four New Installments in ‘The Future of Apache Hadoop’ Webinar Series

From the post:

During the ‘Future of Apache Hadoop’ webinar series, Hortonworks founders and core committers will discuss the future of Hadoop and related projects including Apache Pig, Apache Ambari, Apache Zookeeper and Apache Hadoop YARN.

Apache Hadoop has rapidly evolved to become the leading platform for managing, processing and analyzing big data. Consequently there is a thirst for knowledge on the future direction for Hadoop related projects. The Hortonworks webinar series will feature core committers of the Apache projects discussing the essential components required in a Hadoop Platform, current advances in Apache Hadoop, relevant use-cases and best practices on how to get started with the open source platform. Each webinar will include a live Q&A with the individuals at the center of the Apache Hadoop movement.

Coming to a computer near you:

  • Pig Out on Hadoop (Alan Gates): Wednesday, September 12 at 10:00 a.m. PT / 1:00 p.m. ET
  • Deployment and Management of Hadoop Clusters with Ambari (Matt Foley): Wednesday, September 26 at 10:00 a.m. PT / 1:00 p.m. ET
  • Scaling Apache Zookeeper for the Next Generation of Hadoop Applications (Mahadev Konar): Wednesday, October 17 at 10:00 a.m. PT / 1:00 p.m. ET
  • YARN: The Future of Data Processing with Apache Hadoop ( Arun C. Murthy): Wednesday, October 31 at 10:00 a.m. PT / 1:00 p.m. ET

Registration is open so get it on your calendar!

Pig Performance and Optimization Analysis

Tuesday, September 4th, 2012

Pig Performance and Optimization Analysis by Li Jie.

From the post:

In this post, Hortonworks Intern Li Jie talks about his work this summer on performance analysis and optimization of Apache Pig. Li is a PhD candidate in the Department of Computer Science at Duke University. His research interests are in the area of database systems and big data computing. He is currently working with Associate Professor Shivnath Babu.

If you need to optimize Pig operations, this is a very good starting place.

Be sure to grab a copy of Running TPC-H on Pig by Li Jie, Koichi Ishida, Xuan Wang and Muzhi Zhao, with its “Six Rules of Writing Efficient Pig Scripts.”

Expect to see all three of these authors in DBLP sooner rather than later.

DBLP: Shivnath Babu

Apache Hadoop YARN – ResourceManager

Friday, August 31st, 2012

Apache Hadoop YARN – ResourceManager by Arun Murthy

From the post:

This is the third post in the multi-part series to cover important aspects of the newly formed Apache Hadoop YARN sub-project. In our previous posts (part one, part two), we provided the background and an overview of Hadoop YARN, and then covered the key YARN concepts and walked you through how diverse user applications work within this new system.

In this post, we are going to delve deeper into the heart of the system – the ResourceManager.

In case your data processing needs run towards the big/large end of the spectrum.