Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 16, 2013

Hadoop Tutorials – Hortonworks

Filed under: Hadoop,HCatalog,HDFS,Hive,Hortonworks,MapReduce,Pig — Patrick Durusau @ 4:49 pm

With the GA release of Hadoop 2, it seems appropriate to list a set of tutorials for the Hortonworks Sandbox.

Tutorial 1: Hello World – An Overview of Hadoop with HCatalog, Hive and Pig

Tutorial 2: How To Process Data with Apache Pig

Tutorial 3: How to Process Data with Apache Hive

Tutorial 4: How to Use HCatalog, Pig & Hive Commands

Tutorial 5: How to Use Basic Pig Commands

Tutorial 6: How to Load Data for Hadoop into the Hortonworks Sandbox

Tutorial 7: How to Install and Configure the Hortonworks ODBC driver on Windows 7

Tutorial 8: How to Use Excel 2013 to Access Hadoop Data

Tutorial 9: How to Use Excel 2013 to Analyze Hadoop Data

Tutorial 10: How to Visualize Website Clickstream Data

Tutorial 11: How to Install and Configure the Hortonworks ODBC driver on Mac OS X

Tutorial 12: How to Refine and Visualize Server Log Data

Tutorial 13: How To Refine and Visualize Sentiment Data

Tutorial 14: How To Analyze Machine and Sensor Data

By the time you finish these, I am sure there will be more tutorials or even proposed additions to the Hadoop stack!

(Updated December 3, 2013 to add #13 and #14.)

October 7, 2013

Hortonworks Sandbox – Default Instructional Tool?

Filed under: BigData,Eclipse,Hadoop,Hortonworks,Visualization — Patrick Durusau @ 10:07 am

Visualizing Big Data: Actuate, Hortonworks and BIRT

From the post:

Challenge

Hadoop stores data in key-value pairs. While the raw data is accessible to view, to be usable it needs to be presented in a more intuitive visualization format that will allow users to glean insights at a glance. While a business analytics tool can help business users gather those insights, to do so effectively requires a robust platform that can:

  • Work with expansive volumes of data
  • Offer standard and advanced visualizations, which can be delivered as reports, dashboards or scorecards
  • Be scalable to deliver these visualizations to a large number of users

Solution

When paired with Hortonworks, Actuate adds data visualization support for the Hadoop platform, using Hive queries to access data from Hortonworks. Actuate’s commercial product suite – built on open source Eclipse BIRT – extracts data from Hadoop, pulling data sets into interactive BIRT charts, dashboards and scorecards, allowing users to view and analyze data (see diagram below). With Actuate’s familiar approach to presenting information in easily modified charts and graphs, users can quickly identify patterns, resolve business issues and discover opportunities through personalized insights. This is further enhanced by Actuate’s inherent ability to combine Hadoop data with more traditional data sources in a single visualization screen or dashboard.

A BIRT/Hortonworks “Sandbox” for both the Eclipse open source and commercial versions of BIRT is now available. As a full HDP environment on a virtual machine, the Sandbox allows users to start benefiting quickly from Hortonworks’ distribution of Hadoop with BIRT functionality.

If you know about “big data” you should be familiar with the Hortonworks Sandbox.

Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials. Sandbox includes many of the most exciting developments from the latest HDP distribution, packaged up in a virtual environment that you can get up and running in 15 minutes!

What you may not know is that Hortonworks partners are creating additional tutorials based on the sandbox.

I count seven (7) to date and more are coming.

The Sandbox may become the default instructional tool for Hadoop.

That would be a benefit to all users, whatever the particulars of their environments.

October 1, 2013

Get Started with Hadoop

Filed under: BigData,Hadoop,Hortonworks — Patrick Durusau @ 6:16 pm

Get Started with Hadoop

If you want to avoid being a Gartner statistic or hear big data jokes involving the name of your enterprise, this is a page to visit.

Hortonworks, one of the leading contributors to the Hadoop ecosystem, has assembled resources targeted at developers, analysts and systems administrators.

There are videos, tutorials and even a Hadoop sandbox.

All of which are free.

The choice is yours: Spend enterprise funds and hope to avoid failure or spend some time and plan for success.

September 5, 2013

Stinger Phase 2:…

Filed under: Hive,Hortonworks,SQL,STINGER — Patrick Durusau @ 6:28 pm

Stinger Phase 2: The Journey to 100x Faster Hive on Hadoop by Carter Shanklin.

From the post:

The Stinger Initiative is Hortonworks’ community-facing roadmap laying out the investments Hortonworks is making to improve Hive performance 100x and evolve Hive to SQL compliance to simplify migrating SQL workloads to Hive.

We launched the Stinger Initiative along with Apache Tez to evolve Hadoop beyond its MapReduce roots into a data processing platform that satisfies the need for both interactive query AND petabyte scale processing. We believe it’s more feasible to evolve Hadoop to cover interactive needs rather than move traditional architectures into the era of big data.

If you don’t think SQL is all that weird, ;-), this is a status update for you!

Serious progress is being made by a broad coalition of more than 60 developers.

Take the challenge and download HDP 2.0 Beta.

You can help build the future of SQL-IN-Hadoop.

But only if you participate.

September 3, 2013

…Integrate Tableau and Hadoop…

Filed under: Hadoop,Hortonworks,Tableau — Patrick Durusau @ 7:34 pm

How To Integrate Tableau and Hadoop with Hortonworks Data Platform by Kim Truong.

From the post:

Chances are you’ve already used Tableau Software if you’ve been involved with data analysis and visualization solutions for any length of time. Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Hadoop with Hortonworks Data Platform via Hive and the Hortonworks Hive ODBC driver.

If you want to get hands on with Tableau as quickly as possible, we recommend using the Hortonworks Sandbox and the ‘Visualize Data with Tableau’ tutorial.

(…)

Kim has a couple of great resources from Tableau to share with you so jump to her post now.

That’s right. I want you to look at someone else’s blog. Won’t catch on at capture sites with advertising but then that’s not me.

June 29, 2013

Hortonworks Data Platform 2.0 Community…

Filed under: Hadoop YARN,Hortonworks — Patrick Durusau @ 3:38 pm

Hortonworks Data Platform 2.0 Community Preview Now Available

June 26, 2013—Hortonworks, a leading contributor and provider to enterprise Apache™ Hadoop®, today announced the availability of the Hortonworks Data Platform (HDP) 2.0 Community Preview and the launch of the Hortonworks Certification Program for Apache Hadoop YARN to accelerate the availability of YARN-based partner solutions. Based on the next evolution of Apache Hadoop, including the first functional Apache YARN framework that has been more than four years in the making, the 100-percent open source HDP 2.0 features the latest advancements from the open source community that are igniting a new wave of Hadoop innovation.

[Jumping to the chase]

Please join Hortonworks for a webinar on HDP 2.0 on Wednesday, July 10 at 10 a.m. PT / 1:00 p.m. ET. To register for the webinar, please visit: http://bit.ly/1226vAP.

Availability

Hortonworks Data Platform 2.0 Community Preview is available today as a downloadable single-node instance that runs inside a virtual machine, and also as a complete installation for deployment to distributed infrastructure. To download HDP 2.0, please visit: http://bit.ly/15DBbd1.

New in this release: Apache YARN, Apache Tex, and, Stinger.

June 26, 2013

Hadoop YARN

Filed under: Hadoop YARN,Hortonworks,MapReduce — Patrick Durusau @ 10:15 am

Hadoop YARN by Steve Loughran, Devaraj Das & Eric Baldeschwieler.

From the post:

A next-generation framework for Hadoop data processing.

Apache™ Hadoop® YARN is a sub-project of Hadoop at the Apache Software Foundation introduced in Hadoop 2.0 that separates the resource management and processing components. YARN was borne of a need to enable a broader array of interaction patterns for data stored in HDFS beyond MapReduce. The YARN-based architecture of Hadoop 2.0 provides a more general processing platform that is not constrained to MapReduce.

yarn

As part of Hadoop 2.0, YARN takes the resource management capabilities that were in MapReduce and packages them so they can be used by new engines. This also streamlines MapReduce to do what it does best, process data. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource management. Many organizations are already building applications on YARN in order to bring them IN to Hadoop.

yarn2

(…)

One of the more accessible explanations of the importance of Hadoop YARN.

Likely not anything new to you but may be helpful when talking to others.

Better Content on Memory Stick?

Filed under: Hadoop,Hortonworks,Marketing — Patrick Durusau @ 9:40 am

Sandbox on Memory Stick (pic)

There was talk over at LinkedIn about marketing for topic maps.

Here’s an idea.

No mention of topic maps on the outside but without an install, configuring paths, etc. the user gets a topic map engine plus content.

Topical content for the forum where the sticks are being distributed.

Plug and compare results to your favorite search sewer.

Limited range of data.

But if I am supposed to be searching SEC mandated financial reports and related data, not being able to access Latvian lingerie ads is probably ok. With management at least.*

I first saw this in a tweet by shaunconnolly.

Suggestions for content?


* Just an aside but curated content could provide not only better search results but also eliminate results that may distract staff from the task at hand.

Better than filters, etc. Other content would simply not be an option.

June 15, 2013

Hortonworks Sandbox (1.3): Stinger, Visualizations and Virtualization

Filed under: BigData,Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 2:13 pm

Hortonworks Sandbox: Stinger, Visualizations and Virtualization by Cheryle Custer.

From the post:

A couple of weeks ago, we releases several new Hadoop tutorials showcasing real-life uses cases and you can read about them here.Today, we’re delighted to bring to you the newest release of the Hortonworks Sandbox 1.3. The Hortonworks Sandbox allows you to go from Zero to Big Data in 15 Minutes through step-by-step hands-on Hadoop tutorials. The Sandbox is a fully functional single node personal Hadoop environment, where you can add your own data sets, validate your Hadoop use cases and build a small proof-of-concept.

Update of your favorite way to explore Hadoop!

Get the sandbox here.

May 30, 2013

Hadoop Tutorials: Real Life Use Cases in the Sandbox

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 7:56 pm

Hadoop Tutorials: Real Life Use Cases in the Sandbox by Cheryle Custer.

Six (6) new tutorials from Hortonworks:

  • Tutorial 6 – Loading Data into the Hortonworks Sandbox
  • Tutorials 7 & 11 – Installing the ODBC Driver in the Hortonworks Sandbox (Windows and Mac)
  • Tutorials 8 & 9 – Accessing and Analyzing Data in Excel
  • Tutorial 10 – Visualizing Clickstream Data

You have done the first five (5).

Yes?

Hortonworks Data Platform 1.3 Release

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 7:51 pm

Hortonworks Data Platform 1.3 Release: The community continues to power innovation in Hadoop by Jeff Sposetti.

From the post:

HDP 1.3 release delivers on community-driven innovation in Hadoop with SQL-IN-Hadoop, and continued ease of enterprise integration and business continuity features.

Almost one year ago (50 weeks to be exact) we released Hortonworks Data Platform 1.0, the first 100% open source Hadoop platform into the marketplace. The past year has been dynamic to say the least! However, one thing has remained constant: the steady, predictable cadence of HDP releases. In September 2012 we released 1.1, this February gave us 1.2 and today we’re delighted to release HDP 1.3.

HDP 1.3 represents yet another significant step forward and allows customers to harness the latest innovation around Apache Hadoop and its related projects in the open source community. In addition to providing a tested, integrated distribution of these projects, HDP 1.3 includes a primary focus on enhancements to Apache Hive, the de-facto standard for SQL access in Hadoop as well as numerous improvements that simplify ease of use.

Whatever the magic dust is for a successful open source project, the Hadoop community has it in abundance.

May 21, 2013

Hadoop, Hadoop, Hurrah! HDP for Windows is Now GA!

Filed under: Hadoop,Hortonworks,Microsoft — Patrick Durusau @ 4:54 pm

Hadoop, Hadoop, Hurrah! HDP for Windows is Now GA! by John Kreisa.

From the post:

Today we are very excited to announce that Hortonworks Data Platform for Windows (HDP for Windows) is now generally available and ready to support the most demanding production workloads.

We have been blown away with the number and size of organizations who have downloaded the beta bits of this 100% open source, and native to Windows distribution of Hadoop and engaged Hortonworks and Microsoft around evolving their data architecture to respond to the challenges of enterprise big data.

With this key milestone HDP for Windows offers the millions of customers running their business on Microsoft technologies an ecosystem-friendly Hadoop-based solution that is built for the enterprise and purpose built for Windows. This release cements Apache Hadoop’s role as a key component of the next generation enterprise data architecture, across the broadest set of datacenter configurations as HDP becomes the first production-ready Apache Hadoop distribution to run on both Windows and Linux.

Additionally, customers now also have complete portability of their Hadoop applications between on-premise and cloud deployments via HDP for Windows and Microsofts’s HDInsight Service.

Two lessons here:

First, Hadoop is a very popular way to address enterprise big data.

Second, going where users are, not where they ought to be, is a smart business move.

April 16, 2013

Hadoop, The Perfect App for OpenStack

Filed under: Cloud Computing,Hadoop,Hortonworks,OpenStack — Patrick Durusau @ 6:03 pm

Hadoop, The Perfect App for OpenStack by Shaun Connolly.

From the post:

The convergence of big data and cloud is a disruptive market force that we at Hortonworks not only want to encourage but also accelerate. Our partnerships with Microsoft and Rackspace have been perfect examples of bringing Hadoop to the cloud in a way that enables choice and delivers meaningful value to enterprise customers. In January, Hortonworks joined the OpenStack Foundation in support of our efforts with Rackspace (i.e. OpenStack-based Hadoop solution for the public and private cloud).

Today, we announced our plans to work with engineers from Red Hat and Mirantis within the OpenStack community on open source Project Savanna to automate the deployment of Hadoop on enterprise-class OpenStack-powered clouds.

Why is this news important?

Because big data and cloud computing are two of the top priorities in enterprise IT today, and it’s our intention to work diligently within the Hadoop and OpenStack open source communities to deliver solutions in support of these market needs. By bringing our Hadoop expertise to the OpenStack community in concert with Red Hat (the leading contributor to OpenStack), Mirantis (the leading system integrator for OpenStack), and Rackspace (a founding member of OpenStack), we feel we can speed the delivery of operational agility and efficient sharing of infrastructure that deploying elastic Hadoop on OpenStack can provide.

Why is this news important for topic maps?

Have you noticed that none, read none of the big data or cloud efforts say anything about data semantics?

As if when big data and the cloud arrives, all your data integration problems will magically melt away.

I don’t think so.

What I think is going to happen is discordant data sets are going to start rubbing and binding on each other. Perhaps not a lot at first but as data explorers get bolder, the squeaks are going to get louder.

So loud in fact the squeaks (now tearing metal sounds) are going to attract the attention of… (drum roll)… the CEO.

What’s your answer for discordant data?

  • Ear plugs?
  • Job with another company?
  • Job in another country?
  • Job under an assumed name?

I would say none of the above.

April 9, 2013

Apache Hadoop Patterns of Use: Refine, Enrich and Explore

Filed under: Hadoop,Hortonworks — Patrick Durusau @ 9:27 am

Apache Hadoop Patterns of Use: Refine, Enrich and Explore by Jim Walter.

From the post:

“OK, Hadoop is pretty cool, but exactly where does it fit and how are other people using it?” Here at Hortonworks, this has got to be the most common question we get from the community… well that and “what is the airspeed velocity of an unladen swallow?”

We think about this (where Hadoop fits) a lot and have gathered a fair amount of expertise on the topic. The core team at Hortonworks includes the original architects, developers and operators of Apache Hadoop and its use at Yahoo, and through this experience and working within the larger community they have been privileged to see Hadoop emerge as the technological underpinning for so many big data projects. That has allowed us to observe certain patterns that we’ve found greatly simplify the concepts associated with Hadoop, and our aim is to share some of those patterns here.

As an organization laser focused on developing, distributing and supporting Apache Hadoop for enterprise customers, we have been fortunate to have a unique vantage point.

With that, we’re delighted to share with you our new whitepaper ‘Apache Hadoop Patterns of Use’. The patterns discussed in the whitepaper are:

Refine: Collect data and apply a known algorithm to it in a trusted operational process.
Enrich: Collect data, analyze and present salient results for online apps.
Explore: Collect data and perform iterative investigation for value.

You can download it here, and we hope you enjoy it.

If you are looking for detailed patterns of use, you will be disappointed.

Runs about nine (9) pages in very high level summary mode.

What remains to be written (to my knowledge) is a collection of use patterns with a realistic amount of detail from a cross-section of Hadoop users.

That would truly be a compelling resource for the community.

February 27, 2013

Microsoft and Hadoop, Sitting in a Tree…*

Filed under: Hadoop,Hortonworks,MapReduce,Microsoft — Patrick Durusau @ 2:55 pm

Putting the Elephant in the Window by John Kreisa.

From the post:

For several years now Apache Hadoop has been fueling the fast growing big data market and has become the defacto platform for Big Data deployments and the technology foundation for an explosion of new analytic applications. Many organizations turn to Hadoop to help tame the vast amounts of new data they are collecting but in order to do so with Hadoop they have had to use servers running the Linux operating system. That left a large number of organizations who standardize on Windows (According to IDC, Windows Server owned 73 percent of the market in 2012 – IDC, Worldwide and Regional Server 2012–2016 Forecast, Doc # 234339, May 2012) without the ability to run Hadoop natively, until today.

We are very pleased to announce the availability of Hortonworks Data Platform for Windows providing organizations with an enterprise-grade, production-tested platform for big data deployments on Windows. HDP is the first and only Hadoop-based platform available on both Windows and Linux and provides interoperability across Windows, Linux and Windows Azure. With this release we are enabling a massive expansion of the Hadoop ecosystem. New participants in the community of developers, data scientist, data management professionals and Hadoop fans to build and run applications for Apache Hadoop natively on Windows. This is great news for Windows focused enterprises, service provides, software vendors and developers and in particular they can get going today with Hadoop simply by visiting our download page.

This release would not be possible without a strong partnership and close collaboration with Microsoft. Through the process of creating this release, we have remained true to our approach of community-driven enterprise Apache Hadoop by collecting enterprise requirements, developing them in open source and applying enterprise rigor to produce a 100-precent open source enterprise-grade Hadoop platform.

Now there is a very smart marketing move!

A smaller share of a larger market is always better than a large share of a small market.

(You need to be writing down these quips.) 😉

Seriously, take note of how Hortonworks used the open source model.

They did not build Hadoop in their image and try to sell it to the world.

Hortonworks gathered requirements from others and built Hadoop to meet their needs.

Open source model in both cases, very different outcomes.

* I didn’t remember the rhyme beyond the opening line. Consulting the oracle (Wikipedia), I discovered Playground song. 😉

February 20, 2013

The Stinger Initiative: Making Apache Hive 100 Times Faster

Filed under: Hive,Hortonworks — Patrick Durusau @ 9:23 pm

The Stinger Initiative: Making Apache Hive 100 Times Faster by Alan Gates.

From the post:

Introduced by Facebook in 2007, Apache Hive and its HiveQL interface has become the de facto SQL interface for Hadoop. Today, companies of all types and sizes use Hive to access Hadoop data in a familiar way and to extend value to their organization or customers either directly or though a broad ecosystem of existing BI tools that rely on this key proven interface. The who’s who of business analytics have already adopted Hive.

Hive was originally built for large-scale operational batch processing and it is very effective with reporting, data mining and data preparation use cases. These usage patterns remain very important but with widespread adoption of Hadoop, the enterprise requirement for Hadoop to become more real time or interactive has increased in importance as well. At Hortonworks, we believe in the power of the open source community to innovate faster than any proprietary offering and the Stinger initiative is proof of this once again as we collaborate with others to improve Hive performance.

So, What is Stinger?

Enabling Hive to answer human-time use cases (i.e. queries in the 5-30 second range) such as big data exploration, visualization, and parameterized reports without needing to resort to yet another tool to install, maintain and learn can deliver a lot of value to the large community of users with existing Hive skills and investments.

To this end, we have launched the Stinger Initiative, with input and participation from the broader community, to enhance Hive with more SQL and better performance for these human-time use cases. All the while, HiveQL remains the same before and after these advancements so it just gets better. And in keeping with the ecosystem of existing tools, it is complementary to best-of-breed data warehouses and analytic platforms.

Leveraging on existing skills and infrastructure.

Who knows? Hortonworks maybe about to start a trend!

February 5, 2013

Doing More with the Hortonworks Sandbox

Filed under: Data,Dataset,Hadoop,Hortonworks — Patrick Durusau @ 2:01 pm

Doing More with the Hortonworks Sandbox by Cheryle Custer.

From the post:

The Hortonworks Sandbox was recently introduced garnering incredibly positive response and feedback. We are as excited as you, and gratified that our goal providing the fastest onramp to Apache Hadoop has come to fruition. By providing a free, integrated learning environment along with a personal Hadoop environment, we are helping you gain those big data skills faster. Because of your feedback and demand for new tutorials, we are accelerating the release schedule for upcoming tutorials. We will continue to announce new tutorials via the Hortonworks blog, opt-in email and Twitter (@hortonworks).

While you wait for more tutorials, Cheryle points to some data sets to keep you busy:

For advice, see the Sandbox Forums.

BTW, while you are munging across different data sets, be sure to notice any semantic impedance if you try to merge some data sets.

If you don’t want everyone in your office doing that merging one-off, you might want to consider topic maps.

Design and document a merge between data sets once, run many times.

Even if your merging requirements change. Just change that part of the map, don’t re-create the entire map.

What if mapping companies recreated their maps for every new street?

Or would it be better to add the new street to an existing map?

If that looks obvious, try the extra-bonus question:

Which model, new map or add new street, do you use for schema migration?

January 25, 2013

Hadoop – “State of the Union” – Notes

Filed under: Hadoop,Hortonworks — Patrick Durusau @ 8:15 pm

I took notes on Shaun Connolly’s “Hortonworks State of the Union and Vision for Apache Hadoop in 2013.”

Unless you are an open source/Hadoop innocent, you aren’t going to gain much new information from the webinar.

But there is another reason for watching the webinar.

That is the strategy Hortonworks is pursuing in developing the Hadoop ecosystem.

Shaun refers to it in various ways (warning, some paraphrasing): “Investing in making Hadoop work with existing infrastructure,” add Hadoop to (not replace) traditional data architectures, “customers don’t want more data silos.”

Rather than a rip-and-replace technology, Hortonworks is building a Hadoop ecosystem that interacts with and compliments existing data architectures.

Think about that for a moment.

Works with existing data architectures.

Which means everyone from ordinary users to power users and sysadmins, can work from what they know and gain the benefits of a Hadoop ecosystem.

True enough, over time some (all?) of their traditional data architectures may become more Hadoop based but that will be a gradual process.

In the meantime, the benefits of Hadoop will be made manifest in the context of familiar tooling.

When a familiar tool runs faster, acquires new capabilities, user’s notice the change along with the lack of a learning curve.

Watch the webinar for the strategy and think about how to apply it to your favorite semantic technology.

January 22, 2013

Hortonworks Sandbox — the Fastest On Ramp to Apache Hadoop

Filed under: Hadoop,Hortonworks — Patrick Durusau @ 2:42 pm

Hortonworks Sandbox — the Fastest On Ramp to Apache Hadoop by Cheryle Custer.

From the post:

Today Hortonworks announced the availability of the Hortonworks Sandbox, an easy-to-use, flexible and comprehensive learning environment that will provide you with fastest on-ramp to learning and exploring enterprise Apache Hadoop.

The Hortonworks Sandbox is:

  • A free download
  • A complete, self contained virtual machine with Apache Hadoop pre-configured
  • A personal, portable and standalone Hadoop environment
  • A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop on your own

The Hortonworks Sandbox is designed to help close the gap between people wanting to learn and evaluate Hadoop, and the complexities of spinning up an evaluation cluster of Hadoop. The Hortonworks Sandbox provides a powerful combination of hands-on, step-by-step tutorials paired with an easy to use Web interface designed to lower the learning curve for people who just want to explore and evaluate Hadoop, as quickly as possible.

BTW, the tutorials can be refreshed to load new tutorials as they are released.

A marketing/teaching strategy that merits imitation by others.

Mahout on Windows Azure…

Filed under: Azure Marketplace,Hadoop,Hortonworks,Machine Learning,Mahout — Patrick Durusau @ 2:42 pm

Mahout on Windows Azure – Machine Learning Using Microsoft HDInsight by Istvan Szegedi.

From the post:

Our last post was about Microsoft and Hortonworks joint effort to deliver Hadoop on Microsoft Windows Azure dubbed HDInsight. One of the key Microsoft HDInsight components is Mahout, a scalable machine learning library that provides a number of algorithms relying on the Hadoop platform. Machine learning supports a wide range of use cases from email spam filtering to fraud detection to recommending books or movies, similar to Amazon.com features.These algorithms can be divided into three main categories: recommenders/collaborative filtering, categorization and clustering. More details about these algorithms can be read on Apache Mahout wiki.

Are you hearing Hadoop, Mahout, HBase, Hive, etc., as often as I am?

Does it make you wonder about Apache becoming the locus of transferable IT skills?

Something to think about as you are developing topic map ecosystems.

You can hand roll your own solutions.

Or build upon solutions that have widespread vendor support.

PS: Another great post from Istvan.

January 20, 2013

Hadoop Bingo [The More Things Change…, First Game 22nd Jan. 2013]

Filed under: BigData,Hadoop,Hortonworks — Patrick Durusau @ 8:03 pm

Don’t be Tardy for This Hadoop BINGO Party! by Kim Truong.

It had to happen. Virtual “door prizes” for attending webinars, now bingo games.

I’m considering a contest to guess what the next 1950’s/60’s marketing tool will appear next. 😉

From the post:

I’m excited to kick-off our first webinar series for 2013: The True Value of Apache Hadoop.

Get all your friends, co-workers together and be prepared to geek out to Hadoop!

This 4-part series will have a mixture of amazing guest speakers covering topics such as Hortonworks 2013 vision and roadmaps for Apache Hadoop and Big Data, What’s new with Hortonworks Data Platform v1.2, How Luminar (an Entravision company) adopted Apache Hadoop, and use case on Hadoop, R and GoogleVis. This series will provide organizations an opportunity to gain a better understanding of Apache Hadoop and Big Data landscape and practical guidance on how to leverage Hadoop as part of your Big Data strategy.

How is that a party?

Don’t be confused. The True Value of Apache Hadoop is the series name and Hortonworks State of the Union and Vision for Apache Hadoop in 2013 is the first webinar title. My note on the “State of the Union.”

Don’t get me wrong. Entirely appropriate to recycle 1950’s/60’s techniques (or older).

We are people and people haven’t changed in terms of motivations, virtues or vices in recorded history.

If the past works, use it.

January 19, 2013

Hadoop “State of the Union” [Webinar]

Filed under: Hadoop,Hortonworks — Patrick Durusau @ 7:07 pm

Hortonworks State of the Union and Vision for Apache Hadoop in 2013 by Kim Rose.

From the post:

Who: Shaun Connolly, Vice President of Corporate Strategy, Hortonworks

When: Tuesday, January 22, 2013 at 1:00 p.m. ET/10:00am PT

Where: http://info.hortonworks.com/Winterwebinarseries_TheTrueValueofHadoop.html

Click to Tweet: #Hortonworks hosting “State of the Union” webinar to discuss 2013 vision for #Hadoop, 1/22 at 1 pm ET. Register here: http://bit.ly/VYJxKX

The “State of the Union” webinar is the first in a four-part Hortonworks webinar series titled, “The True Value of Apache Hadoop,” designed to inform attendees of key trends, future roadmaps, best practices and the tools necessary for the successful enterprise adoption of Apache Hadoop.

During the “State of the Union,” Connolly will look at key company highlights from 2012, including the release of the Hortonworks Data Platform (HDP)—the industry’s online 100-percent open source platform powered by Apache Hadoop—and the further development of the Hadoop ecosystem through partnerships with leading software vendors, such as Microsoft and Teradata. Connolly will also provide insight into upcoming initiatives and projects that the Company plans to focus on this year as well as topical advances in the Apache Hadoop community.

Attendees will learn:

  • How Hortonworks’ focus contributes to innovation within the Apache open source community while addressing enterprise requirements and ecosystem interoperability;
  • About the latest releases in the Hortonworks product offering; and
  • About Hortonworks’ roadmap and major areas of investment across core platform, data and operational services for productive operations and management.

For more information, or to register for the “State of the Union” webinar, please visit: http://info.hortonworks.com/Winterwebinarseries_TheTrueValueofHadoop.html.

You will learn more from this “State of the Union” address than any similarly titled presentations with Congressional responses and the sycophantic choruses that accompany them.

January 18, 2013

Hortonworks Data Platform 1.2 Available Now!

Filed under: Apache Ambari,Hadoop,HBase,Hortonworks,MapReduce — Patrick Durusau @ 7:18 pm

Hortonworks Data Platform 1.2 Available Now! by Kim Rose.

From the post:

Hortonworks Data Platform (HDP) 1.2, the industry’s only complete 100-percent open source platform powered by Apache Hadoop is available today. The enterprise-grade Hortonworks Data Platform includes the latest version of Apache Ambari for comprehensive management, monitoring and provisioning of Apache Hadoop clusters. By also introducing additional new capabilities for improving security and ease of use, HDP delivers an enterprise-class distribution of Apache Hadoop that is endorsed and adopted by some of the largest vendors in the IT ecosystem.

Hortonworks continues to drive innovation through a range of Hadoop-related projects, packaging the most enterprise-ready components, such as Ambari, into the Hortonworks Data Platform. Powered by an Apache open source community, Ambari represents the forefront of innovation in Apache Hadoop management. Built on Apache Hadoop 1.0, the most stable and reliable code available today, HDP 1.2 improves the ease of enterprise adoption for Apache Hadoop with comprehensive management and monitoring, enhanced connectivity to high-performance drivers, and increased enterprise-readiness of Apache HBase, Apache Hive and Apache HCatalog projects.

The Hortonworks Data Platform 1.2 features a number of new enhancements designed to improve the enterprise viability of Apache Hadoop, including:

  • Simplified Hadoop Operations—Using the latest release of Apache Ambari, HDP 1.2 now provides both cluster management and the ability to zoom into cluster usage and performance metrics for jobs and tasks to identify the root cause of performance bottlenecks or operations issues. This enables Hadoop users to identify issues and optimize future job processing.
  • Improved Security and Multi-threaded Query—HDP 1.2 provides an enhanced security architecture and pluggable authentication model that controls access to Hive tables and the metastore. In addition, HDP 1.2 improves scalability by supporting multiple concurrent query connections to Hive from business intelligence tools and Hive clients.
  • Integration with High-performance Drivers Built for Big Data—HDP 1.2 empowers organizations with a trusted and reliable ODBC connector that enables the integration of current systems with high-performance drivers built for big data. The ODBC driver enables integration with reporting or visualization components through a SQL engine built into the driver. Hortonworks has partnered with Simba to deliver a trusted, reliable high-performance ODBC connector that is enterprise ready and completely free.
  • HBase Enhancements—By including and testing HBase 0.94.2, HDP 1.2 delivers important performance and operational improvements for customers building and deploying highly scalable interactive applications using HBase.

There goes the weekend!

December 13, 2012

Big Graph Data on Hortonworks Data Platform

Filed under: Aurelius Graph Cluster,Faunus,Gremlin,Hadoop,Hortonworks,Titan — Patrick Durusau @ 5:24 pm

Big Graph Data on Hortonworks Data Platform by Marko Rodriguez.

The Hortonworks Data Platform (HDP) conveniently integrates numerous Big Data tools in the Hadoop ecosystem. As such, it provides cluster-oriented storage, processing, monitoring, and data integration services. HDP simplifies the deployment and management of a production Hadoop-based system.

In Hadoop, data is represented as key/value pairs. In HBase, data is represented as a collection of wide rows. These atomic structures makes global data processing (via MapReduce) and row-specific reading/writing (via HBase) simple. However, writing queries is nontrivial if the data has a complex, interconnected structure that needs to be analyzed (see Hadoop joins and HBase joins). Without an appropriate abstraction layer, processing highly structured data is cumbersome. Indeed, choosing the right data representation and associated tools opens up otherwise unimaginable possibilities. One such data representation that naturally captures complex relationships is a graph (or network). This post presents Aurelius‘ Big Graph Data technology suite in concert with Hortonworks Data Platform. Moreover, for a real-world grounding, a GitHub clone is described in this context to help the reader understand how to use these technologies for building scalable, distributed, graph-based systems.

If you like graphs at all or have been looking at graph solutions, you are going to like this post.

November 9, 2012

Why not RAID-0? It’s about Time and Snowflakes

Filed under: Hadoop,Hortonworks — Patrick Durusau @ 11:51 am

Why not RAID-0? It’s about Time and Snowflakes by Steve Loughran.

From the post:

A recurrent question on the various Hadoop mailing lists is “why does Hadoop prefer a set of separate disks to the same set managed as a RAID-0 disks array?”

Steve uses empirical data on disk storage to explain why to avoid RAID-0 when using Hadoop.

As nice a summary as you are likely to find.

October 25, 2012

DINOSAURS ARE REAL: Microsoft WOWs audience with HDInsight…(Hortonworks Inside)

Filed under: Hadoop,HDInsight,Hortonworks,Microsoft — Patrick Durusau @ 4:02 pm

DINOSAURS ARE REAL: Microsoft WOWs audience with HDInsight at Strata NYC (Hortonworks Inside) by Russell Jurney.

From the post:

You don’t see many demos like the one given by Shawn Bice (Microsoft) today in the Regent Parlor of the New York Hilton, at Strata NYC. “Drive Smarter Decisions with Microsoft Big Data,” was different.

For starters – everything worked like clockwork. Live demos of new products are notorious for failing on-stage, even if they work in production. And although Microsoft was presenting about a Java-based platform at a largely open-source event… it was standing room only, with the crowd overflowing out the doors.

Shawn demonstrated working with Apache Hadoop from Excel, through Power Pivot, to Hive (with sampling-driven early results!?) and out to import third party data-sets. To get the full effect of what he did, you’re going to have to view a screencast or try it out but to give you the idea of what the first proper interface on Hadoop feels like…

My thoughts on reading Russell’s post:

  • A live product demo that did not fail? Really?
  • Is that tatoo copyrighted?
  • Oh, yes, +1!, big data has become real for millions of users.

How’s that for a big data book, tutorial, consulting, semantic market explosion?

Why Microsoft is committed to Hadoop and Hortonworks

Filed under: BigData,Hadoop,Hortonworks,Microsoft — Patrick Durusau @ 2:53 pm

Why Microsoft is committed to Hadoop and Hortonworks (a buest post at Hortonworks by Microsoft’s Dave Campbell).

From the post:

Last February at Strata Conference in Santa Clara we shared Microsoft’s progress on Big Data, specifically working to broaden the adoption of Hadoop with the simplicity and manageability of Windows and enabling customers to easily derive insights from their structured and unstructured data through familiar tools like Excel.

Hortonworks is a recognized pioneer in the Hadoop Community and a leading contributor to the Apache Hadoop project, and that’s why we’re excited to announce our expanded partnership with Hortonworks to give customers access to an enterprise-ready distribution of Hadoop that is 100 percent compatible with Windows Server and Windows Azure. To provide customers with access to this Hadoop compatibility, yesterday we also released new previews of Microsoft HDInsight Server for Windows and Windows Azure HDInsight Service, our Hadoop-based solutions for Windows Server and Windows Azure.

With this expanded partnership, the Hadoop community will reap the following benefits of Hadoop on Windows:

  • Insights to all users from all data:….
  • Enterprise-ready Hadoop with HDInsight:….
  • Simplicity of Windows for Hadoop:….
  • Extend your data warehouse with Hadoop:….
  • Seamless Scale and Elasticity of the Cloud:….

This is a very exciting milestone, and we hope you’ll join us for the ride as we continue partnering with Hortonworks to democratize big data. Download HDInsight today at Microsoft.com/BigData.

See Dave’s post for the details on “benefits of Hadoop on Windows” and then like the man says:

Download HDInsight today at Microsoft.com/BigData.

Enabling Big Data Insight for Millions of Windows Developers [Your Target Audience?]

Filed under: Azure Marketplace,BigData,Hadoop,Hortonworks,Microsoft — Patrick Durusau @ 2:39 pm

Enabling Big Data Insight for Millions of Windows Developers by Shaun Connolly.

From the post:

At Hortonworks, we fundamentally believe that, in the not-so-distant future, Apache Hadoop will process over half the world’s data flowing through businesses. We realize this is a BOLD vision that will take a lot of hard work by not only Hortonworks and the open source community, but also software, hardware, and solution vendors focused on the Hadoop ecosystem, as well as end users deploying platforms powered by Hadoop.

If the vision is to be achieved, we need to accelerate the process of enabling the masses to benefit from the power and value of Apache Hadoop in ways where they are virtually oblivious to the fact that Hadoop is under the hood. Doing so will help ensure time and energy is spent on enabling insights to be derived from big data, rather than on the IT infrastructure details required to capture, process, exchange, and manage this multi-structured data.

So how can we accelerate the path to this vision? Simply put, we focus on enabling the largest communities of users interested in deriving value from big data.

You don’t have to wonder long what Shaun is reacting to:

Today Microsoft unveiled previews of Microsoft HDInsight Server and Windows Azure HDInsight Service, big data solutions that are built on Hortonworks Data Platform (HDP) for Windows Server and Windows Azure respectively. These new offerings aim to provide a simplified and consistent experience across on-premise and cloud deployment that is fully compatible with Apache Hadoop.

Enabling big data insight isn’t the same as capturing those insights for later use or re-use.

May just be me, but that sounds like a great opportunity for topic maps.

Bringing semantics to millions of Windows developers that is.

October 22, 2012

HBase Futures

Filed under: Hadoop,HBase,Hortonworks,Semantics — Patrick Durusau @ 2:28 pm

HBase Futures by Devaraj Das.

From the post:

As we have said here, Hortonworks has been steadily increasing our investment in HBase. HBase’s adoption has been increasing in the enterprise. To continue this trend, we feel HBase needs investments in the areas of:

  1. Reliability and High Availability (all data always available, and recovery from failures is quick)
  2. Autonomous operation (minimum operator intervention)
  3. Wire compatibility (to support rolling upgrades across a couple of versions at least)
  4. Cross data-center replication (for disaster recovery)
  5. Snapshots and backups (be able to take periodic snapshots of certain/all tables and be able to restore them at a later point if required)
  6. Monitoring and Diagnostics (which regionserver is hot or what caused an outage)

Probably just a personal prejudice but I would have mentioned semantics in that list.

You?

HBase at Hortonworks: An Update [Features, Consumer Side?]

Filed under: Hadoop,HBase,Hortonworks — Patrick Durusau @ 3:37 am

HBase at Hortonworks: An Update by Devaraj Das.

From the post:

HBase is a critical component of the Apache Hadoop ecosystem and a core component of the Hortonworks Data Platform. HBase enables a host of low latency Hadoop use-cases; As a publishing platform, HBase exposes data refined in Hadoop to outside systems; As an online column store, HBase supports the blending of random access data read/write with application workloads whose data is directly accessible to Hadoop MapReduce.

The HBase community is moving forward aggressively, improving HBase in many ways. We are in the process of integrating HBase 0.94 into our upcoming HDP 1.1 refresh. This “minor upgrade” will include a lot of bug fixes (nearly 200 in number) and quite a few performance improvements and will be wire compatible with HBase 0.92 (in HDP 1.0).

The post concludes:

All of the above is just what we’ve been doing recently and Hortonworkers are only a small fraction of the HBase contributor base. When one factors in all the great contributions coming from across the Apache HBase community, we predict 2013 is going to be a great year for HBase. HBase is maturing fast, becoming both more operationally reliable and more feature rich.

When a technical infrastructure becomes “feature rich,” can “features” for consumer services/interfaces be far behind?

Delivering location-based coupons for latte’s on a cellphone may seem like a “feature.” But we can do that with a man wearing a sandwich board.

A “feature” for the consumer needs to be more than digital imitation of an analog capability.

What consumer “feature(s)” would you offer based on new features in HBase?

« Newer PostsOlder Posts »

Powered by WordPress