Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 18, 2012

Bigger Than A Bread Box

Filed under: Analytics,BigData,Hortonworks — Patrick Durusau @ 10:38 am

Hortonworks & Teradata: More Than Just an Elephant in a Box by Jim Walker.

I’m not going to wake up Christmas morning to find:

Teredata

But in case you are in the market for a big analytics hardware/software appliance, Jim writes:

Today our partner, Teradata, announced availability of the Teradata Aster Big Analytics Appliance, which packages our Hortonworks Data Platform (HDP) with Teradata Aster on machine that is ready to plug-in and bring big data value in hours.

There is more to this appliance than meets the eye… it is not just a simple packaging of software on hardware. Teradata and Hortonworks engineers have been working together for months tying our solutions together and optimizing them for an appliance. This solution gives an analyst the ability to leverage big data (social media, Web clickstream, call center, and other types of customer interaction data) in their analysis and all the while use the tools they are already familiar with. It is analytics and data discovery/exploration with big data (or HDP) inside… all on an appliance that can be operational in hours.

Not just anyone can do this

This is an engineered solution. Many analytics tools are building their solutions on top of Hadoop using Hive and HiveQL. This is a great approach but it lacks integration of metadata and metadata exchange. With the appliance we have extended a new approach using HCatalog and the Teradata SQL-H product. SQL-H is a conduit that allows new analysis to be created and schema changes to be adopted within Hadoop from Teradata. Analysts are abstracted completely from the Hadoop environment so they can focus on what they do best… analyze. All of this is enabled by an innovation provided by HCatalog, which enables this metadata exchange.

Shortcut to Big Data Exploration

In the appliance, Aster provides over 50 pre-built functions that allow analysts to perform segmentation, transformations and even pre-packaged marketing analytics. With this package, these valuable functions can now be applied to big data in Hadoop. This shortens the time it takes for an analyst to explore and discover value in big data. And if the pre-packaged functions aren’t explicit enough, Teradata Aster also provides an environment to create MapReduce functions that can be executed in HDP.

Just as well.

Red doesn’t really go with my office decor. Runs more towards the hulking black server tower, except for the artificial pink tree in the corner. 😉

Apache Hadoop YARN Meetup at Hortonworks

Filed under: Hadoop YARN,Hortonworks — Patrick Durusau @ 10:36 am

Apache Hadoop YARN Meetup at Hortonworks – Recap by Vinod Kumar Vavilapalli.

Just in case you missed the Apache Hadoop YARN meetup, summaries and slides are available for:

  • Chris Riccomini’s on “Building Applications on YARN”
  • YARN API Discussion
  • Efforts Underway

Enjoy!

October 16, 2012

Ready to Contribute to Apache Hadoop 2.0?

Filed under: Hadoop,Hadoop YARN,Hortonworks — Patrick Durusau @ 4:08 am

User feedback is a contribution to a software project.

Software can only mature with feedback, your feedback.

Otherwise the final deliverable has a “works on my machine” outcome.

Don’t let Apache Hadoop 2.0 have a “works on my machine” outcome.

Download the preview and contribute your experiences back to the community.

We will all be glad you did!

Details:

Hortonworks Data Platform 2.0 Alpha is Now Available for Preview! by Jeff Sposetti.

From the post:

We are very excited to announce the Alpha release of the Hortonworks Data Platform 2.0 (HDP 2.0 Alpha).

HDP 2.0 Alpha is built around Apache Hadoop 2.0, which improves availability of HDFS with High Availability for the NameNode along with several performance and reliability enhancements. Apache Hadoop 2.0 also significantly advances data processing in the Hadoop ecosystem with the introduction of YARN, a generic resource-management and application framework to support MapReduce and other paradigms such as real-time processing and graph processing.

In addition to Apache Hadoop 2.0, this release includes the essential Hadoop ecosystem projects such as Apache HBase, Apache Pig, Apache Hive, Apache HCatalog, Apache ZooKeeper and Apache Oozie to provide a fully integrated and verified Apache Hadoop 2.0 stack

Apache Hadoop 2.0 is well on the path to General Availability, and is already deployed at scale in several organizations; but it won’t get to the current maturity levels of the Hadoop 1.0 stack (available in Hortonworks Data Platform 1.x) without feedback and contributions from the community.

Hortonworks strongly believes that for open source technologies to mature and become widely adopted in the enterprise, you must balance innovation with stability. With HDP 2.0 Alpha, Hortonworks provides organizations an easy way to evaluate and gain experience with the Apache Hadoop 2.0 technology stack, and it presents the perfect opportunity to help bring stability to the platform and influence the future of the technology.

October 4, 2012

YARN Meetup at Hortonworks on Friday, Oct 12

Filed under: Hadoop,Hadoop YARN,Hortonworks — Patrick Durusau @ 4:35 pm

YARN Meetup at Hortonworks on Friday, Oct 12 by Russell Jurney.

From the post:

Hortonworks is hosting an Apache YARN Meetup on Friday, Oct 12, to solicit feedback on the YARN APIs. We’ve talked about YARN before in a four-part series on YARN, parts one, two, three and four.

YARN, or “Apache Hadoop NextGen MapReduce,” has come a long way this year. It is now a full-fledged sub-project of Apache Hadoop and has already been deployed on a massive 2,000 node cluster at Yahoo. Many projects, both open-src and otherwise, are porting to work in YARN such as Storm, S4 and many of them are in fairly advanced stages. We also have several individuals implementing one-off or ad-hoc application on YARN.

This meetup is a good time for YARN developers to catch up and talk more about YARN, it’s current status and medium-term and long-term roadmap.

OK, it’s probably too late to get cheap tickets but if you are in New York on the 12th of October, take advantage of the opportunity!

And please blog about the meeting, with a note to yours truly! I will post a link to your posting.

September 23, 2012

Pig Out to Hadoop (Replay) [Restore Your Faith in Webinars]

Filed under: Hadoop,Hortonworks,Pig — Patrick Durusau @ 3:08 pm

Pig Out to Hadoop with Alan Gates (Link to the webinar page at Hortonworks. Scroll down for this webinar. You have to register/login to view.)

From the description:

Pig has added some exciting new features in 0.10, including a boolean type, UDFs in JRuby, load and store functions for JSON, bloom filters, and performance improvements. Join Alan Gates, Hortonworks co-founder and long-time contributor to the Apache Pig and HCatalog projects, to discuss these new features, as well as talk about work the project is planning to do in the near future. In particular, we will cover how Pig can take advantage of changes in Hadoop 0.23.

I should have been watching more closely for this webinar recording to get posted.

Not only is it a great webinar on Pig, but it will restore your faith in webinars as a means of content delivery.

I have suffered through several lately where introductions took more time than actual technical content of the webinar.

Hard to know until you have already registered and spent time expecting substantive content.

Is there a public tally board for webinars on search, semantics, big data, etc.?

September 12, 2012

Welcome Hortonworks Data Platform 1.1

Filed under: Flume,Hadoop,Hortonworks — Patrick Durusau @ 10:30 am

Welcome Hortonworks Data Platform 1.1 by Jim Walker.

From the post:

Hortonworks Data Platform 1.1 Brings Expanded High Availability and Streaming Data Capture, Easier Integration with Existing Tools to Improve Enterprise Reliability and Performance of Apache Hadoop

It is exactly three months to the day that Hortonworks Data Platform version 1.0 was announced. A lot has happened since that day…

  • Our distribution has been downloaded by thousands and is delivering big value to organizations throughout the world,
  • Hadoop Summit gathered over 2200 Hadoop enthusiasts into the San Jose Convention Center,
  • And, our Hortonworks team grew by leaps and bounds!

In these same three months our growing team of committers, engineers, testers and writers have been busy knocking out our next release, Hortonworks Data Platform 1.1. We are delighted to announce availability of HDP 1.1 today! With this release, we expand our high availability options with the addition of Red Hat based HA, add streaming capability with Flume, expand monitoring API enhancements and have made significant performance improvements to the core platform.

New features include high availability, capturing data streams (Flume), improved operations management and performance increases.

For the details, see the post, documentation or even download Hortonworks Data Platform 1.1 for a spin.

Unlike Odo’s Klingon days, a day with several items from Hortonworks is a good day. Enjoy!

How To Take Big Data to the Cloud [Webinar – 13 Sept 2012 – 10 AM PDT]

Filed under: BigData,Cloud Computing,Hortonworks — Patrick Durusau @ 10:17 am

How To Take Big Data to the Cloud by Lisa Sensmeier.

From the post:

Hortonworks boasts a rich and vibrant ecosystem of partners representing a huge array of solutions that leverage Hadoop, and specifically Hortonworks Data Platform, to provide big data insights for customers. The goal of our Partner Webinar Series is to help communicate the value and benefit of our partners’ solutions and how they connect and use Hortonworks Data Platform.

Look to the CloudsBig-Data-and-the-cloud

Setting up a big data cluster can be difficult, especially considering the assembly of all the all the equipment, power, and space to make it happen. One option to consider is using the cloud for a practical and economical way to go. The cloud is also used to provide extra capacity for an existing cluster or for test your Hadoop applications.

Join our webinar and we will show how you can build a flexible and reliable Hadoop cluster in the cloud using Amazon EC2 cloud infrastructure, StackIQ Apache Hadoop Amazon Machine Image (AMI) and Hortonworks Data Platform. The panel of speakers includes Matt Tavis, Solutions Architect for Amazon Web Services, Mason Katz, CTO and co-founder of StackIQ, and Rohit Bakhshi, Product Manager at Hortonworks.

OK, it is a vendor/partner presentation but most of us work for vendors and use vendor created tools.

Yes?

The real question is whether tool X does what is necessary at a cost project Y can afford?

Whether vendor sponsored tool, service, home grown or otherwise.

Yes?

Looking forward to it!

Apache Hadoop YARN – NodeManager

Filed under: Hadoop YARN,Hortonworks — Patrick Durusau @ 10:06 am

Apache Hadoop YARN – NodeManager by Vinod Kumar Vavilapalli

From the post:

In the previous post, we briefly covered the internals of Apache Hadoop YARN’s ResourceManager. In this post, which is the fourth in the multi-part YARN blog series, we are going to dig deeper into the NodeManager internals and some of the key-features that NodeManager exposes. Part one, two and three are available.

Introduction

The NodeManager (NM) is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the ResourceManager (RM), overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.

Administration isn’t high on the “exciting” list, although without good administration, things can get very “exciting.”

NodeManager gives you the monitoring tools to help avoid the latter form of excitement.

September 6, 2012

Meet the Committer, Part One: Alan Gates

Filed under: Hadoop,Hortonworks,MapReduce,Pig — Patrick Durusau @ 7:52 pm

Meet the Committer, Part One: Alan Gates by Kim Truong.

From the post:

Series Introduction

Hortonworks is on a mission to accelerate the development and adoption of Apache Hadoop. Through engineering open source Hadoop, our efforts with our distribution, Hortonworks Data Platform (HDP), a 100% open source data management platform, and partnerships with the likes of Microsoft, Teradata, Talend and others, we will accomplish this, one installation at a time.

What makes this mission possible is our all-star team of Hadoop committers. In this series, we’re going to profile those committers, to show you the face of Hadoop.

Alan Gates, Apache Pig and HCatalog Committer

Education is a key component of this mission. Helping companies gain a better understanding of the value of Hadoop through transparent communications of the work we’re doing is paramount. In addition to explaining core Hadoop projects (MapReduce and HDFS) we also highlight significant contributions to other ecosystem projects including Apache Ambari, Apache HCatalog, Apache Pig and Apache Zookeeper.

Alan Gates is a leader in our Hadoop education programs. That is why I’m incredibly excited to kick off the next phase of our “Future of Apache Hadoop” webinar series. We’re starting off this segment with 4-webinar series on September 12 with “Pig out to Hadoop” with Alan Gates (twitter:@alanfgates). Alan is an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan is also a member of the Apache Software Foundation and a co-founder of Hortonworks.

My only complaint is that the interview is too short!

Looking forward to the Pig webinar!

September 5, 2012

New ‘The Future of Apache Hadoop’ Season!

Filed under: Hadoop,Hadoop YARN,Hortonworks,Zookeeper — Patrick Durusau @ 3:37 pm

OK, the real title is: Four New Installments in ‘The Future of Apache Hadoop’ Webinar Series

From the post:

During the ‘Future of Apache Hadoop’ webinar series, Hortonworks founders and core committers will discuss the future of Hadoop and related projects including Apache Pig, Apache Ambari, Apache Zookeeper and Apache Hadoop YARN.

Apache Hadoop has rapidly evolved to become the leading platform for managing, processing and analyzing big data. Consequently there is a thirst for knowledge on the future direction for Hadoop related projects. The Hortonworks webinar series will feature core committers of the Apache projects discussing the essential components required in a Hadoop Platform, current advances in Apache Hadoop, relevant use-cases and best practices on how to get started with the open source platform. Each webinar will include a live Q&A with the individuals at the center of the Apache Hadoop movement.

Coming to a computer near you:

  • Pig Out on Hadoop (Alan Gates): Wednesday, September 12 at 10:00 a.m. PT / 1:00 p.m. ET
  • Deployment and Management of Hadoop Clusters with Ambari (Matt Foley): Wednesday, September 26 at 10:00 a.m. PT / 1:00 p.m. ET
  • Scaling Apache Zookeeper for the Next Generation of Hadoop Applications (Mahadev Konar): Wednesday, October 17 at 10:00 a.m. PT / 1:00 p.m. ET
  • YARN: The Future of Data Processing with Apache Hadoop ( Arun C. Murthy): Wednesday, October 31 at 10:00 a.m. PT / 1:00 p.m. ET

Registration is open so get it on your calendar!

September 4, 2012

Pig Performance and Optimization Analysis

Filed under: Hortonworks,Pig — Patrick Durusau @ 3:57 pm

Pig Performance and Optimization Analysis by Li Jie.

From the post:

In this post, Hortonworks Intern Li Jie talks about his work this summer on performance analysis and optimization of Apache Pig. Li is a PhD candidate in the Department of Computer Science at Duke University. His research interests are in the area of database systems and big data computing. He is currently working with Associate Professor Shivnath Babu.

If you need to optimize Pig operations, this is a very good starting place.

Be sure to grab a copy of Running TPC-H on Pig by Li Jie, Koichi Ishida, Xuan Wang and Muzhi Zhao, with its “Six Rules of Writing Efficient Pig Scripts.”

Expect to see all three of these authors in DBLP sooner rather than later.

DBLP: Shivnath Babu

August 31, 2012

Apache Hadoop YARN – ResourceManager

Filed under: Hadoop YARN,Hortonworks — Patrick Durusau @ 3:45 pm

Apache Hadoop YARN – ResourceManager by Arun Murthy

From the post:

This is the third post in the multi-part series to cover important aspects of the newly formed Apache Hadoop YARN sub-project. In our previous posts (part one, part two), we provided the background and an overview of Hadoop YARN, and then covered the key YARN concepts and walked you through how diverse user applications work within this new system.

In this post, we are going to delve deeper into the heart of the system – the ResourceManager.

In case your data processing needs run towards the big/large end of the spectrum.

August 30, 2012

Recap of the August Pig Hackathon at Hortonworks

Filed under: Hortonworks,Pig — Patrick Durusau @ 2:27 pm

Recap of the August Pig Hackathon at Hortonworks by Russell Jurney.

From the post:

The August Pig Hackathon brought Pig users from Hortonworks, Yahoo, Cloudera, Visa, Kaiser Permanente, and LinkedIn to Hortonworks HQ in Sunnyvale, CA to talk and work on Apache Pig.

If you weren’t at this hackathon, Russell’s summary and pointers will make you want to attend the next one!

BTW, someone needs to tell Michael Sperberg-McQueen that Pig is being used to build generic DAG structures. Don’t worry, he’ll understand.

July 28, 2012

The Coming Majority: Mainstream Adoption and Entrepreneurship [Cloud Gift Certificates?]

Filed under: Cloud Computing,Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 6:22 pm

The Coming Majority: Mainstream Adoption and Entrepreneurship by James Locus.

From the post:

Small companies, big data.

Big data is sometimes at odds with the business-savvy entrepreneur who wants to exploit its full potential. In essence, the business potential of big data is the massive (but promising) elephant in the room that remains invisible because the available talent necessary to take full advantage of the technology is difficult to obtain.

Inventing new technology for the platform is critical, but so too is making it easier to use.

The future of big data may not be a technological breakthrough by a select core of contributing engineers, but rather a platform that allows common, non-PhD holding entrepreneurs and developers to innovate. Some incredible progress has been made in Apache Hadoop with Hortonworks’ HDP (Hortonworks Data Platform) in minimizing the installation process required for full implementation. Further, the improved MapReduce v2 framework also greatly lowers the risk of adoption for businesses by expressly creating features designed to increase efficiency and usability (e.g. backward and forward compatibility). Finally, with HCatalog, the platform is opened up to integrate with new and existing enterprise applications.

What kinds of opportunities lie ahead when more barriers are eliminated?

You really do need a local installation of Hadoop for experimenting.

But at the same time, having a minimal cloud account where you can whistle up some serious computing power isn’t a bad idea either.

That would make an interesting “back to school” or “holiday present for your favorite geek” sort of present. A “gift certificate” for so many hours/cycles a month on a cloud platform.

BTW, what projects would you undertake if barriers of access and capacity were diminished if not removed?

July 25, 2012

Thinking about the HDFS vs. Other Storage Technologies

Filed under: Hadoop,HDFS,Hortonworks — Patrick Durusau @ 3:11 pm

Thinking about the HDFS vs. Other Storage Technologies by Eric Baldeschwieler.

Just to whet your interest (see Eric’s post for the details):

As Apache Hadoop has risen in visibility and ubiquity we’ve seen a lot of other technologies and vendors put forth as replacements for some or all of the Hadoop stack. Recently, GigaOM listed eight technologies that can be used to replace HDFS (Hadoop Distributed File System) in some use cases. HDFS is not without flaws, but I predict a rosy future for HDFS. Here is why…

To compare HDFS to other technologies one must first ask the question, what is HDFS good at:

  • Extreme low cost per byte….
  • Very high bandwidth to support MapReduce workloads….
  • Rock solid data reliability….

A lively storage competition is a good thing.

A good opportunity to experiment with different storage strategies.

July 16, 2012

Happy Birthday Hortonworks!

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 2:04 pm

Happy Birthday Hortonworks! by Eric Baldeschwieler.

From the post:

Last week was an important milestone for Hortonworks: our one year anniversary. Given all of the activity around Apache Hadoop and Hortonworks, it’s hard to believe it’s only been one year. In honor of our birthday, I thought I would look back to contrast our original intentions with what we delivered over the past year.

Hortonworks was officially announced at Hadoop Summit 2011. At that time, I published a blog on the Hortonworks Manifesto. This blog told our story, including where we came from, what motivated the original founders and what our plans were for the company. I wanted to address many of the important statements from this blog here:

Read the post in full to see Eric’s take on:

Hortonworks was formed to “accelerate the development and adoption of Apache Hadoop”. …

We are “committed to open source” and commit that “all core code will remain open source”. …

We will “make Apache Hadoop easier to install, manage and use”. …

We will “make Apache Hadoop more robust”. …

We will “make Apache Hadoop easier to integrate and extend”. …

We will “deliver an ever-increasing array of services aimed at improving the Hadoop experience and support in the growing needs of enterprises, systems integrators and technology vendors”. …

This has been a banner year for Hortonworks, the Hadoop ecosystem and everyone concerned with this rapidly developing area!

We are looking forward to the next year being more of same, except more so!

June 28, 2012

Data Integration Services & Hortonworks Data Platform

Filed under: Data Integration,HCatalog,Hortonworks,Pig,Talend — Patrick Durusau @ 6:30 pm

Data Integration Services & Hortonworks Data Platform by Jim Walker

From the post:

What’s possible with all this data?

Data Integration is a key component of the Hadoop solution architecture. It is the first obstacle encountered once your cluster is up and running. Ok, I have a cluster… now what? Do I write a script to move the data? What is the language? Isn’t this just ETL with HDFS as another target?Well, yes…

Sure you can write custom scripts to perform a load, but that is hardly repeatable and not viable in the long term. You could also use Apache Sqoop (available in HDP today), which is a tool to push bulk data from relational stores into HDFS. While effective and great for basic loads, there is work to be done on the connections and transforms necessary in these types of flows. While custom scripts and Sqoop are both viable alternatives, they won’t cover everything and you still need to be a bit technical to be successful.

For wide scale adoption of Apache Hadoop, tools that abstract integration complexity are necessary for the rest of us. Enter Talend Open Studio for Big Data. We have worked with Talend in order to deeply integrate their graphical data integration tools with HDP as well as extend their offering beyond HDFS, Hive, Pig and HBase into HCatalog (metadata service) and Oozie (workflow and job scheduler).

Jim covers four advantages of using Talend:

  • Bridge the skills gap
  • HCatalog Integration
  • Connect to the entire enterprise
  • Graphic Pig Script Creation

Definitely something to keep in mind.

June 12, 2012

Introducing Hortonworks Data Platform v1.0

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 3:27 pm

Introducing Hortonworks Data Platform v1.0

John Kreisa writes:

I wanted to take this opportunity to share some important news. Today, Hortonworks announced version 1.0 of the Hortonworks Data Platform, a 100% open source data management platform based on Apache Hadoop. We believe strongly that Apache Hadoop, and therefore, Hortonworks Data Platform, will become the foundation for the next generation enterprise data architecture, helping companies to load, store, process, manage and ultimately benefit from the growing volume and variety of data entering into, and flowing throughout their organizations. The imminent release of Hortonworks Data Platform v1.0 represents a major step forward for achieving this vision.

You can read the full press release here. You can also read what many of our partners have to say about this announcement here. We were extremely pleased that industry leaders such as Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata and VMware all expressed their support and excitement for Hortonworks Data Platform.

Those who have followed Hortonworks since our initial launch already know that we are absolutely committed to open source and the Apache Software Foundation. You will be glad to know that our commitment remains the same today. We don’t hold anything back. No proprietary code is being developed at Hortonworks.

Hortonworks Data Platform was created to make it easier for organizations and solution providers to install, integrate, manage and use Apache Hadoop. It includes the latest stable versions of the essential Hadoop components in an integrated and tested package. Here is a diagram that shows the Apache Hadoop components included in Hortonworks Data Platform:

And I thought this was going to be a slow news week. 😉

Excellent news!

« Newer Posts

Powered by WordPress