Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 12, 2013

Rapid hadoop development with progressive testing

Filed under: Hadoop,MapReduce — Patrick Durusau @ 3:45 pm

Rapid hadoop development with progressive testing by Abe Gong.

From the post:

Debugging Hadoop jobs can be a huge pain. The cycle time is slow, and error messages are often uninformative — especially if you’re using Hadoop streaming, or working on EMR.

I once found myself trying to debug a job that took a full six hours to fail. It took more than a week — a whole week! — to find and fix the problem. Of course, I was doing other things at the same time, but the need to constantly check up on the status of the job was a huge drain on my energy and productivity. It was a Very Bad Week.

crushed by elephant

Painful experiences like this have taught me to follow a test-driven approach to hadoop development. Whenever I’m working on a new hadoop-based data pipe, my goal is to isolate six distinct kinds of problems that arise in hadoop development.

(…)

See Abe’s post for the six steps and suggestions for how to do them.

Reformatted a bit with local tool preferences, Abe’s list will make a nice quick reference for Hadoop development.

Introducing Morphlines:…

Filed under: Cloudera,ETL,Hadoop,Morphlines — Patrick Durusau @ 3:07 pm

Introducing Morphlines: The Easy Way to Build and Integrate ETL Apps for Hadoop by Wolfgang Hoschek.

From the post:

Cloudera Morphlines is a new open source framework that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs, this post gets you started.

A “morphline” is a rich configuration file that makes it easy to define a transformation chain that consumes any kind of data from any kind of data source, processes the data, and loads the results into a Hadoop component. It replaces Java programming with simple configuration steps, and correspondingly reduces the cost and integration effort associated with developing, maintaining, or integrating custom ETL projects.

Morphlines is a library, embeddable in any Java codebase. A morphline is an in-memory container of transformation commands. Commands are plugins to a morphline that perform tasks such as loading, parsing, transforming, or otherwise processing a single record. A record is an in-memory data structure of name-value pairs with optional blob attachments or POJO attachments. The framework is extensible and integrates existing functionality and third-party systems in a simple and straightforward manner.

The Morphlines library was developed as part of Cloudera Search. It powers a variety of ETL data flows from Apache Flume and MapReduce into Solr. Flume covers the real time case, whereas MapReduce covers the batch processing case.

Since the launch of Cloudera Search, Morphlines development has graduated into the Cloudera Development Kit (CDK) in order to make the technology accessible to a wider range of users, contributors, integrators, and products beyond Search. The CDK is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem (and hence a perfect home for Morphlines). The CDK is hosted on GitHub and encourages involvement by the community.

(…)

The sidebar promises: Morphlines replaces Java programming with simple configuration steps, reducing the cost and effort of doing custom ETL.

Sound great!

But how do I search one or more morphlines for the semantics of the records/fields that are being processed or the semantics of that processing?

If I want to save “cost and effort,” shouldn’t I be able to search for existing morphlines that have transformed particular records/fields?

True, morphlines have “#” comments but that seems like a poor way to document transformations.

How would you test for field documentation?

Or make sure transformations of particular fields always use the same semantics?

Ponder those questions while you are reading:

Cloudera Morphlines Reference Guide

and,

Syntax – HOCON github page.

If we don’t capture semantics at the point of authoring, subsequent searches are mechanized guessing.

Hadoop Summit 2013

Filed under: Hadoop,MapReduce — Patrick Durusau @ 8:55 am

Hadoop Summit 2013

Videos and slides from Hadoop Summit 2013!

Forty-two (42) presentations on day one and Forty-one (41) on day two.

Just this week I got news that ISO is hunting down “rogue” copies of ISO standards, written by volunteers, that aren’t behind its paywall.

While others, like the presenters at the Hadoop Summit 2013, are sharing their knowledge in hopes of creating more knowledge.

Which group do you think will be relevant in a technology driven future?

July 9, 2013

How To Unlock Business Value from your Big Data with Hadoop

Filed under: BigData,Hadoop,Marketing,Topic Maps — Patrick Durusau @ 6:36 pm

How To Unlock Business Value from your Big Data with Hadoop by Jim Walker.

From the post:

By now, you’re probably well aware of what Hadoop does: low-cost processing of huge amounts of data. But more importantly, what can Hadoop do for you?

We work with many customers across many industries with many different specific data challenges, but in talking to so many customers, we are also able to see patterns emerge on certain types of data and the value that could bring to a business.

We love to share these kinds of insights, so we built a series of video tutorials covering some of those scenarios:

The tutorials cover social media, server logs, clickstream data, geolocation data, and others.

This is a brilliant marketing move.

Hadoop may be the greatest invention since sliced bread but if it isn’t shown to help you, what good is it?

These tutorials answer that question for several different areas of potential customer interest.

We should do something very similar for topic maps.

Something that focuses on a known need or interest of customers.

The Blur Project: Marrying Hadoop with Lucene

Filed under: Hadoop,Lucene — Patrick Durusau @ 3:40 pm

The Blur Project: Marrying Hadoop with Lucene by Aaron McCurry.

From the post:

Blur is an Apache Incubator project that provides distributed search functionality on top of Apache Hadoop, Apache Lucene, Apache ZooKeeper, and Apache Thrift. When I started building Blur three years ago, there wasn’t a search solution that had a solid integration with the Hadoop ecosystem. Our initial needs were to be able to index our data using MapReduce, store indexes in HDFS, and serve those indexes from clusters of commodity servers while remaining fault tolerant. Blur was built specifically for Hadoop — taking scalability, redundancy, and performance into consideration from the very start — while leveraging all the great features that already exist in the Hadoop stack.

(…)

Blur was initially released on Github as an Apache Licensed project and was then accepted into the Apache Incubator project in July 2012, with Patrick Hunt as its champion. Since then, Blur as a software project has matured and become much more stable. One of the major milestones over the past year has been the upgrade to Lucene 4, which has brought many new features and massive performance gains.

Recently there has been some interest in folding some of Blur’s code (HDFSDirectory and BlockCache) back into the Lucene project for others to utilize. This is an exciting development that legitimizes some of the approaches that we have taken to date. We are in conversations with some members of the Lucene community, such as Mark Miller, to figure out how we can best work together to benefit both the fledgling Blur project as well as the much larger and more well known/used Lucene project.

Blur’s community is small but growing. Our project goals are to continue to grow our community and graduate from the Incubator project. Our technical goals are to continue to add features that perform well at scale while maintaining the fault tolerance that is required of any modern distributed system.

We welcome your contributions at http://incubator.apache.org/blur/!

Another exciting Apache project that needs contributors!

Friend Recommendations using MapReduce

Filed under: Hadoop,MapReduce,Recommendation — Patrick Durusau @ 3:26 pm

Friend Recommendations using MapReduce by John Berryman.

From the post:

So Jonathan, one of our interns this summer, asked an interesting question today about MapReduce. He said, “Let’s say you download the entire data set of who’s following who from Twitter. Can you use MapReduce to make recommendations about who any particular individual should follow?” And as Jonathan’s mentor this summer, and as one of the OpenSource Connections MapReduce experts I dutifully said, “uuuhhhhh…”

And then in a stoke of genius … I found a way to stall for time. “Well, young Padawan,” I said to Jonathan, “first you must more precisely define your problem… and only then will the answer be revealed to you.” And then darn it if he didn’t ask me what I meant! Left with no viable alternatives, I squeezed my brain real hard, and this is what came out:

This is a post to work through carefully while waiting for the second post to drop!

Particularly the custom partitioning, grouping and sorting in MapReduce.

June 27, 2013

Trying to get the coding Pig,

Filed under: BigData,Hadoop,Mahout,MapReduce,Pig,Talend — Patrick Durusau @ 3:00 pm

Trying to get the coding Pig, er – monkey off your back?

From the webpage:

Are you struggling with the basic ‘WordCount’ demo, or which Mahout algorithm you should be using? Forget hand-coding and see what you can do with Talend Studio.

In this on-demand webinar we demonstrate how you could become MUCH more productive with Hadoop and NoSQL. Talend Big Data allows you to develop in Eclipse and run your data jobs 100% natively on Hadoop… and become a big data guru over night. Rémy Dubois, big data specialist and Talend Lead developer, shows you in real-time:

  • How to visually create the ‘WordCount’ example in under 5 minutes
  • How to graphically build a big data job to perform sentiment analysis
  • How to archive NoSQL and optimize data warehouse usage

A content filled webinar! Who knew?

Be forewarned that the demos presume familiarity with the Talend interface and the demo presenter is difficult to understand.

From what I got out of the earlier parts of the webinar, very much a step in the right direction to empower users with big data.

Think of the distance between stacks of punch cards (Hadoop/MapReduce a few years ago) and the personal computer (Talend and others).

That was a big shift. This one is likely to be as well.

Looks like I need to spend some serious time with the latest Talend release!

What a Great Year for Hue Users! [Semantics?]

Filed under: Hadoop,Hue — Patrick Durusau @ 1:34 pm

What a Great Year for Hue Users! by Eva Andreasson.

From the post:

With the recent release of CDH 4.3, which contains Hue 2.3, I’d like to report on the fantastic progress of Hue in the past year.

For those who are unfamiliar with it, Hue is a very popular, end-user focused, fully open source Web UI designed for interaction with Apache Hadoop and its ecosystem components. Founded by Cloudera employees, Hue has been around for quite some time, but only in the last 12 months has it evolved into the great ramp-up and interaction tool it is today. It’s fair to say that Hue is the most popular open source GUI for the Hadoop ecosystem among beginners — as well as a valuable tool for seasoned Hadoop users (and users generally in an enterprise environment) – and it is the only end-user tool that ships with Hadoop distributions today. In fact, Hue is even redistributed and marketed as part of other user-experience and ramp-up-on-Hadoop VMs in the market.

We have reached where we are today – 1,000+ commits later – thanks to the talented Cloudera Hue team (special kudos needed to Romain, Enrico, and Abe) and our customers and users in the community. Therefore it is time to celebrate with a classy new logo and community website at gethue.com!

See Eva’s post for her reflections but I have to say, I do like the new logo:

Hue

If Hue has the capability to document the semantics of structures or data, I have overlooked it.

Seems like a golden area for a topic map contribution.

June 26, 2013

Better Content on Memory Stick?

Filed under: Hadoop,Hortonworks,Marketing — Patrick Durusau @ 9:40 am

Sandbox on Memory Stick (pic)

There was talk over at LinkedIn about marketing for topic maps.

Here’s an idea.

No mention of topic maps on the outside but without an install, configuring paths, etc. the user gets a topic map engine plus content.

Topical content for the forum where the sticks are being distributed.

Plug and compare results to your favorite search sewer.

Limited range of data.

But if I am supposed to be searching SEC mandated financial reports and related data, not being able to access Latvian lingerie ads is probably ok. With management at least.*

I first saw this in a tweet by shaunconnolly.

Suggestions for content?


* Just an aside but curated content could provide not only better search results but also eliminate results that may distract staff from the task at hand.

Better than filters, etc. Other content would simply not be an option.

June 25, 2013

Hadoop for Everyone: Inside Cloudera Search

Filed under: Cloudera,Hadoop,Search Engines,Searching — Patrick Durusau @ 12:26 pm

Hadoop for Everyone: Inside Cloudera Search by Eva Andreasson.

From the post:

CDH, Cloudera’s 100% open source distribution of Apache Hadoop and related projects, has successfully enabled Big Data processing for many years. The typical approach is to ingest a large set of a wide variety of data into HDFS or Apache HBase for cost-efficient storage and flexible, scalable processing. Over time, various tools to allow for easier access have emerged — so you can now interact with Hadoop through various programming methods and the very familiar structured query capabilities of SQL.

However, many users with less interest in programmatic interaction have been shut out of the value that Hadoop creates from Big Data. And teams trying to achieve more innovative processing struggle with a time-efficient way to interact with, and explore, the data in Hadoop or HBase.

Helping these users find the data they need without the need for Java, SQL, or scripting languages inspired integrating full-text search functionality, via Cloudera Search (currently in beta), with the powerful processing platform of CDH. The idea of using search on the same platform as other workloads is the key — you no longer have to move data around to satisfy your business needs, as data and indices are stored in the same scalable and cost-efficient platform. You can also not only find what you are looking for, but within the same infrastructure actually “do” things with your data. Cloudera Search brings simplicity and efficiency for large and growing data sets that need to enable mission-critical staff, as well as the average user, to find a needle in an unstructured haystack!

As a workload natively integrated with CDH, Cloudera Search benefits from the same security model, access to the same data pool, and cost-efficient storage. In addition, it is added to the services monitored and managed by Cloudera Manager on the cluster, providing a unified production visibility and rich cluster management – a priceless tool for any cluster admin.

In the rest of this post, I’ll describe some of Cloudera Search’s most important features.

You have heard the buzz about Cloudera Search, now get a quick list of facts and pointers to more resources!

The most significant fact?

Cloudera Search uses Apache Solr.

If you are looking for search capabilities, what more need I say?

June 23, 2013

How to Contribute to HBase and Hadoop 2.0

Filed under: Hadoop,HBase — Patrick Durusau @ 1:58 pm

How to Contribute to HBase and Hadoop 2.0 by Nick Dimiduk.

From the post:

In case you haven’t heard, Hadoop 2.0 is on the way! There are loads more new features than I can begin to enumerate, including lots of interesting enhancements to HDFS for online applications like HBase. One of the most anticipated new features is YARN, an entirely new way to think about deploying applications across your Hadoop cluster. It’s easy to think of YARN as the infrastructure necessary to turn Hadoop into a cloud-like runtime for deploying and scaling data-centric applications. Early examples of such applications are rare, but two noteworthy examples are Knitting Boar and Storm on YARN. Hadoop 2.0 will also ship a MapReduce implementation built on top of YARN that is binary compatible with applications written for MapReduce on Hadoop-1.x.

The HBase project is rearing to get onto this new platform as well. Hadoop2 will be a fully supported deployment environment for HBase 0.96 release. There are still lots of bugs to squish and the build lights aren’t green yet. That’s where you come in!

To really “know” software you can:

  • Teach it.
  • Write (good) documentation about it.
  • Squash bugs.

Nick is inviting you to squash bugs for HBase and Hadoop 2.0.

Memories of sun drenched debauchery will fade.

Being a contributor to an Apache project over the summer won’t.

June 22, 2013

The New Search App in Hue 2.4

Filed under: Hadoop,Hue,Interface Research/Design,Solr,UX — Patrick Durusau @ 3:59 pm

The New Search App in Hue 2.4

From the post:

In version 2.4 of Hue, the open source Web UI that makes Apache Hadoop easier to use, a new app was added in addition to more than 150 fixes: Search!

Using this app, which is based on Apache Solr, you can now search across Hadoop data just like you would do keyword searches with Google or Yahoo! In addition, a wizard lets you tweak the result snippets and tailors the search experience to your needs.

The new Hue Search app uses the regular Solr API underneath the hood, yet adds a remarkable list of UI features that makes using search over data stored in Hadoop a breeze. It integrates with the other Hue apps like File Browser for looking at the index file in a few clicks.

Here’s a video demoing queries and results customization. The demo is based on Twitter Streaming data collected with Apache Flume and indexed in real time:

Even allowing for the familiarity of the presenter with the app, this is impressive!

More features are reported to be on the way!

Definitely sets a higher bar for search UIs.

AWS: Your Next Utility Bill?

Filed under: Amazon Web Services AWS,Hadoop,MapReduce — Patrick Durusau @ 3:08 pm

Netflix open sources its Hadoop manager for AWS be Derrick Harris.

From the post:

Netflix runs a lot of Hadoop jobs on the Amazon Web Services cloud computing platform, and on Friday the video-streaming leader open sourced its software to make running those jobs as easy as possible. Called Genie, it’s a RESTful API that makes it easy for developers to launch new MapReduce, Hive and Pig jobs and to monitor longer-running jobs on transient cloud resources.

In the blog post detailing Genie, Netflix’s Sriram Krishnan makes clear a lot more about what Genie is and is not. Essentially, Genie is a platform as a service running on top of Amazon’s Elastic MapReduce Hadoop service. It’s part of a larger suite of tools that handles everything from diagnostics to service registration.

It is not a cluster manager or workflow scheduler for building ETL processes (e.g., processing unstructured data from a web source, adding structure and loading into a relational database system). Netflix uses a product called UC4 for the latter, but it built the other components of the Genie system.

It’s not very futuristic to say that AWS (or something very close to it) will be your next utility bill.

Like paying for water, gas, cable, electricity, it will be an auto-pay setup on your bank account.

What will you say when clients ask if the service you are building for them is hosted on AWS?

Are you going to say your servers are more reliable? That you don’t “trust” Amazon?

Both of which may be true but how will you make that case?

Without sounding like you are selling something the client doesn’t need?

As the price of cloud computing drops, those questions are going to become common.

June 20, 2013

Glimmer

Filed under: Hadoop,RDF — Patrick Durusau @ 6:36 pm

Glimmer: An RDF Search Engine

New RDF search engine from Yahoo, built on Hadoop (0.23) and MG4j.

I first saw this in a tweet by Yves Raimond.

The best part being pointed to the MG4j project, which I haven’t looked at in a year or more.

More news on that tomorrow!

June 17, 2013

I Mapreduced a Neo store [Legacy of CPU Shortages?]

Filed under: Graphs,Hadoop,MapReduce,Neo4j — Patrick Durusau @ 8:40 am

I Mapreduced a Neo store by Kris Geusebroek.

From the post:

Lately I’ve been busy talking at conferences to tell people about our way to create large Neo4j databases. Large means some tens of millions of nodes and hundreds of millions of relationships and billions of properties.

Although the technical description is already on the Xebia blog part 1 and part 2, I would like to give a more functional view on what we did and why we started doing it in the first place.

Our use case consisted of exploring our data to find interesting patterns. The data we want to explore is about financial transactions between people, so the Neo4j graph model is a good fit for us. Because we don’t know upfront what we are looking for we need to create a Neo4j database with some parts of the data and explore that. When there is nothing interesting to find we go enhance our data to contain new information and possibly new connections and create a new Neo4j database with the extra information.

This means it’s not about a one time load of the current data and keep that up to date by adding some more nodes and edges. It’s really about building a new database from the ground up everytime we think of some new way to look at the data.

Deeply interesting work, particularly for its investigation of the internal file structure of Neo4j.

Curious about the

…building a new database from the ground up everytime we think of some new way to look at the data.

To what extent are static database structures a legacy of a shortage of CPU cycles?

With limited CPU cycles, it was necessary to create a static structure, against which query languages could be developed and optimized (again because of a shortage of CPU cycles), and the persisted data structure avoided the overhead of rebuilding the data structure for each user.

It may be that cellphones and tablets need the convenience of static data structures or at least representations of static data structures.

But what of server farms populated by TBs of 3D memory?

Isn’t it time to start thinking beyond the limitations imposed by decades of CPU cycle shortages?

June 15, 2013

Hortonworks Sandbox (1.3): Stinger, Visualizations and Virtualization

Filed under: BigData,Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 2:13 pm

Hortonworks Sandbox: Stinger, Visualizations and Virtualization by Cheryle Custer.

From the post:

A couple of weeks ago, we releases several new Hadoop tutorials showcasing real-life uses cases and you can read about them here.Today, we’re delighted to bring to you the newest release of the Hortonworks Sandbox 1.3. The Hortonworks Sandbox allows you to go from Zero to Big Data in 15 Minutes through step-by-step hands-on Hadoop tutorials. The Sandbox is a fully functional single node personal Hadoop environment, where you can add your own data sets, validate your Hadoop use cases and build a small proof-of-concept.

Update of your favorite way to explore Hadoop!

Get the sandbox here.

June 11, 2013

Kiji

Filed under: Hadoop,HBase,KIji Project — Patrick Durusau @ 8:40 am

What’s Next for HBase? Big Data Applications Using Frameworks Like Kiji by Michael Stack.

From the post:

Apache Hadoop and HBase have quickly become industry standards for storage and analysis of Big Data in the enterprise, yet as adoption spreads, new challenges and opportunities have emerged. Today, there is a large gap — a chasm, a gorge — between the nice application model your Big Data Application builder designed and the raw, byte-based APIs provided by HBase and Hadoop. Many Big Data players have invested a lot of time and energy in bridging this gap. Cloudera, where I work, is developing the Cloudera Development Kit (CDK). Kiji, an open source framework for building Big Data Applications, is another such thriving option. A lot of thought has gone into its design. More importantly, long experience building Big Data Applications on top of Hadoop and HBase has been baked into how it all works.

Kiji provides a model and set of libraries that you to get up and running quickly.

Kiji provides a model and a set of libraries that allow developers to get up and running quickly. Intuitive Java APIs and Kiji’s rich data model allow developers to build business logic and machine learning algorithms without having to worry about bytes, serialization, schema evolution, and lower-level aspects of the system. The Kiji framework is modularized into separate components to support a wide range of usage and encourage clean separation of functionality. Kiji’s main components include KijiSchema, KijiMR, KijiHive, KijiExpress, KijiREST, and KijiScoring. KijiSchema, for example, helps team members collaborate on long-lived Big Data management projects, and does away with common incompatibility issues, and helps developers build more integrated systems across the board. All of these components are available in a single download called a BentoBox.

When mainstream news only has political scandals, wars and rumors of wars, tech news can brighten your day!

Be sure to visit the Kiji Project website.

Turn-key tutorials to get you started.

June 5, 2013

Cloudera Search: The Newest Hadoop Framework for CDH Users and Developers

Filed under: Cloudera,Hadoop,Lucene,Solr — Patrick Durusau @ 2:41 pm

Cloudera Search: The Newest Hadoop Framework for CDH Users and Developers by Doug Cutting.

From the post:

One of the unexpected pleasures of open source development is the way that technologies adapt and evolve for uses you never originally anticipated.

Seven years ago, Apache Hadoop sprang from a project based on Apache Lucene, aiming to solve a search problem: how to scalably store and index the internet. Today, it’s my pleasure to announce Cloudera Search, which uses Lucene (among other things) to make search solve a Hadoop problem: how to let non-technical users interactively explore and analyze data in Hadoop.

Cloudera Search is released to public beta, as of today. (See a demo here; get installation instructions here.) Powered by Apache Solr 4.3, Cloudera Search allows hundreds of users to search petabytes of Hadoop data interactively.

In the context of our platform, CDH (Cloudera’s Distribution including Apache Hadoop), Cloudera Search is another framework much like MapReduce and Cloudera Impala. It’s another way for users to interact with Hadoop data and for developers to build Hadoop applications. Each framework in our platform is designed to cater to different families of applications and users:

(…)

Did you catch the line:

Powered by Apache Solr 4.3, Cloudera Search allows hundreds of users to search petabytes of Hadoop data interactively.

Does that make you feel better about scale issues?

Also see: Cloudera Search Webinar, Wednesday, June 19, 2013 11AM-12PM PT.

A serious step up in capabilities.

June 2, 2013

Hadoop REST API – WebHDFS

Filed under: Hadoop,HDFS — Patrick Durusau @ 6:57 pm

Hadoop REST API – WebHDFS by Istvan Szegedi.

From the post:

Hadoop provides a Java native API to support file system operations such as create, rename or delete files and directories, open, read or write files, set permissions, etc. A very basic example can be found on Apache wiki about how to read and write files from Hadoop.

This is great for applications running within the Hadoop cluster but there may be use cases where an external application needs to manipulate HDFS like it needs to create directories and write files to that directory or read the content of a file stored on HDFS. Hortonworks developed an additional API to support these requirements based on standard REST functionalities.

Something to add to your Hadoop toolbelt.

Take DMX-h ETL Pre-Release for a Test Drive!

Filed under: ETL,Hadoop,Integration — Patrick Durusau @ 6:49 pm

Take DMX-h ETL Pre-Release for a Test Drive! by Keith Kohl.

From the post:

Last Monday, we announced two new DMX-h Hadoop products, DMX-h Sort Edition and DMX-h ETL Edition. Several Blog posts last week included why I thought the announcement was cool and also some Hadoop benchmarks on both TeraSort and also running ETL.

Part of our announcement was the DMX-h ETL Pre-Release Test Drive. The test drive is a trial download of our DMX-h ETL software. We have installed our software on our partner Cloudera’s VM (VMware) image complete with the user case accelerators, sample data, documentation and even videos. While the download is a little large ─ ok it’s over 3GB─ it’s a complete VM with Linux and Cloudera’s CDH 4.2 Hadoop release (the DMX-h footprint is a mere 165MB!).

Cool!

Then Keith asks later in the post:

The test drive is not your normal download. This is actually a pre-release of our DMX-h ETL product offering. While we have announced our product, it is not generally available (GA) yet…scheduled for end of June. We are offering a download of a product that isn’t even available yet…how many vendors do that?!

Err, lots of them? It’s call a beta/candidate/etc release?

😉

Marketing quibbles aside, it does sound quite interesting.

In some ways I would like to see the VM release model become more common.

Test driving software should not be a install/configuration learning experience.

That should come after users are interested in the software.

BTW, interesting approach, at least reading the webpages/documentation.

Doesn’t generate code for conversion/ETL so there is no code to maintain. Written against the DMX-h engine.

Need to think about what that means in terms of documenting semantics.

Or reconciling different ETL approaches in the same enterprise.

More to follow.

May 30, 2013

Hadoop Tutorials: Real Life Use Cases in the Sandbox

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 7:56 pm

Hadoop Tutorials: Real Life Use Cases in the Sandbox by Cheryle Custer.

Six (6) new tutorials from Hortonworks:

  • Tutorial 6 – Loading Data into the Hortonworks Sandbox
  • Tutorials 7 & 11 – Installing the ODBC Driver in the Hortonworks Sandbox (Windows and Mac)
  • Tutorials 8 & 9 – Accessing and Analyzing Data in Excel
  • Tutorial 10 – Visualizing Clickstream Data

You have done the first five (5).

Yes?

Hortonworks Data Platform 1.3 Release

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 7:51 pm

Hortonworks Data Platform 1.3 Release: The community continues to power innovation in Hadoop by Jeff Sposetti.

From the post:

HDP 1.3 release delivers on community-driven innovation in Hadoop with SQL-IN-Hadoop, and continued ease of enterprise integration and business continuity features.

Almost one year ago (50 weeks to be exact) we released Hortonworks Data Platform 1.0, the first 100% open source Hadoop platform into the marketplace. The past year has been dynamic to say the least! However, one thing has remained constant: the steady, predictable cadence of HDP releases. In September 2012 we released 1.1, this February gave us 1.2 and today we’re delighted to release HDP 1.3.

HDP 1.3 represents yet another significant step forward and allows customers to harness the latest innovation around Apache Hadoop and its related projects in the open source community. In addition to providing a tested, integrated distribution of these projects, HDP 1.3 includes a primary focus on enhancements to Apache Hive, the de-facto standard for SQL access in Hadoop as well as numerous improvements that simplify ease of use.

Whatever the magic dust is for a successful open source project, the Hadoop community has it in abundance.

May 25, 2013

Apache Pig Editor in Hue 2.3

Filed under: Cloudera,Hadoop,Hue,Pig — Patrick Durusau @ 1:38 pm

Apache Pig Editor in Hue 2.3

From the post:

In the previous installment of the demo series about Hue — the open source Web UI that makes Apache Hadoop easier to use — you learned how to analyze data with Hue using Apache Hive via Hue’s Beeswax and Catalog applications. In this installment, we’ll focus on using the new editor for Apache Pig in Hue 2.3.

Complementing the editors for Hive and Cloudera Impala, the Pig editor provides a great starting point for exploration and real-time interaction with Hadoop. This new application lets you edit and run Pig scripts interactively in an editor tailored for a great user experience. Features include:

  • UDFs and parameters (with default value) support
  • Autocompletion of Pig keywords, aliases, and HDFS paths
  • Syntax highlighting
  • One-click script submission
  • Progress, result, and logs display
  • Interactive single-page application

Here’s a short video demoing its capabilities and ease of use:

(…)

How are you editing your Pig scripts now?

How are you documenting the semantics of your Pig scripts?

How do you search across your Pig scripts?

May 21, 2013

Hadoop, Hadoop, Hurrah! HDP for Windows is Now GA!

Filed under: Hadoop,Hortonworks,Microsoft — Patrick Durusau @ 4:54 pm

Hadoop, Hadoop, Hurrah! HDP for Windows is Now GA! by John Kreisa.

From the post:

Today we are very excited to announce that Hortonworks Data Platform for Windows (HDP for Windows) is now generally available and ready to support the most demanding production workloads.

We have been blown away with the number and size of organizations who have downloaded the beta bits of this 100% open source, and native to Windows distribution of Hadoop and engaged Hortonworks and Microsoft around evolving their data architecture to respond to the challenges of enterprise big data.

With this key milestone HDP for Windows offers the millions of customers running their business on Microsoft technologies an ecosystem-friendly Hadoop-based solution that is built for the enterprise and purpose built for Windows. This release cements Apache Hadoop’s role as a key component of the next generation enterprise data architecture, across the broadest set of datacenter configurations as HDP becomes the first production-ready Apache Hadoop distribution to run on both Windows and Linux.

Additionally, customers now also have complete portability of their Hadoop applications between on-premise and cloud deployments via HDP for Windows and Microsofts’s HDInsight Service.

Two lessons here:

First, Hadoop is a very popular way to address enterprise big data.

Second, going where users are, not where they ought to be, is a smart business move.

May 18, 2013

Apache Hive 0.11: Stinger Phase 1 Delivered

Filed under: Hadoop,Hive,SQL,STINGER — Patrick Durusau @ 3:47 pm

Apache Hive 0.11: Stinger Phase 1 Delivered by Owen O’Malley.

From the post:

In February, we announced the Stinger Initiative, which outlined an approach to bring interactive SQL-query into Hadoop. Simply put, our choice was to double down on Hive to extend it so that it could address human-time use cases (i.e. queries in the 5-30 second range). So, with input and participation from the broader community we established a fairly audacious goal of 100X performance improvement and SQL compatibility.

Introducing Apache Hive 0.11 – 386 JIRA tickets closed

As representatives of this open, community led effort we are very proud to announce the first release of the new and improved Apache Hive, version 0.11. This substantial release embodies the work of a wide group of people from Microsoft, Facebook , Yahoo, SAP and others. Together we have addressed 386 JIRA tickets, of which there were 28 new features and 276 bug fixes. There were FIFTY-FIVE developers involved in this and I would like to thank every one of them. See below for a full list.

Delivering on the promise of Stinger Phase 1

As promised we have delivered phase 1 of the Stinger Initiative in late spring. This release is another proof point that that the open community can innovate at a rate unequaled by any proprietary vendor. As part of phase 1 we promised windowing, new data types, the optimized RC (ORC) file and base optimizations to the Hive Query engine and the community has delivered these key features.

Stinger

Welcome news for the Hive and SQL communities alike!

May 17, 2013

Hadoop SDK and Tutorials for Microsoft .NET Developers

Filed under: .Net,Hadoop,MapReduce,Microsoft — Patrick Durusau @ 3:39 pm

Hadoop SDK and Tutorials for Microsoft .NET Developers by Marc Holmes.

From the post:

Microsoft has begun to treat its developer community to a number of Hadoop-y releases related to its HDInsight (Hadoop in the cloud) service, and it’s worth rounding up the material. It’s all Alpha and Preview so YMMV but looks like fun:

  • Microsoft .NET SDK for Hadoop. This kit provides .NET API access to aspects of HDInsight including HDFS, HCatalag, Oozie and Ambari, and also some Powershell scripts for cluster management. There are also libraries for MapReduce and LINQ to Hive. The latter is really interesting as it builds on the established technology for .NET developers to access most data sources to deliver the capabilities of the de facto standard for Hadoop data query.
  • HDInsight Labs Preview. Up on Github, there is a series of 5 labs covering C#, JavaScript and F# coding for MapReduce jobs, using Hive, and then bringing that data into Excel. It also covers some Mahout use to build a recommendation engine.
  • Microsoft Hive ODBC Driver. The examples above use this preview driver to enable the connection from Hive to Excel.

If all of the above excites you our Hadoop on Windows for Developers training course also similar content in a lot of depth.

Hadoop is coming to an office/data center near you.

Will you be ready?

Hadoop Toolbox: When to Use What

Filed under: Hadoop,MapReduce — Patrick Durusau @ 1:39 pm

Hadoop Toolbox: When to Use What by Mohammad Tariq.

From the post:

Eight years ago not even Doug Cutting would have thought that the tool which he’s naming after his kid’s soft toy would so soon become a rage and change the way people and organizations look at their data. Today Hadoop and Big Data have almost become synonyms to each other. But Hadoop is not just Hadoop now. Over time it has evolved into one big herd of various tools, each meant to serve a different purpose. But glued together they give you a powerpacked combo.

Having said that, one must be careful while choosing these tools for their specific use case as one size doesn’t fit all. What is working for someone might not be that productive for you. So, here I will show you which tool should be picked in which scenario. It’s not a big comparative study but a short intro to some very useful tools. And, this is based totally on my experience so there is always some scope of suggestions. Please feel free to comment or suggest if you have any. I would love to hear from you. Let’s get started :

Not shallow enough to be useful for the c-suite types, not deep enough for decision making.

Nice to use in a survey context, where users need an overview of the Hadoop ecosystem.

May 16, 2013

How-to: Configure Eclipse for Hadoop Contributions

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 12:34 pm

How-to: Configure Eclipse for Hadoop Contributions by Karthik Kambatla.

From the post:

Contributing to Apache Hadoop or writing custom pluggable modules requires modifying Hadoop’s source code. While it is perfectly fine to use a text editor to modify Java source, modern IDEs simplify navigation and debugging of large Java projects like Hadoop significantly. Eclipse is a popular choice thanks to its broad user base and multitude of available plugins.

This post covers configuring Eclipse to modify Hadoop’s source. (Developing applications against CDH using Eclipse is covered in a different post.) Hadoop has changed a great deal since our previous post on configuring Eclipse for Hadoop development; here we’ll revisit configuring Eclipse for the latest “flavors” of Hadoop. Note that trunk and other release branches differ in their directory structure, feature set, and build tools they use. (The EclipseEnvironment Hadoop wiki page is a good starting point for development on trunk.)

A post to ease your way towards contributing to the Hadoop project!

Or if you simply want to know the code you are running cold.

Or something in between!

May 10, 2013

Graph processing platform Apache Giraph reaches 1.0

Filed under: Bulk Synchronous Parallel (BSP),Giraph,Graphs,Hadoop,Pregel — Patrick Durusau @ 4:23 am

Graph processing platform Apache Giraph reaches 1.0

From the post:

Used by Facebook and Yahoo, the Apache Giraph project for distributed graph processing has released version 1.0. This is the first new version since the project left incubation and became a top-level project in May 2012, though for some reason it has yet to make it to the Apache index of top level projects.

Giraph allows social graphs and other richly interconnected data structures with many billions of edges to be analysed using hundreds of machines. It is inspired by the Bulk Synchronous Parallel abstract computer model and the Google Pregel system for large scale graph-processing. The developers of Giraph say that unlike those systems, Giraph is an open source, scalable platform built atop of the Apache Hadoop infrastructure which has no single point of failure by design. The documentation includes an introduction to Giraph’s iterative graph processing and how to implement graph processing functions in Java. The Giraph project has seen contributions from Yahoo!, Twitter, Facebook and LinkedIn and from academic institutions around the world.

It’s a little early to be downloading software for the weekend but why not? 😉

Enjoy!

May 8, 2013

Spatially Visualize and Analyze Vast Data Stores…

Filed under: GIS,Graphics,Hadoop,Visualization — Patrick Durusau @ 2:39 pm

Spatially Visualize and Analyze Vast Data Stores with Esri’s GIS Tools for Hadoop

From the post:

Perhaps the greatest untapped IT resource available today is the ability to spatially analyze and visualize Big Data. As part of its continuing effort to expand the use of geographic information system (GIS) technology among web, mobile, and other developers, Esri has launched GIS Tools for Hadoop. The toolkit removes the obstacles of building map applications for developers to truly capitalize on geoenabling Big Data within Hadoop—the popular open source data management framework. Developers now will be able to answer the where questions in their large data stores.

“Hadoop’s method of processing volumes of information directly addresses the most significant challenge facing IT today,” says Marwa Mabrouk, product manager at Esri. “Enabling Hadoop with spatial capabilities is part of Esri’s continued effort to derive more value from Big Data through spatial analysis.”

Processing and displaying Big Data on maps requires functionality that core Hadoop lacks. GIS Tools for Hadoop extends the Hadoop platform with a series of libraries and utilities that connect Esri ArcGIS to the Hadoop environment. It allows ArcGIS users to export map data in HDFS format—Hadoop’s native file system—and intersect it with billions of records stored in Hadoop. Results can be either directly saved to the Hadoop database or reimported back to ArcGIS for higher-level geoprocessing and visualization.

GIS Tools for Hadoop includes the following:

  • Sample tools and templates that demonstrate the power of GIS
  • Spatial querying inside Hadoop using Hive—Hadoop’s ad hoc querying module
  • Geometry Library to build spatial applications in Hadoop

“GIS Tools for Hadoop not only introduces spatial analysis to Hadoop but creates a looping workflow that pulls Big Data into the ArcGIS environment,” says Mansour Raad, senior software architect at Esri. “It provides tools for Hadoop users who need to visualize Big Data on maps.”

Esri recognizes Big Data as a challenge that community-level involvement can help solve. As such, Esri provides GIS Tools for Hadoop as an open source product available on GitHub. Esri encourages users to download the toolkit, report issues, and actively contribute to improving the tools through the GitHub system.

To download GIS Tools for Hadoop, visit http://esri.github.com/gis-tools-for-hadoop.

Once you have where, your topic map can merge in who, what, why and how.

« Newer PostsOlder Posts »

Powered by WordPress