Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 12, 2014

The Road to Summingbird:…

Filed under: Hadoop,MapReduce,Summingbird,Tweets — Patrick Durusau @ 8:37 pm

The Road to Summingbird: Stream Processing at (Every) Scale by Sam Ritchie.

Description:

Twitter’s Summingbird library allows developers and data scientists to build massive streaming MapReduce pipelines without worrying about the usual mess of systems issues that come with realtime systems at scale.

But what if your project is not quite at “scale” yet? Should you ignore scale until it becomes a problem, or swallow the pill ahead of time? Is using Summingbird overkill for small projects? I argue that it’s not. This talk will discuss the ideas and components of Summingbird that you could, and SHOULD, use in your startup’s code from day one. You’ll come away with a new appreciation for monoids and semigroups and a thirst for abstract algebra.

A slide deck that will make you regret missing the presentation.

I wasn’t able to find a video of Sam’s presentation at Data Day Texas 2014, but I did find a collection of his presentations, including some videos, at: http://sritchie.github.io/.

Valuable lessons for startups and others.

January 10, 2014

SIMR – Spark on top of Hadoop

Filed under: Hadoop,Spark — Patrick Durusau @ 3:58 pm

SIMR – Spark on top of Hadoop by Danny Bickson.

From the post:

Just learned from my collaborator Aapo Kyrola that the Spark team have now released a plugin which allows running Spark on top of Hadoop, without installation anything and without administrator privileges. This will probably encourage many more companies to try out Spark, which significantly improves on Hadoop performance.

The tools for data are getting easier to use every day.

Which moves the semantic wall a little closer with each improvement.

Efficiently processing TB of data only to confess it isn’t clear what the data may or may not mean, isn’t going to win IT any friends.

The Cloudera Developer Newsletter: It’s For You!

Filed under: BigData,Cloudera,Hadoop — Patrick Durusau @ 12:01 pm

The Cloudera Developer Newsletter: It’s For You! by Justin Kestelyn.

From the post:

Developers and data scientists, we’re realize you’re special – as are operators and analysts, in their own particular ways.

For that reason, we are very happy to kick off 2014 with a new free service designed for you and other technical end-users in the Cloudera ecosystem: the Cloudera Developer Newsletter.

This new email-based newsletter contains links to a curated list of new how-to’s, docs, tools, engineer and community interviews, training, projects, conversations, videos, and blog posts to help you get a new Apache Hadoop-based enterprise data hub deployment off the ground, or get the most value out of an existing deployment. Look for a new issue every month!

All you have you to do is click the button below, provide your name and email address, tick the “Developer Community” check-box, and submit. Done! (Of course, you can also opt-in to several other communication channels if you wish.)

The first newsletter is due to appear at the end of January, 2014.

Given the quality of other Cloudera resources I look forward to this newsletter with anticipation!

January 3, 2014

Hadoop Map Reduce For Google web graph

Filed under: Graphs,Hadoop,MapReduce — Patrick Durusau @ 2:56 pm

Hadoop Map Reduce For Google web graph

A good question from Stackoverflow:

we have been given as an assignment the task of creating map reduce functions that will output for each node n in the google web graph list the nodes that you can go from node n in 3 hops. (The actual data can be found here: http://snap.stanford.edu/data/web-Google.html)

The answer on Stackover does not provide a solution (it is a homework assignment) but does walk though an explanation of using MapReduce for graph computations.

If you are thinking about using Hadoop for large graph processing, you are likely to find this very useful.

December 30, 2013

Ready to learn Hadoop?

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 5:59 pm

Ready to learn Hadoop?

From the webpage:

Sign up for the challenge of learning the basics of Hadoop in two weeks! You will get one email every day for the next 14 days.

  • Hello World: Overview of Hadoop
  • Data Processing Using Apache Hadoop
  • Setting up ODBC Connections
  • Connecting to Enterprise Applications
  • Data Integration and ETL
  • Data Analytics
  • Data Visualization
  • Hadoop Use Cases: Web
  • Hadoop Use Cases: Business
  • Recap

You could do this entirely on your own but the daily email may help.

If nothing else, it will be a reminder that something fun is waiting for you after work.

Enjoy!

December 24, 2013

‘Hadoop Illuminated’ Book

Filed under: Hadoop,MapReduce — Patrick Durusau @ 9:41 am

‘Hadoop Illuminated’ Book by Mark Kerzner and Sujee Maniyam.

From the webpage:

Gentle Introduction of Hadoop and Big Data

Get the book…

HTML – multipage

HTML : single page

PDF

We are writing a book on Hadoop with following goals and principles.

More of a great outline for a Hadoop book than a great Hadoop book at present.

However, it is also the perfect opportunity for you to try your hand at clear, readable introductory prose on Hadoop. (That isn’t as easy as it sounds.)

As a special treat, there is a Hadoop Coloring Book for Kids. (Send more art for the coloring book as well.)

I especially appreciate the coloring book because I don’t have any coloring books. Did I mention I have a small child coming to visit during the holidays? 😉

PS: Has anyone produced a sort algorithm coloring book?

December 21, 2013

Accumulo Comes to CDH

Filed under: Accumulo,Cloudera,Hadoop,NSA — Patrick Durusau @ 7:11 pm

Accumulo Comes to CDH by by Sean Busbey, Bill Havanki, and Mike Drob.

From the post:

Cloudera is pleased to announce the immediate availability of its first release of Accumulo packaged to run under CDH, our open source distribution of Apache Hadoop and related projects and the foundational infrastructure for Enterprise Data Hubs.

Accumulo is an open source project that provides the ability to store data in massive tables (billions of rows, millions of columns) for fast, random access. Accumulo was created and contributed to the Apache Software Foundation by the National Security Agency (NSA), and it has quickly gained adoption as a Hadoop-based key/value store for applications that require access to sensitive data sets. Cloudera provides enterprise support with the RTD Accumulo add-on subscription for Cloudera Enterprise.

This release provides Accumulo 1.4.3 tested for use under CDH 4.3.0. The release includes a significant number of backports and fixes to allow use with CDH 4’s highly available, production-ready packaging of HDFS. As a part of our commitment to the open source community, these changes have been submitted back upstream.

At least with Accumulo, you know you are getting NSA vetted software.

Can’t say the same thing for RSA software.

Enterprise customers need to demand open source software that reserves commercial distribution rights to its source.

For self-preservation if no other reason.

CDK Becomes “Kite SDK”

Filed under: Cloudera,Hadoop — Patrick Durusau @ 1:44 pm

Cloudera Development Kit is Now “Kite SDK” by Ryan Blue.

From the post:

CDK has a new monicker, but the goals remain the same.

We are pleased to announce a new name for the Cloudera Development Kit (CDK): Kite. We’ve just released Kite version 0.10.0, which is purely a rename of CDK 0.9.0.

The new repository and documentation are here:

Why the rename?

The original goal of CDK was to increase accessibility to the Apache Hadoop platform by developers. That goal isn’t Cloudera-specific, and we want the name to more forcefully reflect the open, community-driven character of the project.

Will this change break anything?

The rename mainly affects dependencies and package names. Once imports and dependencies are updated, almost everything should work the same. However, there are a couple of configuration changes to make for anyone using Apache Flume or Morphlines. The changes are detailed on our migration page.

The continuation of the Kite SDK version 0.10.0 along side the Cloudera Development Kit 0.9.0, should make some aspects of the name transition easier.

However, when you search for CDK 0.9.0, are you going to get “hits” for the Kite SDK 0.10.0? Such as blog posts, tutorials, code, etc.

I suspect not. The reverse won’t work either.

So we have relevant material that is indexed under two different names, names a user will have to remember in order to get all the relevant results.

Defining a synonym table works for cases like this but does have one shortfall.

Will the synonym table make sense to us in ten (10) years? Or in twenty (20) years?

There is no guarantee that even a synonym mapping based on disclosed properties will remain intelligible for X number of years.

But if long term data access is mission critical, something more than blind synonym mappings needs to be done.

December 20, 2013

3rd Annual Federal Big Data Apache Hadoop Forum

Filed under: BigData,Cloudera,Conferences,Hadoop — Patrick Durusau @ 6:59 pm

3rd Annual Federal Big Data Apache Hadoop Forum

From the webpage:

Registration is now open for the third annual Federal Big Data Apache Hadoop Forum! Join us on Thurs., Feb. 6, as leaders from government and industry convene to share Big Data best practices. This is a must attend event for any organization or agency looking to be information-driven and give access to more data to more resources and applications. During this informative event you will learn:

  • Key trends in government today and the role Big Data plays in driving transformation;
  • How leading agencies are putting data to good use to uncover new insight, streamline costs, and manage threats;
  • The role of an Enterprise Data Hub, and how it is a game changing data management platform central to any Big Data strategy today.

Get the most from all your data assets, analytics, and teams to enable your mission, efficiently and on budget. Register today and discover how Cloudera and an Enterprise Data Hub can empower you and your teams to do more with Big Data.

A Cloudera fest but I don’t think they will be searching people for business cards at the door. 😉

An opportunity for you to meet and greet, make contacts, etc.

I first saw this in a tweet by Bob Gourley.

December 13, 2013

Storm Technical Preview Available Now!

Filed under: Hadoop,Hortonworks,Storm — Patrick Durusau @ 5:09 pm

Storm Technical Preview Available Now! by Himanshu Bari.

From the post:

In October, we announced our intent to include and support Storm as part of Hortonworks Data Platform. With this commitment, we also outlined and proposed an open roadmap to improve the enterprise readiness of this key project. We are committed to doing this with a 100% open source approach and your feedback is immensely valuable in this process.

Today, we invite you to take a look at our Storm technical preview. This preview includes the latest release of Storm with instructions on how to install Storm on Hortonworks Sandbox and run a sample topology to familiarize yourself with the technology. This is the final pre-Apache release of Storm.

You know this but I wanted to emphasize how your participation in alpha/beta/candidate/preview releases benefits not only the community but yourself as well.

Bugs that are found and squashed now won’t bother you (or anyone else) later in production.

Not to mention you get to exercise your skills before using the software become routine and so does your use of it.

Enjoy the weekend!

December 11, 2013

…Graph Analytics

Filed under: Graphs,Gremlin,Hadoop,Titan — Patrick Durusau @ 8:42 pm

Big Data in Security – Part III: Graph Analytics by Levi Gundert.

In interview form with Michael Howe and Preetham Raghunanda.

You will find two parts of the exchange particularly interesting:

You mention very large technology companies, obviously Cisco falls into this category as well — how is TRAC using graph analytics to improve Cisco Security products?

Michael: How we currently use graph analytics is an extension of the work we have been doing for some time. We have been pulling data from different sources like telemetry and third-party feeds in order to look at the relationships between them, which previously required a lot of manual work. We would do analysis on one source and analysis on another one and then pull them together. Now because of the benefits of graph technology we can shift that work to a common view of the data and give people the ability to quickly access all the data types with minimal overhead using one tool. Rather than having to query multiple databases or different types of data stores, we have a polyglot store that pulls data in from multiple types of databases to give us a unified view. This allows us two avenues of investigation: one, security investigators now have the ability to rapidly analyze data as it arrives in an ad hoc way (typically used by security response teams) and the response times dramatically drop as they can easily view related information in the correlations. Second are the large-scale data analytics. Folks with traditional machine learning backgrounds can apply algorithms that did not work on previous data stores and now they can apply those algorithms across a well-defined data type – the graph.

For intelligence analysts, being able to pivot quickly across multiple disparate data sets from a visual perspective is crucial to accelerating the process of attribution.

Michael: Absolutely. Graph analytics is enabling a much more agile approach from our research and analysis teams. Previously when something of interest was identified there was an iterative process of query, analyze the results, refine the query, wash, rinse, and repeat. This process moves from taking days or hours down to minutes or seconds. We can quickly identify the known information, but more importantly, we can identify what we don’t know. We have a comprehensive view that enables us to identify data gaps to improve future use cases.

Did you catch the “…to a common view of the data…” caveat In the third sentence of Michael’s first reply.

Not to deny the usefulness of Titan (the graph solution being discussed) but to point out that current graphs require normalization of data.

For Cisco, that is a winning solution.

But then Cisco can use a closed solution based on normalized data.

Importing, analyzing and then returning results to heterogeneous clients could require a different approach.

Or if you have legacy data that spans centuries.

Or even agencies, departments, or work groups.

December 10, 2013

How to analyze 100 million images for $624

Filed under: Hadoop,Image Processing,OpenCV — Patrick Durusau @ 3:47 pm

How to analyze 100 million images for $624 by Pete Warden.

From the post:

Jetpac is building a modern version of Yelp, using big data rather than user reviews. People are taking more than a billion photos every single day, and many of these are shared publicly on social networks. We analyze these pictures to discover what they can tell us about bars, restaurants, hotels, and other venues around the world — spotting hipster favorites by the number of mustaches, for example.

[photo omitted]

Treating large numbers of photos as data, rather than just content to display to the user, is a pretty new idea. Traditionally it’s been prohibitively expensive to store and process image data, and not many developers are familiar with both modern big data techniques and computer vision. That meant we had to cut a path through some thick underbrush to get a system working, but the good news is that the free-falling price of commodity servers makes running it incredibly cheap.

I use m1.xlarge servers on Amazon EC2, which are beefy enough to process two million Instagram-sized photos a day, and only cost $12.48! I’ve used some open source frameworks to distribute the work in a completely scalable way, so this works out to $624 for a 50-machine cluster that can process 100 million pictures in 24 hours. That’s just 0.000624 cents per photo! (I seriously do not have enough exclamation points for how mind-blowingly exciting this is.)
….

There are a couple of other components that are necessary to reach the same results as Pete.

Seek HIPI for processing photos on Hadoop and OpenCV and the rest of Pete’s article for some very helpful tips.

December 3, 2013

Of Algebirds, Monoids, Monads, …

Filed under: BigData,Data Analysis,Functional Programming,Hadoop,Scala,Storm — Patrick Durusau @ 2:50 pm

Of Algebirds, Monoids, Monads, and Other Bestiary for Large-Scale Data Analytics by Michael G. Noll.

From the post:

Have you ever asked yourself what monoids and monads are, and particularly why they seem to be so attractive in the field of large-scale data processing? Twitter recently open-sourced Algebird, which provides you with a JVM library to work with such algebraic data structures. Algebird is already being used in Big Data tools such as Scalding and SummingBird, which means you can use Algebird as a mechanism to plug your own data structures – e.g. Bloom filters, HyperLogLog – directly into large-scale data processing platforms such as Hadoop and Storm. In this post I will show you how to get started with Algebird, introduce you to monoids and monads, and address the question why you get interested in those in the first place.

Goal of this article

The main goal of this is article is to spark your curiosity and motivation for Algebird and the concepts of monoid, monads, and category theory in general. In other words, I want to address the questions “What’s the big deal? Why should I care? And how can these theoretical concepts help me in my daily work?”

You can call this a “blog post” but I rarely see blog posts with a table of contents! 😉

The post should come with a warning: May require substantial time to read, digest, understand.

Just so you know, I was hooked by this paragraph early on:

So let me use a different example because adding Int values is indeed trivial. Imagine that you are working on large-scale data analytics that make heavy use of Bloom filters. Your applications are based on highly-parallel tools such as Hadoop or Storm, and they create and work with many such Bloom filters in parallel. Now the money question is: How do you combine or add two Bloom filters in an easy way?

Are you motivated?

I first saw this in a tweet by CompSciFact.

December 2, 2013

Modern Healthcare Architectures Built with Hadoop

Filed under: Hadoop,Health care,Hortonworks — Patrick Durusau @ 7:03 pm

Modern Healthcare Architectures Built with Hadoop by Justin Sears.

From the post:

We have heard plenty in the news lately about healthcare challenges and the difficult choices faced by hospital administrators, technology and pharmaceutical providers, researchers, and clinicians. At the same time, consumers are experiencing increased costs without a corresponding increase in health security or in the reliability of clinical outcomes.

One key obstacle in the healthcare market is data liquidity (for patients, practitioners and payers) and some are using Apache Hadoop to overcome this challenge, as part of a modern data architecture. This post describes some healthcare use cases, a healthcare reference architecture and how Hadoop can ease the pain caused by poor data liquidity.

As you would guess, I like the phrase data liquidity. 😉

And Justin lays out the areas where we are going to find “poor data liquidity.”

Source data comes from:

  • Legacy Electronic Medical Records (EMRs)
  • Transcriptions
  • PACS
  • Medication Administration
  • Financial
  • Laboratory (e.g. SunQuest, Cerner)
  • RTLS (for locating medical equipment & patient throughput)
  • Bio Repository
  • Device Integration (e.g. iSirona)
  • Home Devices (e.g. scales and heart monitors)
  • Clinical Trials
  • Genomics (e.g. 23andMe, Cancer Genomics Hub)
  • Radiology (e.g. RadNet)
  • Quantified Self Sensors (e.g. Fitbit, SmartSleep)
  • Social Media Streams (e.g. FourSquare, Twitter)

But then I don’t see what part of the Hadoop architecture addresses the problem of “poor data liquidity.”

Do you?

I thought I had found it when Charles Boicey (in the UCIH case study) says:

“Hadoop is the only technology that allows healthcare to store data in its native form. If Hadoop didn’t exist we would still have to make decisions about what can come into our data warehouse or the electronic medical record (and what cannot). Now we can bring everything into Hadoop, regardless of data format or speed of ingest. If I find a new data source, I can start storing it the day that I learn about it. We leave no data behind.”

But that’s not “data liquidity,” not in any meaningful sense of the word. Dumping your data to paper would be just as effective and probably less costly.

To be useful, “data liquidity” must has a sense of being integrated with data from diverse sources. To present the clinician, researcher, health care facility, etc. with all the data about a patient, not just some of it.

I also checked the McKinsey & Company report “The ‘Big Data’ Revolution in Healthcare.” I didn’t expect them to miss the data integration question and they didn’t.

The second exhibit in the McKinsey and Company report (the full report):

big data integration

The part in red reads:

Integration of data pools required for major opportunities.

I take that to mean that in order to have meaningful healthcare reform, integration of health care data pools is the first step.

Do you disagree?

And if that’s true, that we need integration of health care data pools first, do you think Hadoop can accomplish that auto-magically?

I don’t either.

December 1, 2013

Hadoop on a Raspberry Pi

Filed under: Hadoop,Programming — Patrick Durusau @ 8:52 pm

Hadoop on a Raspberry Pi by Isaac Lopez

From the post:

Looking for a fun side project this winter? Jamie Whitehorn has an idea for you. He put Hadoop on a cluster of Raspberry Pi mini-computers. Sound ridiculous? For a student trying to learn Hadoop, it could be ridiculously cool.

For those who don’t know what a Raspberry Pi is, think of it as a computer on a credit card meets Legos. They’re little chunks of computing technology, complete with a Linux operating system, a 700MHz ARM11 processor, a low-power video processor and up to 512MB of Memory. Tinkerers can use it as the computing brains behind any number of applications that they design to their heart’s content. In a recent example, a Raspberry Pi enthusiast built a Raspberry Pi mini PC, which he used to control a mini CNC Laser engraver made out of an old set of salvaged DVD drives and $10 dollars in parts of eBay. Ideas range from building a web server, a weather station, home automation systems, mini arcades – the list of projects is endless.

At the Strata + Hadoop World conference last month, Jamie Whitehorn shared his Hadoop Raspberry Pi creation with an audience. He discussed the challenges a student has in learning the Hadoop system. Chiefly, it’s a distributed architecture that requires multiple computers to operate. Someone looking to build Hadoop skills in a test environment would need several machines, and quite an electricity bill to get a cluster up – a prospect that can be very expensive for a student.

Whitehorn makes the point that while it’s true that this can all be avoided using a Hadoop cloud service, he says that defeats the point, which is understanding the interaction between the software and the hardware. The whole point of the exercise, he explains, is to face the complexity of the project and overcome it.

Whitehorn says that he’s learned a lot about Hadoop from attempting the project, and encourages others to get in on the action. For anyone who is interested in doing that, he has posted a blog entry that discusses his approach and some of the nuances that can be found here.

If you want to learn Hadoop close to the metal, or closer than usual, this is the project for you!

November 26, 2013

CDH 4.5, Manager 4.8, Impala 1.2.1, Search 1.1

Filed under: Cloudera,Hadoop,Impala,MapReduce — Patrick Durusau @ 3:13 pm

Announcing: CDH 4.5, Cloudera Manager 4.8, Cloudera Impala 1.2.1, and Cloudera Search 1.1

Before your nieces and nephews (at least in the U.S.) start chewing up your bandwidth over the Thanksgiving Holidays, you may want to grab the most recent releases from Cloudera.

If you are traveling, it will give you something to do during airport delays. 😉

November 21, 2013

Setting up a Hadoop cluster

Filed under: Documentation,Hadoop,Topic Maps — Patrick Durusau @ 6:36 pm

Setting up a Hadoop cluster – Part 1: Manual Installation by Lars Francke.

From the post:

In the last few months I was tasked several times with setting up Hadoop clusters. Those weren’t huge – two to thirteen machines – but from what I read and hear this is a common use case especially for companies just starting with Hadoop or setting up a first small test cluster. While there is a huge amount of documentation in form of official documentation, blog posts, articles and books most of it stops just where it gets interesting: Dealing with all the stuff you really have to do to set up a cluster, cleaning logs, maintaining the system, knowing what and how to tune etc.

I’ll try to describe all the hoops we had to jump through and all the steps involved to get our Hadoop cluster up and running. Probably trivial stuff for experienced Sysadmins but if you’re a Developer and finding yourself in the “Devops” role all of a sudden I hope it is useful to you.

While working at GBIF I was asked to set up a Hadoop cluster on 15 existing and 3 new machines. So the first interesting thing about this setup is that it is a heterogeneous environment: Three different configurations at the moment. This is where our first goal came from: We wanted some kind of automated configuration management. We needed to try different cluster configurations and we need to be able to shift roles around the cluster without having to do a lot of manual work on each machine. We decided to use a tool called Puppet for this task.

While Hadoop is not currently in production at GBIF there are mid- to long-term plans to switch parts of our infrastructure to various components of the HStack. Namely MapReduce jobs with Hive and perhaps Pig (there is already strong knowledge of SQL here) and also storing of large amounts of raw data in HBase to be processed asynchronously (~500 million records until next year) and indexed in a Lucene/Solr solution possibly using something like Katta to distribute indexes. For good measure we also have fairly complex geographic calculations and map-tile rendering that could be done on Hadoop. So we have those 18 machines and no real clue how they’ll be used and which services we’d need in the end.

Dated, 2011, but illustrates some of the issues I raised in: Hadoop Ecosystem Configuration Woes?

Do you keep this level of documentation on your Hadoop installs?

I first saw this in a tweet by Marko A. Rodriguez.

Putting Spark to Use:…

Filed under: Hadoop,MapReduce,Spark — Patrick Durusau @ 5:43 pm

Putting Spark to Use: Fast In-Memory Computing for Your Big Data Applications by Justin Kestelyn.

From the post:

Apache Hadoop has revolutionized big data processing, enabling users to store and process huge amounts of data at very low costs. MapReduce has proven to be an ideal platform to implement complex batch applications as diverse as sifting through system logs, running ETL, computing web indexes, and powering personal recommendation systems. However, its reliance on persistent storage to provide fault tolerance and its one-pass computation model make MapReduce a poor fit for low-latency applications and iterative computations, such as machine learning and graph algorithms.

Apache Spark addresses these limitations by generalizing the MapReduce computation model, while dramatically improving performance and ease of use.

Fast and Easy Big Data Processing with Spark

At its core, Spark provides a general programming model that enables developers to write application by composing arbitrary operators, such as mappers, reducers, joins, group-bys, and filters. This composition makes it easy to express a wide array of computations, including iterative machine learning, streaming, complex queries, and batch.

In addition, Spark keeps track of the data that each of the operators produces, and enables applications to reliably store this data in memory. This is the key to Spark’s performance, as it allows applications to avoid costly disk accesses. As illustrated in the figure below, this feature enables:

I would not use the following example to promote Spark:

One of Spark’s most useful features is the interactive shell, bringing Spark’s capabilities to the user immediately – no IDE and code compilation required. The shell can be used as the primary tool for exploring data interactively, or as means to test portions of an application you’re developing.

The screenshot below shows a Spark Python shell in which the user loads a file and then counts the number of lines that contain “Holiday”.

Spark Example

Isn’t that just:

grep holiday WarAndPeace.txt | wc -l
15

?

Grep doesn’t require an IDE or compilation either. Of course, grep isn’t reading from an HDFS file.

The “file.filter(lamda line: “Holiday” in.line).count()” works but some of us prefer the terseness of Unix.

Unix text tools for HDFS?

November 20, 2013

Learning MapReduce:…[Of Ethics and Self-Interest]

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 4:57 pm

Learning MapReduce: Everywhere and For Everyone

From the post:

Tom White, author of Hadoop: The Definitive Guide, recently celebrated his five-year anniversary at Cloudera with a blog post reflecting on the early days of Big Data and what has changed and remained since 2008. Having just seen Tom in New York at the biggest and best Hadoop World to date, I’m struck by the poignancy of his earliest memories. Even then, Cloudera’s projects were focused on broadening adoption and building the community by writing effective training material, integrating with other systems, and building on the core open source. The founding team had a vision to make Apache Hadoop the focal point of an accessible, powerful, enterprise-ready Big Data platform.

Today, Cloudera is working harder than ever to help companies deploy Hadoop as part of an Enterprise Data Hub. We’re just as committed to a healthy and vibrant open-source community, have a lively partner ecosystem over 700, and have contributed innovations that make data access and analysis faster, more secure, more relevant, and, ultimately, more profitable.

However, with all these successes in driving Hadoop towards the mainstream and providing a new and dynamic data engine, the fact remains that broadening adoption at the end-user level remains job one. Even as Cloudera unifies the Big Data stack, the availability of talent to drive operations and derive full value from massive data falls well short of the enormous demand. As more companies across industries adopt Hadoop and build out their Big Data strategies focused on the Enterprise Data Hub, Cloudera has expanded its commitment to educating technologists of all backgrounds on Hadoop, its applications, and its systems.

A Partnership to Cultivate Hadoop Talent

We at Cloudera University are proud to announce a new partnership with Udacity, a leader in open, online professional education. We believe in Udacity’s vision to democratize professional development by making technical training affordable and accessible to everyone, and this model will enable us to reach aspiring Big Data practitioners around the world who want to expand their skills into Hadoop.

Our first Udacity course, Introduction to Hadoop and MapReduce, guides learners from an understanding of Big Data to the basics of Hadoop, all the way through writing your first MapReduce program. We partnered directly with Udacity’s development team to build the most engaging online Hadoop course available, including demonstrative instruction, interactive quizzes, an interview with Hadoop co-founder Doug Cutting, and a hands-on project using live data. Most importantly, the lessons are self-paced, open, and based on Cloudera’s insights into industry best practices and professional requirements.

Cloudera, and to be fair, others, have adopted a strategy of self-interest that is also ethical.

They are literally giving away the knowledge and training to use a free product. Think of it as a rising tide that floats all boats higher.

The more popular and widely use Hadoop/MapReduce become, the greater the demand for professional training and services from Cloudera (and others).

You may experiment or even run a local cluster, but if you are a Hadoop newbie, who are you going to call when it is a mission-critical application? (Hopefully professionals but there’s no guarantee on that.)

You don’t have to build silos or closed communities to be economically viable.

Delivering professional services for a popular technology seems to do the trick.

November 15, 2013

In Praise of “Modest Data”

Filed under: BigData,Data Mining,Hadoop — Patrick Durusau @ 8:22 pm

From Big Data to Modest Data by Chuck Hollis.

Mea culpa.

Several years ago, I became thoroughly enamored with the whole notion of Big Data.

I, like many, saw a brave new world of petabyte-class data sets, gleaned through by trained data science professionals using advanced algorithms — all in the hopes of bringing amazing new insights to virtually every human endeavor.

It was pretty heady stuff — and still is.

While that vision has certainly is coming to pass in many ways, there’s an interesting distinct and separate offshoot: use of big data philosophies and toolsets — but being applied to much smaller use cases with far less ambitious goals.

Call it Modest Data for lack of a better term.

No rockstars, no glitz, no glam, no amazing keynote speeches — just ordinary people getting closer to their data more efficiently and effectively than before.

That’s the fun part about technology: you put the tools in people’s hands, and they come up with all sorts of interesting ways to use it — maybe quite differently than originally intended.

Master of the metaphor, Chuck manages to talk about “big data,” “teenage sex,” “rock stars,” “Hadoop,” “business data,” and “modest data,” all in one entertaining and useful post.

While the Hadoop eco-system can handle “big data,” it also brings new capabilities to processing less than “big data,” or what Chuck calls “modest data.”

Very much worth your while to read Chuck’s post and see if your “modest” data can profit from “big data” tools.

November 14, 2013

Cloudera + Udacity = Hadoop MOOC!

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 1:54 pm

Cloudera and Udacity partner to offer Data Science training courses by Lauren Hockenson.

From the post:

After launching the Open Education Alliance with some of the biggest tech companies in Silicon Valley, Udacity has forged a partnership with Cloudera to bring comprehensive Data Science curriculum to a massively open online course (MOOC) format in a program called Cloudera University — allowing anyone to learn the intricacies of Hadoop and other Data Science methods.

“Recognizing the growing demand for skilled data professionals, more students are seeking instruction in Hadoop and data science in order to prepare themselves to take advantage of the rapidly expanding data economy,” said Sebastian Thun, founder of Udacity, in a press release. “As the leader in Hadoop solutions, training, and services, Cloudera’s insights and technical guidance are in high demand, so we are pleased to be leveraging that experience and expertise as their partner in online open courseware,”

The first offering to come via Cloudera University will be “Introduction to Hadoop and MapReduce,” a three-lesson course that serves a precursor to the program’s larger, track-based training already in place. While Cloudera already offers many of these courses in Data Science, as well as intensive certificate training programs, in an in-person setting, it seems that the partnership with Udacity will translate curriculum that Cloudera has developed into a more palatable format for online learning.

Looking forward to Cloudera University reflecting all of the Hadoop eco-system.

In the mean time, there are a number of online training resources already available at Cloudera.

November 12, 2013

Using Solr to Search and Analyze Logs

Filed under: Hadoop,Log Analysis,logstash,Lucene,Solr — Patrick Durusau @ 4:07 pm

Using Solr to Search and Analyze Logs by Radu Gheorghe.

From the description:

Since we’ve added Solr output for Logstash, indexing logs via Logstash has become a possibility. But what if you are not using (only) Logstash? Are there other ways you can index logs in Solr? Oh yeah, there are! The following slides are from Lucene Revolution conference that just took place in Dublin where we talked about indexing and searching logs with Solr.

Slides but a very good set of slides.

Radu’s post reminds me I over looked logs in the Hadoop eco-system when describing semantic diversity (Hadoop Ecosystem Configuration Woes?).

Or for that matter, how do you link up the logs with particular configuration or job settings?

Emails to the support desk and sticky notes don’t seem equal to the occasion.

November 11, 2013

Hadoop – 100x Faster… [With NO ETL!]

Filed under: ETL,Hadoop,HDFS,MapReduce,Topic Maps — Patrick Durusau @ 8:32 pm

Hadoop – 100x Faster. How we did it… by Nikita Ivanov.

From the post:

Almost two years ago, Dmitriy and I stood in front of a white board at GridGain’s office thinking: “How can we deliver the real-time performance of GridGain’s in-memory technology to Hadoop customers without asking them rip and replace their systems and without asking them to move their datasets off Hadoop?”.

Given Hadoop’s architecture – the task seemed daunting; and it proved to be one of the more challenging engineering puzzles we have had to solve.

After two years of development, tens of thousands of lines of Java, Scala and C++ code, multiple design iterations, several releases and dozens of benchmarks later, we finally built a product that can deliver real-time performance to Hadoop customers with seamless integration and no tedious ETL. Actual customers deployments can now prove our performance claims and validate our product’s architecture.

Here’s how we did it.

The Idea – In-Memory Hadoop Accelerator

Hadoop is based on two primary technologies: HDFS for storing data, and MapReduce for processing these data in parallel. Everything else in Hadoop and the Hadoop ecosystem sits atop these foundation blocks.

Originally, neither HDFS nor MapReduce were designed with real-time performance in mind. In order to deliver real-time processing without moving data out of Hadoop onto another platform, we had to improve the performance of both of these subsystems. (emphasis added)

The highlighted phrase is the key isn’t it?

In order to deliver real-time processing without moving data out of Hadoop onto another platform

ETL is down time, expense and risk of data corruption.

Given a choice between making your current data platform (of whatever type) more robust or risking a migration to a new data platform, which one would you choose?

Bear in mind those 2.5 million spreadsheets that Felienne mentions in her presentation.

Are you really sure you want to ETL on all you data?

As opposed to making your most critical data more robust and enhanced by other data? All while residing where it lives right now.

Are you ready to get off the ETL merry-go-round?

November 9, 2013

Hue: New Search feature: Graphical facets

Filed under: Hadoop,Hue — Patrick Durusau @ 4:54 pm

Hue: New Search feature: Graphical facets

A very short video demonstrating graphical facets in Hue.

If you aren’t already interested in Hue, you will be!

November 8, 2013

How to use R … in MapReduce and Hive

Filed under: Hadoop,Hive,Hortonworks,R — Patrick Durusau @ 7:28 pm

How to use R and other non-Java languages in MapReduce and Hive by Tom Hanlon.

From the post:

I teach for Hortonworks and in class just this week I was asked to provide an example of using the R statistics language with Hadoop and Hive. The good news was that it can easily be done. The even better news is that it is actually possible to use a variety of tools: Python, Ruby, shell scripts and R to perform distributed fault tolerant processing of your data on a Hadoop cluster.

In this blog post I will provide an example of using R, http://www.r-project.org with Hive. I will also provide an introduction to other non-Java MapReduce tools.

If you wanted to follow along and run these examples in the Hortonworks Sandbox you would need to install R.

The Hortonworks Sandbox just keeps getting better!

Sqooping Data with Hue

Filed under: Cloudera,Hadoop,Hue — Patrick Durusau @ 4:47 pm

Sqooping Data with Hue by Abraham Elmahrek.

From the post:

Hue, the open source Web UI that makes Apache Hadoop easier to use, has a brand-new application that enables transferring data between relational databases and Hadoop. This new application is driven by Apache Sqoop 2 and has several user experience improvements, to boot.

Sqoop is a batch data migration tool for transferring data between traditional databases and Hadoop. The first version of Sqoop is a heavy client that drives and oversees data transfer via MapReduce. In Sqoop 2, the majority of the work was moved to a server that a thin client communicates with. Also, any client can communicate with the Sqoop 2 server over its JSON-REST protocol. Sqoop 2 was chosen instead of its predecessors because of its client-server design.

I knew I was missing one or more Hadoop ecosystem components yesterday! Hadoop Ecosystem Configuration Woes? I left Hue out but also some others.

The Hadoop “ecosystem” varies depending on which open source supporter you read. I didn’t take the time to cross-check my list against all the major supporters. Will be correcting that over the weekend.

This will give you something “practical” to do over the weekend. 😉

November 7, 2013

Hadoop Ecosystem Configuration Woes?

Filed under: Documentation,Hadoop,Topic Maps — Patrick Durusau @ 3:15 pm

After listening to Kathleen Ting (Cloudera) describe how 44% of support tickets for the Hadoop ecosystem arise from misconfiguration (Dealing with Data in the Hadoop Ecosystem…), I started to wonder how many opportunities there are for misconfiguration in the Hadoop ecosystem?

That’s probably not an answerable question, but we can look at how configurations are documented in the Hadoop ecosystem:

Comment in the Hadoop ecosystem:

  • Accumulo – XML <!– comment –>
  • Avro – Schemas defined in JSON (no comment facility)
  • Cassandra – “#” comment indicator
  • Chukwa – XML <!– comment –>
  • Falcon – XML <!– comment –>
  • Flume – “#” comment indicator
  • Hadoop – XML <!– comment –>
  • Hama – XML <!– comment –>
  • HBase – XML <!– comment –>
  • Hive – XML <!– comment –>
  • Knox – XML <!– comment –>
  • Mahout – XML <!– comment –>
  • PIG – C style comments
  • Sqoop – “#” comment indicator
  • Tex – XML <!– comment –>
  • ZooKeeper – text but no apparent ability to comment (Zookeeper Administrator’s Guide)

I read that to mean:

1 Component, Pig uses C style comments

2 Components, Avro and ZooKeeper, have no ability for comments at all.

3 Components, Cassandra, Flume and Sqoop use “#” for comments

10 Components, Accumulo, Chukwa, Falcon, Hama, Hadoop, HBase, Hive, Knox, Mahout and Tex presumably support XML comments

A full one third of the Hadoop ecosystem uses a non-XML comments, if comments are permitted at all. The other two-thirds of the ecosystem uses XML comments in some files and not others.

The entire ecosystem lacks a standard way to associate value or settings in one component with values or settings in another component.

To say nothing of associating values or settings with releases of different components.

Without looking at the details of the possible settings for each component, does that seem problematic to you?

Dealing with Data in the Hadoop Ecosystem…

Filed under: Cloudera,Data,Hadoop — Patrick Durusau @ 1:15 pm

Dealing with Data in the Hadoop Ecosystem – Hadoop, Sqoop, and ZooKeeper by Rachel Roumeliotis.

From the post:

Kathleen Ting (@kate_ting), Technical Account Manager at Cloudera, and our own Andy Oram 0:22]

  • ZooKeeper, the canary in the Hadoop coal mine [Discussed at 1:10]
  • Leaky clients are often a problem ZooKeeper detects [Discussed at 2:10]
  • Sqoop is a bulk data transfer tool [Discussed at 2:47]
  • Sqoop helps to bring together structured and unstructured data [Discussed at 3:50]
  • ZooKeep is not for storage, but coordination, reliability, availability [Discussed at 4:44]
  • Conference interview so not deep but interesting.

    For example, reported that 44% of production errors could be traced to misconfiguration errors.

    November 5, 2013

    Hadoop for Data Science: A Data Science MD Recap

    Filed under: Data Science,Hadoop — Patrick Durusau @ 2:02 pm

    Hadoop for Data Science: A Data Science MD Recap by Matt Motyka.

    From the post:

    On October 9th, Data Science MD welcomed Dr. Donald Miner as its speaker to talk about doing data science work and how the hadoop framework can help. To start the presentation, Don was very clear about one thing: hadoop is bad at a lot of things. It is not meant to be a panacea for every problem a data scientist will face.

    With that in mind, Don spoke about the benefits that hadoop offers data scientists. Hadoop is a great tool for data exploration. It can easily handle filtering, sampling and anti-filtering (summarization) tasks. When speaking about these concepts, Don expressed the benefits of each and included some anecdotes that helped to show real world value. He also spoke about data cleanliness in a very Baz Luhrmann Wear Sunscreen sort of way, offering that as his biggest piece of advice.

    What?

    Hadoop is not a panacea for every data problem????

    😉

    Don’t panic when you start the video. The ads, etc., take almost seven (7) minutes but Dr. Miner is on the way.


    Update: Slides for Hadoop for Data Science. Enjoy!

    October 29, 2013

    Hadoop Weekly – October 28, 2013

    Filed under: Hadoop,HBase,Hive,Parquet,Pig,Zookeeper — Patrick Durusau @ 7:06 pm

    Hadoop Weekly – October 28, 2013 by Joe Crobak.

    A weekly blog post that tracks all things in the Hadoop ecosystem.

    I will keep posting on Hadoop things of particular interest for topic maps but will also be pointing to this blog for those who want/need more Hadoop coverage.

    « Newer PostsOlder Posts »

    Powered by WordPress