Archive for the ‘Cloudera’ Category
Wednesday, June 5th, 2013
Cloudera Search: The Newest Hadoop Framework for CDH Users and Developers by Doug Cutting.
From the post:
One of the unexpected pleasures of open source development is the way that technologies adapt and evolve for uses you never originally anticipated.
Seven years ago, Apache Hadoop sprang from a project based on Apache Lucene, aiming to solve a search problem: how to scalably store and index the internet. Today, it’s my pleasure to announce Cloudera Search, which uses Lucene (among other things) to make search solve a Hadoop problem: how to let non-technical users interactively explore and analyze data in Hadoop.
Cloudera Search is released to public beta, as of today. (See a demo here; get installation instructions here.) Powered by Apache Solr 4.3, Cloudera Search allows hundreds of users to search petabytes of Hadoop data interactively.
In the context of our platform, CDH (Cloudera’s Distribution including Apache Hadoop), Cloudera Search is another framework much like MapReduce and Cloudera Impala. It’s another way for users to interact with Hadoop data and for developers to build Hadoop applications. Each framework in our platform is designed to cater to different families of applications and users:
(…)
Did you catch the line:
Powered by Apache Solr 4.3, Cloudera Search allows hundreds of users to search petabytes of Hadoop data interactively.
Does that make you feel better about scale issues?
Also see: Cloudera Search Webinar, Wednesday, June 19, 2013 11AM-12PM PT.
A serious step up in capabilities.
Posted in Cloudera, Hadoop, Lucene, Solr | No Comments »
Saturday, May 25th, 2013
Apache Pig Editor in Hue 2.3
From the post:
In the previous installment of the demo series about Hue — the open source Web UI that makes Apache Hadoop easier to use — you learned how to analyze data with Hue using Apache Hive via Hue’s Beeswax and Catalog applications. In this installment, we’ll focus on using the new editor for Apache Pig in Hue 2.3.
Complementing the editors for Hive and Cloudera Impala, the Pig editor provides a great starting point for exploration and real-time interaction with Hadoop. This new application lets you edit and run Pig scripts interactively in an editor tailored for a great user experience. Features include:
- UDFs and parameters (with default value) support
- Autocompletion of Pig keywords, aliases, and HDFS paths
- Syntax highlighting
- One-click script submission
- Progress, result, and logs display
- Interactive single-page application
Here’s a short video demoing its capabilities and ease of use:
(…)
How are you editing your Pig scripts now?
How are you documenting the semantics of your Pig scripts?
How do you search across your Pig scripts?
Posted in Cloudera, Hadoop, Hue, Pig | No Comments »
Thursday, May 16th, 2013
How-to: Configure Eclipse for Hadoop Contributions by Karthik Kambatla.
From the post:
Contributing to Apache Hadoop or writing custom pluggable modules requires modifying Hadoop’s source code. While it is perfectly fine to use a text editor to modify Java source, modern IDEs simplify navigation and debugging of large Java projects like Hadoop significantly. Eclipse is a popular choice thanks to its broad user base and multitude of available plugins.
This post covers configuring Eclipse to modify Hadoop’s source. (Developing applications against CDH using Eclipse is covered in a different post.) Hadoop has changed a great deal since our previous post on configuring Eclipse for Hadoop development; here we’ll revisit configuring Eclipse for the latest “flavors” of Hadoop. Note that trunk and other release branches differ in their directory structure, feature set, and build tools they use. (The EclipseEnvironment Hadoop wiki page is a good starting point for development on trunk.)
A post to ease your way towards contributing to the Hadoop project!
Or if you simply want to know the code you are running cold.
Or something in between!
Posted in Cloudera, Hadoop, MapReduce | No Comments »
Tuesday, May 7th, 2013
Cloudera Development Kit (CDK): Hadoop Application Development Made Easier by Eric Sammer & Tom White.
From the post:
At Cloudera, we have the privilege of helping thousands of developers learn Apache Hadoop, as well as build and deploy systems and applications on top of Hadoop. While we (and many of you) believe that platform is fast becoming a staple system in the data center, we’re also acutely aware of its complexities. In fact, this is the entire motivation behind Cloudera Manager: to make the Hadoop platform easy for operations staff to deploy and manage.
So, we’ve made Hadoop much easier to “consume” for admins and other operators — but what about for developers, whether working for ISVs, SIs, or users? Until now, they’ve largely been on their own.
That’s why we’re really excited to announce the Cloudera Developer Kit (CDK), a new open source project designed to help developers get up and running to build applications on CDH, Cloudera’s open source distribution including Hadoop, faster and easier than before. The CDK is a collection of libraries, tools, examples, and documentation engineered to simplify the most common tasks when working with the platform. Just like CDH, the CDK is 100% free, open source, and licensed under the same permissive Apache License v2, so you can use the code any way you choose in your existing commercial code base or open source project.
The CDK lives on GitHub where users can freely browse, download, fork, and contribute back to the source. Community contributions are not only welcome but strongly encouraged. Since most Java developers use tools such as Maven (or tools that are compatible with Maven repositories), artifacts are also available from the Cloudera Maven Repository for easy project integration.
The CDK is a collection of libraries, tools, examples, and docs engineered to simplify common tasks.
What’s In There Today
Our goal is to release a number of CDK modules over time. The first module that can be found in the current release is the CDK Data module; a set of APIs to drastically simplify working with datasets in Hadoop filesystems such as HDFS and the local filesystem. The Data module handles automatic serialization and deserialization of Java POJOs as well as Avro Records, automatic compression, file and directory layout and management, automatic partitioning based on configurable functions, and a metadata provider plugin interface to integrate with centralized metadata management systems (including HCatalog). All Data APIs are fully documented with javadoc. A reference guide is available to walk you through the important parts of the module, as well. Additionally, a set of examples is provided to help you see the APIs in action immediately.
Here’s to hoping that vendor support as shown for Hadoop, Lucene/Solr, R, (who am I missing?), continues and spreads to other areas of software development.
Posted in Cloudera, Hadoop, MapReduce | No Comments »
Wednesday, May 1st, 2013
Impala 1.0: Industry’s First Production-Ready SQL-on-Hadoop Solution
From the post:
Cloudera, the category leader that sets the standard for Apache Hadoop in the enterprise, today announced the general availability of Cloudera Impala, its open source, interactive SQL query engine for analyzing data stored in Hadoop clusters in real time. Cloudera was first-to-market with its SQL-on-Hadoop offering, releasing Impala to open source as a public beta offering in October 2012. Since that time, it has worked closely with customers and open source users, rigorously testing and refining the platform in real world applications to deliver today’s production-hardened and customer validated release, designed from the ground-up for enterprise workloads. The company noted that adoption of the platform has been strong: over 40 enterprise customers and open source users are using Impala today, including 37signals, Expedia, Six3 Systems, Stripe, and Trion Worlds. With its 1.0 release, Impala extends Cloudera’s unified Platform for Big Data, which is designed specifically to bring different computation frameworks and applications to a single pool of data, using a common set of system resources.
The bigger data pools get, the more opportunity there is for semantic confusion.
Or to put that more positively, the greater the market for tools to lessen or avoid semantic confusion.
Posted in Cloudera, Hadoop, Impala | No Comments »
Friday, March 22nd, 2013
Cloudera ML: New Open Source Libraries and Tools for Data Scientists by Josh Wills.
From the post:
Today, I’m pleased to introduce Cloudera ML, an Apache licensed collection of Java libraries and command line tools to aid data scientists in performing common data preparation and model evaluation tasks. Cloudera ML is intended to be an educational resource and reference implementation for new data scientists that want to understand the most effective techniques for building robust and scalable machine learning models on top of Hadoop.
…[details about clustering omitted]
If you were paying at least somewhat close attention, you may have noticed that the algorithms I’m describing above are essentially clever sampling techniques. With all of the hype surrounding big data, sampling has gotten a bit of a bad rap, which is unfortunate, since most of the work of a data scientist involves finding just the right way to turn a large data set into a small one. Of course, it usually takes a few hundred tries to find that right way, and Hadoop is a powerful tool for exploring the space of possible features and how they should be weighted in order to achieve our objectives.
Wherever possible, we want to minimize the amount of parameter tuning required for any model we create. At the very least, we should try to provide feedback on the quality of the model that is created by different parameter settings. For k-means, we want to help data scientists choose a good value of K, the number of clusters to create. In Cloudera ML, we integrate the process of selecting a value of K into the data sampling and cluster fitting process by allowing data scientists to evaluate multiple values of K during a single run of the tool and reporting statistics about the stability of the clusters, such as the prediction strength.
Finally, we want to investigate the anomalous events in our clustering- those points that don’t fit well into any of the larger clusters. Cloudera ML includes a tool for using the clusters that were identified by the scalable k-means algorithm to compute an assignment of every point in our large data set to a particular cluster center, including the distance from that point to its assigned center. This information is created via a MapReduce job that outputs a CSV file that can be analyzed interactively using Cloudera Impala or your preferred analytical application for processing data stored in Hadoop.
Cloudera ML is under active development, and we are planning to add support for pivot tables, Hive integration via HCatalog, and tools for building ensemble classifers over the next few weeks. We’re eager to get feedback on bug fixes and things that you would like to see in the tool, either by opening an issue or a pull request on our github repository. We’re also having a conversation about training a new generation of data scientists next Tuesday, March 26th, at 2pm ET/11am PT, and I hope that you will be able to join us.
Another great project by Cloudera!
Posted in Cloudera, Clustering, Machine Learning | No Comments »
Thursday, March 21st, 2013
Training a New Generation of Data Scientists by Ryan Goldman.
From the post:
Data scientists drive data as a platform to answer previously unimaginable questions. These multi-talented data professionals are in demand like never before because they identify or create some of the most exciting and potentially profitable business opportunities across industries. However, a scarcity of existing external talent will require companies of all sizes to find, develop, and train their people with backgrounds in software engineering, statistics, or traditional business intelligence as the next generation of data scientists.
Join us for the premiere of Training a New Generation of Data Scientists on Tuesday, March 26, at 2pm ET/11am PT. In this video, Cloudera’s Senior Director of Data Science, Josh Wills, will discuss what data scientists do, how they think about problems, the relationship between data science and Hadoop, and how Cloudera training can help you join this increasingly important profession. Following the video, Josh will answer your questions about data science, Hadoop, and Cloudera’s Introduction to Data Science: Building Recommender Systems course.
This could be fun!
And if nothing else, will give you the tools to distinguish legitimate training, like Cloudera’s, from the “How to make $millions in real estate,” from the guy who makes money selling lectures and books sort of training.
As “hot” as data science is, you don’t have to look for to find that sort of training.
Posted in CS Lectures, Cloudera, Data Science | No Comments »
Tuesday, March 12th, 2013
How-to: Use the Apache HBase REST Interface, Part 1 by Jesse Anderson.
From the post:
There are various ways to access and interact with Apache HBase. The Java API provides the most functionality, but many people want to use HBase without Java.
There are two main approaches for doing that: One is the Thrift interface, which is the faster and more lightweight of the two options. The other way to access HBase is using the REST interface, which uses HTTP verbs to perform an action, giving developers a wide choice of languages and programs to use.
This series of how-to’s will discuss the REST interface and provide Python code samples for accessing it. The first post will cover HBase REST, some Python caveats, and table administration. The second post will explain how to insert multiple rows at a time using XML and JSON. The third post will show how to get multiples rows using XML and JSON. The full code samples can be found on my GitHub account.
Post also has a reminder about HBaseCon 2013 (June 13, San Francisco).
Posted in Cloudera, HBase | No Comments »
Wednesday, January 9th, 2013
Cloudera Impala: A Modern SQL Engine for Hadoop
From the post:
Join us for this technical deep dive about Cloudera Impala, the project that makes scalable parallel databse technology available to the Hadoop community for the first time. Impala is an open-sourced code base that allows users to issue low-latency queries to data stored in HDFS and Apache HBase using familiar SQL operators.
Presenter Marcel Kornacker, creator of Impala, will begin with an overview of Impala from the user’s perspective, followed by an overview of Impala’s architecture and implementation, and will conclude with a comparison of Impala with Apache Hive, commercial MapReduce alternatives and traditional data warehouse infrastructure.
Looking forward to the comparison part. Picking the right tool for a job is an important first step.
Posted in Cloudera, Hadoop, Impala | No Comments »
Friday, December 14th, 2012
How-To: Run a MapReduce Job in CDH4 by Sandy Ryza.
From the post:
This is the first post in series that will get you going on how to write, compile, and run a simple MapReduce job on Apache Hadoop. The full code, along with tests, is available at http://github.com/cloudera/mapreduce-tutorial. The program will run on either MR1 or MR2.
We’ll assume that you have a running Hadoop installation, either locally or on a cluster, and your environment is set up correctly so that typing “hadoop” into your command line gives you some notes on usage. Detailed instructions for installing CDH, Cloudera’s open-source, enterprise-ready distro of Hadoop and related projects, are available here: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation. We’ll also assume you have Maven installed on your system, as this will make compiling your code easier. Note that Maven is not a strict dependency; we could also compile using Java on the command line or with an IDE like Eclipse.
The Use Case
There’s been a lot of brawling on our pirate ship recently. Not so rarely, one of the mates will punch another one in the mouth, knocking a tooth out onto the deck. Our poor sailors will wake up the next day with an empty bottle of rum, wondering who’s responsible for the gap between their teeth. All this violence has gotten out of hand, so as a deterrent, we’d like to provide everyone with a list of everyone that’s ever left them with a gap. Luckily, we’ve been able to set up a Flume source so that every time someone punches someone else, it gets written out as a line in a big log file in Hadoop. To turn this data into these lists, we need a MapReduce job that can 1) invert the mapping from attacker to their victim, 2) group by victims, and 3) eliminate duplicates.
Cool!
Imagine using the same technique while you watch the evening news!
On second thought, that would take too much data entry and be depressing.
Stick to the pirates!
Posted in Cloudera, Hadoop, MapReduce | No Comments »
Tuesday, December 11th, 2012
Solving real world analytics problems with Apache Hadoop
Thursday December 13, 2012 at 8:30 a.m. PST/11:30 a.m. EST
From the registration page:
Agenda:
- Defining big data
- What are the most critical components of a big data solution?
- The business and technical challenges of delivering a solution
- How Cloudera accelerates big data value?
- Why partner with HP?
- The HP AppSystem powered by Cloudera
Doesn’t look heavy on the technical side but on the other hand, attending means you will be entered in a raffle for an HP Mini Notebook.
Posted in Cloudera, Hadoop | No Comments »
Wednesday, December 5th, 2012
Cloudera Impala Beta (version 0.3) and Cloudera Manager 4.1.2 Now Available by Vinithra Varadharajan.
If you are keeping your Hadoop ecosystem skills up to date, drop by Cloudera for the latest Impala beta and a new release of Cloudera Manager.
Vinithra reports that new releases of Impala are going to drop every two to four weeks.
You can either wait for the final release of Impala or read along and contribute to the final product with your testing and comments.
Posted in Cloudera, Hadoop, Impala, MapReduce | No Comments »
Tuesday, December 4th, 2012
New to Hadoop
Cloudera has organized a seven step program for learning Hadoop!
- Read Up on Background
- Install Locally, Install a VM, or Spin Up on Cloud
- Explore Tutorials
- Get Trained Up
- Read Books
- Contribute!
- Participate!
It doesn’t list every possible resource but all the ones listed are high quality.
Following this program will build a solid basis for exploring the Hadoop ecosystem on your own.
Posted in Cloudera, Hadoop, MapReduce | No Comments »
Monday, December 3rd, 2012
Cloudera – Videos from Strata + Hadoop World 2012
The link is to the main resources page, where you can find many other videos and other materials.
If you want Strata + Hadoop World 2012 videos specifically, search on Hadoop World 2012.
As of today, that pulls up 41 entries. Should be enough to keep you occupied for a day or so.
Posted in Cloudera, Hadoop, MapReduce | No Comments »
Tuesday, November 27th, 2012
SINAInnovation: Innovation and Data by Jeffrey Hammerbacher.
From the description:
Cloudera Co-founder Jeff Hammerbacher speaks about data and innovation in the biology and medicine fields.
Interesting presentation, particularly on creating structures for innovation.
One of his insights I would summarize as “break early, rebuild fast.” His term for it was “lower batch size.” Try new ideas and when they fail, try a new one.
I do wonder about his goal to : “Lower the cost of data storage and processing to zero.”
It may get to be “too cheap to meter” but that isn’t the same thing as being zero. Somewhere in the infrastructure, someone is paying bills for storage and processing.
I mention that because some political parties think that infrastructure can exist without ongoing maintenance and care.
Failing infrastructures don’t lead to innovation.
SINAInnovation description:
SINAInnovations was a three-day conference at The Mount Sinai Medical Center that examined all aspects of innovation and therapeutic discovery within academic medical centers, from how it can be taught and fostered within academia, to how it can accelerate drug discovery and the commercialization of emerging biotechnologies.
Posted in Bioinformatics, Cloudera, Data | No Comments »
Monday, November 19th, 2012
The “Ask Bigger Questions” Contest! by Ryan Goldman. (Deadline, Feb. 1 2013)
From the post:
Have you helped your company ask bigger questions? Our mission at Cloudera University is to equip Hadoop professionals with the skills to manage, process, analyze, and monetize more data than they ever thought possible.
Over the past three years, we’ve heard many great stories from our training participants about faster cluster deployments, complex data workflows made simple, and superhero troubleshooting moments. And we’ve heard from executives in all types of businesses that staffing Cloudera Certified professionals gives them confidence that their Hadoop teams have the skills to turn data into breakthrough insights.
Now, it’s your turn to tell us your bigger questions story! Cloudera University is seeking tales of Hadoop success originating with training and certification. How has an investment in your education paid dividends for your company, team, customer, or career?
The most compelling stories chosen from all entrants will receive prizes like Amazon gift cards, discounted Cloudera University training, autographed copies of Hadoop books from O’Reilly Media, and Cloudera swag. We may even turn your story into a case study!
Sign up to participate here. Submissions must be received by Friday, Feb. 1, 2013 to qualify for a prize.
A good marketing technique that might bear imitation.
Don’t have to seek out success stories. Incentives for people to bring them to you.
You get good marketing material that is likely to resonate with other users.
Something to think about.
Posted in Cloudera, Contest, Hadoop | No Comments »
Monday, November 19th, 2012
BioInformatics: A Data Deluge with Hadoop to the Rescue by Marty Lurie.
From the post:
Cloudera Cofounder and Chief Scientist Jeff Hammerbacher is leading a revolutionary project with Mount Sinai School of Medicine to apply the power of Cloudera’s Big Data platform to critical problems in predicting and understanding the process and treatment of disease.
“We are at the cutting edge of disease prevention and treatment, and the work that we will do together will reshape the landscape of our field,” said Dennis S. Charney, MD, Anne and Joel Ehrenkranz Dean, Mount Sinai School of Medicine and Executive Vice President for Academic Affairs, The Mount Sinai Medical Center. “Mount Sinai is thrilled to join minds with Cloudera.” (Please see http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/release.html?ReleaseID=1747809 for more details.)
Cloudera is active in many other areas of BioInformatics. Due to Cloudera’s market leadership in Big Data, many DNA mapping programs have specific installation instructions for CDH (Cloudera’s 100% open-source, enterprise-ready distribution of Hadoop and related projects). But rather than just tell you about Cloudera let’s do a worked example of BioInformatics data – specifically FAERS.
A sponsored piece by Cloudera but walks you through using Impala with the FDA data on adverse drug reactions.
Demonstrates getting started with Impala isn’t hard. Which is true.
What’s lacking is a measure of the difficulty of good results.
Any old result, good or bad, probably isn’t of interest to most users.
Posted in Bioinformatics, Cloudera, Hadoop, Impala | No Comments »
Thursday, November 15th, 2012
Cloudera Glossary
A goodly collection of terms used with Cloudera (Hadoop and related) technology.
I have a weakness for dictionaries, lexicons, grammars and the like so your mileage may vary.
Includes links to additional resources.
Posted in Cloudera, Hadoop | No Comments »
Wednesday, November 14th, 2012
Cloudera Impala – Fast, Interactive Queries with Hadoop by Istvan Szegedi.
From the post:
As discussed in the previous post about Twitter’s Storm, Hadoop is a batch oriented solution that has a lack of support for ad-hoc, real-time queries. Many of the players in Big Data have realised the need for fast, interactive queries besides the traditional Hadooop approach. Cloudera, one the key solution vendors in Big Data/Hadoop domain has just recently launched Cloudera Impala that addresses this gap.
As Cloudera Engineering team descibed in ther blog, their work was inspired by Google Dremel paper which is also the basis for Google BigQuery. Cloudera Impala provides a HiveQL-like query language for wide variety of SELECT statements with WHERE, GROUP BY, HAVING clauses, with ORDER BY – though currently LIMIT is mandatory with ORDER BY -, joins (LEFT, RIGTH, FULL, OUTER, INNER), UNION ALL, external tables, etc. It also supports arithmetic and logical operators and Hive built-in functions such as COUNT, SUM, LIKE, IN or BETWEEN. It can access data stored on HDFS but it does not use mapreduce, instead it is based on its own distributed query engine.
The current Impala release (Impala 1.0beta) does not support DDL statements (CREATE, ALTER, DROP TABLE), all the table creation/modification/deletion functions have to be executed via Hive and then refreshed in Impala shell.
Cloudera Impala is open-source under Apache Licence, the code can be retrieved from Github. Its components are written in C++, Java and Python.
Will get you off to a good start with Impala.
Posted in Cloudera, Hadoop, Impala | No Comments »
Wednesday, November 7th, 2012
Impala: Real-time Queries in Hadoop
From the description:
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage
You can also grab the slides here
.Almost fifty-nine minutes.
Speedup on Hive over MapReduce reported to be 4-30X faster.
I’m not sure about #2 above but then I lack a skull-jack. Maybe this next Christmas.
Posted in Cloudera, Hadoop, Impala | No Comments »
Wednesday, October 31st, 2012
One To Watch: Apache Crunch by Chris Mayer.
From the post:
Over the past few years, the Apache Software Foundation has become the hub for big data-focused projects. An array of companies have recognised the worth of housing their latest innovative projects at the ASF, with Apache Hadoop and Apache Cassandra two shining examples.
Amongst the number of projects arriving in the Apache Incubator was Apache Crunch. Crunch is a Java library created to eliminate the tedium of writing a MapReduce pipeline. It aims to take hold of the entire process, making writing, testing, and running MapReduce pipelines more efficient and “even fun” (if this Cloudera blog post is to be believed).
That’s a tall order, to make MapReduce pipelines “even fun.” On the other hand, remarkable things have emerged from Apache for decades now.
A project to definitely keep in sight.
Posted in Apache Crunch, Cloudera, Hadoop, MapReduce | No Comments »
Thursday, October 25th, 2012
Cloudera’s Impala tool binds Hadoop with business intelligence apps by Christina Farr.
From the post:
In traditional circles, Hadoop is viewed as a bright but unruly problem child.
Indeed, it is still in the nascent stages of development. However the scores of “big data” startups that leverage Hadoop will tell you that it is here to stay.
Cloudera, the venture-backed startup that ushered the mainstream deployment of Hadoop, has unveiled a new technology at the Hadoop World, the data-focused conference in New York.
Its new product, known as “Impala”, addresses many of the concerns that large enterprises still have about Hadoop, namely that it does not integrate well with traditional business intelligence applications.
“We have heard this criticism,” said Charles Zedlewski, Cloudera’s VP of Product in a phone interview with VentureBeat. “That’s why we decided to do something about it,” he said.
Impala enables its users to store vast volumes of unwieldy data and run queries in HBase, Hadoop’s NoSQL database. What’s interesting is that it is built to maximise speed: it runs on top of Hadoop storage, but speaks to SQL and works with pre-existing drivers.
Legacy data is a well known concept.
Are we approaching the point of legacy applications? Applications that are too widely/deeply embedded in IT infrastructure to be replaced?
Or at least not replaced quickly?
The semantics of legacy data are known to be fair game for topic maps. Do the semantics of legacy applications offer the same possibilities?
Mapping the semantics of “legacy” applications, their ancestors and descendants, data, legacy and otherwise, results in a semantic mosh pit.
Some strategies for a semantic “mosh pit:”
- Prohibit it (we know the success rate on that option)
- Ignore it (costly but more “successful” than #1)
- Create an app on top of the legacy app (an error repeated isn’t an error, it’s following precedent)
- Sample it (but what are you missing?)
- Map it (being mindful of cost/benefit)
Which one are you going to choose?
Posted in Cloudera, Hadoop, Impala | No Comments »
Saturday, October 20th, 2012
Sneak Peek into Skybox Imaging’s Cloudera-powered Satellite System by Justin Kestelyn (@kestelyn)
This is a guest post by Oliver Guinan, VP Ground Software, at Skybox Imaging. Oliver is a 15-year veteran of the internet industry and is responsible for all ground system design, architecture and implementation at Skybox.
One of the great promises of the big data movement is using networks of ubiquitous sensors to deliver insights about the world around us. Skybox Imaging is attempting to do just that for millions of locations across our planet.
Skybox is developing a low cost imaging satellite system and web-accessible big data processing platform that will capture video or images of any location on Earth within a couple of days. The low cost nature of the satellite opens the possibility of deploying tens of satellites which, when integrated together, have the potential to image any spot on Earth within an hour.
Skybox satellites are designed to capture light in the harsh environment of outer space. Each satellite captures multiple images of a given spot on Earth. Once the images are transferred from the satellite to the ground, the data needs to be processed and combined to form a single image, similar to those seen within online mapping portals.
With any sensor network, capturing raw data is only the beginning of the story. We at Skybox are building a system to ingest and process the raw data, allowing data scientists and end users to ask arbitrary questions of the data, then publish the answers in an accessible way and at a scale that grows with the number of satellites in orbit. We selected Cloudera to support this deployment.
Now is the time to start planning topic map based products that can incorporate this type of data.
There are lots of folks who are “curious” about what is happening next door, in the next block, a few “klicks” away, across the border, etc.
Not all of them have the funds for private “keyhole” satellites and vacuum data feeds. But they may have money to pay you for efficient and effective collation of intelligence data.
Topic maps empowering “Intelligence as a Service (InaaS)”?
Posted in BigData, Cloudera, Geographic Data, Geography, Intelligence | No Comments »
Friday, October 19th, 2012
What’s New in CDH4.1 Pig by Cheolsoo Park.
From the post:
Apache Pig is a platform for analyzing large data sets that provides a high-level language called Pig Latin. Pig users can write complex data analysis programs in an intuitive and compact manner using Pig Latin.
Among many other enhancements, CDH4.1, the newest release of Cloudera’s open-source Hadoop distro, upgrades Pig from version 0.9 to version 0.10. This post provides a summary of the top seven new features introduced in CDH4.1 Pig.
Cheolsoo covers these new features:
- Boolean Data Type
- Nested FOREACH and CROSS
- Ruby UDFs
- LIMIT / SPLIT by Expression
- Default SPLIT Destination
- Syntactical Sugar for TOTUPLE, TOBAG, and TOMAP
- AvroStorage Improvements
Enjoy!
Posted in Cloudera, Hadoop, Pig | No Comments »
Thursday, October 18th, 2012
Axemblr’s Java Client for the Cloudera Manager API by Justin Kestelyn.
From the post:
Axemblr, purveyors of a cloud-agnostic MapReduce Web Service, have recently announced the availability of an Apache-licensed Java Client for the Cloudera Manager API.
The task at hand, according to Axemblr, is to ”deploy Hadoop on Cloud with as little user interaction as possible. We have the code to provision the hosts but we still need to install and configure Hadoop on all nodes and make it so the user has a nice experience doing it.” And voila, the answer is Cloudera Manager, with the process made easy via the REST API introduced in Release 4.0.
Thus, says Axemblr: “In the pursuit of our greatest desire (second only to coffee early in the morning), we ended up writing a Java client for Cloudera Manager’s API. Thus we achieved to automate a CDH3 Hadoop installation on Amazon EC2 and Rackspace Cloud. We also decided to open source the client so other people can play along.”
Another goodie to ease your way to Hadoop deployment on your favorite cloud.
Do you remember the lights at radio stations that would show “On Air?”
I need an “On Cloud” that lights up. More realistic than the data appliance.
Posted in Cloud Computing, Cloudera, Hadoop | No Comments »
Wednesday, October 10th, 2012
What is Hadoop Metrics2? by Ahmed Radwan.
I’ve been wondering about that. How about you?
From the post:
Metrics are collections of information about Hadoop daemons, events and measurements; for example, data nodes collect metrics such as the number of blocks replicated, number of read requests from clients, and so on. For that reason, metrics are an invaluable resource for monitoring Hadoop services and an indispensable tool for debugging system problems.
This blog post focuses on the features and use of the Metrics2 system for Hadoop, which allows multiple metrics output plugins to be used in parallel, supports dynamic reconfiguration of metrics plugins, provides metrics filtering, and allows all metrics to be exported via JMX.
However cool the software, can’t ever really get away from managing the software.
And it isn’t a bad skill to have. Read on!
Posted in Cloudera, Hadoop, Systems Administration | No Comments »
Wednesday, October 3rd, 2012
CDH4.1 Now Released! by Charles Zedlewski.
From the post:
We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:
- Quorum based storage – Quorum-based Storage for HDFS provides the ability for HDFS to store its own NameNode edit logs, allowing you to run a highly available NameNode without external storage or custom fencing.
- Hive security and concurrency – we’ve fixed some long standing issues with running Hive. With CDH4.1, it is now possible to run a shared Hive instance where users submit queries using Kerberos authentication. In addition this new Hive server supports multiple users submitting queries at the same time.
- Support for DataFu – the LinkedIn data science team was kind enough to open source their library of Pig UDFs that make it easier to perform common jobs like sessionization or set operations. Big thanks to the LinkedIn team!!!
- Oozie workflow builder – since we added Oozie to CDH more than two years ago, we have often had requests to make it easier to develop Oozie workflows. The newly enhanced job designer in Hue enables users to use a visual tool to build and run Oozie workflows.
- FlumeNG improvements – since its release, FlumeNG has become the backbone for some exciting data collection projects, in some cases collecting as much as 20TB of new event data per day. In CDH4.1 we added an HBase sink as well as metrics for monitoring as well as a number of performance improvements.
- Various performance improvements – CDH4.1 users should experience a boost in their MapReduce performance from CDH4.0.
- Various security improvements – CDH4.1 enables users to configure the system to encrypt data in flight during the shuffle phase. CDH now also applies Hadoop security to users who access the filesystem via a FUSE mount.
It’s releases like this that make me wish I spent more time writing documentation for software. To try out all the cool features with no real goal other than trying them out.
Enjoy!
Posted in Cloudera, Flume, HBase, HDFS, Hadoop, Hive, Pig | No Comments »
Monday, September 24th, 2012
Schedule This! Strata + Hadoop World Speakers from Cloudera by Justin Kestelyn.
Oct. 23-25, 2012, New York City
From the post:
We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)
The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.
Just in case the Clouderans aren’t enough incentive to attend (they should be), consider the full schedule for the conference.
Posted in Cloudera, Conferences, Hadoop | No Comments »