Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 21, 2013

Accumulo Comes to CDH

Filed under: Accumulo,Cloudera,Hadoop,NSA — Patrick Durusau @ 7:11 pm

Accumulo Comes to CDH by by Sean Busbey, Bill Havanki, and Mike Drob.

From the post:

Cloudera is pleased to announce the immediate availability of its first release of Accumulo packaged to run under CDH, our open source distribution of Apache Hadoop and related projects and the foundational infrastructure for Enterprise Data Hubs.

Accumulo is an open source project that provides the ability to store data in massive tables (billions of rows, millions of columns) for fast, random access. Accumulo was created and contributed to the Apache Software Foundation by the National Security Agency (NSA), and it has quickly gained adoption as a Hadoop-based key/value store for applications that require access to sensitive data sets. Cloudera provides enterprise support with the RTD Accumulo add-on subscription for Cloudera Enterprise.

This release provides Accumulo 1.4.3 tested for use under CDH 4.3.0. The release includes a significant number of backports and fixes to allow use with CDH 4’s highly available, production-ready packaging of HDFS. As a part of our commitment to the open source community, these changes have been submitted back upstream.

At least with Accumulo, you know you are getting NSA vetted software.

Can’t say the same thing for RSA software.

Enterprise customers need to demand open source software that reserves commercial distribution rights to its source.

For self-preservation if no other reason.

CDK Becomes “Kite SDK”

Filed under: Cloudera,Hadoop — Patrick Durusau @ 1:44 pm

Cloudera Development Kit is Now “Kite SDK” by Ryan Blue.

From the post:

CDK has a new monicker, but the goals remain the same.

We are pleased to announce a new name for the Cloudera Development Kit (CDK): Kite. We’ve just released Kite version 0.10.0, which is purely a rename of CDK 0.9.0.

The new repository and documentation are here:

Why the rename?

The original goal of CDK was to increase accessibility to the Apache Hadoop platform by developers. That goal isn’t Cloudera-specific, and we want the name to more forcefully reflect the open, community-driven character of the project.

Will this change break anything?

The rename mainly affects dependencies and package names. Once imports and dependencies are updated, almost everything should work the same. However, there are a couple of configuration changes to make for anyone using Apache Flume or Morphlines. The changes are detailed on our migration page.

The continuation of the Kite SDK version 0.10.0 along side the Cloudera Development Kit 0.9.0, should make some aspects of the name transition easier.

However, when you search for CDK 0.9.0, are you going to get “hits” for the Kite SDK 0.10.0? Such as blog posts, tutorials, code, etc.

I suspect not. The reverse won’t work either.

So we have relevant material that is indexed under two different names, names a user will have to remember in order to get all the relevant results.

Defining a synonym table works for cases like this but does have one shortfall.

Will the synonym table make sense to us in ten (10) years? Or in twenty (20) years?

There is no guarantee that even a synonym mapping based on disclosed properties will remain intelligible for X number of years.

But if long term data access is mission critical, something more than blind synonym mappings needs to be done.

December 20, 2013

3rd Annual Federal Big Data Apache Hadoop Forum

Filed under: BigData,Cloudera,Conferences,Hadoop — Patrick Durusau @ 6:59 pm

3rd Annual Federal Big Data Apache Hadoop Forum

From the webpage:

Registration is now open for the third annual Federal Big Data Apache Hadoop Forum! Join us on Thurs., Feb. 6, as leaders from government and industry convene to share Big Data best practices. This is a must attend event for any organization or agency looking to be information-driven and give access to more data to more resources and applications. During this informative event you will learn:

  • Key trends in government today and the role Big Data plays in driving transformation;
  • How leading agencies are putting data to good use to uncover new insight, streamline costs, and manage threats;
  • The role of an Enterprise Data Hub, and how it is a game changing data management platform central to any Big Data strategy today.

Get the most from all your data assets, analytics, and teams to enable your mission, efficiently and on budget. Register today and discover how Cloudera and an Enterprise Data Hub can empower you and your teams to do more with Big Data.

A Cloudera fest but I don’t think they will be searching people for business cards at the door. 😉

An opportunity for you to meet and greet, make contacts, etc.

I first saw this in a tweet by Bob Gourley.

November 26, 2013

CDH 4.5, Manager 4.8, Impala 1.2.1, Search 1.1

Filed under: Cloudera,Hadoop,Impala,MapReduce — Patrick Durusau @ 3:13 pm

Announcing: CDH 4.5, Cloudera Manager 4.8, Cloudera Impala 1.2.1, and Cloudera Search 1.1

Before your nieces and nephews (at least in the U.S.) start chewing up your bandwidth over the Thanksgiving Holidays, you may want to grab the most recent releases from Cloudera.

If you are traveling, it will give you something to do during airport delays. 😉

November 25, 2013

Integrating R with Cloudera Impala…

Filed under: Cloudera,Impala,R — Patrick Durusau @ 8:00 pm

Integrating R with Cloudera Impala for Real-Time Queries on Hadoop by Istvan Szegedi.

From the post:

Cloudera Impala supports low-latency, interactive queries on Hadoop data sets either stored in Hadoop Distributed File System (HDFS) or HBase, the distributed NoSQL database for Hadoop. Impala’s notion is to use Hadoop as a storage engine but move away from MapReduce algorithms. Instead, Impala uses distributed queries, a concept inherited from massive parallel processing databases. As a result, Impala supports SQL-like query languange (in the same way way as Apache Hive), but can execute the queries 10-100 times fasters than Hive that converts them into MapReduce. You can find more details on Impala in one of the previous posts.

R is one of the most popular open source statistical computing and graphical software. It can work with various data sources from comma separated files to web contents referred by URLs to relational databases to NoSQL (e.g. MongoDB or Cassandra) and Hadoop.

Thanks to the generic Impala ODBC driver, R can be integrated with Impala, too. The solution will provide fast, interactive queries running on top of Hadoop data sets and then the data can be further processed or visualized within R.

Have you noticed that newer technologies (Hadoop) are becoming accessible to more traditional tools (R)?

Which will move traditional tool users towards newer technologies.

The combining of the new with the traditional has a distinct odor.

I think it is called success. 😉

November 20, 2013

Learning MapReduce:…[Of Ethics and Self-Interest]

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 4:57 pm

Learning MapReduce: Everywhere and For Everyone

From the post:

Tom White, author of Hadoop: The Definitive Guide, recently celebrated his five-year anniversary at Cloudera with a blog post reflecting on the early days of Big Data and what has changed and remained since 2008. Having just seen Tom in New York at the biggest and best Hadoop World to date, I’m struck by the poignancy of his earliest memories. Even then, Cloudera’s projects were focused on broadening adoption and building the community by writing effective training material, integrating with other systems, and building on the core open source. The founding team had a vision to make Apache Hadoop the focal point of an accessible, powerful, enterprise-ready Big Data platform.

Today, Cloudera is working harder than ever to help companies deploy Hadoop as part of an Enterprise Data Hub. We’re just as committed to a healthy and vibrant open-source community, have a lively partner ecosystem over 700, and have contributed innovations that make data access and analysis faster, more secure, more relevant, and, ultimately, more profitable.

However, with all these successes in driving Hadoop towards the mainstream and providing a new and dynamic data engine, the fact remains that broadening adoption at the end-user level remains job one. Even as Cloudera unifies the Big Data stack, the availability of talent to drive operations and derive full value from massive data falls well short of the enormous demand. As more companies across industries adopt Hadoop and build out their Big Data strategies focused on the Enterprise Data Hub, Cloudera has expanded its commitment to educating technologists of all backgrounds on Hadoop, its applications, and its systems.

A Partnership to Cultivate Hadoop Talent

We at Cloudera University are proud to announce a new partnership with Udacity, a leader in open, online professional education. We believe in Udacity’s vision to democratize professional development by making technical training affordable and accessible to everyone, and this model will enable us to reach aspiring Big Data practitioners around the world who want to expand their skills into Hadoop.

Our first Udacity course, Introduction to Hadoop and MapReduce, guides learners from an understanding of Big Data to the basics of Hadoop, all the way through writing your first MapReduce program. We partnered directly with Udacity’s development team to build the most engaging online Hadoop course available, including demonstrative instruction, interactive quizzes, an interview with Hadoop co-founder Doug Cutting, and a hands-on project using live data. Most importantly, the lessons are self-paced, open, and based on Cloudera’s insights into industry best practices and professional requirements.

Cloudera, and to be fair, others, have adopted a strategy of self-interest that is also ethical.

They are literally giving away the knowledge and training to use a free product. Think of it as a rising tide that floats all boats higher.

The more popular and widely use Hadoop/MapReduce become, the greater the demand for professional training and services from Cloudera (and others).

You may experiment or even run a local cluster, but if you are a Hadoop newbie, who are you going to call when it is a mission-critical application? (Hopefully professionals but there’s no guarantee on that.)

You don’t have to build silos or closed communities to be economically viable.

Delivering professional services for a popular technology seems to do the trick.

November 14, 2013

Cloudera + Udacity = Hadoop MOOC!

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 1:54 pm

Cloudera and Udacity partner to offer Data Science training courses by Lauren Hockenson.

From the post:

After launching the Open Education Alliance with some of the biggest tech companies in Silicon Valley, Udacity has forged a partnership with Cloudera to bring comprehensive Data Science curriculum to a massively open online course (MOOC) format in a program called Cloudera University — allowing anyone to learn the intricacies of Hadoop and other Data Science methods.

“Recognizing the growing demand for skilled data professionals, more students are seeking instruction in Hadoop and data science in order to prepare themselves to take advantage of the rapidly expanding data economy,” said Sebastian Thun, founder of Udacity, in a press release. “As the leader in Hadoop solutions, training, and services, Cloudera’s insights and technical guidance are in high demand, so we are pleased to be leveraging that experience and expertise as their partner in online open courseware,”

The first offering to come via Cloudera University will be “Introduction to Hadoop and MapReduce,” a three-lesson course that serves a precursor to the program’s larger, track-based training already in place. While Cloudera already offers many of these courses in Data Science, as well as intensive certificate training programs, in an in-person setting, it seems that the partnership with Udacity will translate curriculum that Cloudera has developed into a more palatable format for online learning.

Looking forward to Cloudera University reflecting all of the Hadoop eco-system.

In the mean time, there are a number of online training resources already available at Cloudera.

November 12, 2013

Oryx [Alphaware]

Filed under: Cloudera,Machine Learning — Patrick Durusau @ 4:29 pm

Oryx [Alphaware] (Cloudera)

From the webpage:

The Oryx open source project provides simple, real-time large-scale machine learning infrastructure. It implements a few classes of algorithm commonly used in business applications: collaborative filtering / recommendation, classification / regression, and clustering. It can continuously build models from a stream of data at large scale using Apache Hadoop‘s MapReduce. It also serves queries of those models in real-time via an HTTP REST API, and can update models approximately in response to new data. Models are exchanged in PMML format.

It is not a library, visualization tool, exploratory analytics tool, or environment. Oryx represents a unified continuation of the Myrrix and cloudera/ml projects.

Oryx should be considered alpha software; it may have bugs and will change in incompatible ways.

I’m sure management has forgotten about that incident where you tanked the production servers. Not to mention those beady-eyed government agents that slowly track you in a car when you grab lunch. 😉

Just teasing. Keep Oryx off the production servers and explore!

Sorry, no advice for the beady-eyed government agents.

November 8, 2013

Sqooping Data with Hue

Filed under: Cloudera,Hadoop,Hue — Patrick Durusau @ 4:47 pm

Sqooping Data with Hue by Abraham Elmahrek.

From the post:

Hue, the open source Web UI that makes Apache Hadoop easier to use, has a brand-new application that enables transferring data between relational databases and Hadoop. This new application is driven by Apache Sqoop 2 and has several user experience improvements, to boot.

Sqoop is a batch data migration tool for transferring data between traditional databases and Hadoop. The first version of Sqoop is a heavy client that drives and oversees data transfer via MapReduce. In Sqoop 2, the majority of the work was moved to a server that a thin client communicates with. Also, any client can communicate with the Sqoop 2 server over its JSON-REST protocol. Sqoop 2 was chosen instead of its predecessors because of its client-server design.

I knew I was missing one or more Hadoop ecosystem components yesterday! Hadoop Ecosystem Configuration Woes? I left Hue out but also some others.

The Hadoop “ecosystem” varies depending on which open source supporter you read. I didn’t take the time to cross-check my list against all the major supporters. Will be correcting that over the weekend.

This will give you something “practical” to do over the weekend. 😉

November 7, 2013

Dealing with Data in the Hadoop Ecosystem…

Filed under: Cloudera,Data,Hadoop — Patrick Durusau @ 1:15 pm

Dealing with Data in the Hadoop Ecosystem – Hadoop, Sqoop, and ZooKeeper by Rachel Roumeliotis.

From the post:

Kathleen Ting (@kate_ting), Technical Account Manager at Cloudera, and our own Andy Oram 0:22]

  • ZooKeeper, the canary in the Hadoop coal mine [Discussed at 1:10]
  • Leaky clients are often a problem ZooKeeper detects [Discussed at 2:10]
  • Sqoop is a bulk data transfer tool [Discussed at 2:47]
  • Sqoop helps to bring together structured and unstructured data [Discussed at 3:50]
  • ZooKeep is not for storage, but coordination, reliability, availability [Discussed at 4:44]
  • Conference interview so not deep but interesting.

    For example, reported that 44% of production errors could be traced to misconfiguration errors.

    November 5, 2013

    Email Indexing Using Cloudera Search and HBase

    Filed under: Cloudera,HBase,Solr — Patrick Durusau @ 6:38 pm

    Email Indexing Using Cloudera Search and HBase by Jeff Shmain.

    From the post:

    In my previous post you learned how to index email messages in batch mode, and in near real time, using Apache Flume with MorphlineSolrSink. In this post, you will learn how to index emails using Cloudera Search with Apache HBase and Lily HBase Indexer, maintained by NGDATA and Cloudera. (If you have not read the previous post, I recommend you do so for background before reading on.)

    Which near-real-time method to choose, HBase Indexer or Flume MorphlineSolrSink, will depend entirely on your use case, but below are some things to consider when making that decision:

    • Is HBase an optimal storage medium for the given use case?
    • Is the data already ingested into HBase?
    • Is there any access pattern that will require the files to be stored in a format other than HFiles?
    • If HBase is not currently running, will there be enough hardware resources to bring it up?

    There are two ways to configure Cloudera Search to index documents stored in HBase: to alter the configuration files directly and start Lily HBase Indexer manually or as a service, or to configure everything using Cloudera Manager. This post will focus on the latter, because it is by far the easiest way to enable Search on HBase — or any other service on CDH, for that matter.

    This rocks!

    Including the reminder to fit the solution to your requirements, not the other way around.

    The phrase “…near real time…” reminds me that HBase can operate in “…near real time…” but no analyst using HBase can.

    Think about it. A search result comes back, the analyst reads it, perhaps compares it to their memory of other results and/or looks for other results to make the comparison. Then the analyst has to decide what if anything the results mean in a particular context and then communicate those results to others or take action based on those results.

    That doesn’t sound even close to “…near real time…” to me.

    You?

    October 30, 2013

    Use MADlib Pre-built Analytic Functions….

    Filed under: Analytics,Cloudera,Impala,Machine Learning,MADlib — Patrick Durusau @ 6:53 pm

    How-to: Use MADlib Pre-built Analytic Functions with Impala by Victor Bittorf.

    From the post:

    Cloudera Impala is an exciting project that unlocks interactive queries and SQL analytics on big data. Over the past few months I have been working with the Impala team to extend Impala’s analytic capabilities. Today I am happy to announce the availability of pre-built mathematical and statistical algorithms for the Impala community under a free open-source license. These pre-built algorithms combine recent theoretical techniques for shared nothing parallelization for analytics and the new user-defined aggregations (UDA) framework in Impala 1.2 in order to achieve big data scalability. This initial release has support for logistic regression, support vector machines (SVMs), and linear regression.

    Having recently completed my masters degree while working in the database systems group at University of Madison Wisconsin, I’m excited to work with the Impala team on this project while I continue my research as a visiting student at Stanford. I’m going to go through some details about what we’ve implemented and how to use it.

    As interest in data analytics increases, there is growing demand for deploying analytic algorithms in enterprise systems. One approach that has received much attention from researchers, engineers and data scientists is the integration of statistical data analysis into databases. One example of this is MADlib, which leverages the data-processing capabilities of an RDBMS to analyze data.

    Victor walks through several examples of data analytics but for those of you who want to cut to the chase:

    This package uses UDAs and UDFs when training and evaluating analytic models. While all of these tasks can be done in pure SQL using the Impala shell, we’ve put together some front-end scripts to streamline the process. The source code for the UDAs, UDFs, and scripts are all on GitHub.

    Usual cautions apply: The results of your script or model may or may not have any resemblance to “facts” as experienced by others.

    October 25, 2013

    Collection Aliasing:…

    Filed under: BigData,Cloudera,Lucene,Solr — Patrick Durusau @ 7:29 pm

    Collection Aliasing: Near Real-Time Search for Really Big Data by Mark Miller.

    From the post:

    The rise of Big Data has been pushing search engines to handle ever-increasing amounts of data. While building Cloudera Search, one of the things we considered in Cloudera Engineering was how we would incorporate Apache Solr with Apache Hadoop in a way that would enable near-real-time indexing and searching on really big data.

    Eventually, we built Cloudera Search on Solr and Apache Lucene, both of which have been adding features at an ever-faster pace to aid in handling more and more data. However, there is no silver bullet for dealing with extremely large-scale data. A common answer in the world of search is “it depends,” and that answer applies in large-scale search as well. The right architecture for your use case depends on many things, and your choice will generally be guided by the requirements and resources for your particular project.

    We wanted to make sure that one simple scaling strategy that has been commonly used in the past for large amounts of time-series data would be fairly simple to set up with Cloudera Search. By “time-series data,” I mean logs, tweets, news articles, market data, and so on — data that is continuously being generated and is easily associated with a current timestamp.

    One of the keys to this strategy is a feature that Cloudera recently contributed to Solr: collection aliasing. The approach involves using collection aliases to juggle collections in a very scalable little “dance.” The architecture has some limitations, but for the right use cases, it’s an extremely scalable option. I also think there are some areas of the dance that we can still add value to, but you can already do quite a bit with the current functionality.

    A great post if you have really big data. 😉

    Seriously, it is a great post and introduction to collection aliases.

    On the other hand, I do wonder what routine Abbot and Costello would do with the variations on big, bigger, really big, etc., data.

    Suggestions welcome!

    October 1, 2013

    Cloudera now supports Accumulo…

    Filed under: Accumulo,Cloudera,NSA — Patrick Durusau @ 6:43 pm

    Cloudera now supports Accumulo, the NSA’s take on HBase by Derrick Harris.

    From the post:

    Cloudera will be integrating with the Apache Accumulo database and, according to a press release, “devoting significant internal engineering resources to speed Accumulo’s development.” The National Security Agency created Accumulo and built in fine-grained authentication to ensure only authorized individuals could see ay given piece of data. Cloudera’s support could be bittersweet for Sqrrl, an Accumulo startup comprised of former NSA engineers and intelligence experts, which should benefit from a bigger ecosystem but whose sales might suffer if Accumulo makes its way into Cloudera’s Hadoop distribution.

    I would think the bittersweet part would be the NSA’s supporting of a design that leaves them with document level security.

    It’s great that they can control access to how many saucers are stolen from White House dinners every year but document security, other than at the grossest level, goes wanting.

    Maybe they haven’t heard of SGML or XML?

    If you don’t mind, mention XML in your phone calls every now and again. Maybe if enough people say it, then it will come up on the “big board.”

    September 26, 2013

    Email Indexing Using Cloudera Search [Stepping Beyond “Hello World”]

    Filed under: Cloudera,Email,Indexing — Patrick Durusau @ 6:00 pm

    Email Indexing Using Cloudera Search by Jeff Shmain

    From the post:

    Why would any company be interested in searching through its vast trove of email? A better question is: Why wouldn’t everybody be interested?

    Email has become the most widespread method of communication we have, so there is much value to be extracted by making all emails searchable and readily available for further analysis. Some common use cases that involve email analysis are fraud detection, customer sentiment and churn, lawsuit prevention, and that’s just the tip of the iceberg. Each and every company can extract tremendous value based on its own business needs.

    A little over a year ago we described how to archive and index emails using HDFS and Apache Solr. However, at that time, searching and analyzing emails were still relatively cumbersome and technically challenging tasks. We have come a long way in document indexing automation since then — especially with the recent introduction of Cloudera Search, it is now easier than ever to extract value from the corpus of available information.

    In this post, you’ll learn how to set up Apache Flume for near-real-time indexing and MapReduce for batch indexing of email documents. Note that although this post focuses on email data, there is no reason why the same concepts could not be applied to instant messages, voice transcripts, or any other data (both structured and unstructured).

    If you want a beyond “Hello World” introduction to: Flume, Solr, Cloudera Morphlines, HDFS, Hue’s Search application, and Cloudera Search, this is the post for you.

    With the added advantage that you can apply the basic principles in this post as you expand your knowledge of the Hadoop ecosystem.

    September 5, 2013

    Introducing Cloudera Search

    Filed under: Cloudera,Hadoop,Search Engines — Patrick Durusau @ 6:15 pm

    Introducing Cloudera Search

    Cloudera Search 1.0 has hit the streets!

    Download

    Prior coverage of Cloudera Search: Hadoop for Everyone: Inside Cloudera Search.

    Enjoy!

    August 2, 2013

    How-to: Use Eclipse with MapReduce in Cloudera’s QuickStart VM

    Filed under: Cloudera,Eclipse,Hadoop,MapReduce — Patrick Durusau @ 6:26 pm

    How-to: Use Eclipse with MapReduce in Cloudera’s QuickStart VM by Jesse Anderson.

    From the post:

    One of the common questions I get from students and developers in my classes relates to IDEs and MapReduce: How do you create a MapReduce project in Eclipse and then debug it?

    To answer that question, I have created a screencast showing you how, using Cloudera’s QuickStart VM:

    The QuickStart VM helps developers get started writing MapReduce code without having to worry about software installs and configuration. Everything is installed and ready to go. You can download the image type that corresponds to your preferred virtualization platform.

    Eclipse is installed on the VM and there is a link on the desktop to start it.

    Nice illustration of walking through the map reduce process.

    I continue to be impressed by the use of VMs.

    Would be a nice way to distribute topic map tooling.

    July 29, 2013

    New Community Forums for Cloudera Customers and Users

    Filed under: Cloudera,Hadoop,MapReduce,Taxonomy — Patrick Durusau @ 4:34 pm

    New Community Forums for Cloudera Customers and Users by Justin Kestelyn.

    From the post:

    This is a great day for technical end-users – developers, admins, analysts, and data scientists alike. Starting now, Cloudera complements its traditional mailing lists with a new, feature-rich community forums intended for users of Cloudera’s Platform for Big Data! (Login using your existing credentials or click the link to register.)

    Although mailing lists have long been a standard for user interaction, and will undoubtedly continue to be, they have flaws. For example, they lack structure or taxonomy, which makes consumption difficult. Search functionality is often less than stellar and users are unable to build reputations that span an appreciable period of time. For these reasons, although they’re easy to create and manage, mailing lists inherently limit access to knowledge and hence limit adoption.

    The new service brings key additions to the conversation: functionality, search, structure and scalability. It is now considerably easier to ask questions, find answers (or questions to answer), follow and share threads, and create a visible and sustainable reputation in the community. And for Cloudera customers, there’s a bonus: your questions will be escalated as bonafide support cases under certain circumstances (see below).

    Another way for you to participate in the Hadoop ecosystem!

    BTW, the discussion taxonomy:

    What is the reasoning behind your taxonomy?

    We made a sincere effort to balance the requirements of simplicity and thoroughness. Of course, we’re always open to suggestions for improvements.

    I don’t doubt the sincerity of the taxonomy authors. Not one bit.

    But all taxonomies represent the “intuitive” view of some small group. There is no means to escape the narrow view of all taxonomies.

    What we can do, at least with topic maps, is to allow groups to have their own taxonomies and to view data through those taxonomies.

    Mapping between taxonomies means that addition via any of the taxonomies results in new data appearing as appropriate in other taxonomies.

    Perhaps it was necessary to champion one taxonomy when information systems were fixed, printed representations of data and access systems.

    But the need for a single taxonomy, if it ever existed, does not exist now. We are free to have any number of taxonomies for any data set, visible or invisible to other users/taxonomies.

    More than thirty (30) years after the invention of the personal computer, we are still laboring under the traditions of printed information systems.

    Isn’t it time to move on?

    July 24, 2013

    …Sentry: Fine-Grained Authorization for Impala and Apache Hive

    Filed under: BigData,Cloudera,Cybersecurity,Hive,Impala,Security — Patrick Durusau @ 2:19 pm

    …Sentry: Fine-Grained Authorization for Impala and Apache Hive

    From the post:

    Cloudera, the leader in enterprise analytic data management powered by Apache Hadoop™, today unveiled the next step in the evolution of enterprise-class big data security, introducing Sentry: a new Apache licensed open source project that delivers the industry’s first fine-grained authorization framework for Hadoop. An independent security module that integrates with open source SQL query engines Apache Hive and Cloudera Impala, Sentry delivers advanced authorization controls to enable multi-user applications and cross-functional processes for enterprise datasets. This level of granular control, available for the first time in Hadoop, is imperative to meet enterprise Role Based Access Control (RBAC) requirements of highly regulated industries, like healthcare, financial services and government. Sentry alleviates the security concerns that have prevented some organizations from opening Hadoop data systems to a more diverse set of users, extending the power of Hadoop and making it suitable for new industries, organizations and enterprise use cases. Concurrently, the company confirmed it plans to submit the Sentry security module to the Apache Incubator at the Apache Software Foundation later this year.

    Welcome news but I could not bring myself to include all the noise words in the press release title. 😉

    For technical details, see: http://cloudera.com/content/cloudera/en/Campaign/introducing-sentry.html.

    Just a word of advice: This doesn’t “solve” big data security issues. It is one aspect of big data security.

    Another aspect of big data security is not allowing people to bring in and leave your facility with magnetic media. Ever.

    Not to mention using glue to permanently close all USB ports and CD/DVD drives.

    There is always tension between how much security do you need versus the cost and inconvenience.

    Another form of security: Have your supervisor’s approval in writing for deviations from known “good” security practices.

    July 12, 2013

    Introducing Morphlines:…

    Filed under: Cloudera,ETL,Hadoop,Morphlines — Patrick Durusau @ 3:07 pm

    Introducing Morphlines: The Easy Way to Build and Integrate ETL Apps for Hadoop by Wolfgang Hoschek.

    From the post:

    Cloudera Morphlines is a new open source framework that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs, this post gets you started.

    A “morphline” is a rich configuration file that makes it easy to define a transformation chain that consumes any kind of data from any kind of data source, processes the data, and loads the results into a Hadoop component. It replaces Java programming with simple configuration steps, and correspondingly reduces the cost and integration effort associated with developing, maintaining, or integrating custom ETL projects.

    Morphlines is a library, embeddable in any Java codebase. A morphline is an in-memory container of transformation commands. Commands are plugins to a morphline that perform tasks such as loading, parsing, transforming, or otherwise processing a single record. A record is an in-memory data structure of name-value pairs with optional blob attachments or POJO attachments. The framework is extensible and integrates existing functionality and third-party systems in a simple and straightforward manner.

    The Morphlines library was developed as part of Cloudera Search. It powers a variety of ETL data flows from Apache Flume and MapReduce into Solr. Flume covers the real time case, whereas MapReduce covers the batch processing case.

    Since the launch of Cloudera Search, Morphlines development has graduated into the Cloudera Development Kit (CDK) in order to make the technology accessible to a wider range of users, contributors, integrators, and products beyond Search. The CDK is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem (and hence a perfect home for Morphlines). The CDK is hosted on GitHub and encourages involvement by the community.

    (…)

    The sidebar promises: Morphlines replaces Java programming with simple configuration steps, reducing the cost and effort of doing custom ETL.

    Sound great!

    But how do I search one or more morphlines for the semantics of the records/fields that are being processed or the semantics of that processing?

    If I want to save “cost and effort,” shouldn’t I be able to search for existing morphlines that have transformed particular records/fields?

    True, morphlines have “#” comments but that seems like a poor way to document transformations.

    How would you test for field documentation?

    Or make sure transformations of particular fields always use the same semantics?

    Ponder those questions while you are reading:

    Cloudera Morphlines Reference Guide

    and,

    Syntax – HOCON github page.

    If we don’t capture semantics at the point of authoring, subsequent searches are mechanized guessing.

    July 9, 2013

    …Apache HBase REST Interface, Part 3

    Filed under: Cloudera,HBase — Patrick Durusau @ 1:51 pm

    How-to: Use the Apache HBase REST Interface, Part 3 by Jesse Anderson.

    From the post:

    This how-to is the third in a series that explores the use of the Apache HBase REST interface. Part 1 covered HBase REST fundamentals, some Python caveats, and table administration. Part 2 showed you how to insert multiple rows simultaneously using XML and JSON. Part 3 below will show how to get multiple rows using XML and JSON.

    Jesse is an instructor with Cloudera University. I checked but Cloudera doesn’t offer a way to search for courses by instructor. 🙁

    I will drop them a note.

    June 25, 2013

    Hadoop for Everyone: Inside Cloudera Search

    Filed under: Cloudera,Hadoop,Search Engines,Searching — Patrick Durusau @ 12:26 pm

    Hadoop for Everyone: Inside Cloudera Search by Eva Andreasson.

    From the post:

    CDH, Cloudera’s 100% open source distribution of Apache Hadoop and related projects, has successfully enabled Big Data processing for many years. The typical approach is to ingest a large set of a wide variety of data into HDFS or Apache HBase for cost-efficient storage and flexible, scalable processing. Over time, various tools to allow for easier access have emerged — so you can now interact with Hadoop through various programming methods and the very familiar structured query capabilities of SQL.

    However, many users with less interest in programmatic interaction have been shut out of the value that Hadoop creates from Big Data. And teams trying to achieve more innovative processing struggle with a time-efficient way to interact with, and explore, the data in Hadoop or HBase.

    Helping these users find the data they need without the need for Java, SQL, or scripting languages inspired integrating full-text search functionality, via Cloudera Search (currently in beta), with the powerful processing platform of CDH. The idea of using search on the same platform as other workloads is the key — you no longer have to move data around to satisfy your business needs, as data and indices are stored in the same scalable and cost-efficient platform. You can also not only find what you are looking for, but within the same infrastructure actually “do” things with your data. Cloudera Search brings simplicity and efficiency for large and growing data sets that need to enable mission-critical staff, as well as the average user, to find a needle in an unstructured haystack!

    As a workload natively integrated with CDH, Cloudera Search benefits from the same security model, access to the same data pool, and cost-efficient storage. In addition, it is added to the services monitored and managed by Cloudera Manager on the cluster, providing a unified production visibility and rich cluster management – a priceless tool for any cluster admin.

    In the rest of this post, I’ll describe some of Cloudera Search’s most important features.

    You have heard the buzz about Cloudera Search, now get a quick list of facts and pointers to more resources!

    The most significant fact?

    Cloudera Search uses Apache Solr.

    If you are looking for search capabilities, what more need I say?

    June 5, 2013

    Cloudera Search: The Newest Hadoop Framework for CDH Users and Developers

    Filed under: Cloudera,Hadoop,Lucene,Solr — Patrick Durusau @ 2:41 pm

    Cloudera Search: The Newest Hadoop Framework for CDH Users and Developers by Doug Cutting.

    From the post:

    One of the unexpected pleasures of open source development is the way that technologies adapt and evolve for uses you never originally anticipated.

    Seven years ago, Apache Hadoop sprang from a project based on Apache Lucene, aiming to solve a search problem: how to scalably store and index the internet. Today, it’s my pleasure to announce Cloudera Search, which uses Lucene (among other things) to make search solve a Hadoop problem: how to let non-technical users interactively explore and analyze data in Hadoop.

    Cloudera Search is released to public beta, as of today. (See a demo here; get installation instructions here.) Powered by Apache Solr 4.3, Cloudera Search allows hundreds of users to search petabytes of Hadoop data interactively.

    In the context of our platform, CDH (Cloudera’s Distribution including Apache Hadoop), Cloudera Search is another framework much like MapReduce and Cloudera Impala. It’s another way for users to interact with Hadoop data and for developers to build Hadoop applications. Each framework in our platform is designed to cater to different families of applications and users:

    (…)

    Did you catch the line:

    Powered by Apache Solr 4.3, Cloudera Search allows hundreds of users to search petabytes of Hadoop data interactively.

    Does that make you feel better about scale issues?

    Also see: Cloudera Search Webinar, Wednesday, June 19, 2013 11AM-12PM PT.

    A serious step up in capabilities.

    May 25, 2013

    Apache Pig Editor in Hue 2.3

    Filed under: Cloudera,Hadoop,Hue,Pig — Patrick Durusau @ 1:38 pm

    Apache Pig Editor in Hue 2.3

    From the post:

    In the previous installment of the demo series about Hue — the open source Web UI that makes Apache Hadoop easier to use — you learned how to analyze data with Hue using Apache Hive via Hue’s Beeswax and Catalog applications. In this installment, we’ll focus on using the new editor for Apache Pig in Hue 2.3.

    Complementing the editors for Hive and Cloudera Impala, the Pig editor provides a great starting point for exploration and real-time interaction with Hadoop. This new application lets you edit and run Pig scripts interactively in an editor tailored for a great user experience. Features include:

    • UDFs and parameters (with default value) support
    • Autocompletion of Pig keywords, aliases, and HDFS paths
    • Syntax highlighting
    • One-click script submission
    • Progress, result, and logs display
    • Interactive single-page application

    Here’s a short video demoing its capabilities and ease of use:

    (…)

    How are you editing your Pig scripts now?

    How are you documenting the semantics of your Pig scripts?

    How do you search across your Pig scripts?

    May 16, 2013

    How-to: Configure Eclipse for Hadoop Contributions

    Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 12:34 pm

    How-to: Configure Eclipse for Hadoop Contributions by Karthik Kambatla.

    From the post:

    Contributing to Apache Hadoop or writing custom pluggable modules requires modifying Hadoop’s source code. While it is perfectly fine to use a text editor to modify Java source, modern IDEs simplify navigation and debugging of large Java projects like Hadoop significantly. Eclipse is a popular choice thanks to its broad user base and multitude of available plugins.

    This post covers configuring Eclipse to modify Hadoop’s source. (Developing applications against CDH using Eclipse is covered in a different post.) Hadoop has changed a great deal since our previous post on configuring Eclipse for Hadoop development; here we’ll revisit configuring Eclipse for the latest “flavors” of Hadoop. Note that trunk and other release branches differ in their directory structure, feature set, and build tools they use. (The EclipseEnvironment Hadoop wiki page is a good starting point for development on trunk.)

    A post to ease your way towards contributing to the Hadoop project!

    Or if you simply want to know the code you are running cold.

    Or something in between!

    May 13, 2013

    Analyzing Twitter: An End-to-End Data Pipeline Recap

    Filed under: BigData,Cloudera,Mahout,Tweets — Patrick Durusau @ 10:32 am

    Analyzing Twitter: An End-to-End Data Pipeline Recap by Jason Barbour.

    Jason reviews presentations at a recent Data Science MD meeting:

    Starting off the night, Joey Echeverria, a Principal Solutions Architect, first discussed a big data architecture and how a key components of relational data management system can be replaced with current big data technologies. With Twitter being increasingly popular with marketing teams, analyzing Twitter data becomes a perfect use case to demonstrate a complete big data pipeline.

    (…)

    Following Joey, Sean Busbey, a Solutions Architect at Cloudera, discussed working with Mahout, a scalable machine learning library for Hadoop. Sean first introduced the three C’s of machine learning: classification, clustering, and collaborative filtering. With classification, learning from a training set supervised, and new examples can be categorized. Clustering allows examples to be grouped together with common features, while collaborative filtering allows new candidates to be suggested.

    Great summaries, links to additional resources and the complete slides.

    Check the DC Data Community Events Calendar if you plan to visit the DC area. (I assume residents already do.)

    May 7, 2013

    Cloudera Development Kit (CDK)…

    Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 4:01 pm

    Cloudera Development Kit (CDK): Hadoop Application Development Made Easier by Eric Sammer & Tom White.

    From the post:

    At Cloudera, we have the privilege of helping thousands of developers learn Apache Hadoop, as well as build and deploy systems and applications on top of Hadoop. While we (and many of you) believe that platform is fast becoming a staple system in the data center, we’re also acutely aware of its complexities. In fact, this is the entire motivation behind Cloudera Manager: to make the Hadoop platform easy for operations staff to deploy and manage.

    So, we’ve made Hadoop much easier to “consume” for admins and other operators — but what about for developers, whether working for ISVs, SIs, or users? Until now, they’ve largely been on their own.

    That’s why we’re really excited to announce the Cloudera Developer Kit (CDK), a new open source project designed to help developers get up and running to build applications on CDH, Cloudera’s open source distribution including Hadoop, faster and easier than before. The CDK is a collection of libraries, tools, examples, and documentation engineered to simplify the most common tasks when working with the platform. Just like CDH, the CDK is 100% free, open source, and licensed under the same permissive Apache License v2, so you can use the code any way you choose in your existing commercial code base or open source project.

    The CDK lives on GitHub where users can freely browse, download, fork, and contribute back to the source. Community contributions are not only welcome but strongly encouraged. Since most Java developers use tools such as Maven (or tools that are compatible with Maven repositories), artifacts are also available from the Cloudera Maven Repository for easy project integration.

    The CDK is a collection of libraries, tools, examples, and docs engineered to simplify common tasks.

    What’s In There Today

    Our goal is to release a number of CDK modules over time. The first module that can be found in the current release is the CDK Data module; a set of APIs to drastically simplify working with datasets in Hadoop filesystems such as HDFS and the local filesystem. The Data module handles automatic serialization and deserialization of Java POJOs as well as Avro Records, automatic compression, file and directory layout and management, automatic partitioning based on configurable functions, and a metadata provider plugin interface to integrate with centralized metadata management systems (including HCatalog). All Data APIs are fully documented with javadoc. A reference guide is available to walk you through the important parts of the module, as well. Additionally, a set of examples is provided to help you see the APIs in action immediately.

    Here’s to hoping that vendor support as shown for Hadoop, Lucene/Solr, R, (who am I missing?), continues and spreads to other areas of software development.

    May 1, 2013

    Impala 1.0

    Filed under: Cloudera,Hadoop,Impala — Patrick Durusau @ 7:30 pm

    Impala 1.0: Industry’s First Production-Ready SQL-on-Hadoop Solution

    From the post:

    Cloudera, the category leader that sets the standard for Apache Hadoop in the enterprise, today announced the general availability of Cloudera Impala, its open source, interactive SQL query engine for analyzing data stored in Hadoop clusters in real time. Cloudera was first-to-market with its SQL-on-Hadoop offering, releasing Impala to open source as a public beta offering in October 2012. Since that time, it has worked closely with customers and open source users, rigorously testing and refining the platform in real world applications to deliver today’s production-hardened and customer validated release, designed from the ground-up for enterprise workloads. The company noted that adoption of the platform has been strong: over 40 enterprise customers and open source users are using Impala today, including 37signals, Expedia, Six3 Systems, Stripe, and Trion Worlds. With its 1.0 release, Impala extends Cloudera’s unified Platform for Big Data, which is designed specifically to bring different computation frameworks and applications to a single pool of data, using a common set of system resources.

    The bigger data pools get, the more opportunity there is for semantic confusion.

    Or to put that more positively, the greater the market for tools to lessen or avoid semantic confusion.

    😉

    March 22, 2013

    Cloudera ML:…

    Filed under: Cloudera,Clustering,Machine Learning — Patrick Durusau @ 10:57 am

    Cloudera ML: New Open Source Libraries and Tools for Data Scientists by Josh Wills.

    From the post:

    Today, I’m pleased to introduce Cloudera ML, an Apache licensed collection of Java libraries and command line tools to aid data scientists in performing common data preparation and model evaluation tasks. Cloudera ML is intended to be an educational resource and reference implementation for new data scientists that want to understand the most effective techniques for building robust and scalable machine learning models on top of Hadoop.

    …[details about clustering omitted]

    If you were paying at least somewhat close attention, you may have noticed that the algorithms I’m describing above are essentially clever sampling techniques. With all of the hype surrounding big data, sampling has gotten a bit of a bad rap, which is unfortunate, since most of the work of a data scientist involves finding just the right way to turn a large data set into a small one. Of course, it usually takes a few hundred tries to find that right way, and Hadoop is a powerful tool for exploring the space of possible features and how they should be weighted in order to achieve our objectives.

    Wherever possible, we want to minimize the amount of parameter tuning required for any model we create. At the very least, we should try to provide feedback on the quality of the model that is created by different parameter settings. For k-means, we want to help data scientists choose a good value of K, the number of clusters to create. In Cloudera ML, we integrate the process of selecting a value of K into the data sampling and cluster fitting process by allowing data scientists to evaluate multiple values of K during a single run of the tool and reporting statistics about the stability of the clusters, such as the prediction strength.

    Finally, we want to investigate the anomalous events in our clustering- those points that don’t fit well into any of the larger clusters. Cloudera ML includes a tool for using the clusters that were identified by the scalable k-means algorithm to compute an assignment of every point in our large data set to a particular cluster center, including the distance from that point to its assigned center. This information is created via a MapReduce job that outputs a CSV file that can be analyzed interactively using Cloudera Impala or your preferred analytical application for processing data stored in Hadoop.

    Cloudera ML is under active development, and we are planning to add support for pivot tables, Hive integration via HCatalog, and tools for building ensemble classifers over the next few weeks. We’re eager to get feedback on bug fixes and things that you would like to see in the tool, either by opening an issue or a pull request on our github repository. We’re also having a conversation about training a new generation of data scientists next Tuesday, March 26th, at 2pm ET/11am PT, and I hope that you will be able to join us.

    Another great project by Cloudera!

    March 21, 2013

    Training a New Generation of Data Scientists

    Filed under: Cloudera,CS Lectures,Data Science — Patrick Durusau @ 2:26 pm

    Training a New Generation of Data Scientists by Ryan Goldman.

    From the post:

    Data scientists drive data as a platform to answer previously unimaginable questions. These multi-talented data professionals are in demand like never before because they identify or create some of the most exciting and potentially profitable business opportunities across industries. However, a scarcity of existing external talent will require companies of all sizes to find, develop, and train their people with backgrounds in software engineering, statistics, or traditional business intelligence as the next generation of data scientists.

    Join us for the premiere of Training a New Generation of Data Scientists on Tuesday, March 26, at 2pm ET/11am PT. In this video, Cloudera’s Senior Director of Data Science, Josh Wills, will discuss what data scientists do, how they think about problems, the relationship between data science and Hadoop, and how Cloudera training can help you join this increasingly important profession. Following the video, Josh will answer your questions about data science, Hadoop, and Cloudera’s Introduction to Data Science: Building Recommender Systems course.

    This could be fun!

    And if nothing else, will give you the tools to distinguish legitimate training, like Cloudera’s, from the “How to make $millions in real estate,” from the guy who makes money selling lectures and books sort of training.

    As “hot” as data science is, you don’t have to look for to find that sort of training.

    « Newer PostsOlder Posts »

    Powered by WordPress