Archive for the ‘HDFS’ Category
Thursday, April 18th, 2013
How Hadoop Works? HDFS case study by Dane Dennis.
From the post:
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The Hadoop library contains two major components HDFS and MapReduce, in this post we will go inside each HDFS part and discover how it works internally.
Knowing how to use Hadoop is one level of expertise.
Knowing how Hadoop works takes you to the next level.
One where you can better adapt Hadoop to your needs.
Understanding HDFS is a step in that direction.
Posted in HDFS, Hadoop | No Comments »
Wednesday, April 17th, 2013
Tachyon by UC Berkeley AMP Lab.
From the webpage:
Tachyon is a fault tolerant distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce.It offers up to 300 times higher throughput than HDFS, by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Thus, Tachyon avoids going to disk to load datasets that is frequently read.
Since we aren’t quite to in-memory computing just yet, you may want to review Tachyon.
The numbers are very impressive.
Posted in HDFS, Storage, Tachyon | No Comments »
Monday, April 8th, 2013
HDFS File Operations Made Easy with Hue by Romain Rigaux.
From the post:
Managing and viewing data in HDFS is an important part of Big Data analytics. Hue, the open source web-based interface that makes Apache Hadoop easier to use, helps you do that through a GUI in your browser — instead of logging into a Hadoop gateway host with a terminal program and using the command line.
The first episode in a new series of Hue demos, the video below demonstrates how to get up and running quickly with HDFS file operations via Hue’s File Browser application.
Very nice 2:18 video.
Brings the usual graphical file interface to Hadoop (no small feat) but reminds me of every other graphical file interface.
To step beyond the common graphical file interface, why not:
- Links to scripts that call a file
- File ownership – show all files owned by a user
- Navigation of files by content type(s)
- Grouping of files by common scripts
- Navigation of files by content
- Grouping of files by script owners calling the files
are just a few of the possibilities that come to mind.
I would make the roles in those relationships explicit but that is probably my topic map background showing through.
Posted in HDFS, Hadoop, Hue | No Comments »
Wednesday, March 27th, 2013
Apache Tajo
From the webpage:
Introduction
Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo uses HDFS as a primary storage layer and has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer.
Features
- Fast and low-latency query processing on SQL queries including projection, filter, group-by, sort, and join.
- Rudiment ETL that transforms one data format to another data format.
- Support various file formats, such as CSV, RCFile, RowFile (a row store file), and Trevni.
- Command line interface to allow users to submit SQL queries
- Java API to enable clients to submit SQL queries to Tajo
If you ever wanted to get in on the ground floor of a data warehouse project, this could be your chance!
I first saw this at Apache Incubator: Tajo – a Relational and Distributed Data Warehouse for Hadoop by Alex Popescu.
Posted in Apache Tajo, HDFS, SQL | No Comments »
Wednesday, March 6th, 2013
PolyBase
From the webpage:
PolyBase is a fundamental breakthrough in data processing used in SQL Server 2012 Parallel Data Warehouse to enable truly integrated query across Hadoop and relational data.
Complementing Microsoft’s overall Big Data strategy, PolyBase is a breakthrough new technology on the data processing engine in SQL Server 2012 Parallel Data Warehouse designed as the simplest way to combine non-relational data and traditional relational data in your analysis. While customers would normally burden IT to pre-populate the warehouse with Hadoop data or undergo an extensive training on MapReduce in order to query non-relational data, PolyBase does this all seamlessly giving you the benefits of “Big Data” without the complexities.
I must admit I had my hopes up for the videos labeled: “Watch informative videos to understand PolyBase.”
But the first one was only 2:52 in length and the second was about the Jim Gray Systems Lab (2:13).
So, fair to say it was short on details.
The closest thing I found to a clue was in the PolyBase datasheet that reads (under PolyBase Use Cases, if you are reading along) where it says:
PolyBase introduces the concept of external tables to represent data residing in HDFS. An external table defines a schema (that is, columns and their types) for data residing in HDFS. The table’s metadata lives in the context of a SQL Server database and the actual table data resides in HDFS.
I assume that means that the data in HDFS could have multiple external tables for the same data? Depending upon the query?
Curious if the external tables and/or data types are going to have mapreduce capabilities built-in? To take advantage of parallel processing of the data?
BTW, for topic map types, subject identities for the keys and data types would be the same as with more traditional “internal” tables. In case you want to merge data.
Just out of curiosity, any thoughts on possible IP on external schemas being applied to data?
I first saw this at Alex Popescu’s Microsoft PolyBase: Unifying Relational and Non-Relational Data.
Posted in HDFS, Hadoop, MapReduce, PolyBase, SQL, SQL Server | No Comments »
Wednesday, February 13th, 2013
Data deduplication tactics with HDFS and MapReduce
From the post:
As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Covers five (5) tactics:
- Using HDFS and MapReduce only
- Using HDFS and HBase
- Using HDFS, MapReduce and a Storage Controller
- Using Streaming, HDFS and MapReduce
- Using MapReduce with Blocking techniques
In these times of “Great Sequestration,” how much you are spending on duplicated contractor documentation?
You do get electronic forms of documentation. Yes?
Not that difficult to document prior contractor self-plagiarism. Teasing out what you “mistakenly” paid for it may be harder.
Question: Would you rather find out now and correct or have someone else find out?
PS: For the ambitious in government employment. You might want to consider how discovery of contractor self-plagiarism reflects on your initiative and dedication to “good” government.
Posted in Deduplication, HDFS, MapReduce, Plagiarism | No Comments »
Friday, November 16th, 2012
Exploring Apache Hama
From the post:
Apache Hama is one of the under-hyped projects in the Hadoop ecosystem but gaining a lot of traction steadily with the efforts of its committers. “Apache Hama is a pure BSP (Bulk Synchronous Parallel) computing framework on top of HDFS (Hadoop Distributed File System) for massive scientific computations such as matrix, graph and network algorithms.”
A summary of resources on Apache Hama.
You won’t learn Hama over the weekend but you can get a start towards a new skill to list at LinkedIn.
Posted in Bulk Synchronous Parallel (BSP), HDFS, Hama | No Comments »
Wednesday, October 3rd, 2012
CDH4.1 Now Released! by Charles Zedlewski.
From the post:
We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:
- Quorum based storage – Quorum-based Storage for HDFS provides the ability for HDFS to store its own NameNode edit logs, allowing you to run a highly available NameNode without external storage or custom fencing.
- Hive security and concurrency – we’ve fixed some long standing issues with running Hive. With CDH4.1, it is now possible to run a shared Hive instance where users submit queries using Kerberos authentication. In addition this new Hive server supports multiple users submitting queries at the same time.
- Support for DataFu – the LinkedIn data science team was kind enough to open source their library of Pig UDFs that make it easier to perform common jobs like sessionization or set operations. Big thanks to the LinkedIn team!!!
- Oozie workflow builder – since we added Oozie to CDH more than two years ago, we have often had requests to make it easier to develop Oozie workflows. The newly enhanced job designer in Hue enables users to use a visual tool to build and run Oozie workflows.
- FlumeNG improvements – since its release, FlumeNG has become the backbone for some exciting data collection projects, in some cases collecting as much as 20TB of new event data per day. In CDH4.1 we added an HBase sink as well as metrics for monitoring as well as a number of performance improvements.
- Various performance improvements – CDH4.1 users should experience a boost in their MapReduce performance from CDH4.0.
- Various security improvements – CDH4.1 enables users to configure the system to encrypt data in flight during the shuffle phase. CDH now also applies Hadoop security to users who access the filesystem via a FUSE mount.
It’s releases like this that make me wish I spent more time writing documentation for software. To try out all the cool features with no real goal other than trying them out.
Enjoy!
Posted in Cloudera, Flume, HBase, HDFS, Hadoop, Hive, Pig | No Comments »
Monday, September 10th, 2012
Automating Your Cluster with Cloudera Manager API
From the post:
API access was a new feature introduced in Cloudera Manager 4.0 (download free edition here.). Although not visible in the UI, this feature is very powerful, providing programmatic access to cluster operations (such as configuration and restart) and monitoring information (such as health and metrics). This article walks through an example of setting up a 4-node HDFS and MapReduce cluster via the Cloudera Manager (CM) API.
Cloudera Manager API Basics
The CM API is an HTTP REST API, using JSON serialization. The API is served on the same host and port as the CM web UI, and does not require an extra process or extra configuration. The API supports HTTP Basic Authentication, accepting the same users and credentials as the Web UI. API users have the same privileges as they do in the web UI world.
You can read the full API documentation here.
We are nearing mid-September so the holiday season will be here before long. It isn’t too early to start planning on price/hardware break points.
This will help configure a HDFS and MapReduce cluster on your holiday hardware.
Posted in Clustering (servers), HDFS, MapReduce | No Comments »
Monday, August 13th, 2012
CDH3 update 5 is now available by Arvind Prabhakar
From the post:
We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:
- Flume 1.2.0 – Provides a durable file channel and many more features over the previous release.
- Hive AvroSerDe – Replaces the Haivvreo SerDe and provides robust support for Avro data format.
- WebHDFS – A full read/write REST API to HDFS.
Maintenance release. Installation is good practice before major releases.
Posted in Avro, Cloudera, Flume, HDFS, Hadoop, Hive | No Comments »
Friday, August 3rd, 2012
Introducing Apache Hadoop YARN by Arun Murthy.
From the post:
I’m thrilled to announce that the Apache Hadoop community has decided to promote the next-generation Hadoop data-processing framework, i.e. YARN, to be a sub-project of Apache Hadoop in the ASF!
Apache Hadoop YARN joins Hadoop Common (core libraries), Hadoop HDFS (storage) and Hadoop MapReduce (the MapReduce implementation) as the sub-projects of the Apache Hadoop which, itself, is a Top Level Project in the Apache Software Foundation. Until this milestone, YARN was a part of the Hadoop MapReduce project and now is poised to stand up on it’s own as a sub-project of Hadoop.
In a nutshell, Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.
As folks are aware, Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer. However, the MapReduce algorithm, by itself, isn’t sufficient for the very wide variety of use-cases we see Hadoop being employed to solve. With YARN, Hadoop now has a generic resource-management and distributed application framework, where by, one can implement multiple data processing applications customized for the task at hand. Hadoop MapReduce is now one such application for YARN and I see several others given my vantage point – in future you will see MPI, graph-processing, simple services etc.; all co-existing with MapReduce applications in a Hadoop YARN cluster.
Considering the explosive growth of Hadoop, what new data processing applications do you see emerging first in YARN?
Posted in HDFS, Hadoop, Hadoop YARN, MapReduce | No Comments »
Thursday, July 26th, 2012
Why we build our platform on HDFS by Charles Zedlewski
Charles Zedlewski pushes the number of Hadoop competitors up to twelve:
It’s not often the case that I have a chance to concur with my colleague E14 over at Hortonworks but his recent blog post gave the perfect opportunity. I wanted to build on a few of E14’s points and add some of my own.
A recent GigaOm article presented 8 alternatives to HDFS. They actually missed at least 4 others. For over a year, Parascale marketed itself as an HDFS alternative (until it became an asset sale to Hitachi). Appistry continues to market its HDFS alternative. I’m not sure if it’s released yet but it is very evident that Symantec’s Veritas unit is proposing its Clustered Filesystem (CFS) as an alternative to HDFS as well. HP Ibrix has also supported the HDFS API for some years now.
The GigaOm article implies that the presence of twelve other vendors promoting alternatives must speak to some deficiencies in HDFS for what else would motivate so many offerings? This really draws the incorrect conclusion. I would ask this:
What can we conclude from the fact that there are:
Best links I have for Hadoop competitors (for your convenience and additions):
- Appistry
- Cassandra (DataStax)
- Ceph (Inktrack)
- Clustered Filesystem (CFS)
- Dispersed Storage Network (Cleversafe)
- GPFS (IBM)
- Ibrix
- Isilon (EMC)
- Lustre
- MapR File System
- NetApp Open Solution for Hadoop
- Parascale
Posted in Cloudera, HDFS, Hadoop | No Comments »
Wednesday, July 25th, 2012
Thinking about the HDFS vs. Other Storage Technologies by Eric Baldeschwieler.
Just to whet your interest (see Eric’s post for the details):
As Apache Hadoop has risen in visibility and ubiquity we’ve seen a lot of other technologies and vendors put forth as replacements for some or all of the Hadoop stack. Recently, GigaOM listed eight technologies that can be used to replace HDFS (Hadoop Distributed File System) in some use cases. HDFS is not without flaws, but I predict a rosy future for HDFS. Here is why…
To compare HDFS to other technologies one must first ask the question, what is HDFS good at:
- Extreme low cost per byte….
- Very high bandwidth to support MapReduce workloads….
- Rock solid data reliability….
A lively storage competition is a good thing.
A good opportunity to experiment with different storage strategies.
Posted in HDFS, Hadoop, Hortonworks | 1 Comment »
Thursday, June 21st, 2012
Hortonworks Data Platform v1.0 Download Now Available
From the post:
If you haven’t yet noticed, we have made Hortonworks Data Platform v1.0 available for download from our website. Previously, Hortonworks Data Platform was only available for evaluation for members of the Technology Preview Program or via our Virtual Sandbox (hosted on Amazon Web Services). Moving forward and effective immediately, Hortonworks Data Platform is available to the general public.
Hortonworks Data Platform is a 100% open source data management platform, built on Apache Hadoop. As we have stated on many occasions, we are absolutely committed to the Apache Hadoop community and the Apache development process. As such, all code developed by Hortonworks has been contributed back to the respective Apache projects.
Version 1.0 of Hortonworks Data Platform includes Apache Hadoop-1.0.3, the latest stable line of Hadoop as defined by the Apache Hadoop community. In addition to the core Hadoop components (including MapReduce and HDFS), we have included the latest stable releases of essential projects including HBase 0.92.1, Hive 0.9.0, Pig 0.9.2, Sqoop 1.4.1, Oozie 3.1.3 and Zookeeper 3.3.4. All of the components have been tested and certified to work together. We have also added tools that simplify the installation and configuration steps in order to improve the experience of getting started with Apache Hadoop.
I’m a member of the general public! And you probably are too!
See the rest of the post for more goodies that are included with this release.
Posted in HBase, HDFS, Hadoop, Hive, MapReduce, Oozie, Pig, Sqoop, Zookeeper | No Comments »
Wednesday, June 20th, 2012
HBase Write Path by Jimmy Xiang.
From the post:
Apache HBase is the Hadoop database, and is based on the Hadoop Distributed File System (HDFS). HBase makes it possible to randomly access and update data stored in HDFS, but files in HDFS can only be appended to and are immutable after they are created. So you may ask, how does HBase provide low-latency reads and writes? In this blog post, we explain this by describing the write path of HBase — how data is updated in HBase.
The write path is how an HBase completes put or delete operations. This path begins at a client, moves to a region server, and ends when data eventually is written to an HBase data file called an HFile. Included in the design of the write path are features that HBase uses to prevent data loss in the event of a region server failure. Therefore understanding the write path can provide insight into HBase’s native data loss prevention mechanism.
Whether you intend to use Hadoop for topic map processing or not, this will be a good introduction to updating data in HBase. Not all applications using Hadoop are topic maps so this may serve you in other contexts as well.
Posted in HBase, HDFS, Hadoop | No Comments »
Monday, June 4th, 2012
Cloudera Manager 3.7.6 released! by Jon Zuanich.
Jon writes:
We are pleased to announce that Cloudera Manager 3.7.6 is now available! The most notable updates in this release are:
- Support for multiple Hue service instances
- Separating RPC queue and processing time metrics for HDFS
- Performance tuning of the Resource Manager components
- Several bug fixes and performance improvements
The detailed Cloudera Manager 3.7.6 release notes are available at: https://ccp.cloudera.com/display/ENT/Cloudera+Manager+3.7.x+Release+Notes
Cloudera Manager 3.7.6 is available to download from: https://ccp.cloudera.com/display/SUPPORT/Downloads
Only fair since I mentioned the Cray earlier that I get a post about Cloudera out today as well.
Posted in Cloudera, HDFS, Hadoop, MapReduce | No Comments »
Friday, May 25th, 2012
Apache Hadoop 2.0 (Alpha) Released by Arun Murthy.
From the post:
As the release manager for the Apache Hadoop 2.0 release, it gives me great pleasure to share that the Apache Hadoop community has just released Apache Hadoop 2.0.0 (alpha)! While only an alpha release (read: not ready to run in production), it is still an important step forward as it represents the very first release that delivers new and important capabilities, including:
In addition to these new capabilities, there are several planned enhancements that are on the way from the community, including HDFS Snapshots and auto-failover for HA NameNode, along with further improvements to the stability and performance with the next generation of MapReduce (YARN). There are definitely good times ahead.
Let the good times roll!
Posted in HDFS, Hadoop, MapReduce | No Comments »
Tuesday, February 14th, 2012
Cloudera Manager | Service and Configuration Management Demo Videos by Jon Zuanich.
From the post:
Service and Configuration Management (Part I & II)
We’ve recently recorded a series of demo videos intended to highlight the extensive set of features and functions included with Cloudera Manager, the industry’s first end-to-end management application for Apache Hadoop. These demo videos showcase the newly enhanced Cloudera Manager interface and reveal how to use this powerful application to simplify the administration of Hadoop clusters, optimize performance and enhance the quality of service.
In the first two videos of this series, Philip Langdale, a software engineer at Cloudera, walks through Cloudera Manager’s Service and Configuration Management module. He demonstrates how simple it is to set up and configure the full range of Hadoop services in CDH (including HDFS, MR and HBase); enable security; perform configuration rollbacks; and add, delete and decommission nodes.
Interesting that Vimeo detects the “embedding” of these videos in my RSS reader and displays a blocked message. At the Cloudera site, all is well.
Management may not be as romantic as the latest graph algorithms but it is a pre-condition to widespread enterprise adoption.
Posted in Cloudera, HBase, HDFS, Hadoop, MapReduce | No Comments »
Tuesday, February 14th, 2012
Introducing CDH4 by Charles Zedlewski.
From the post:
I’m pleased to inform our users and customers that Cloudera has released its 4th version of Cloudera’s Distribution Including Apache Hadoop (CDH) into beta today. This release combines the input from our enterprise customers, partners and users with the hard work of Cloudera engineering and the larger Apache open source community to create what we believe is a compelling advance for this widely adopted platform.
There are a great many improvements and new capabilities in CDH4 compared to CDH3. Here is a high level list of what’s available for you to test in this first beta release:
- Availability – a high availability namenode, better job isolation, hard drive failure handling, and multi-version support
- Utilization – multiple namespaces, co-processors and a slot-less resource management model
- Performance – improvements in HBase, HDFS, MapReduce and compression performance
- Usability – broader BI support, expanded API access, unified file formats & compression codecs
- Security – scheduler ACL’s
Some items of note about this beta:
This is the first beta for CDH4. We plan to do a second beta some weeks after the first beta. The second beta will roll in updates to Apache Flume, Apache Sqoop, Hue, Apache Oozie and Apache Whirr that did not make the first beta. It will also broaden the platform support back out to our normal release matrix of Red Hat, Centos, Suse, Ubuntu and Debian. Our plan is for this second beta to have the last significant component changes before CDH goes GA.
Some CDH components are getting substantial revamps and we have transition plans for these. There is a significantly redesigned MapReduce (aka MR2) with a similar API to the old MapReduce but with new daemons, user interface and more. MR2 is part of CDH4, but we also decided it makes sense to ship with the MapReduce from CDH3 which is widely used, thoroughly debugged and stable. We will support both generations of MapReduce for the life of CDH4, which will allow customers and users to take advantage of all of the new CDH4 features while making the transition to the new MapReduce in a timeframe that makes sense for them.
The only better time to be in data mining, information retrieval, data analysis is next week.
Posted in Cloudera, HBase, HDFS, Hadoop, MapReduce | No Comments »
Sunday, November 20th, 2011
Jeff Hammerbacher on Experiences Evolving a New Analytical Platform
Slides from Jeff’s presentation and numerous references, including to a live blogging summary by Jeff Dalton.
In terms of the new analytical platform, I would strongly suggest that you take Cloudera’s substrate:
Cloudera starts with a substrate architecture of Open Compute commodity Linux servers configured using Puppet and Chef and coordinated using ZooKeeper. Naturally this entire stack is open-source. They use HFDS and Ceph to provide distributed, schema-less storage. They offer append-only table storage and metadata using Avro, RCFile, and HCatalog; and mutable table storage and metadata using HBase. For computation, they offer YARN (inter-job scheduling, like Grid Engine, for data intensive computing) and Mesos for cluster resource management; MapReduce, Hamster (MPI), Spark, Dryad / DryadLINQ, Pregel (Giraph), and Dremel as processing frameworks; and Crunch (like Google’s FlumeJava), PigLatin, HiveQL, and Oozie as high-level interfaces. Finally, Cloudera offers tool access through FUSE, JDBC, and ODBC; and data ingest through Sqoop and Flume.
Rather than asking the usual questions, how to make this faster, more storage, etc., all of which are important, ask the more difficult questions:
- In or between which of these elements, would human analysis/judgment have the greatest impact?
- Would human analysis/judgment be best made by experts or crowds?
- What sort of interface would elicit the best human analysis/judgment? (visual/aural; contest/game/virtual)
- Performance with feedback or homeostasis mechanisms?
That is a very crude and uninformed starter set of questions.
Putting higher speed access to more data with better tools at our fingertips expands the questions we can ask of interfaces and our interaction with the data. (Before we ever ask questions of the data.)
Posted in Crunch, Dremel, Dryad, Flume, Giraph, HBase, HDFS, Hive, JDBC, MapReduce, ODBC, Oozie, Pregel | No Comments »