Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 17, 2012

Apache HBase 0.94 is now released

Filed under: Cloudera,HBase — Patrick Durusau @ 10:40 am

Apache HBase 0.94 is now released by Himanshu Vashishtha.

Some of the new features:

  • More powerful first aid box: The previous HBck tool did a good job of fixing inconsistencies related to region assignments but lacked some basic features like fixing orphaned regions, region holes, overlapping regions, etc. HBASE-5128: “Uber hbck”, adds these missing features to the first aid box.
  • Simplified Region Sizing: Deciding a region size is always tricky as it varies on a number of dynamic parameters such as data size, cluster size, workload, etc. HBASE-4365: “Heuristic for Region size” adds a heuristic where it increases the split size threshold of a table region as the data grows, thus limiting the number of region splits.
  • Smarter transaction semantics: Though HBase supports single row level transaction, if there are a number of updates (Puts/Deletes) to an individual row, it will lock the row for each of these operations. HBASE-3584: “Atomic Put & Delete in a single transaction” enhances the HBase single row locking semantics by allowing Puts and Deletes on a row to be executed in a single call. This feature is on by default.

BTW, also from the post:

Apache HBase 0.94.0 has been released! This is the first major release since the January 22nd HBase 0.92 release. In the HBase 0.94.0 release the main focuses were on performance enhancements and the addition of new features (Also, several major bug fixes).

Less than four (4) months as I count it between HBase 0.92 and 0.94.

Sounds like a lot of people have been working very hard.

And making serious progress.

May 12, 2012

CDH3 update 4 is now available

Filed under: Flume,Hadoop,HBase,MapReduce — Patrick Durusau @ 3:24 pm

CDH3 update 4 is now available by David Wang.

From the post:

We are happy to officially announce the general availability of CDH3 update 4. This update consists primarily of reliability enhancements as well as a number of minor improvements.

First, there have been a few notable HBase updates. In this release, we’ve upgraded Apache HBase to upstream version 0.90.6, improving system robustness and availability. Also, some of the recent hbck changes were incorporated to better detect and handle various types of corruptions. Lastly, HDFS append support is now disabled by default in this release as it is no longer needed for HBase. Please see the CDH3 Known Issues and Workarounds page for details.

In addition to the HBase updates, CDH3 update 4 also includes the latest release of Apache Flume (incubating) – version 1.1.0. A detailed description of what it brings to the table is found in a previous Cloudera blog post describing its architecture. Please note that we will continue to ship Flume 0.9.4 as well.

April 29, 2012

HBase Real-time Analytics & Rollbacks via Append-based Updates

Filed under: Analytics,HBase — Patrick Durusau @ 3:21 pm

HBase Real-time Analytics & Rollbacks via Append-based Updates by Alex Baranau.

From the post:

In this part 1 of a 3-part post series we’ll describe how we use HBase at Sematext for real-time analytics and how we can perform data rollbacks by using an append-only updates approach.

Some bits of this topic were already covered in Deferring Processing Updates to Increase HBase Write Performance and some were briefly presented at BerlinBuzzwords 2011 (video). We will also talk about some of the ideas below during HBaseCon-2012 in late May (see Real-time Analytics with HBase). The approach described in this post is used in our production systems (SPM & SA) and the implementation was open-sourced as HBaseHUT project.

Problem we are Solving

While HDFS & MapReduce are designed for massive batch processing and with the idea of data being immutable (write once, read many times), HBase includes support for additional operations such as real-time and random read/write/delete access to data records. HBase performs its basic job very well, but there are times when developers have to think at a higher level about how to utilize HBase capabilities for specific use-cases. HBase is a great tool with good core functionality and implementation, but it does require one to do some thinking to ensure this core functionality is used properly and optimally. The use-case we’ll be working with in this post is a typical data analytics system where:

  • new data are continuously streaming in
  • data are processed and stored in HBase, usually as time-series data
  • processed data are served to users who can navigate through most recent data as well as dig deep into historical data

Although the above points frame the use-case relatively narrowly, the approach and its implementation that we’ll describe here are really more general and applicable to a number of other systems, too. The basic issues we want to solve are the following:

  • increase record update throughput. Ideally, despite high volume of incoming data changes can be applied in real-time . Usually. due to the limitations of the “normal HBase update”, which requires Get+Put operations, updates are applied using batch-processing approach (e.g. as MapReduce jobs). This, of course, is anything but real-time: incoming data is not immediately seen. It is seen only after it has been processed.
  • ability to roll back changes in the served data. Human errors or any other issues should not permanently corrupt data that system serves.
  • ability to fetch data interactively (i.e. fast enough for inpatient humans). When one navigates through a small amount of recent data, as well as when selected time interval spans years, the retrieval should be fast.

Here is what we consider an “update”:

  • addition of a new record if no records with same key exists
  • update of an existing record with a particular key

See anything familiar? That resembles your use cases?

The proffered solution may not fit your use case(s) but this is an example of exploring a solution. Not fitting a problem to a solution. Not the same thing.

HBase Real-time Analytics & Rollbacks via Append-based Updates Part 2 is available. Solution uses HBaseHUT. Really informative graphics in part 2 as well.

Very interested in seeing Part 3!

April 17, 2012

HBaseCon 2012: A Glimpse into the Development Track

Filed under: Conferences,HBase,NoSQL — Patrick Durusau @ 7:11 pm

HBaseCon 2012: A Glimpse into the Development Track by Jon Zuanich.

Jon posted a reminder about the development track at HBaseCon 2012:

  • Learning HBase Internals – Lars Hofhansl, Salesforce.com
  • Lessons learned from OpenTSDB – Benoit Sigoure, StumbleUpon
  • HBase Schema Design – Ian Varley, Salesforce.com
  • HBase and HDFS: Past, Present, and Future – Todd Lipcon, Cloudera
  • Lightning Talk | Relaxed Transactions for HBase – Francis Liu, Yahoo!
  • Lightning Talk | Living Data: Applying Adaptable Schemas to HBase – Aaron Kimball, WibiData

Non-developers can check out the rest of the Agenda. 😉

Conference: May 22, 2012 InterContinental San Francisco Hotel.

April 10, 2012

HBase Hackathon at Cloudera

Filed under: Cloudera,HBase — Patrick Durusau @ 6:45 pm

HBase Hackathon at Cloudera by David S. Wang

From the post:

Cloudera will be hosting an Apache HBase hackathon on May 23rd, 2012, the day after HBaseCon 2012. The overall theme of the event will be 0.96 stabilization. If you are in the area for HBaseCon, please come down to our offices in Palo Alto the next day to attend the hackathon. This is a great opportunity to contribute some code towards the project and hang out with other HBasers.

More details are on the hackathon’s Meetup page. Please RSVP so we can better plan lunch, room size, and other logistics for the event. See you there!

If you get the opportunity, attend.

Studies show (American Library Association) that building social relationships that are then continued helps to sustain virtual communities.

Here is your chance to get to know other HBase folks.

April 4, 2012

Apache Bigtop 0.3.0 (incubating) has been released

Filed under: Bigtop,Flume,Hadoop,HBase,Hive,Mahout,Oozie,Sqoop,Zookeeper — Patrick Durusau @ 2:33 pm

Apache Bigtop 0.3.0 (incubating) has been released by Roman Shaposhnik.

From the post:

Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested:

  • Apache Hadoop 1.0.1
  • Apache Zookeeper 3.4.3
  • Apache HBase 0.92.0
  • Apache Hive 0.8.1
  • Apache Pig 0.9.2
  • Apache Mahout 0.6.1
  • Apache Oozie 3.1.3
  • Apache Sqoop 1.4.1
  • Apache Flume 1.0.0
  • Apache Whirr 0.7.0

Thoughts on what is missing from this ecosystem?

What if you moved from the company where you wrote the scripts? And they needed new scripts?

Re-write? On what basis?

Is your “big data” big enough to need “big documentation?”

March 2012 Bay Area HBase User Group meetup summary

Filed under: HBase,Hive — Patrick Durusau @ 2:31 pm

March 2012 Bay Area HBase User Group meetup summary by David S. Wang.

Let’s see:

  • …early access program – HBase In Action
  • …recent HBase releases
  • …Moving HBase RPC to protobufs
  • …Comparing the native HBase client and asynchbase
  • …Using Apache Hive with HBase: Recent improvements
  • …backups and snapshots in HBase
  • …Apache HBase PMC meeting

Do you need any additional reasons to live in the Bay Area? 😉

Seriously, if you do, take advantage of the opportunity meetings like this one offer.

If you don’t, might be cheaper than air fare to create you own HBase/Hive ecosystem.

March 24, 2012

HBASE CON2012

Filed under: Conferences,HBase — Patrick Durusau @ 7:35 pm

HBASE CON2012

Early Bird Registration ends 6 April 2012

May 22, 2012
InterContinental San Francisco Hotel
888 Howard Street
San Francisco, CA 94103

From the webpage:

Real-Time Your Hadoop

Join us for HBaseCon 2012, the first industry conference for Apache HBase users, contributors, administrators and application developers.

Network. Share ideas with colleagues and others in the the rapidly growing HBase community. See who is speaking

Learn. Attend sessions and lightning talks about what’s new in HBase, how to contribute, best practices on running HBase in production, use cases and applications. View the agenda

Train. Make the most of your week and attend Cloudera training for Apache HBase, in the 2 days following the conference. Sign up

BTW, if you attend, you get a voucher for a free ebook: HBase: The Definitive Guide from O’Reilly.

As rapidly as solutions are developing, conferences look like a major source of up to date information.

Apache HBase 0.92.1 now available

Filed under: Cloudera,Hadoop,HBase — Patrick Durusau @ 7:35 pm

Apache HBase 0.92.1 now available by Shaneal Manek

From the post:

Apache HBase 0.92.1 is now available. This release is a marked improvement in system correctness, availability, and ease of use. It’s also backwards compatible with 0.92.0 — except for the removal of the rarely-used transform functionality from the REST interface in HBASE-5228.

Apache HBase 0.92.1 is a bug fix release covering 61 issues – including 6 blockers and 6 critical issues, such as:

March 19, 2012

Apache HBase 0.90.6 is now available

Filed under: HBase — Patrick Durusau @ 6:54 pm

Apache HBase 0.90.6 is now available

Jimmy Xiang writes:

Apache HBase 0.90.6 is now available. It is a bug fix release covering 31 bugs and 5 improvements. Among them, 3 are blockers and 3 are critical, such as:

  • HBASE-5008: HBase can not provide services to a region when it can’t flush the region, but considers it stuck in flushing,
  • HBASE-4773: HBaseAdmin may leak ZooKeeper connections,
  • HBASE-5060: HBase client may be blocked forever when there is a temporary network failure.

This release has improved system robustness and availability by fixing bugs that cause potential data loss, system unavailability, possible deadlocks, read inconsistencies and resource leakage.

The 0.90.6 release is backward compatible with 0.90.5. The fixes in this release will be included in CDH3u4.

March 14, 2012

HBase + Hadoop + Xceivers

Filed under: Hadoop,HBase — Patrick Durusau @ 7:35 pm

HBase + Hadoop + Xceivers by Lars George.

From the post:

Introduction

Some of the configuration properties found in Hadoop have a direct effect on clients, such as HBase. One of those properties is called “dfs.datanode.max.xcievers”, and belongs to the HDFS subproject. It defines the number of server side threads and – to some extent – sockets used for data connections. Setting this number too low can cause problems as you grow or increase utilization of your cluster. This post will help you to understand what happens between the client and server, and how to determine a reasonable number for this property.

The Problem

Since HBase is storing everything it needs inside HDFS, the hard upper boundary imposed by the ”dfs.datanode.max.xcievers” configuration property can result in too few resources being available to HBase, manifesting itself as IOExceptions on either side of the connection.

This is a true sysadmin type post.

Error messages say “DataXceiver,” but set the “dfs.datanode.max.xcievers” property. Post notes “xcievers” is misspelled.

Detailed coverage of the nature of the problem, complete with sample log entries. Along with suggested solutions.

And, word of current work to improve the current situation.

If you are using HBase and Hadoop, put a copy of this with your sysadmin stuff.

March 8, 2012

On the Power of HBase Filters

Filed under: BigData,Filters,HBase — Patrick Durusau @ 8:49 pm

On the Power of HBase Filters

From the post:

Filters are a powerful feature of HBase to delegate the selection of rows to the servers rather than moving rows to the Client. We present the filtering mechanism as an illustration of the general data locality principle and compare it to the traditional select-and-project data access pattern.

Dealing with massive amounts of data changes the way you think about data processing tasks. In a standard business application context, people use a Relational Database System (RDBMS) and consider this system as a service in charge of providing data to the client application. How this data is processed, manipulated, shown to the user, is considered to be the full responsability of the application. In other words, the role of the data server is restricted to what is does best: efficient, safe and consistent storage and access.

The post goes on to observe:

When you deal with BigData, the data center is your computer.

True, but that isn’t the lesson I would draw from HBase Filters.

The lesson I would draw is: it is only big data until you can find the relevant data.

I may have to sift several haystacks of data but at the end of the day I want the name, photo, location, target, time frame for any particular evil-doer. That “big data” was part of the process is a fact, not a goal. Yes?

March 7, 2012

Integrating Lucene with HBase

Filed under: Geographic Information Retrieval,HBase,Lucene,Spatial Index — Patrick Durusau @ 5:40 pm

Integrating Lucene with HBase by Boris Lublinsky and Mike Segel.

You have to get to the conclusion for the punch line:

The simple implementation, described in this paper fully supports all of the Lucene functionality as validated by many unit tests from both Lucene core and contrib modules. It can be used as a foundation of building a very scalable search implementation leveraging inherent scalability of HBase and its fully symmetric design, allowing for adding any number of processes serving HBase data. It also avoids the necessity to close an open Lucene Index reader to incorporate newly indexed data, which will be automatically available to user with possible delay controlled by the cache time to live parameter. In the next article we will show how to extend this implementation to incorporate geospatial search support.

Put why your article is important in the introduction as well.

The second article does better:

Implementing Lucene Spatial Support

In our previous article [1], we discussed how to integrate Lucene with HBase for improved scalability and availability. In this article I will show how to extend this Implementation with the spatial support.

Lucene spatial contribution package [2, 3, 4, 5] provides powerful support for spatial search, but is limited to finding the closest point. In reality spatial search often has significantly more requirements, for example, which points belong to a given shape (circle, bounding box, polygon), which shapes intersect with a given shape and so on. Solution, presented in this article allows solving all of the above problems.

March 5, 2012

“Modern” Algorithms and Data Structures (Bloom Filters, Merkle Trees)

Filed under: Bloom Filters,Cassandra,HBase,Merkle Trees — Patrick Durusau @ 7:51 pm

“Modern” Algorithms and Data Structures (Bloom Filters, Merkle Trees) by Lorenzo Alberton.

From the description:

The first part of a series of talks about modern algorithms and data structures, used by nosql databases like HBase and Cassandra. An explanation of Bloom Filters and several derivates, and Merkle Trees.

Looking forward to more of this series!

February 14, 2012

Cloudera Manager | Service and Configuration Management Demo Videos

Filed under: Cloudera,Hadoop,HBase,HDFS,MapReduce — Patrick Durusau @ 5:11 pm

Cloudera Manager | Service and Configuration Management Demo Videos by Jon Zuanich.

From the post:

Service and Configuration Management (Part I & II)

We’ve recently recorded a series of demo videos intended to highlight the extensive set of features and functions included with Cloudera Manager, the industry’s first end-to-end management application for Apache Hadoop. These demo videos showcase the newly enhanced Cloudera Manager interface and reveal how to use this powerful application to simplify the administration of Hadoop clusters, optimize performance and enhance the quality of service.

In the first two videos of this series, Philip Langdale, a software engineer at Cloudera, walks through Cloudera Manager’s Service and Configuration Management module. He demonstrates how simple it is to set up and configure the full range of Hadoop services in CDH (including HDFS, MR and HBase); enable security; perform configuration rollbacks; and add, delete and decommission nodes.

Interesting that Vimeo detects the “embedding” of these videos in my RSS reader and displays a blocked message. At the Cloudera site, all is well.

Management may not be as romantic as the latest graph algorithms but it is a pre-condition to widespread enterprise adoption.

Introducing CDH4

Filed under: Cloudera,Hadoop,HBase,HDFS,MapReduce — Patrick Durusau @ 5:10 pm

Introducing CDH4 by Charles Zedlewski.

From the post:

I’m pleased to inform our users and customers that Cloudera has released its 4th version of Cloudera’s Distribution Including Apache Hadoop (CDH) into beta today. This release combines the input from our enterprise customers, partners and users with the hard work of Cloudera engineering and the larger Apache open source community to create what we believe is a compelling advance for this widely adopted platform.

There are a great many improvements and new capabilities in CDH4 compared to CDH3. Here is a high level list of what’s available for you to test in this first beta release:

  • Availability – a high availability namenode, better job isolation, hard drive failure handling, and multi-version support
  • Utilization – multiple namespaces, co-processors and a slot-less resource management model
  • Performance – improvements in HBase, HDFS, MapReduce and compression performance
  • Usability – broader BI support, expanded API access, unified file formats & compression codecs
  • Security – scheduler ACL’s

Some items of note about this beta:

This is the first beta for CDH4. We plan to do a second beta some weeks after the first beta. The second beta will roll in updates to Apache Flume, Apache Sqoop, Hue, Apache Oozie and Apache Whirr that did not make the first beta. It will also broaden the platform support back out to our normal release matrix of Red Hat, Centos, Suse, Ubuntu and Debian. Our plan is for this second beta to have the last significant component changes before CDH goes GA.

Some CDH components are getting substantial revamps and we have transition plans for these. There is a significantly redesigned MapReduce (aka MR2) with a similar API to the old MapReduce but with new daemons, user interface and more. MR2 is part of CDH4, but we also decided it makes sense to ship with the MapReduce from CDH3 which is widely used, thoroughly debugged and stable. We will support both generations of MapReduce for the life of CDH4, which will allow customers and users to take advantage of all of the new CDH4 features while making the transition to the new MapReduce in a timeframe that makes sense for them.

The only better time to be in data mining, information retrieval, data analysis is next week. 😉

February 1, 2012

[HBase] Coprocessor Introduction

Filed under: HBase,HBase Coprocessor — Patrick Durusau @ 4:39 pm

[HBase] Coprocessor Introduction by Trend Micro Hadoop Group: Mingjie Lai, Eugene Koontz and Andrew Purtell.

From the post:

HBase has very effective MapReduce integration for distributed computation over data stored within its tables, but in many cases – for example simple additive or aggregating operations like summing, counting, and the like – pushing the computation up to the server where it can operate on the data directly without communication overheads can give a dramatic performance improvement over HBase’s already good scanning performance.

Also, before 0.92, it was not possible to extend HBase with custom functionality except by extending the base classes. Due to Java’s lack of multiple inheritance this required extension plus base code to be refactored into a single class providing the full implementation, which quickly becomes brittle when considering multiple extensions. Who inherits from whom? Coprocessors allow a much more flexible mixin extension model.

In this article I will introduce the new Coprocessors feature of HBase, a framework for both flexible and generic extension, and of distributed computation directly within the HBase server processes. I will talk about what it is, how it works, and how to develop coprocessor extensions.

If you are using HBase, this looks like a must read article. It also covers how to write extensions to the coprocessor.

I first saw this at myNoSQL.

January 25, 2012

Berlin Buzzwords 2012

Filed under: BigData,Conferences,ElasticSearch,Hadoop,HBase,Lucene,MongoDB,Solr — Patrick Durusau @ 3:24 pm

Berlin Buzzwords 2012

Important Dates (all dates in GMT +2)

Submission deadline: March 11th 2012, 23:59 MEZ
Notification of accepted speakers: April 6st, 2012, MEZ
Publication of final schedule: April 13th, 2012
Conference: June 4/5. 2012

The call:

Call for Submission Berlin Buzzwords 2012 – Search, Store, Scale — June 4 / 5. 2012

The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics:

  • IR / Search – Lucene, Solr, katta, ElasticSearch or comparable solutions
  • NoSQL – like CouchDB, MongoDB, Jackrabbit, HBase and others
  • Large Data Processing – Hadoop itself, MapReduce, Cascading or Pig and relatives

Related topics not explicitly listed above are more than welcome. We are looking for presentations on the implementation of the systems themselves, technical talks, real world applications and case studies.

…(moved dates to top)…

High quality, technical submissions are called for, ranging from principles to practice. We are looking for real world use cases, background on the architecture of specific projects and a deep dive into architectures built on top of e.g. Hadoop clusters.

Here is your chance to experience summer in Berlin (Berlin Buzzwords 2012) and in Montreal (Balisage).

Seriously, both conferences are very strong and worth your attention.

January 24, 2012

Apache HBase 0.92.0 has been released

Filed under: HBase — Patrick Durusau @ 3:34 pm

Apache HBase 0.92.0 has been released by Jonathan Hsieh

More than 670 issues were addressed in this release but Jonathan highlights nine changes/improvements for your attention:

User Features:

  • HFile v2, a new more efficient storage format
  • Faster recovery via distributed log splitting
  • Lower latency region-server operations via new multi-threaded and asynchronous implementations.

Operator Features:

  • An enhanced web UI that exposes more internal state
  • Improved logging for identifying slow queries
  • Improved corruption detection and repair tools

Developer Features:

  • Coprocessors
  • Build support for Hadoop 0.20.20x, 0.22, 0.23.
  • Experimental: offheap slab cache and online table schema change

January 19, 2012

All Your HBase Are Belong to Clojure

Filed under: Clojure,Hadoop,HBase — Patrick Durusau @ 7:41 pm

All Your HBase Are Belong to Clojure by

I’m sure you’ve heard a variation on this story before…

So I have this web crawler and it generates these super-detailed log files, which is great ‘cause then we know what it’s doing but it’s also kind of bad ‘cause when someone wants to know why the crawler did this thing but not that thing I have, like, literally gajigabytes of log files and I’m using grep and awk and, well, it’s not working out. Plus what we really want is a nice web application the client can use.

I’ve never really had a good solution for this. One time I crammed this data into a big Lucene index and slapped a web interface on it. One time I turned the data into JSON and pushed it into CouchDB and slapped a web interface on that. Neither solution left me with a great feeling although both worked okay at the time.

This time I already had a Hadoop cluster up and running, I didn’t have any experience with HBase but it looked interesting. After hunting around the internet, thought this might be the solution I had been seeking. Indeed, loading the data into HBase was fairly straightforward and HBase has been very responsive. I mean, very responsive now that I’ve structured my data in such a way that HBase can be responsive.

And that’s the thing: if you are loading literally gajigabytes of data into HBase you need to be pretty sure that it’s going to be able to answer your questions in a reasonable amount of time. Simply cramming it in there probably won’t work (indeed, that approach probably won’t work great for anything). I loaded and re-loaded a test set of twenty thousand rows until I had something that worked.

An excellent tutorial on Hadoop, HBase and Clojure!

First seen at myNoSQL but the URL is not longer working at in my Google Reader.

January 7, 2012

Caching in HBase: SlabCache

Filed under: Cloud Computing,Hadoop,HBase — Patrick Durusau @ 4:06 pm

Caching in HBase: SlabCache by Li Pi.

From the post:

The amount of memory available on a commodity server has increased drastically in tune with Moore’s law. Today, its very feasible to have up to 96 gigabytes of RAM on a mid-end, commodity server. This extra memory is good for databases such as HBase which rely on in memory caching to boost read performance.

However, despite the availability of high memory servers, the garbage collection algorithms available on production quality JDK’s have not caught up. Attempting to use large amounts of heap will result in the occasional stop-the-world pause that is long enough to cause stalled requests and timeouts, thus noticeably disrupting latency sensitive user applications.

Introduces management of the file system cache for those with loads and memory to justify and enable it.

Quite interesting work, particularly if you are ignoring the nay-sayers about the adoption of Hadoop and the Cloud in the coming year.

What the nay-sayers are missing is that yes, unimaginative mid-level managers and admins have no interest in Hadoop or the Cloud. What Hadoop and the Cloud present are opportunities that imaginative re-packagers and re-processing startups are going to use to provide new data streams and services.

Can’t ask startups that don’t exist yet why they have chosen to go with Hadoop and the Cloud.

That goes unnoticed by unimaginative commentators who reflect the opinions of uninformed managers, whose opinions are confirmed by the publication of the columns by unimaginative commentators. One of those feedback loops I mentioned earlier today.

January 1, 2012

Gora Graduates!

Filed under: Cassandra,Hadoop,HBase,Hive,Lucene,MapReduce,Pig,Solr — Patrick Durusau @ 5:54 pm

Gora Graduates! (Incubator location)

Over Twitter I just saw a post announcing that Gora has graduated from the Apache Incubator!

Congratulations to all involved.

Oh, the project:

What is Gora?

Gora is an ORM framework for column stores such as Apache HBase and Apache Cassandra with a specific focus on Hadoop.

Why Gora?

Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores. Gora fills this gap by giving the user an easy-to-use ORM framework with data store specific mappings and built in Apache Hadoop support.

The overall goal for Gora is to become the standard data representation and persistence framework for big data. The roadmap of Gora can be grouped as follows.

  • Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS.
  • Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location.
  • Indexing : Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API.
  • Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading
  • MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.

December 30, 2011

Hadoop Hits 1.0!

Filed under: Hadoop,HBase — Patrick Durusau @ 6:13 pm

Hadoop Hits 1.0!

From the news:

After six years of gestation, Hadoop reaches 1.0.0! This release is from the 0.20-security code line, and includes support for:

  • security
  • HBase (append/hsynch/hflush, and security)
  • webhdfs (with full support for security)
  • performance enhanced access to local files for HBase
  • other performance enhancements, bug fixes, and features

Please see the complete Hadoop 1.0.0 Release Notes for details.

With the release prior to this one being 0.22.0, I was reminded that of a publication by the Union of Concerned Scientists that had a clock on the cover, showing how close or how far away the world was to a nuclear “midnight.” Always counting towards midnight, except for one or more occasions when more time was added. The explanation I remember was that these were nuclear scientists, not clock experts. 😉

I am sure there will be some explanation for the jump in revisions that will pass into folklore and then into publications about Hadoop.

In the meantime, I would suggest that we all download copies and see what 2012 holds with Hadoop 1.0 under our belts.

December 28, 2011

Apache HBase 0.90.5 is now available

Filed under: Hadoop,HBase — Patrick Durusau @ 9:31 pm

Apache HBase 0.90.5 is now available

From Jonathan Hsieh at Cloudera:

Apache HBase 0.90.5 is now available. This is release of the scalable distributed data store inspired by Google’s BigTable is a fix release that covers 81 issue, including 5 considered blockers, and 11 considered critical. This release addresses several robustness and resource leakage issues, fixes rare data-loss scenarios having to do with splits and replication, and improves the atomicity of bulk loads. This version includes some new supporting features including improvements to hbck and an offline meta-rebuild disaster recovery mechanism.

The 0.90.5 release is backward compatible with 0.90.4. Many of the fixes in this release will be included as part of CDH3u3.

I like the HBase page:

Welcome to Apache HBase!

HBase is the Hadoop database. Think of it as a distributed scalable Big Data store.

When Would I Use HBase?

Use HBase when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s Bigtable: A Distributed Storage System for Structured by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Concise, to the point, either you are interested or you are not. Doesn’t waste time on hand wringing about “big data,” “oh, what shall we do?,” or parades of data horrors.

Do you think something similar for topic maps would need an application area approach? That is to focus on a particular problem deeply rather than all the possible uses of topic maps?

December 19, 2011

NoSQL Screencast: HBase Schema Design

Filed under: HBase,NoSQL — Patrick Durusau @ 8:11 pm

NoSQL Screencast: HBase Schema Design

From Alex Popescu’s post:

In this O’Reilly webcast, long time HBase developer and Cloudera HBase/Hadoop architect Lars George discusses the underlying concepts of the storage layer in HBase and how to do model data in HBase for best possible performance.

You may know George from HBase: The Definitive Guide.

December 11, 2011

Installing HBase over HDFS on a Single Ubuntu Box

Filed under: HBase — Patrick Durusau @ 9:23 pm

Installing HBase over HDFS on a Single Ubuntu Box

From the post:

I faced some issues making HBase run over HDFS on my Ubuntu box. This is a informal step-by-step guide from setting up HDFS to running HBase on a single Ubuntu machine.

I am going to be doing this fairly soon so let me know if this sounds about right. 😉

If I get to it before you do, I will return the favor.

December 1, 2011

Seven Databases in Seven Weeks now in Beta

Filed under: CouchDB,HBase,MongoDB,Neo4j,PostgreSQL,Redis,Riak — Patrick Durusau @ 7:41 pm

Seven Databases in Seven Weeks now in Beta

From the webpage:

Redis, Neo4J, Couch, Mongo, HBase, Riak, and Postgres: with each database, you’ll tackle a real-world data problem that highlights the concepts and features that make it shine. You’ll explore the five data models employed by these databases: relational, key/value, columnar, document, and graph. See which kinds of problems are best suited to each, and when to use them.

You’ll learn how MongoDB and CouchDB, both JavaScript powered, document oriented datastores, are strikingly different. Learn about the Dynamo heritage at the heart of Riak and Cassandra. Understand MapReduce and how to use it to solve Big Data problems.

Build clusters of servers using scalable services like Amazon’s Elastic Compute Cloud (EC2). Discover the CAP theorem and its implications for your distributed data. Understand the tradeoffs between consistency and availability, and when you can use them to your advantage. Use multiple databases in concert to create a platform that’s more than the sum of its parts, or find one that meets all your needs at once.

Seven Databases in Seven Weeks will give you a broad understanding of the databases, their strengths and weaknesses, and how to choose the ones that fit your needs.

Now in beta, in non-DRM PDF, epub, and mobi from pragprog.com/book/rwdata.

If you know the Seven Languages in Seven Weeks by Bruce Tate, no further recommendation is necessary for the approach.

I haven’t read the book, yet, but will be getting the electronic beta tonight. More to follow.

November 20, 2011

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Filed under: Crunch,Dremel,Dryad,Flume,Giraph,HBase,HDFS,Hive,JDBC,MapReduce,ODBC,Oozie,Pregel — Patrick Durusau @ 4:21 pm

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Slides from Jeff’s presentation and numerous references, including to a live blogging summary by Jeff Dalton.

In terms of the new analytical platform, I would strongly suggest that you take Cloudera’s substrate:

Cloudera starts with a substrate architecture of Open Compute commodity Linux servers configured using Puppet and Chef and coordinated using ZooKeeper. Naturally this entire stack is open-source. They use HFDS and Ceph to provide distributed, schema-less storage. They offer append-only table storage and metadata using Avro, RCFile, and HCatalog; and mutable table storage and metadata using HBase. For computation, they offer YARN (inter-job scheduling, like Grid Engine, for data intensive computing) and Mesos for cluster resource management; MapReduce, Hamster (MPI), Spark, Dryad / DryadLINQ, Pregel (Giraph), and Dremel as processing frameworks; and Crunch (like Google’s FlumeJava), PigLatin, HiveQL, and Oozie as high-level interfaces. Finally, Cloudera offers tool access through FUSE, JDBC, and ODBC; and data ingest through Sqoop and Flume.

Rather than asking the usual questions, how to make this faster, more storage, etc., all of which are important, ask the more difficult questions:

  1. In or between which of these elements, would human analysis/judgment have the greatest impact?
  2. Would human analysis/judgment be best made by experts or crowds?
  3. What sort of interface would elicit the best human analysis/judgment? (visual/aural; contest/game/virtual)
  4. Performance with feedback or homeostasis mechanisms?

That is a very crude and uninformed starter set of questions.

Putting higher speed access to more data with better tools at our fingertips expands the questions we can ask of interfaces and our interaction with the data. (Before we ever ask questions of the data.)

November 8, 2011

Toad Virtual Expo – 11.11.11 – 24-hour Toad Event

Filed under: Conferences,Hadoop,HBase,Hive,MySQL,Oracle,Toad — Patrick Durusau @ 7:46 pm

Toad Virtual Expo – 11.11.11 – 24-hour Toad Event

From the website:

24 hours of Toad is here! Join us on 11.11.11, and take an around the world journey with Toad and database experts who will share database development and administration best practices. This is your chance to see new products and new features in action, virtually collaborate with other users – and Quest’s own experts, and get a first-hand look at what’s coming in the world of Toad.

If you are not going to see the Immortals on 11.11.11 or looking for something to do after the movie, drop in on the Toad Virtual Expo! 😉 (It doesn’t look like a “chick” movie anyway.)

Times:

Register today for Quest Software’s 24-hour Toad Virtual Expo and learn why the best just got better.

  1. Tokyo Friday, November 11, 2011 6:00 a.m. JST – Saturday, November 12, 2011 6:00 a.m. JST
  2. Sydney Friday, November 11, 2011 8:00 a.m. EDT – Saturday, November 12, 2011 8:00 a.m. EDT

  3. Tel Aviv Thursday, November 10, 2011 11:00 p.m. IST – Friday, November 11, 2011 11:00 p.m. IST
  4. Central Europe Thursday, November 10, 2011 10:00 p.m. CET – Friday, November 11, 2011 10:00 p.m. CET
  5. London Thursday, November 10, 2011 9:00 p.m. GMT – Friday, November 11, 2011 9:00 p.m. GMT
  6. New York Thursday, November 10, 2011 4:00 p.m. EST – Friday, November 11, 2011 4:00 p.m. EST
  7. Los Angeles Thursday, November 10, 2011 1:00 p.m. PST – Friday, November 11, 2011 1:00 p.m. PST

The site wasn’t long on specifics but this could be fun!

Toad for Cloud Databases (Quest Software)

Filed under: BigData,Cloud Computing,Hadoop,HBase,Hive,MySQL,Oracle,SQL Server — Patrick Durusau @ 7:45 pm

Toad for Cloud Databases (Quest Software)

From the news release:

The data management industry is experiencing more disruption than at any other time in more than 20 years. Technologies around cloud, Hadoop and NoSQL are changing the way people manage and analyze data, but the general lack of skill sets required to manage these new technologies continues to be a significant barrier to mainstream adoption. IT departments are left without a clear understanding of whether development and DBA teams, whose expertise lies with traditional technology platforms, can effectively support these new systems. Toad® for Cloud Databases addresses the skill-set shortage head-on, empowering database professionals to directly apply their existing skills to emerging Big Data systems through an easy-to-use and familiar SQL-based interface for managing non-relational data. 

News Facts:

  • Toad for Cloud Databases is now available as a fully functional, commercial-grade product, for free, at www.quest.com/toad-for-cloud-databases.  Toad for Cloud Databases enables users to generate queries, migrate, browse, and edit data, as well as create reports and tables in a familiar SQL view. By simplifying these tasks, Toad for Cloud Databases opens the door to a wider audience of developers, allowing more IT teams to experience the productivity gains and cost benefits of NoSQL and Big Data.
  • Quest first released Toad for Cloud Databases into beta in June 2010, making the company one of the first to provide a SQL-based database management tool to support emerging, non-relational platforms. Over the past 18 months, Quest has continued to drive innovation for the product, growing its list of supported platforms and integrating a UI for its bi-directional data connector between Oracle and Hadoop.
  • Quest’s connector between Oracle and Hadoop, available within Toad for Cloud Databases, delivers a fast and scalable method for data transfer between Oracle and Hadoop in both directions. The bidirectional characteristic of the utility enables organizations to take advantage of Hadoop’s lower cost of storage and analytical capabilities. Quest also contributed the connector to the Apache Hadoop project as an extension to the existing SQOOP framework, and is also available as part of Cloudera’s Distribution Including Apache Hadoop. 
  • Toad for Cloud Databases today supports:
    • Apache Hive
    • Apache HBase
    • Apache Cassandra
    • MongoDB
    • Amazon SimpleDB
    • Microsoft Azure Table Services
    • Microsoft SQL Azure, and
    • All Open Database Connectivity (ODBC)-enabled relational databases (Oracle, SQL Server, MySQL, DB2, etc)

 

Anything that eases the transition to cloud computing is going to be welcome. Toad being free will increase the ranks of DBAs who will at least experiment on their own.

« Newer PostsOlder Posts »

Powered by WordPress