Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 18, 2013

Hortonworks Data Platform 1.2 Available Now!

Filed under: Apache Ambari,Hadoop,HBase,Hortonworks,MapReduce — Patrick Durusau @ 7:18 pm

Hortonworks Data Platform 1.2 Available Now! by Kim Rose.

From the post:

Hortonworks Data Platform (HDP) 1.2, the industry’s only complete 100-percent open source platform powered by Apache Hadoop is available today. The enterprise-grade Hortonworks Data Platform includes the latest version of Apache Ambari for comprehensive management, monitoring and provisioning of Apache Hadoop clusters. By also introducing additional new capabilities for improving security and ease of use, HDP delivers an enterprise-class distribution of Apache Hadoop that is endorsed and adopted by some of the largest vendors in the IT ecosystem.

Hortonworks continues to drive innovation through a range of Hadoop-related projects, packaging the most enterprise-ready components, such as Ambari, into the Hortonworks Data Platform. Powered by an Apache open source community, Ambari represents the forefront of innovation in Apache Hadoop management. Built on Apache Hadoop 1.0, the most stable and reliable code available today, HDP 1.2 improves the ease of enterprise adoption for Apache Hadoop with comprehensive management and monitoring, enhanced connectivity to high-performance drivers, and increased enterprise-readiness of Apache HBase, Apache Hive and Apache HCatalog projects.

The Hortonworks Data Platform 1.2 features a number of new enhancements designed to improve the enterprise viability of Apache Hadoop, including:

  • Simplified Hadoop Operations—Using the latest release of Apache Ambari, HDP 1.2 now provides both cluster management and the ability to zoom into cluster usage and performance metrics for jobs and tasks to identify the root cause of performance bottlenecks or operations issues. This enables Hadoop users to identify issues and optimize future job processing.
  • Improved Security and Multi-threaded Query—HDP 1.2 provides an enhanced security architecture and pluggable authentication model that controls access to Hive tables and the metastore. In addition, HDP 1.2 improves scalability by supporting multiple concurrent query connections to Hive from business intelligence tools and Hive clients.
  • Integration with High-performance Drivers Built for Big Data—HDP 1.2 empowers organizations with a trusted and reliable ODBC connector that enables the integration of current systems with high-performance drivers built for big data. The ODBC driver enables integration with reporting or visualization components through a SQL engine built into the driver. Hortonworks has partnered with Simba to deliver a trusted, reliable high-performance ODBC connector that is enterprise ready and completely free.
  • HBase Enhancements—By including and testing HBase 0.94.2, HDP 1.2 delivers important performance and operational improvements for customers building and deploying highly scalable interactive applications using HBase.

There goes the weekend!

January 16, 2013

ANN: HBase 0.94.4 is available for download

Filed under: Hadoop,HBase — Patrick Durusau @ 7:55 pm

ANN: HBase 0.94.4 is available for download by lars Hofhansl

Bug fix release with 81 issues resolved plus performance enhancements!

First seen tweet by Stack

December 10, 2012

Apache Gora

Filed under: BigData,Gora,Hadoop,HBase,MapReduce — Patrick Durusau @ 5:26 pm

Apache Gora

From the webpage:

What is Apache Gora?

The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column
stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support.

Why Apache Gora?

Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores. Gora fills this gap by giving the user an easy-to-use in-memory data model and persistence for big data framework with data store specific mappings and built in Apache Hadoop support.

The overall goal for Gora is to become the standard data representation and persistence framework for big data. The roadmap of Gora can be grouped as follows.

  • Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS.
  • Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location.
  • Indexing : Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API.
  • Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading
  • MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.

When writing about the Nutch 2.X development path, I discovered my omission of Gora from this blog. Apologies for having overlooked it until now.

Apache Nutch v1.6 and Apache 2.1 Releases

Filed under: Gora,HBase,Nutch,Search Engines,Solr — Patrick Durusau @ 10:45 am

Apache Nutch v1.6 Released

From the news:

The Apache Nutch PMC are extremely pleased to announce the release of Apache Nutch v1.6. This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API inluding the normalization of URL’s and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8. Please see the list of changes or the release report made in this version for a full breakdown. The release is available here.

See the Nutch 1.x tutorial.

Apache Nutch v2.1 Released

From the news:

The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v2.1. This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as addressing ~20 bugs this release also offers improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search. Please see the list of changes made in this version for a full breakdown. The release is available here.

See the Nutch 2.x tutorial.

I haven’t done a detailed comparison but roughly, Nutch 1.x relies upon Solr for storage and Nutch 2.x relies upon Gora and HBase.

Surprised that isn’t in the FAQ.

Perhaps I will investigate further and offer a short summary of the differences.

November 29, 2012

Streaming data into Apache HBase using Apache Flume

Filed under: Flume,HBase — Patrick Durusau @ 2:37 pm

Streaming data into Apache HBase using Apache Flume

From the post:

Apache Flume was conceived as a fault-tolerant ingest system for the Apache Hadoop ecosystem. Flume comes packaged with an HDFS Sink which can be used to write events into HDFS, and two different implementations of HBase sinks to write events into HBase. You can read about the basic architecture of Apache Flume 1.x in this blog post. You can also read about how Flume’s File Channel persists events and still provides extremely high performance in an earlier blog post. In this article, we will explore how to configure Flume to write events into HBase, and write custom serializers to write events into HBase in a format of the user’s choice.

Streaming data is great, but being able to capture it when needed, is even better!

November 14, 2012

Kiji Project [Framework for HBase]

Filed under: Entities,Hadoop,HBase,KIji Project — Patrick Durusau @ 1:22 pm

Kiji Project: An Open Source Framework for Building Big Data Applications with Apache HBase by Aaron Kimball.

From the post:

Our team at WibiData has been developing applications on Hadoop since 2010 and we’ve helped many organizations transform how they use data by deploying Hadoop. HBase in particular has allowed companies of all types to drive their business using scalable, high performance storage. Organizations have started to leverage these capabilities for various big data applications, including targeted content, personalized recommendations, enhanced customer experience and social network analysis.

While building many of these applications, we have seen emerging tools, design patterns and best practices repeated across projects. One of the clear lessons learned is that Hadoop and HBase provide very low-level interfaces. Each large-scale application we have built on top of Hadoop has required a great deal of scaffolding and data management code. This repetitive programming is tedious, error-prone, and makes application interoperability more challenging in the long run.

Today, we are proud to announce the launch of the Kiji project (www.kiji.org), as well as the first Kiji component: KijiSchema. The Kiji project was developed to host a suite of open source components built on top of Apache HBase and Apache Hadoop, that makes it easier for developers to:

  1. Use HBase as a real-time data storage and serving layer for applications
  2. Maximize HBase performance using data management best practices
  3. Get started building data applications quickly with easy startup and configuration

Kiji is open source and licensed under the Apache 2.0 license. The Kiji project is modularized into separate components to simplify adoption and encourage clean separation of functionality. Our approach emphasizes interoperability with other systems, leveraging the open source HBase, Avro and MapReduce projects, enabling you to easily fit Kiji into your development process and applications.

KijiSchema: Schema Management for HBase

The first component within the Kiji project is KijiSchema, which provides layout and schema management on top of HBase. KijiSchema gives developers the ability to easily store both structured and unstructured data within HBase using Avro serialization. It supports a variety of rich schema features, including complex, compound data types, HBase column key and time-series indexing, as well cell-level evolving schemas that dynamically encode version information.

KijiSchema promotes the use of entity-centric data modeling, where all information about a given entity (user, mobile device, ad, product, etc.), including dimensional and transaction data, is encoded within the same row. This approach is particularly valuable for user-based analytics such as targeting, recommendations, and personalization.

This looks important!

Reading further about their “entiity-centric” approach:

Entity-Centric Data Model

KijiSchema’s data model is entity-centric. Each row typically holds information about a single entity in your information scheme. As an example, a consumer e-commerce web site may have a row representing each user of their site. The entity-centric data model enables easier analysis of individual entities. For example, to recommend products to a user, information such as the user’s past purchases, previously viewed items, search queries, etc. all need to be brought together. The entity-centric model stores all of these attributes of the user in the same row, allowing for efficient access to relevant information.

The entity-centric data model stands in comparison to a more typical log-based approach to data collection. Many MapReduce systems import log files for analysis. Logs are action-centric; each action performed by a user (adding an item to a shopping cart, checking out, performing a search, viewing a product) generates a new log entry. Collecting all the data required for a per-user analysis thus requires a scan of many logs. The entity-centric model is a “pivoted” form of this same information. By pivoting the information as the data is loaded into KijiSchema, later analysis can be run more efficiently, either in a MapReduce job operating over all users, or in a more narrowly-targeted fashion if individual rows require further computation.

I’m already convinced about a single representative for an entity. 😉

Need to work through the documentation on capturing diverse information about a single entity in one row.

I suspect that the structures that capture data aren’t entities for purposes of this model.

Still, will be an interesting exploration.

October 22, 2012

HBase Futures

Filed under: Hadoop,HBase,Hortonworks,Semantics — Patrick Durusau @ 2:28 pm

HBase Futures by Devaraj Das.

From the post:

As we have said here, Hortonworks has been steadily increasing our investment in HBase. HBase’s adoption has been increasing in the enterprise. To continue this trend, we feel HBase needs investments in the areas of:

  1. Reliability and High Availability (all data always available, and recovery from failures is quick)
  2. Autonomous operation (minimum operator intervention)
  3. Wire compatibility (to support rolling upgrades across a couple of versions at least)
  4. Cross data-center replication (for disaster recovery)
  5. Snapshots and backups (be able to take periodic snapshots of certain/all tables and be able to restore them at a later point if required)
  6. Monitoring and Diagnostics (which regionserver is hot or what caused an outage)

Probably just a personal prejudice but I would have mentioned semantics in that list.

You?

HBase at Hortonworks: An Update [Features, Consumer Side?]

Filed under: Hadoop,HBase,Hortonworks — Patrick Durusau @ 3:37 am

HBase at Hortonworks: An Update by Devaraj Das.

From the post:

HBase is a critical component of the Apache Hadoop ecosystem and a core component of the Hortonworks Data Platform. HBase enables a host of low latency Hadoop use-cases; As a publishing platform, HBase exposes data refined in Hadoop to outside systems; As an online column store, HBase supports the blending of random access data read/write with application workloads whose data is directly accessible to Hadoop MapReduce.

The HBase community is moving forward aggressively, improving HBase in many ways. We are in the process of integrating HBase 0.94 into our upcoming HDP 1.1 refresh. This “minor upgrade” will include a lot of bug fixes (nearly 200 in number) and quite a few performance improvements and will be wire compatible with HBase 0.92 (in HDP 1.0).

The post concludes:

All of the above is just what we’ve been doing recently and Hortonworkers are only a small fraction of the HBase contributor base. When one factors in all the great contributions coming from across the Apache HBase community, we predict 2013 is going to be a great year for HBase. HBase is maturing fast, becoming both more operationally reliable and more feature rich.

When a technical infrastructure becomes “feature rich,” can “features” for consumer services/interfaces be far behind?

Delivering location-based coupons for latte’s on a cellphone may seem like a “feature.” But we can do that with a man wearing a sandwich board.

A “feature” for the consumer needs to be more than digital imitation of an analog capability.

What consumer “feature(s)” would you offer based on new features in HBase?

October 21, 2012

Full stack HA in Hadoop 1: HBase’s Resilience to Namenode Failover

Filed under: Hadoop,HBase,Systems Administration — Patrick Durusau @ 2:22 pm

Full stack HA in Hadoop 1: HBase’s Resilience to Namenode Failover by Devaraj Das.

From the post:

In this blog, I’ll cover how we tested Full Stack HA with NameNode HA in Hadooop 1 with Hadoop and HBase as components of the stack.

Yes, NameNode HA is finally available in the Hadoop 1 line. The test was done with Hadoop branch-1 and HBase-0.92.x on a cluster of roughly ten nodes. The aim was to try to keep a really busy HBase cluster up in the face of the cluster’s NameNode repeatedly going up and down. Note that, HBase would be functional during the time NameNode would be down. It’d only affect those operations that requires a trip to the NameNode (for example, rolling of the WAL, or compaction, or flush), and those would affect only the relevant end users (a user using the HBase get API may not be affected if that get didn’t require a new file open, for example).

A non-reliable cluster is just that, a non-reliable cluster. Not as bad as a backup that may or may not restore your data, but almost.

Regularly and routinely test any alleged HA capability along with backup restore capability. Document that testing.

As opposed to “testing” when either has to work or critical operations will fail or critical data will be lost.*

*Not Miller but résumé time.

October 11, 2012

Hadoop/HBase Cluster ~ 1 Hour/$10 (What do you have to lose?)

Filed under: Hadoop,HBase — Patrick Durusau @ 3:08 pm

Set Up a Hadoop/HBase Cluster on EC2 in (About) an Hour by George London.

From the post:

I’m going to walk you through a (relatively) simple set of steps that will get you up and running MapReduce programs on a cloud-based, six-node distributed Hadoop/HBase cluster as fast as possible. This is all based on what I’ve picked up on my own, so if you know of better/faster methods, please let me know in comments!

We’re going to be running our cluster on Amazon EC2, and launching the cluster using Apache Whirr and configuring it using Cloudera Manager Free Edition. Then we’ll run some basic programs I’ve posted on Github that will parse data and load it into Apache HBase.

All together, this tutorial will take a bit over one hour and cost about $10 in server costs.

This is the sort of tutorial that I long to write for topic maps.

There is a longer version of this tutorial here.

October 3, 2012

CDH4.1 Now Released!

Filed under: Cloudera,Flume,Hadoop,HBase,HDFS,Hive,Pig — Patrick Durusau @ 8:28 pm

CDH4.1 Now Released! by Charles Zedlewski.

From the post:

We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:

  • Quorum based storage – Quorum-based Storage for HDFS provides the ability for HDFS to store its own NameNode edit logs, allowing you to run a highly available NameNode without external storage or custom fencing.
  • Hive security and concurrency – we’ve fixed some long standing issues with running Hive. With CDH4.1, it is now possible to run a shared Hive instance where users submit queries using Kerberos authentication. In addition this new Hive server supports multiple users submitting queries at the same time.
  • Support for DataFu – the LinkedIn data science team was kind enough to open source their library of Pig UDFs that make it easier to perform common jobs like sessionization or set operations. Big thanks to the LinkedIn team!!!
  • Oozie workflow builder – since we added Oozie to CDH more than two years ago, we have often had requests to make it easier to develop Oozie workflows. The newly enhanced job designer in Hue enables users to use a visual tool to build and run Oozie workflows.
  • FlumeNG improvements – since its release, FlumeNG has become the backbone for some exciting data collection projects, in some cases collecting as much as 20TB of new event data per day. In CDH4.1 we added an HBase sink as well as metrics for monitoring as well as a number of performance improvements.
  • Various performance improvements – CDH4.1 users should experience a boost in their MapReduce performance from CDH4.0.
  • Various security improvements – CDH4.1 enables users to configure the system to encrypt data in flight during the shuffle phase. CDH now also applies Hadoop security to users who access the filesystem via a FUSE mount.

It’s releases like this that make me wish I spent more time writing documentation for software. To try out all the cool features with no real goal other than trying them out.

Enjoy!

October 2, 2012

HAcid: A lightweight transaction system for HBase

Filed under: HBase — Patrick Durusau @ 4:22 pm

HAcid: A lightweight transaction system for HBase

From the post:

HAcid is a client library that applications can use for operating multi-row transactions in HBase. Seems to be motivated by Google’s Percolator.

Link to the original paper

September 26, 2012

Understanding User Authentication and Authorization in Apache HBase

Filed under: Accumulo,HBase,Security — Patrick Durusau @ 12:57 pm

Understanding User Authentication and Authorization in Apache HBase by Matteo Bertozzi.

From the post:

With the default Apache HBase configuration, everyone is allowed to read from and write to all tables available in the system. For many enterprise setups, this kind of policy is unacceptable.

Administrators can set up firewalls that decide which machines are allowed to communicate with HBase. However, machines that can pass the firewall are still allowed to read from and write to all tables. This kind of mechanism is effective but insufficient because HBase still cannot differentiate between multiple users that use the same client machines, and there is still no granularity with regard to HBase table, column family, or column qualifier access.

In this post, we will discuss how Kerberos is used with Hadoop and HBase to provide User Authentication, and how HBase implements User Authorization to grant users permissions for particular actions on a specified set of data.

When you think about security, remember: Accumulo: Why The World Needs Another NoSQL Database. Accumulo was written to provide cell level security.

Nice idea but the burden of administering cell level authorizations is going to lead to sloppy security practices. Or granting higher level permissions, inadvisedly, to some users.

Not to mention the truck sized security hole in Accumulo for imported data changing access tokens.

You can get a lot of security mileage out of HBase and Kerberos, long before you get to cell level security permissions.

August 27, 2012

Pig as Hadoop Connector, Part Two: HBase, JRuby and Sinatra

Filed under: Hadoop,HBase,JRuby,Pig — Patrick Durusau @ 2:01 pm

Pig as Hadoop Connector, Part Two: HBase, JRuby and Sinatra by Russell Jurney.

From the post:

Hadoop is about freedom as much as scale: providing you disk spindles and processor cores together to process your data with whatever tool you choose. Unleash your creativity. Pig as duct tape facilitates this freedom, enabling you to connect distributed systems at scale in minutes, not hours. In this post we’ll demonstrate how you can turn raw data into a web service using Hadoop, Pig, HBase, JRuby and Sinatra. In doing so we will demonstrate yet another way to use Pig as connector to publish data you’ve processed on Hadoop.

When (not if) the next big cache of emails or other “sensitive” documents drops, everyone who has followed this and similar tutorials should be ready.

August 16, 2012

HBase Replication: Operational Overview

Filed under: Hadoop,HBase — Patrick Durusau @ 2:16 pm

HBase Replication: Operational Overview by Himanshu Vashishtha

From the post:

This is the second blogpost about HBase replication. The previous blogpost, HBase Replication Overview, discussed use cases, architecture and different modes supported in HBase replication. This blogpost is from an operational perspective and will touch upon HBase replication configuration, and key concepts for using it — such as bootstrapping, schema change, and fault tolerance.

The sort of post that makes you long for one or more mini-clusters. 😉

August 11, 2012

HBase FuzzyRowFilter: Alternative to Secondary Indexes

Filed under: HBase — Patrick Durusau @ 3:43 pm

HBase FuzzyRowFilter: Alternative to Secondary Indexes by Alex Baranau.

From the post:

In this post we’ll explain the usage of FuzzyRowFilter which can help in many situations where secondary indexes solutions seems to be the only choice to avoid full table scans.

Background

When it comes to HBase the way you design your row key affects everything. It is a common pattern to have composite row key which consists of several parts, e.g. userId_actionId_timestamp. This allows for fast fetching of rows (or single row) based on start/stop row keys which have to be a prefix of the row keys you want to select. E.g. one may select last time of userX logged in by specifying row key prefix “userX_login_”. Or last action of userX by fetching the first row with prefix “userX_”. These partial row key scans work very fast and does not require scanning the whole table: HBase storage is optimized to make them fast.

Problem

However, there are cases when you need to fetch data based on key parts which happen to be in the middle of the row key. In the example above you may want to find last logged in users. When you don’t know the first parts of the key partial row key scan turns into full table scan which might be very slow and resource intensive.

Although Alex notes the solution he presents is no “silver bullet,” it illustrates:

  • The impact of key design on later usage.
  • Important of knowing all your options for query performance.

I would capture the availability of the “FuzzyRowFilter,” key structure and cardinality of data using a topic map. Saves then next HBase administrator time and effort.

True, they can always work out the details for themselves but then they make not have your analytical skills.

August 6, 2012

Apache HBase (DZone Refcard)

Filed under: HBase — Patrick Durusau @ 8:50 am

Apache HBase (DZone Refcard) by Otis Gospodnetic and Alex Baranau.

From the webpage:

The Essential HBase Cheat Sheet

HBase is the Hadoop database: a distributed, scalable Big Data store that lets you host very large tables — billions of rows multiplied by millions of columns — on clusters built with commodity hardware. Just as Google Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

If you aren’t familiar with the DZone Refcards, they are a valuable resource.

July 30, 2012

HBase Replication Overview

Filed under: Cloudera,HBase — Patrick Durusau @ 3:22 pm

HBase Replication Overview by Himanshu Vashishtha.

From the post:

HBase Replication is a way of copying data from one HBase cluster to a different and possibly distant HBase cluster. It works on the principle that the transactions from the originating cluster are pushed to another cluster. In HBase jargon, the cluster doing the push is called the master, and the one receiving the transactions is called the slave. This push of transactions is done asynchronously, and these transactions are batched in a configurable size (default is 64MB). Asynchronous mode incurs minimal overhead on the master, and shipping edits in a batch increases the overall throughput.

This blogpost discusses the possible use cases, underlying architecture and modes of HBase replication as supported in CDH4 (which is based on 0.92). We will discuss Replication configuration, bootstrapping, and fault tolerance in a follow up blogpost.

Use cases

HBase replication supports replicating data across datacenters. This can be used for disaster recovery scenarios, where we can have the slave cluster serve real time traffic in case the master site is down. Since HBase replication is not intended for automatic failover, the act of switching from the master to the slave cluster in order to start serving traffic is done by the user. Afterwards, once the master cluster is up again, one can do a CopyTable job to copy the deltas to the master cluster (by providing the start/stop timestamps) as described in the CopyTable blogpost.

Another replication use case is when a user wants to run load intensive MapReduce jobs on their HBase cluster; one can do so on the slave cluster while bearing a slight performance decrease on the master cluster.

So there is a non-romantic, sysadmin side to “big data.” I understand, no one ever even speaks unless something has gone wrong with the system. Sysadmins either get no contacts (a good thing) or pages, tweets, emails, phone calls and physical visits from irate users, managers, etc.

This post is a start towards always having the first case, no contacts. Leaves you more time for things that interest sysadmins. I won’t tell if you don’t.

July 18, 2012

U.S. Senate vs. Apache Accumulo: Whose side are you on?

Filed under: Accumulo,Cassandra,HBase — Patrick Durusau @ 10:47 am

Jack Park sent a link to NSA Mimics Google, Pisses Off Senate this morning. If you are unfamiliar with the software, see: Apache Accumulo.

Long story made short:

The bill bars the DoD from using the database unless the department can show that the software is sufficiently different from other databases that mimic BigTable. But at the same time, the bill orders the director of the NSA to work with outside organizations to merge the Accumulo security tools with alternative databases, specifically naming HBase and Cassandra.

At issue is:

The bill indicates that Accumulo may violate OMB Circular A-130, a government policy that bars agencies from building software if it’s less expensive to use commercial software that’s already available. And according to one congressional staffer who worked on the bill, this is indeed the case. He asked that his name not be used in this story, as he’s not authorized to speak with the press.

On its face, OMB Circular A-130 sounds like a good idea. Don’t build your own if it is cheaper to buy commercial.

But here the Senate trying to play favorites.

I have a suggestion: Let’s disappoint them.

Let’s contribute to all three projects:

Apache Accumulo

Apache Cassandra

Apache HBase

Would you look at that! All three of these projects are based at Apache!

Let’s push all three projects forward in terms of working on releases, documentation, testing, etc.

But more than that, let’s build applications based on all three projects that analyze political contributions, patterns of voting, land transfers, stock purchases, virtually every fact than can be known about members of the Senate and the Senate Armed Services Committee in particular.

They are accustomed to living in a gold fish bowl.

Let’s move them into a frying pan.

PS: +1 if the NSA is ordered to contribute to open source projects, if the projects are interested. Direction from the U.S. Senate is not part of the Apache governance model.

July 17, 2012

Configuring HBase Memstore: What You Should Know

Filed under: HBase — Patrick Durusau @ 4:30 pm

Configuring HBase Memstore: What You Should Know by Alex Baranau.

Alex gives three good reasons to care about HBase Memstore:

There are number of reasons HBase users and/or administrators should be aware of what Memstore is and how it is used:

  • There are number of configuration options for Memstore one can use to achieve better performance and avoid issues. HBase will not adjust settings for you based on usage pattern.
  • Frequent Memstore flushes can affect reading performance and can bring additional load to the system
  • The way Memstore flushes work may affect your schema design

Which reason is yours?

July 13, 2012

HBase Log Splitting

Filed under: HBase — Patrick Durusau @ 4:53 pm

HBase Log Splitting by Jimmy Xiang.

When I was being certified in NetWare many years ago, the #1 reason for dismissal of sysadmins was failure to maintain backups. I don’t know what the numbers are like now but suspect it is probably about the same. But in any event, you are less likely to be fired if you don’t lose data when your servers fail. That’s just a no-brainer.

If you are running HBase, take the time to review this post and make sure the job that gets lost isn’t yours.

From the post:

In the recent blog post about the HBase Write Path, we talked about the write-ahead-log (WAL), which plays an important role in preventing data loss should a HBase region server failure occur. This blog post describes how HBase prevents data loss after a region server crashes, using an especially critical process for recovering lost updates called log splitting.
Log splitting

As we mentioned in the write path blog post, HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region server failure, the contents of the memstore are lost because they have not been saved to disk yet. To prevent data loss in such a scenario, the updates are persisted in a WAL file before they are stored in the memstore. In the event of a region server failure, the lost contents in the memstore can be regenerated by replaying the updates (also called edits) from the WAL file.

A region server serves many regions. All of the regions in a region server share the same active WAL file. Each edit in the WAL file has information about which region it belongs to. When a region is opened, we need to replay those edits in the WAL file that belong to that region. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting. It is a critical process for recovering data if a region server fails.

Log splitting is done by HMaster as the cluster starts or by ServerShutdownHandler as a region server shuts down. Since we need to guarantee consistency, affected regions are unavailable until data is restored. So we need to recover and replay all WAL edits before letting those regions become available again. As a result, regions affected by log splitting are unavailable until the process completes and any required edits are applied.

July 10, 2012

Intro to HBase Internals and Schema Design

Filed under: HBase,NoSQL,Schema — Patrick Durusau @ 9:21 am

Intro to HBase Internals and Schema Design by Alex Baranau.

You will be disappointed by the slide that reads:

HBase will not adjust cluster settings to optimal based on usage patterns automatically.

Sorry, but we just aren’t quite to drag-n-drop software that optimizes to arbitrary data without user intervention.

Not sure we could keep that secret from management very long in any case so perhaps all for the best.

Once you get over your chagrin at having to still work, a little anyone, you will find Alex’s presentation a high level peak at the internals of HBase. Should be enough to get you motivated to learn more on your own. Not guaranteeing that but that should be the average result.

Intro to HBase [Augmented Marketing Opportunities]

Filed under: HBase,NoSQL — Patrick Durusau @ 8:37 am

Intro to HBase by Alex Baranau.

Slides from a presentation Alex did on HBase for a meetup in New York City on HBase.

Fairly high level overview but one of the better ones.

Should leave you with a good orientation to HBase and its capabilities.

Just in case you are looking for a project, ;-), it would be interesting to point into a slide deck like this one with links into tutorial and documentation for the product.

Thinking of the old “hub” document concept from HyTime so you would not have to hard code links in the source but could update them as newer material comes along.

Just in case you need some encouragement, think of every slide deck as an augmented marketing opportunity. Where you are leveraging not only the presentation but the other documentation and materials created by your group.

July 1, 2012

HBase I/O – HFile

Filed under: Hadoop,HBase,HFile — Patrick Durusau @ 4:46 pm

HBase I/O – HFile by Matteo Bertozzi.

From the post:

Introduction

Apache HBase is the Hadoop open-source, distributed, versioned storage manager well suited for random, realtime read/write access.

Wait wait? random, realtime read/write access?

How is that possible? Is not Hadoop just a sequential read/write, batch processing system?

Yes, we’re talking about the same thing, and in the next few paragraphs, I’m going to explain to you how HBase achieves the random I/O, how it stores data and the evolution of the HBase’s HFile format.

Hadoop I/O file formats

Hadoop comes with a SequenceFile[1] file format that you can use to append your key/value pairs but due to the hdfs append-only capability, the file format cannot allow modification or removal of an inserted value. The only operation allowed is append, and if you want to lookup a specified key, you’ve to read through the file until you find your key.

As you can see, you’re forced to follow the sequential read/write pattern… but how is it possible to build a random, low-latency read/write access system like HBase on top of this?

To help you solve this problem Hadoop has another file format, called MapFile[1], an extension of the SequenceFile. The MapFile, in reality, is a directory that contains two SequenceFiles: the data file “/data” and the index file “/index”. The MapFile allows you to append sorted key/value pairs and every N keys (where N is a configurable interval) it stores the key and the offset in the index. This allows for quite a fast lookup, since instead of scanning all the records you scan the index which has less entries. Once you’ve found your block, you can then jump into the real data file.

A couple of important lessons:

First, file formats evolve. They shouldn’t be entombed by programming code, no matter how clever your code may be. That is what “versions” are for.

Second, the rapid evolution of the Hadoop ecosystem makes boundary observations strictly temporary. Wait a week or so, they will change!

Cascading map-side joins over HBase for scalable join processing

Filed under: HBase,Joins,Linked Data,LOD,MapReduce,RDF,SPARQL — Patrick Durusau @ 4:45 pm

Cascading map-side joins over HBase for scalable join processing by Martin Przyjaciel-Zablocki, Alexander Schätzle, Thomas Hornung, Christopher Dorner, and Georg Lausen.

Abstract:

One of the major challenges in large-scale data processing with MapReduce is the smart computation of joins. Since Semantic Web datasets published in RDF have increased rapidly over the last few years, scalable join techniques become an important issue for SPARQL query processing as well. In this paper, we introduce the Map-Side Index Nested Loop Join (MAPSIN join) which combines scalable indexing capabilities of NoSQL storage systems like HBase, that suffer from an insufficient distributed processing layer, with MapReduce, which in turn does not provide appropriate storage structures for efficient large-scale join processing. While retaining the flexibility of commonly used reduce-side joins, we leverage the effectiveness of map-side joins without any changes to the underlying framework. We demonstrate the significant benefits of MAPSIN joins for the processing of SPARQL basic graph patterns on large RDF datasets by an evaluation with the LUBM and SP2Bench benchmarks. For most queries, MAPSIN join based query execution outperforms reduce-side join based execution by an order of magnitude.

Some topic map applications include Linked Data/RDF processing capabilities.

The salient comment here being: “For most queries, MAPSIN join based query execution outperforms reduce-side join based execution by an order of magnitude.

June 21, 2012

Hortonworks Data Platform v1.0 Download Now Available

Filed under: Hadoop,HBase,HDFS,Hive,MapReduce,Oozie,Pig,Sqoop,Zookeeper — Patrick Durusau @ 3:36 pm

Hortonworks Data Platform v1.0 Download Now Available

From the post:

If you haven’t yet noticed, we have made Hortonworks Data Platform v1.0 available for download from our website. Previously, Hortonworks Data Platform was only available for evaluation for members of the Technology Preview Program or via our Virtual Sandbox (hosted on Amazon Web Services). Moving forward and effective immediately, Hortonworks Data Platform is available to the general public.

Hortonworks Data Platform is a 100% open source data management platform, built on Apache Hadoop. As we have stated on many occasions, we are absolutely committed to the Apache Hadoop community and the Apache development process. As such, all code developed by Hortonworks has been contributed back to the respective Apache projects.

Version 1.0 of Hortonworks Data Platform includes Apache Hadoop-1.0.3, the latest stable line of Hadoop as defined by the Apache Hadoop community. In addition to the core Hadoop components (including MapReduce and HDFS), we have included the latest stable releases of essential projects including HBase 0.92.1, Hive 0.9.0, Pig 0.9.2, Sqoop 1.4.1, Oozie 3.1.3 and Zookeeper 3.3.4. All of the components have been tested and certified to work together. We have also added tools that simplify the installation and configuration steps in order to improve the experience of getting started with Apache Hadoop.

I’m a member of the general public! And you probably are too! 😉

See the rest of the post for more goodies that are included with this release.

June 20, 2012

HBase Write Path

Filed under: Hadoop,HBase,HDFS — Patrick Durusau @ 4:41 pm

HBase Write Path by Jimmy Xiang.

From the post:

Apache HBase is the Hadoop database, and is based on the Hadoop Distributed File System (HDFS). HBase makes it possible to randomly access and update data stored in HDFS, but files in HDFS can only be appended to and are immutable after they are created. So you may ask, how does HBase provide low-latency reads and writes? In this blog post, we explain this by describing the write path of HBase — how data is updated in HBase.

The write path is how an HBase completes put or delete operations. This path begins at a client, moves to a region server, and ends when data eventually is written to an HBase data file called an HFile. Included in the design of the write path are features that HBase uses to prevent data loss in the event of a region server failure. Therefore understanding the write path can provide insight into HBase’s native data loss prevention mechanism.

Whether you intend to use Hadoop for topic map processing or not, this will be a good introduction to updating data in HBase. Not all applications using Hadoop are topic maps so this may serve you in other contexts as well.

June 14, 2012

Ontopia 5.2.1 and HBase [Port 8080 Conflicts]

Filed under: HBase,Ontopia — Patrick Durusau @ 12:58 pm

I have HBase 0.92-1 and Ontopia 5.2.1 installed on the same Ubuntu box. (An obvious “gotcha” and its correction follows.)

The Catalina server will not start with the following error message in {$basedir}/apache-tomcat/logs/catalina.(date).log:

SEVERE: Error initializing endpoint
java.net.BindException: Address already in use <null>:8080

To discover what is using port 8080, use:

sudo lsof -i :8080

Result (partial):

COMMAND PID USER
java 5314 hbase

You can either reset the port on HBase or open up {$basedir}/apache-tomcat/conf/server.xml and change:

<Connector port=”8080″ protocol=”HTTP/1.1″

to:

<!– Redefined port to avoid conflict with hbase
–>
<Connector port=”8090″ protocol=”HTTP/1.1″

The comment about avoiding conflict with HBase isn’t necessary but good practice.

(Cloudera has a note about the 8080 issue in its HBase Installation documentation.)

June 11, 2012

HBase Con2012

Filed under: Conferences,HBase — Patrick Durusau @ 4:26 pm

HBase Con2012

Slides and in some cases videos from HBase Con2012.

Mostly slides at this point, I could six (6) videos as of June 11, 2012 in the late afternoon on the East Coast.

Will keep checking back as there is a lot of good content.

Real-time Analytics with HBase [Aggregation is a form of merging.]

Filed under: Aggregation,Analytics,HBase — Patrick Durusau @ 4:21 pm

Real-time Analytics with HBase

From the post:

Here are slides from another talk we gave at both Berlin Buzzwords and at HBaseCon in San Francisco last month. In this presentation Alex describes one approach to real-time analytics with HBase, which we use at Sematext via HBaseHUT. If you like these slides you will also like HBase Real-time Analytics Rollbacks via Append-based Updates.

The slides come in a long and short version. Both are very good but I suggest the long version.

I particularly liked the “Background: pre-aggregation” slide (8 in the short version, 9 in the long version).

Aggregation as a form of merging.

What information is lost as part of aggregation? (That assumes we know the aggregation process. Without that, can’t say what is lost.)

What information (read subjects/relationships) do we want to preserve through an aggregation process?

What properties should those subjects/relationships have?

(Those are topic map design/modeling questions.)

« Newer PostsOlder Posts »

Powered by WordPress