Archive for the ‘Systems Administration’ Category

Full stack HA in Hadoop 1: HBase’s Resilience to Namenode Failover

Sunday, October 21st, 2012

Full stack HA in Hadoop 1: HBase’s Resilience to Namenode Failover by Devaraj Das.

From the post:

In this blog, I’ll cover how we tested Full Stack HA with NameNode HA in Hadooop 1 with Hadoop and HBase as components of the stack.

Yes, NameNode HA is finally available in the Hadoop 1 line. The test was done with Hadoop branch-1 and HBase-0.92.x on a cluster of roughly ten nodes. The aim was to try to keep a really busy HBase cluster up in the face of the cluster’s NameNode repeatedly going up and down. Note that, HBase would be functional during the time NameNode would be down. It’d only affect those operations that requires a trip to the NameNode (for example, rolling of the WAL, or compaction, or flush), and those would affect only the relevant end users (a user using the HBase get API may not be affected if that get didn’t require a new file open, for example).

A non-reliable cluster is just that, a non-reliable cluster. Not as bad as a backup that may or may not restore your data, but almost.

Regularly and routinely test any alleged HA capability along with backup restore capability. Document that testing.

As opposed to “testing” when either has to work or critical operations will fail or critical data will be lost.*

*Not Miller but résumé time.

Big Data Security Part One: Introducing PacketPig

Thursday, October 11th, 2012

Big Data Security Part One: Introducing PacketPig by Michael Baker.

From the post:

Packetloop CTO Michael Baker (@cloudjunky) made a big splash when he presented ‘Finding Needles in Haystacks (the Size of Countries)‘ at Blackhat Europe earlier this year. The paper outlines a toolkit based on Apache Pig, Packetpig @packetpig (available on github), for doing network security monitoring and intrusion detection analysis on full packet captures using Hadoop.

In this series of posts, we’re going to introduce Big Data Security and explore using Packetpig on real full packet captures to understand and analyze networks. In this post, Michael will introduce big data security in the form of full data capture, Packetpig and Packetloop.

If you are a bit rusty on packets, TCP/IP, I could just wave my hands and say: “See the various tutorials.” and off you go to hunt something down.

Let me be more helpful than that and suggest: TCP/IP Tutorial and Technical Overview from the IBM RedBooks we were talking about earlier.

It’s not short (almost a thousand pages) but it isn’t W. Richards Stevens on the other hand (in three volumes). ;-)

You won’t need all of either resource but it is better to start with too much than too little.

What is Hadoop Metrics2?

Wednesday, October 10th, 2012

What is Hadoop Metrics2? by Ahmed Radwan.

I’ve been wondering about that. How about you? ;-)

From the post:

Metrics are collections of information about Hadoop daemons, events and measurements; for example, data nodes collect metrics such as the number of blocks replicated, number of read requests from clients, and so on. For that reason, metrics are an invaluable resource for monitoring Hadoop services and an indispensable tool for debugging system problems.

This blog post focuses on the features and use of the Metrics2 system for Hadoop, which allows multiple metrics output plugins to be used in parallel, supports dynamic reconfiguration of metrics plugins, provides metrics filtering, and allows all metrics to be exported via JMX.

However cool the software, can’t ever really get away from managing the software.

And it isn’t a bad skill to have. Read on!

Building LinkedIn’s Real-time Activity Data Pipeline

Thursday, August 16th, 2012

Building LinkedIn’s Real-time Activity Data Pipeline by Ken Goodhope, Joel Koshy, Jay Kreps, Neha Narkhede, Richard Park, Jun Rao, and Victor Yang Ye. (pdf)

Abstract:

One trend in the implementation of modern web systems is the use of activity data in the form of log or event messages that capture user and server activity. This data is at the heart of many internet systems in the domains of advertising, relevance, search, recommendation systems, and security, as well as continuing to fulfill its traditional role in analytics and reporting. Many of these uses place real-time demands on data feeds. Activity data is extremely high volume and real-time pipelines present new design challenges. This paper discusses the design and engineering problems we encountered in moving LinkedIn’s data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system called Kafka. This pipeline currently runs in production at LinkedIn and handles more than 10 billion message writes each day with a sustained peak of over 172,000 messages per second. Kafka supports dozens of subscribing systems and delivers more than 55 billion messages to these consumer processing each day. We discuss the origins of this systems, missteps on the path to real-time, and the design and engineering problems we encountered along the way.

More details on Kafka (see Choking Cassandra Bolt).

What if you think about message feeds as being pipelines that are large enough to see and configure?

Chip level pipelines are more efficient but harder to configure.

Perhaps passing messages is efficient and flexible enough for a class of use cases.

Groundhog: Hadoop Fork Testing

Thursday, August 9th, 2012

Groundhog: Hadoop Fork Testing by Anupam Seth.

From the post:

Hadoop is widely used at Yahoo! to do all kinds of processing. It is used for everything from counting ad clicks to optimizing what is shown on the front page for each individual user. Deploying a major release of Hadoop to all 40,000+ nodes at Yahoo! is a long and painful process that impacts all users of Hadoop. It involves doing a staged rollout onto different clusters of increasing importance (e.g. QA, sandbox, research, production) and asking all teams that use Hadoop to verify that their applications work with this new version. This is to harden the new release before it is deployed on clusters that directly impact revenue, but it comes at the expense of the users of these clusters because they have to share the pain of stabilizing a newer version. Further, this process can take over 6 months. Waiting 6 months to get a new feature, which users have asked for, onto a production system is way too long. It stifles innovation both for Hadoop and for the code running on Hadoop. Other software systems avoid these problems by more closely following continuous integration techniques.

Groundhog is an automated testing tool to help ensure backwards compatibility (in terms of API, functionality, and performance) between releases of Hadoop before deploying a new release onto clusters with a high QoS. Groundhog does this by providing an automated mechanism to capture user jobs (currently limited to pig scripts) as they are run on a cluster and then replay them on a different cluster with a different version of Hadoop to verify that they still produce the same results. The test cluster can take inevitable downtime and still help ensure that the latest version of Hadoop has not introduced any new regressions. It is called groundhog because that way Hadoop can relive a pig script over and over again until it gets it right, like the movie Groundhog Day. There is similarity in concept to traditional fork/T testing in that jobs are duplicated and ran on another location. However, Hadoop fork testing differs in that the testing will not occur in real-time but instead the original job with all needed inputs and outputs will be captured and archived. Then at any later date, the archived job can be re-ran.

The main idea is to reduce the deployment cycle of a new Hadoop release by making it easier to get user oriented testing started sooner and at a larger scope. Specifically, get testing running to quickly discover regressions and backwards incompatibility issues. Past efforts to bring up a test cluster and have Hadoop users run their jobs on the test cluster has been less successful than desired. Therefore, fork testing is a method for reducing the human effort needed to get user oriented testing ran against a Hadoop cluster. Additionally, if the level of effort to capture and run tests is reduced, then testing can be performed more often and experiments can also be run. All of this must happen while following data governance policies though.

Thus, Fork testing is a form of end to end testing. If there was a complete suite of end to end tests for Hadoop, the need for fork testing might not exist. Alas, the end to end suite does not exist and creating fork testing is deemed a faster path to achieving the testing goal.

Groundhog currently is limited to work only with pig jobs. The majority of user jobs run on Hadoop at Yahoo! are written in pig. This is what allows Groundhog to nevertheless have a good sampling of production jobs.

This is way cool!

Discovering problems, even errors, before they show up in live installations is always a good thing.

When you make changes to merging rules, how do you test the impact on your topic maps?

I first saw this at: Alex Popescu’s myNoSQL under Groundhog: Hadoop Automated Testing at Yahoo!

…Creating Reliable Billion Page View Web Services

Thursday, August 9th, 2012

High Scalability reports in 3 Tips and Tools for Creating Reliable Billion Page View Web Services an article by Amir Salihefendic that suggests:

  • Realtime monitor everything
  • Be proactive
  • Be notified when crashes happen

Are three tips to follow on the hunt to a reliable billion page view web service.

I’m a few short of that number but it was still an interesting post. ;-)

And you can’t ever tell, might snag a client that is more likely to reach those numbers.

Announcing Scalable Performance Monitoring (SPM) for JVM

Tuesday, August 7th, 2012

Announcing Scalable Performance Monitoring (SPM) for JVM (Sematext)

From the post:

Up until now, SPM existed in several flavors for monitoring Solr, HBase, ElasticSearch, and Sensei. Besides metrics specific to a particular system type, all these SPM flavors also monitor OS and JVM statistics. But what if you want to monitor any Java application? Say your custom Java application run either in some container, application server, or from a command line? You don’t really want to be forced to look at blank graphs that are really meant for stats from one of the above mentioned systems. This was one of our own itches, and we figured we were not the only ones craving to scratch that itch, so we put together a flavor of SPM for monitoring just the JVM and (Operating) System metrics.

Now SPM lets you monitor OS and JVM performance metrics of any Java process through the following 5 reports, along with all other SPM functionality like integrated Alerts, email Subscriptions, etc. If you are one of many existing SPM users these graphs should look very familiar.

JVM monitoring isn’t like radio station management where you can listen for dead air. It a bit more complicated than that.

SPM may help with it.

Beyond the JVM and OS, how do you handle monitoring of topic map applications?

Chaos Monkey released into the wild

Monday, July 30th, 2012

Chaos Monkey released into the wild by Cory Bennett and Ariel Tseitlin

From the post:

We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient. We are excited to make a long-awaited announcement today that will help others who embrace this approach.

We have written about our Simian Army in the past and we are now proud to announce that the source code for the founding member of the Simian Army, Chaos Monkey, is available to the community.

Do you think your applications can handle a troop of mischievous monkeys loose in your infrastructure? Now you can find out.

What is Chaos Monkey?

Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group. The software design is flexible enough to work with other cloud providers or instance groupings and can be enhanced to add that support. The service has a configurable schedule that, by default, runs on non-holiday weekdays between 9am and 3pm. In most cases, we have designed our applications to continue working when an instance goes offline, but in those special cases that they don’t, we want to make sure there are people around to resolve and learn from any problems. With this in mind, Chaos Monkey only runs within a limited set of hours with the intent that engineers will be alert and able to respond.

At first I was unsure if NetFlix is hopeful its competitors will run Chaos Monkey or if they really run it internally. ;-)

It certainly is a way to test your infrastructure. And quite possibly a selling point to clients who want more than projected or historical robustness.

Makes me curious, allowing for different infrastructures, how would you stress test a topic map installation?

And do so on a regular basis?

I first saw this at Alex Popescu’s myNoSQL.

Puppet

Saturday, June 9th, 2012

Puppet

From “What is Puppet?”:

Puppet is IT automation software that helps system administrators manage infrastructure throughout its lifecycle, from provisioning and configuration to patch management and compliance. Using Puppet, you can easily automate repetitive tasks, quickly deploy critical applications, and proactively manage change, scaling from 10s of servers to 1000s, on-premise or in the cloud.

Puppet is available as both open source and commercial software. You can see the differences here and decide which is right for your organization.

How Puppet Works

Puppet uses a declarative, model-based approach to IT automation.

  1. Define the desired state of the infrastructure’s configuration using Puppet’s declarative configuration language.
  2. Simulate configuration changes before enforcing them.
  3. Enforce the deployed desired state automatically, correcting any configuration drift.
  4. Report on the differences between actual and desired states and any changes made enforcing the desired state.

Topic maps seem like a natural for systems administration.

They can capture the experience and judgement of sysadmins that aren’t ever part of printed documentation.

Make sysadmins your allies when introducing topic maps. Part of that will be understanding their problems and concerns.

Being able to intelligently discuss software like Puppet will be a step in the right direction. (Not to mention giving you ideas about topic map applications for systems administration.)