Archive for the ‘Systems Administration’ Category

Doom as a tool for system administration (1999) – Pen Testing?

Saturday, April 23rd, 2016

Doom as a tool for system administration by Dennis Chao.

From the webpage:

As I was listening to Anil talk about daemons spawning processes and sysadmins killing them, I thought, “What a great user interface!” Imagine running around with a shotgun blowing away your daemons and processes, never needing to type kill -9 again.

In Doom: The Aftermath you will find some later references, the most recent being from 2004.

You will have better luck at the ACM Digital library entry for Doom as an interface for process management that lists 29 subsequent papers citing Chao’s work on Doom. Latest is 2015.

If system administration with a Doom interface sounds cool, imagine a Doom hacking interface.

I can drive a car but I don’t set the timing, adjust the fuel injection, program the exhaust controls to beat inspectors, etc.

A higher level of abstraction for tools carries a cost but advantages as well.

Imagine cadres of junior high/high school students competing in pen testing contests.

Learning a marketable skill and helping cash-strapped IT departments with security testing.

Isn’t that a win-win situation?

You Can Backup OrientDB Databases on Ubuntu 14.04

Wednesday, February 17th, 2016

How To Back Up Your OrientDB Databases on Ubuntu 14.04

From the post:

OrientDB is a multi-model, NoSQL database with support for graph and document databases. It is a Java application and can run on any operating system; it’s also fully ACID-complaint with support for multi-master replication.

An OrientDB database can be backed up using a backup script and also via the command line interface, with built-in support for compression of backup files using the ZIP algorithm.

By default, backing up an OrientDB database is a blocking operation — writes to be database are locked until the end of the backup operation, but if the operating system was installed on an LVM partitioning scheme, the backup script can perform a non-blocking backup. LVM is the Linux Logical Volume Manager.

In this article, you’ll learn how to backup your OrientDB database on an Ubuntu 14.04 server.

I don’t know if it is still true, given the rate of data breaches, but failure to maintain useful backups was the #1 cause for sysadmins being fired.

If that is still true today (and it should be), pay attention to proper backup processes! Yes, its unimaginative, tedious, routine, etc. but a life saver when the system crashes.

Don’t buy into the replicas, RAID5, etc., rant. Yes, do all those things plus have physical backups that are store off-site on a regular rotation schedule.

The job you save may well be your own.

Why you should understand (a little) about TCP

Sunday, November 22nd, 2015

Why you should understand (a little) about TCP by Julia Evans.

From the post:

This isn’t about understanding everything about TCP or reading through TCP/IP Illustrated. It’s about how a little bit of TCP knowledge is essential. Here’s why.

When I was at the Recurse Center, I wrote a TCP stack in Python (and wrote about what happens if you write a TCP stack in Python). This was a fun learning experience, and I thought that was all.

A year later, at work, someone mentioned on Slack “hey I’m publishing messages to NSQ and it’s taking 40ms each time”. I’d already been thinking about this problem on and off for a week, and hadn’t gotten anywhere.

A little background: NSQ is a queue that you send to messages to. The way you publish a message is to make an HTTP request on localhost. It really should not take 40 milliseconds to send a HTTP request to localhost. Something was terribly wrong. The NSQ daemon wasn’t under high CPU load, it wasn’t using a lot of memory, it didn’t seem to be a garbage collection pause. Help.

Then I remembered an article I’d read a week before called In search of performance – how we shaved 200ms off every POST request. In that article, they talk about why every one of target=”_blank” their POST requests were taking 200 extra milliseconds. That’s.. weird. Here’s the key paragraph from the post

Julia’s posts are generally useful and entertaining to read and this one is no exception.

As Julia demonstrates in this post, TCP isn’t as low-level as you might think. 😉

The other lesson to draw here is the greater your general knowledge of how things work, the more likely you can fix (or cause) problems with a minimal amount of effort.

Learn a little TCP with Julia and keep bookmarked deeper resources should the need arise.


Friday, February 13th, 2015


From the webpage:

An Apache Storm topology that will, by design, trigger failures at run-time.

The purpose of this bolt-of-death topology is to help testing Storm cluster stability. It was originally created to identify the issues surrounding the Storm defects described at STORM-329 and STORM-404.

This reminds me of PANIC! UNIX System Crash Dump Analysis Handbook by Chris Drake. Has it really been twenty (20) years since that came out?

If you need something a bit more up to date, Linux Kernel Crash Book: Everything you need to know by Igor Ljubuncic aka Dedoimedo, is available as both free and $ PDF files (to support the website).

Everyone needs a hobby, perhaps analyzing clusters and core dumps will be yours!


I first saw storm-bolt-of-death in a tweet by Michael G. Noll.

Simple Testing Can Prevent Most Critical Failures:…

Thursday, October 9th, 2014

Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems by Ding Yuan, et al.


Large, production quality distributed systems still fail periodically,and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop Map Reduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the production failures.

We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code–the last line of defense–even with out an understanding of the software design. We extracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been prevented had Aspirator been used and the identified bugs fixed. Running Aspirator on the code of 9 distributed systems located 143 bugs and bad practices that have been fixed or confirmed by the developers.

If you aren’t already convinced you need to read this paper, consider one more quote:

almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software. (emphasis added)

How will catastrophic system failure reflect on your product or service? Hint: It doesn’t reflect well on topic maps or any other service or technology.

I say “read” this paper, perhaps putting it on a 90-day reading rotation would be better.

Approaches to Backup and Disaster Recovery in HBase

Saturday, November 23rd, 2013

Approaches to Backup and Disaster Recovery in HBase by Clint Heath.

From the post:

With increased adoption and integration of HBase into critical business systems, many enterprises need to protect this important business asset by building out robust backup and disaster recovery (BDR) strategies for their HBase clusters. As daunting as it may sound to quickly and easily backup and restore potentially petabytes of data, HBase and the Apache Hadoop ecosystem provide many built-in mechanisms to accomplish just that.

In this post, you will get a high-level overview of the available mechanisms for backing up data stored in HBase, and how to restore that data in the event of various data recovery/failover scenarios. After reading this post, you should be able to make an educated decision on which BDR strategy is best for your business needs. You should also understand the pros, cons, and performance implications of each mechanism. (The details herein apply to CDH 4.3.0/HBase 0.94.6 and later.)

Note: At the time of this writing, Cloudera Enterprise 4 offers production-ready backup and disaster recovery functionality for HDFS and the Hive Metastore via Cloudera BDR 1.0 as an individually licensed feature. HBase is not included in that GA release; therefore, the various mechanisms described in this blog are required. (Cloudera Enterprise 5, currently in beta, offers HBase snapshot management via Cloudera BDR.)

The critical line in this post reads:

As daunting as it may sound to quickly and easily backup and restore potentially petabytes of data, HBase and the Apache Hadoop ecosystem provide many built-in mechanisms to accomplish just that.

Note the emphasis on provide.

Great backup mechanisms don’t help much unless someone is making, testing and logging the backups.

Ask in writing about backups before performing any changes to a client’s system or data. Make the answer part of your documentation.

Full stack HA in Hadoop 1: HBase’s Resilience to Namenode Failover

Sunday, October 21st, 2012

Full stack HA in Hadoop 1: HBase’s Resilience to Namenode Failover by Devaraj Das.

From the post:

In this blog, I’ll cover how we tested Full Stack HA with NameNode HA in Hadooop 1 with Hadoop and HBase as components of the stack.

Yes, NameNode HA is finally available in the Hadoop 1 line. The test was done with Hadoop branch-1 and HBase-0.92.x on a cluster of roughly ten nodes. The aim was to try to keep a really busy HBase cluster up in the face of the cluster’s NameNode repeatedly going up and down. Note that, HBase would be functional during the time NameNode would be down. It’d only affect those operations that requires a trip to the NameNode (for example, rolling of the WAL, or compaction, or flush), and those would affect only the relevant end users (a user using the HBase get API may not be affected if that get didn’t require a new file open, for example).

A non-reliable cluster is just that, a non-reliable cluster. Not as bad as a backup that may or may not restore your data, but almost.

Regularly and routinely test any alleged HA capability along with backup restore capability. Document that testing.

As opposed to “testing” when either has to work or critical operations will fail or critical data will be lost.*

*Not Miller but résumé time.

Big Data Security Part One: Introducing PacketPig

Thursday, October 11th, 2012

Big Data Security Part One: Introducing PacketPig by Michael Baker.

From the post:

Packetloop CTO Michael Baker (@cloudjunky) made a big splash when he presented ‘Finding Needles in Haystacks (the Size of Countries)‘ at Blackhat Europe earlier this year. The paper outlines a toolkit based on Apache Pig, Packetpig @packetpig (available on github), for doing network security monitoring and intrusion detection analysis on full packet captures using Hadoop.

In this series of posts, we’re going to introduce Big Data Security and explore using Packetpig on real full packet captures to understand and analyze networks. In this post, Michael will introduce big data security in the form of full data capture, Packetpig and Packetloop.

If you are a bit rusty on packets, TCP/IP, I could just wave my hands and say: “See the various tutorials.” and off you go to hunt something down.

Let me be more helpful than that and suggest: TCP/IP Tutorial and Technical Overview from the IBM RedBooks we were talking about earlier.

It’s not short (almost a thousand pages) but it isn’t W. Richards Stevens on the other hand (in three volumes). 😉

You won’t need all of either resource but it is better to start with too much than too little.

What is Hadoop Metrics2?

Wednesday, October 10th, 2012

What is Hadoop Metrics2? by Ahmed Radwan.

I’ve been wondering about that. How about you? 😉

From the post:

Metrics are collections of information about Hadoop daemons, events and measurements; for example, data nodes collect metrics such as the number of blocks replicated, number of read requests from clients, and so on. For that reason, metrics are an invaluable resource for monitoring Hadoop services and an indispensable tool for debugging system problems.

This blog post focuses on the features and use of the Metrics2 system for Hadoop, which allows multiple metrics output plugins to be used in parallel, supports dynamic reconfiguration of metrics plugins, provides metrics filtering, and allows all metrics to be exported via JMX.

However cool the software, can’t ever really get away from managing the software.

And it isn’t a bad skill to have. Read on!

Building LinkedIn’s Real-time Activity Data Pipeline

Thursday, August 16th, 2012

Building LinkedIn’s Real-time Activity Data Pipeline by Ken Goodhope, Joel Koshy, Jay Kreps, Neha Narkhede, Richard Park, Jun Rao, and Victor Yang Ye. (pdf)


One trend in the implementation of modern web systems is the use of activity data in the form of log or event messages that capture user and server activity. This data is at the heart of many internet systems in the domains of advertising, relevance, search, recommendation systems, and security, as well as continuing to fulfill its traditional role in analytics and reporting. Many of these uses place real-time demands on data feeds. Activity data is extremely high volume and real-time pipelines present new design challenges. This paper discusses the design and engineering problems we encountered in moving LinkedIn’s data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system called Kafka. This pipeline currently runs in production at LinkedIn and handles more than 10 billion message writes each day with a sustained peak of over 172,000 messages per second. Kafka supports dozens of subscribing systems and delivers more than 55 billion messages to these consumer processing each day. We discuss the origins of this systems, missteps on the path to real-time, and the design and engineering problems we encountered along the way.

More details on Kafka (see Choking Cassandra Bolt).

What if you think about message feeds as being pipelines that are large enough to see and configure?

Chip level pipelines are more efficient but harder to configure.

Perhaps passing messages is efficient and flexible enough for a class of use cases.

Groundhog: Hadoop Fork Testing

Thursday, August 9th, 2012

Groundhog: Hadoop Fork Testing by Anupam Seth.

From the post:

Hadoop is widely used at Yahoo! to do all kinds of processing. It is used for everything from counting ad clicks to optimizing what is shown on the front page for each individual user. Deploying a major release of Hadoop to all 40,000+ nodes at Yahoo! is a long and painful process that impacts all users of Hadoop. It involves doing a staged rollout onto different clusters of increasing importance (e.g. QA, sandbox, research, production) and asking all teams that use Hadoop to verify that their applications work with this new version. This is to harden the new release before it is deployed on clusters that directly impact revenue, but it comes at the expense of the users of these clusters because they have to share the pain of stabilizing a newer version. Further, this process can take over 6 months. Waiting 6 months to get a new feature, which users have asked for, onto a production system is way too long. It stifles innovation both for Hadoop and for the code running on Hadoop. Other software systems avoid these problems by more closely following continuous integration techniques.

Groundhog is an automated testing tool to help ensure backwards compatibility (in terms of API, functionality, and performance) between releases of Hadoop before deploying a new release onto clusters with a high QoS. Groundhog does this by providing an automated mechanism to capture user jobs (currently limited to pig scripts) as they are run on a cluster and then replay them on a different cluster with a different version of Hadoop to verify that they still produce the same results. The test cluster can take inevitable downtime and still help ensure that the latest version of Hadoop has not introduced any new regressions. It is called groundhog because that way Hadoop can relive a pig script over and over again until it gets it right, like the movie Groundhog Day. There is similarity in concept to traditional fork/T testing in that jobs are duplicated and ran on another location. However, Hadoop fork testing differs in that the testing will not occur in real-time but instead the original job with all needed inputs and outputs will be captured and archived. Then at any later date, the archived job can be re-ran.

The main idea is to reduce the deployment cycle of a new Hadoop release by making it easier to get user oriented testing started sooner and at a larger scope. Specifically, get testing running to quickly discover regressions and backwards incompatibility issues. Past efforts to bring up a test cluster and have Hadoop users run their jobs on the test cluster has been less successful than desired. Therefore, fork testing is a method for reducing the human effort needed to get user oriented testing ran against a Hadoop cluster. Additionally, if the level of effort to capture and run tests is reduced, then testing can be performed more often and experiments can also be run. All of this must happen while following data governance policies though.

Thus, Fork testing is a form of end to end testing. If there was a complete suite of end to end tests for Hadoop, the need for fork testing might not exist. Alas, the end to end suite does not exist and creating fork testing is deemed a faster path to achieving the testing goal.

Groundhog currently is limited to work only with pig jobs. The majority of user jobs run on Hadoop at Yahoo! are written in pig. This is what allows Groundhog to nevertheless have a good sampling of production jobs.

This is way cool!

Discovering problems, even errors, before they show up in live installations is always a good thing.

When you make changes to merging rules, how do you test the impact on your topic maps?

I first saw this at: Alex Popescu’s myNoSQL under Groundhog: Hadoop Automated Testing at Yahoo!

…Creating Reliable Billion Page View Web Services

Thursday, August 9th, 2012

High Scalability reports in 3 Tips and Tools for Creating Reliable Billion Page View Web Services an article by Amir Salihefendic that suggests:

  • Realtime monitor everything
  • Be proactive
  • Be notified when crashes happen

Are three tips to follow on the hunt to a reliable billion page view web service.

I’m a few short of that number but it was still an interesting post. 😉

And you can’t ever tell, might snag a client that is more likely to reach those numbers.

Announcing Scalable Performance Monitoring (SPM) for JVM

Tuesday, August 7th, 2012

Announcing Scalable Performance Monitoring (SPM) for JVM (Sematext)

From the post:

Up until now, SPM existed in several flavors for monitoring Solr, HBase, ElasticSearch, and Sensei. Besides metrics specific to a particular system type, all these SPM flavors also monitor OS and JVM statistics. But what if you want to monitor any Java application? Say your custom Java application run either in some container, application server, or from a command line? You don’t really want to be forced to look at blank graphs that are really meant for stats from one of the above mentioned systems. This was one of our own itches, and we figured we were not the only ones craving to scratch that itch, so we put together a flavor of SPM for monitoring just the JVM and (Operating) System metrics.

Now SPM lets you monitor OS and JVM performance metrics of any Java process through the following 5 reports, along with all other SPM functionality like integrated Alerts, email Subscriptions, etc. If you are one of many existing SPM users these graphs should look very familiar.

JVM monitoring isn’t like radio station management where you can listen for dead air. It a bit more complicated than that.

SPM may help with it.

Beyond the JVM and OS, how do you handle monitoring of topic map applications?

Chaos Monkey released into the wild

Monday, July 30th, 2012

Chaos Monkey released into the wild by Cory Bennett and Ariel Tseitlin

From the post:

We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient. We are excited to make a long-awaited announcement today that will help others who embrace this approach.

We have written about our Simian Army in the past and we are now proud to announce that the source code for the founding member of the Simian Army, Chaos Monkey, is available to the community.

Do you think your applications can handle a troop of mischievous monkeys loose in your infrastructure? Now you can find out.

What is Chaos Monkey?

Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group. The software design is flexible enough to work with other cloud providers or instance groupings and can be enhanced to add that support. The service has a configurable schedule that, by default, runs on non-holiday weekdays between 9am and 3pm. In most cases, we have designed our applications to continue working when an instance goes offline, but in those special cases that they don’t, we want to make sure there are people around to resolve and learn from any problems. With this in mind, Chaos Monkey only runs within a limited set of hours with the intent that engineers will be alert and able to respond.

At first I was unsure if NetFlix is hopeful its competitors will run Chaos Monkey or if they really run it internally. 😉

It certainly is a way to test your infrastructure. And quite possibly a selling point to clients who want more than projected or historical robustness.

Makes me curious, allowing for different infrastructures, how would you stress test a topic map installation?

And do so on a regular basis?

I first saw this at Alex Popescu’s myNoSQL.


Saturday, June 9th, 2012


From “What is Puppet?”:

Puppet is IT automation software that helps system administrators manage infrastructure throughout its lifecycle, from provisioning and configuration to patch management and compliance. Using Puppet, you can easily automate repetitive tasks, quickly deploy critical applications, and proactively manage change, scaling from 10s of servers to 1000s, on-premise or in the cloud.

Puppet is available as both open source and commercial software. You can see the differences here and decide which is right for your organization.

How Puppet Works

Puppet uses a declarative, model-based approach to IT automation.

  1. Define the desired state of the infrastructure’s configuration using Puppet’s declarative configuration language.
  2. Simulate configuration changes before enforcing them.
  3. Enforce the deployed desired state automatically, correcting any configuration drift.
  4. Report on the differences between actual and desired states and any changes made enforcing the desired state.

Topic maps seem like a natural for systems administration.

They can capture the experience and judgement of sysadmins that aren’t ever part of printed documentation.

Make sysadmins your allies when introducing topic maps. Part of that will be understanding their problems and concerns.

Being able to intelligently discuss software like Puppet will be a step in the right direction. (Not to mention giving you ideas about topic map applications for systems administration.)