Archive for the ‘Amazon Web Services AWS’ Category

Accessing IRS 990 Filings (Old School)

Monday, July 25th, 2016

Like many others, I was glad to see: IRS 990 Filings on AWS.

From the webpage:

Machine-readable data from certain electronic 990 forms filed with the IRS from 2011 to present are available for anyone to use via Amazon S3.

Form 990 is the form used by the United States Internal Revenue Service to gather financial information about nonprofit organizations. Data for each 990 filing is provided in an XML file that contains structured information that represents the main 990 form, any filed forms and schedules, and other control information describing how the document was filed. Some non-disclosable information is not included in the files.

This data set includes Forms 990, 990-EZ and 990-PF which have been electronically filed with the IRS and is updated regularly in an XML format. The data can be used to perform research and analysis of organizations that have electronically filed Forms 990, 990-EZ and 990-PF. Forms 990-N (e-Postcard) are not available withing this data set. Forms 990-N can be viewed and downloaded from the IRS website.

I could use AWS but I’m more interested in deep analysis of a few returns than analysis of the entire dataset.

Fortunately the webpage continues:

An index listing all of the available filings is available at s3://irs-form-990/index.json. This file includes basic information about each filing including the name of the filer, the Employer Identificiation Number (EIN) of the filer, the date of the filing, and the path to download the filing.

All of the data is publicly accessible via the S3 bucket’s HTTPS endpoint at No authentication is required to download data over HTTPS. For example, the index file can be accessed at and the example filing mentioned above can be accessed at (emphasis in original).

I open a terminal window and type:


which as of today, results in:

-rw-rw-r-- 1 patrick patrick 1036711819 Jun 16 10:23 index.json

A trial grep:

grep "NATIONAL RIFLE" index.json > nra.txt

Which produces:

{“EIN”: “530116130”, “SubmittedOn”: “2014-11-25”, “TaxPeriod”: “201312”, “DLN”: “93493309004174”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201423099349300417”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2013-12-20”, “TaxPeriod”: “201212”, “DLN”: “93493260005203”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201302609349300520”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2012-12-06”, “TaxPeriod”: “201112”, “DLN”: “93493311011202”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201203119349301120”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “396056607”, “SubmittedOn”: “2011-05-12”, “TaxPeriod”: “201012”, “FormType”: “990EZ”, “LastUpdated”: “2016-06-14T01:22:09.915971Z”, “OrganizationName”: “EAU CLAIRE NATIONAL RIFLE CLUB”, “IsElectronic”: false, “IsAvailable”: false},
{“EIN”: “530116130”, “SubmittedOn”: “2011-11-09”, “TaxPeriod”: “201012”, “DLN”: “93493270005081”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201132709349300508”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2016-01-11”, “TaxPeriod”: “201412”, “DLN”: “93493259005035”, “LastUpdated”: “2016-04-29T13:40:20”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201532599349300503”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},

We have one errant result, the “EAU CLAIRE NATIONAL RIFLE CLUB,” so let’s delete that, re-order by year and the NATIONAL RIFLE ASSOCIATION OF AMERICA result reads (most recent to oldest):

{“EIN”: “530116130”, “SubmittedOn”: “2016-01-11”, “TaxPeriod”: “201412”, “DLN”: “93493259005035”, “LastUpdated”: “2016-04-29T13:40:20”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201532599349300503”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2014-11-25”, “TaxPeriod”: “201312”, “DLN”: “93493309004174”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201423099349300417”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2013-12-20”, “TaxPeriod”: “201212”, “DLN”: “93493260005203”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201302609349300520”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2012-12-06”, “TaxPeriod”: “201112”, “DLN”: “93493311011202”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201203119349301120”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2011-11-09”, “TaxPeriod”: “201012”, “DLN”: “93493270005081”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201132709349300508”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},

Of course, now you want the XML 990 returns, so extract the URLs for the 990s to a file, here nra-urls.txt (I would use awk if it is more than a handful):

Back to wget:

wget -i nra-urls.txt


-rw-rw-r– 1 patrick patrick 111798 Mar 21 16:12 201132709349300508_public.xml
-rw-rw-r– 1 patrick patrick 123490 Mar 21 19:47 201203119349301120_public.xml
-rw-rw-r– 1 patrick patrick 116786 Mar 21 22:12 201302609349300520_public.xml
-rw-rw-r– 1 patrick patrick 122071 Mar 21 15:20 201423099349300417_public.xml
-rw-rw-r– 1 patrick patrick 132081 Apr 29 10:10 201532599349300503_public.xml

Ooooh, it’s in XML! 😉

For the XML you are going to need: Current Valid XML Schemas and Business Rules for Exempt Organizations Modernized e-File, not to mention a means of querying the data (may I suggest XQuery?).

Once you have the index.json file, with grep, a little awk and wget, you can quickly explore IRS 990 filings for further analysis or to prepare queries for running on AWS (such as discovery of common directors, etc.).


IRS 990 Filing Data (2001 to date)

Thursday, June 16th, 2016

IRS 990 Filing Data Now Available as an AWS Public Data Set

From the post:

We are excited to announce that over one million electronic IRS 990 filings are available via Amazon Simple Storage Service (Amazon S3). Filings from 2011 to the present are currently available and the IRS will add new 990 filing data each month.

(image omitted)

Form 990 is the form used by the United States Internal Revenue Service (IRS) to gather financial information about nonprofit organizations. By making electronic 990 filing data available, the IRS has made it possible for anyone to programmatically access and analyze information about individual nonprofits or the entire nonprofit sector in the United States. This also makes it possible to analyze it in the cloud without having to download the data or store it themselves, which lowers the cost of product development and accelerates analysis.

Each electronic 990 filing is available as a unique XML file in the “irs-form-990” S3 bucket in the AWS US East (N. Virginia) region. Information on how the data is organized and what it contains is available on the IRS 990 Filings on AWS Public Data Set landing page.

Some of the forms and instructions that will help you make sense of the data reported:

990 – Form 990 Return of Organization Exempt from Income Tax, Annual Form 990 Requirements for Tax-Exempt Organizations

990-EZ – 2015 Form 990-EZ, Instructions for IRS 990 EZ – Internal Revenue Service

990-PF – 2015 Form 990-PF, 2015 Instructions for Form 990-PF

As always, use caution with law related data as words may have unusual nuances and/or unexpected meanings.

These forms and instructions are only a tiny part of a vast iceberg of laws, regulations, rulings, court decisions and the like.

990* disclosures aren’t detailed enough to pinch but when combined with other data, say leaked data, the results can be remarkable.

Spinning up a Spark Cluster on Spot Instances: Step by Step [$0.69 for 6 hours]

Thursday, October 29th, 2015

Spinning up a Spark Cluster on Spot Instances: Step by Step by Austin Ouyang.

From the post:

The DevOps series covers how to get started with the leading open source distributed technologies. In this tutorial, we step through how to deploy a Spark Standalone cluster on AWS Spot Instances for less than $1. In a follow up post, we will show you how to use a Jupyter notebook on Spark for ad hoc analysis of reddit comment data on Amazon S3.

One of the significant hurdles in learning to build distributed systems is understanding how these various technologies are installed and their inter-dependencies. In our experience, the best way to get started with these technologies is to roll up your sleeves and build projects you are passionate about.

This following tutorial shows how you can deploy your own Spark cluster in standalone mode on top of Hadoop. Due to Spark’s memory demand, we recommend using m4.large spot instances with 200GB of magnetic hard drive space each.

m4.large spot instances are not within the free-tier package on AWS, so this tutorial will incur a small cost. The tutorial should not take any longer than a couple hours, but if we allot 6 hours for your 4 node spot cluster, the total cost should run around $0.69 depending on the region of your cluster. If you run this cluster for an entire month we can look at a bill of around $80, so be sure to spin down you cluster after you are finished using it.

How does $0.69 to improve your experience with distributed systems sound?

It’s hard to imagine a better deal.

The only reason to lack experience with distributed systems is lack of interest.

Odd I know but it does happen (or so I have heard). 😉

I first saw this in a tweet by Kirk Borne.

Deep learning for… chess

Monday, December 15th, 2014

Deep learning for… chess by Erik Bernhardsson.

From the post:

I’ve been meaning to learn Theano for a while and I’ve also wanted to build a chess AI at some point. So why not combine the two? That’s what I thought, and I ended up spending way too much time on it. I actually built most of this back in September but not until Thanksgiving did I have the time to write a blog post about it.

Chess sets are a common holiday gift so why not do something different this year?

Pretty print a copy of this post and include a gift certificate from AWS for a GPU instance for say a week to ten days.

I don’t think AWS sells gift certificates, but they certainly should. Great stocking stuffer, anniversary/birthday/graduation present, etc. Not so great for Valentines Day.

If you ask AWS for a gift certificate, mention my name. They don’t know who I am so I could use the publicity. 😉

I first saw this in a tweet by Onepaperperday.

Amazon Aurora – New Cost-Effective MySQL-Compatible Database Engine for Amazon RDS

Friday, November 14th, 2014

Amazon Aurora – New Cost-Effective MySQL-Compatible Database Engine for Amazon RDS by Jeff Barr.

From the post:

We launched the Amazon Relational Database Service (RDS) service way back in 2009 to help you to set up, operate, and scale a MySQL database in the cloud. Since that time, we have added a multitude of options to RDS including extensive console support, three additional database engines (Oracle, SQL Server, and PostgreSQL), high availability (multiple Availability Zones) and dozens of other features.

We have come a long way in five years, but there’s always room to do better! The database engines that I listed above were designed to function in a constrained and somewhat simplistic hardware environment — a constrained network, a handful of processors, a spinning disk or two, and limited opportunities for parallel processing or a large number of concurrent I/O operations.

The RDS team decided to take a fresh look at the problem and to create a relational database designed for the cloud. Starting from a freshly scrubbed white board, they set as their goal a material improvement in the price-performance ratio and the overall scalability and reliability of existing open source and commercial database engines. They quickly realized that they had a unique opportunity to create an efficient, integrated design that encompassed the storage, network, compute, system software, and database software, purpose-built to handle demanding database workloads. This new design gave them the ability to take advantage of modern, commodity hardware and to eliminate bottlenecks caused by I/O waits and by lock contention between database processes. It turned out that they were able to increase availability while also driving far more throughput than before.

In preview now but you can sign up at the end of Jeff’s post.

Don’t become confused between Apache Aurora (“a service scheduler that runs on top of Mesos”) and Amazon Aurora, the MySQL compatible database from Amazon. (I guess all the good names have been taken for years.)

What am I missing?

Oh, following announcement of open source from Microsoft, Intel, Mapillary (to name the ones I noticed this week), I can’t find any reference to the source code for Amazon Aurora.

Do you think Amazon Aurora is closed source? One of those hiding places for government surveillance/malware? Hopefully not.

Perhaps Jeff just forgot to mention the GitHub respository with the Amazon Aurora source code.

It’s Friday (my location) so let’s see what develops by next Monday, 17 November 2014. If there is no announcement that Amazon Aurora is open source, …, well, at least everyone can factor that into their database choices.

PS: Open source does not mean bug or malware free. Open source means that you have a sporting chance at finding (and correcting) bugs and malware. Non-open source software may have bugs and malware which you will experience but not be able to discover/fix/correct.

Virtual Workshop and Challenge (NASA)

Tuesday, June 24th, 2014

Open NASA Earth Exchange (NEX) Virtual Workshop and Challenge 2014

From the webpage:

President Obama has announced a series of executive actions to reduce carbon pollution and promote sound science to understand and manage climate impacts for the U.S.

Following the President’s call for developing tools for climate resilience, OpenNEX is hosting a workshop that will feature:

  1. Climate science through lectures by experts
  2. Computational tools through virtual labs, and
  3. A challenge inviting participants to compete for prizes by designing and implementing solutions for climate resilience.

Whether you win any of the $60K in prize money or not, this looks like a great way to learn about climate data, approaches to processing climate data and the Amazon cloud all at one time!

Processing in the virtual labs is on the OpenNEX (Open NASA Earth Exchange) nickel. You can experience cloud computing without fear of the bill for computing services. Gain valuable cloud experience and possibly make a contribution to climate science.


Wikipedia Usage Statistics

Sunday, June 22nd, 2014

Wikipedia Usage Statistics by Paul Houle.

From the post:

The Wikimedia Foundation publishes page view statistics for Wikimedia projects here; this serveris rate-limited so it took roughly a month to transfer this 4 TB data set into S3 Storage in the AWS cloud. The photo on the left is of a hard drive containing a copy of the data that was produced with AWS Import/Export.

Once in S3, it is easy to process this data with Amazon Map/Reduce using the Open Source telepath software.

The first product developed from this is SubjectiveEye3D.

It’s your turn

Future projects require that this data be integrated with semantic data from :BaseKB and that has me working on tools such as RDFeasy. In the meantime, a mirror of the Wikipedia pagecounts from Jan 2008 to Feb 2014 is available in a requester pays bucket in S3 , which means you can use it in the Amazon Cloud for free and download data elsewhere for the cost of bulk network transfer.

Interesting isn’t it?

That “open” data can be so difficult to obtain and manipulate that it may as well not be “open” at all for the average user.

Something to keep in mind when big players talk about privacy. Do they mean private from their prying eyes or yours?

I think you will find in most cases that “privacy” means private from you and not the big players.

If you want to do a good deed for this week, support this data set at Gittip.

I first saw this in a tweet by Gregory Piatetsky.

Creating A Galactic Plane Atlas With Amazon Web Services

Saturday, February 15th, 2014

Creating A Galactic Plane Atlas With Amazon Web Services by Bruce Berriman, el. al.


This paper describes by example how astronomers can use cloud-computing resources offered by Amazon Web Services (AWS) to create new datasets at scale. We have created from existing surveys an atlas of the Galactic Plane at 16 wavelengths from 1 μm to 24 μm with pixels co- registered at spatial sampling of 1 arcsec. We explain how open source tools support management and operation of a virtual cluster on AWS platforms to process data at scale, and describe the technical issues that users will need to consider, such as optimization of resources, resource costs, and management of virtual machine instances.

In case you are interesting in taking your astronomy hobby to the next level with AWS.

And/or gaining experience with AWS and large datasets.

Elastic Mesos

Wednesday, January 8th, 2014

Mesosphere Launches Elastic Mesos, Makes Setting Up A Mesos Cluster A 3-Step Process by Frederic Lardinois.

From the post:

Mesosphere, a startup that focuses on developing Mesos, a technology that makes running complex distributed applications easier, is launching Elastic Mesos today. This new product makes setting up a Mesos cluster on Amazon Web Services a basic three-step process that asks you for the size of the cluster you want to set up, your AWS credentials and an email where you want to get notifications about your cluster’s state.

Given the complexity of setting up a regular Mesos cluster, this new project will make it easier for developers to experiment with Mesos and the frameworks Mesosphere and others have created around it.

As Mesosphere’s founder Florian Leibert describes it, for many applications, the data center is now the computer. Most applications now run on distributed systems, but connecting all of the distributed parts is often still a manual process. Mesos’ job is to abstract away all of these complexities and to ensure that an application can treat the data center and all your nodes as a single computer. Instead of setting up various server clusters for different parts of your application, Mesos creates a shared pool of servers where resources can be allocated dynamically as needed.

Remote computing isn’t as secure as my NATO SDIP-27 Level A (formerly AMSG 720B) and USA NSTISSAM Level I conformant office but there is a trade-off between maintenance/upgrade of local equipment and the convenience of remote computing.

In the near future, all forms of digital communication will be secure from the NSA and others. Before Snowden, it was widely known in a vague sense that the NSA and others were spying on U.S. citizens and others. Post-Snowden, user demand will result in vendors developing secure communications with two settings, secure and very secure.

Ironic that overreaching by the NSA will result in greater privacy for everyone of interest to the NSA.

PS: See Learn how to use Apache Mesos as well.

Using AWS to Build a Graph-based…

Friday, November 22nd, 2013

Using AWS to Build a Graph-based Product Recommendation System by Andre Fatala and Renato Pedigoni.

From the description:

Magazine Luiza, one of the largest retail chains in Brazil, developed an in-house product recommendation system, built on top of a large knowledge Graph. AWS resources like Amazon EC2, Amazon SQS, Amazon ElastiCache and others made it possible for them to scale from a very small dataset to a huge Cassandra cluster. By improving their big data processing algorithms on their in-house solution built on AWS, they improved their conversion rates on revenue by more than 25 percent compared to market solutions they had used in the past.

Not a lot of technical details but a good success story to repeat if you are pushing graph-based services.

I first saw this in a tweet by Marko A. Rodriguez.

Amazon Hosting 20 TB of Climate Data

Wednesday, November 13th, 2013

Amazon Hosting 20 TB of Climate Data by Isaac Lopez.

From the post:

Looking to save the world through data? Amazon, in conjunction with the NASA Earth Exchange (NEX) team, today released over 20 terabytes of NASA-collected climate data as part of its OpenNEX project. The goal, they say, is to make important datasets accessible to a wide audience of researchers, students, and citizen scientists in order to facilitate discovery.

“Up until now, it has been logistically difficult for researchers to gain easy access to this data due to its dynamic nature and immense size,” writes Amazon’s Jeff Barr in the Amazon blog. “Limitations on download bandwidth, local storage, and on-premises processing power made in-house processing impractical. Today we are publishing an initial collection of datasets available (over 20 TB), along with Amazon Machine Images (AMIs), and tutorials.”

The OpenNEX project aims to give open access to resources to aid earth science researchers, including data, virtual labs, lectures, computing and more.


Isaac also reports that NASA will be hosting workshops on the data.

Anyone care to wager on the presence of semantic issues in the data sets? 😉

AWS: Your Next Utility Bill?

Saturday, June 22nd, 2013

Netflix open sources its Hadoop manager for AWS be Derrick Harris.

From the post:

Netflix runs a lot of Hadoop jobs on the Amazon Web Services cloud computing platform, and on Friday the video-streaming leader open sourced its software to make running those jobs as easy as possible. Called Genie, it’s a RESTful API that makes it easy for developers to launch new MapReduce, Hive and Pig jobs and to monitor longer-running jobs on transient cloud resources.

In the blog post detailing Genie, Netflix’s Sriram Krishnan makes clear a lot more about what Genie is and is not. Essentially, Genie is a platform as a service running on top of Amazon’s Elastic MapReduce Hadoop service. It’s part of a larger suite of tools that handles everything from diagnostics to service registration.

It is not a cluster manager or workflow scheduler for building ETL processes (e.g., processing unstructured data from a web source, adding structure and loading into a relational database system). Netflix uses a product called UC4 for the latter, but it built the other components of the Genie system.

It’s not very futuristic to say that AWS (or something very close to it) will be your next utility bill.

Like paying for water, gas, cable, electricity, it will be an auto-pay setup on your bank account.

What will you say when clients ask if the service you are building for them is hosted on AWS?

Are you going to say your servers are more reliable? That you don’t “trust” Amazon?

Both of which may be true but how will you make that case?

Without sounding like you are selling something the client doesn’t need?

As the price of cloud computing drops, those questions are going to become common.


Thursday, March 28th, 2013


From the webpage:

Elephant is an S3-backed key-value store with querying powered by Elastic Search. Your data is persisted on S3 as simple JSON documents, but you can instantly query it over HTTP.

Suddenly, your data becomes as durable as S3, as portable as JSON, and as queryable as HTTP. Enjoy!

i don’t recall seeing Elephant on the Database Landscape Map – February 2013. Do you?

Every database is thought, at least by its authors, to be different from all the others.

What dimensions would be the most useful ones for distinction/comparison?


I first saw this in Nat Torkington’s Four short links: 27 March 2013.

Amazon Web Services Announces Amazon Redshift

Saturday, February 16th, 2013

Amazon Web Services Announces Amazon Redshift

From the post:

Amazon Web Services, Inc. today announced that Amazon Redshift, a managed, petabyte-scale data warehouse service in the cloud, is now broadly available for use.

Since Amazon Redshift was announced at the AWS re: Invent conference in November 2012, customers using the service during the limited preview have ranged from startups to global enterprises, with datasets from terabytes to petabytes, across industries including social, gaming, mobile, advertising, manufacturing, healthcare, e-commerce, and financial services.

Traditional data warehouses require significant time and resource to administer. In addition, the financial cost associated with building, maintaining, and growing self-managed, on-premise data warehouses is very high. Amazon Redshift aims to lower the cost of a data warehouse and make it easy to analyze large amounts of data very quickly.

Amazon Redshift uses columnar data storage, advanced compression, and high performance IO and network to achieve higher performance than traditional databases for data warehousing and analytics workloads. Redshift is currently available in the US East (N. Virginia) Region and will be rolled out to other AWS Regions in the coming months.

“When we set out to build Amazon Redshift, we wanted to leverage the massive scale of AWS to deliver ten times the performance at 1/10 the cost of on-premise data warehouses in use today,” said Raju Gulabani, Vice President of Database Services, Amazon Web Services….

Amazon Web Services

Wondering what impact a 90% reduction in cost, if borne out over a variety of customers, will have on the cost of on-premise data warehouses?

Suspect the cost for on-premise warehouses will go up because there will be a smaller market for the hardware and people to run them.

Something to consider as a startup that wants to deliver big data services.

Do you really want your own server room/farm, etc.?

Or for that matter, will VCs ask: Why are you allocating funds to a server farm?

PS: Amazon “Redshift” is another example of semantic pollution. “Redshift” had (past tense) a well know and generally accepted semantic. Well, except for the other dozen or so meanings for “redshift” that I counted in less than a minute. 😉

Sigh, semantic confusion continues unabated.

2012 Year in Review: New AWS Technical Whitepapers, Articles and Videos Published

Saturday, January 19th, 2013

2012 Year in Review: New AWS Technical Whitepapers, Articles and Videos Published

From the post:

In addition to delivering great services and features to our customers, we are constantly working towards helping customers so that they can build highly-scalable, highly-available cost-effective cloud solutions using our services. We not only provide technical documentation for each service but also provide guidance on economics, cross-service architectures, reference implementations, best practices and details on how to get started so customers and partners can use the services effectively.

In this post, let’s review all the content that we published in 2012 so you can help build and prioritize our content roadmap for 2013. We are looking for feedback on content topics that you would like us to build this year.

A mother lode of technical content on AWS!

Definitely a page to bookmark even as new content appears in 2013!

Setting Up a Neo4J Cluster on Amazon

Friday, December 14th, 2012

Setting Up a Neo4J Cluster on Amazon by Max De Marzi.

From the post:

There are multiple ways to setup a Neo4j Cluster on Amazon Web Services (AWS) and I want to show you one way to do it.


  1. Create a VPC
  2. Launch 1 Instance
  3. Install Neo4j HA
  4. Clone 2 Instances
  5. Configure the Instances
  6. Start the Coordinators
  7. Start the Neo4j Cluster
  8. Create 2 Load Balancers
  9. Next Steps

In case you are curious about moving off of your local box to something that can handle more demand.

AWS re:Invent Sold Out – Register for the Live Stream!

Friday, November 9th, 2012

AWS re:Invent Sold Out – Register for the Live Stream! by Jeff Barr.

November 28 and 29, 2012.

From the post:

I’m happy to be able to report that we have sold all of the available seats at AWS re:Invent! The halls here are ablaze with excitement and we’re all working 18 hours per day to bring you a conference that will be fun, informative, and memorable. We’ve lined up an amazing array of speakers and a good time will be had by all.

If you didn’t register in time or if you simply can’t make it to Las Vegas, you can register for the live stream of the re:Invent keynotes. This stream is free and it will be delivered over Amazon CloudFront.

The entire team of AWS evangelists is committed to doing everything possible to bring the excitement of the conference online. We’ll be live-blogging, tweeting (using the #reinvent hashtag), posting pictures, posting videos, and posting the slide decks to the Amazon Web Services SlideShare page.

Way cool! The program is stunning.

I would rather be in Los Vegas but will instead be moving meetings that conflict with the stream.

AWS re:Invent

Sunday, November 4th, 2012

AWS re:Invent

November 27-29, 2012 The Venetian – Las Vegas, NV.

From the webpage:

Amazon Web Services invites you to AWS re: Invent, our first global customer and partner conference. Your whole team can ramp up on everything needed to thrive in the AWS Cloud. AWS re: Invent will feature deep technical content on popular cloud use cases, new AWS services, cloud migration best practices, architecting for scale, operating at high availability and making your cloud apps secure.

Sessions: There are 16 tracks and 150+ sessions. The choices are going to be really hard.

A “streaming” registration was due to appear a month before the conference but as of 4 November 2012, no such option is available.

Unlike some conferences, it looks like conference content is going to be limited to registered attendees who physically attend the conference.

Welcome to the world of the cloud!

Update: Streaming video of keynotes, videos and slides to be posted for free! See: AWS re:Invent Sold Out – Register for the Live Stream!. Be sure to send a nice note to Jeff about this announcement.

Deploying a GraphLab/Spark/Mesos cluster on EC2

Tuesday, October 23rd, 2012

Deploying a GraphLab/Spark/Mesos cluster on EC2 by Danny Bickson.

From the post:

I got the following instructions from my collaborator Jay (Haijie Gu) who spent some time learning Spark cluster deployment and adapted those useful scripts to be used in GraphLab.

This tutorial will help you spawn a GraphLab distributed cluster, run alternating least squares task, collect the results and shutdown the cluster.

This tutorial is very new beta release. Please contact me if you are brave enough to try it out..

I haven’t seen any responses to Danny’s post. Is yours going to be the first one?

Amazon RDS for Oracle Database – Now Starting at $30/Month

Saturday, September 29th, 2012

Amazon RDS for Oracle Database – Now Starting at $30/Month by Jeff Barr.

From the post:

You can now create Amazon RDS database instances running Oracle Database on Micro instances.

This new, option will allow you to build, test, and run your low-traffic database-backed applications at a cost starting at $30 per month ($0.04 per hour) using the License Included option. If you have a more intensive application, the micro instance enables you to get hands on experience with Amazon RDS before you scale up to a larger instance size. You can purchase Reserved Instances in order to further lower your effectively hourly rate.

These instances are available now in all AWS Regions. You can learn more about using Amazon RDS for managing Oracle database instances by attending this webinar.

Oracle databases aren’t for the faint of heart but they are everywhere in enterprise settings.

If you are or aspire to be working with enterprise information systems, the more you know about Oracle databases the more valuable you become.

To your employer and your clients.

Process a Million Songs with Apache Pig

Friday, August 24th, 2012

Process a Million Songs with Apache Pig by Justin Kestelyn.

From the post:

The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!

Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, and loudness as well as artist name, popularity, localization (latitude and longitude pair), and many other things. There are no music files included here, but the links to MP3 song previews at can be easily constructed from the data.

The dataset consists of 339 tab-separated text files. Each file contains about 3,000 songs and each song is represented as one separate line of text. The dataset is publicly available and you can find it at Infochimps or Amazon S3. Since the total size of this data sums up to around 218GB, processing it using one machine may take a very long time.

Definitely, a much more interesting and efficient approach is to use multiple machines and process the songs in parallel by taking advantage of open-source tools from the Apache Hadoop ecosystem (e.g. Apache Pig). If you have your own machines, you can simply use CDH (Cloudera’s Distribution including Apache Hadoop), which includes the complete Apache Hadoop stack. CDH can be installed manually (quickly and easily by typing a couple of simple commands) or automatically using Cloudera Manager Free Edition (which is Cloudera’s recommended approach). Both CDH and Cloudera Manager are freely downloadable here. Alternatively, you may rent some machines from Amazon with Hadoop already installed and process the data using Amazon’s Elastic MapReduce (here is a cool description writen by Paul Lemere how to use it and pay as low as $1, and here is my presentation about Elastic MapReduce given at the second meeting of Warsaw Hadoop User Group).

An example of offering the reader their choice of implementation detail, on or off a cloud. 😉

Suspect that is going to become increasingly common.

Distributed GraphLab: …

Wednesday, August 22nd, 2012

Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud by Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, Joseph M. Hellerstein.


While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees.

We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.

A gem from the first day as a member of the GraphLab and GraphChi group on LinkedIn!

This rocks!

Amazon Glacier: Archival Storage for One Penny Per GB Per Month

Tuesday, August 21st, 2012

Amazon Glacier: Archival Storage for One Penny Per GB Per Month by Jeff Barr.

From the post:

I’m going to bet that you (or your organization) spend a lot of time and a lot of money archiving mission-critical data. No matter whether you’re currently using disk, optical media or tape-based storage, it’s probably a more complicated and expensive process than you’d like which has you spending time maintaining hardware, planning capacity, negotiating with vendors and managing facilities.


If so, then you are going to find our newest service, Amazon Glacier, very interesting. With Glacier, you can store any amount of data with high durability at a cost that will allow you to get rid of your tape libraries and robots and all the operational complexity and overhead that have been part and parcel of data archiving for decades.

Glacier provides – at a cost as low as $0.01 (one US penny, one one-hundredth of a dollar) per Gigabyte, per month – extremely low cost archive storage. You can store a little bit, or you can store a lot (Terabytes, Petabytes, and beyond). There’s no upfront fee and you pay only for the storage that you use. You don’t have to worry about capacity planning and you will never run out of storage space. Glacier removes the problems associated with under or over-provisioning archival storage, maintaining geographically distinct facilities and verifying hardware or data integrity, irrespective of the length of your retention periods.

With the caveat that you don’t have immediate access to your data (it is called “Glacier” for a reason), but it is still an impressive price.

Unless you are monitoring nuclear missile launch signatures or are a day trader, do you really need arbitrary and random access to all your data?

Or is that a requirement because you read some other department or agency was getting “real time” big data?

Deploying Neo4j Graph Database Server across AWS regions with Puppet

Monday, August 20th, 2012

Deploying Neo4j Graph Database Server across AWS regions with Puppet by Jussi Heinonen.

From the post:

It’s been more than a year now since I rolled out Neo4j Graph Database Server image in Amazon EC2.

In May 2011 the version of Neo4j was 1.3 and just recently guys at Neo Technologies published version 1.7.2 so I thought now is the time to revisit this exercise and make fresh AMIs available.

Last year I created Neo4j AMI manually in one region then copied it across to the remaining AWS regions. Due to the size of the AMI and the latency between regions this process was slow.

If you aren’t already familiar with AWS, perhaps this will be your incentive to take the plunge.

Learning Puppet and Neo4j are just a lagniappe.

Titan Provides Real-Time Big Graph Data

Tuesday, August 7th, 2012

Titan Provides Real-Time Big Graph Data

From the post:

Titan is an Apache 2 licensed, distributed graph database capable of supporting tens of thousands of concurrent users reading and writing to a single massive-scale graph. In order to substantiate the aforementioned statement, this post presents empirical results of Titan backing a simulated social networking site undergoing transactional loads estimated at 50,000–100,000 concurrent users. These users are interacting with 40 m1.small Amazon EC2 servers which are transacting with a 6 machine Amazon EC2 cc1.4xl Titan/Cassandra cluster.

The presentation to follow discusses the simulation’s social graph structure, the types of processes executed on that structure, and the various runtime analyses of those processes under normal and peak load. The presentation concludes with a discussion of the Amazon EC2 cluster architecture used and the associated costs of running that architecture in a production environment. In short summary, Titan performs well under substantial load with a relatively inexpensive cluster and as such, is capable of backing online services requiring real-time Big Graph Data.

Fuller version of the information you will find at: Titan Stress Poster [Government Comparison Shopping?].

BTW, Titan is reported to emerge as 0.1 (from 0.1 alpha) later this (2012) summer.

Titan Stress Poster [Government Comparison Shopping?]

Friday, July 13th, 2012

Titan Stress Poster from Marko A. Rodriguez.

Notice of a poster at GraphLab 2012 with Matthias Broecheler:

This poster presents an overview of Titan along with some excellent stress testing done by Matthias and Dan LaRoque. The stress test uses a 6 machine Titan cluster with 14 read/write servers slamming Titan with various read/writes. The results are presented in terms of the number of bytes being read/write from disk, the average runtime of the queries, the cost of a transaction on Amazon EC2, and a speculation of the number of concurrent users are concurrently interacting.

Being a poster you will have to pump up the size for legibility but I think you will like the poster.

Impressive numbers. Including the Amazon EC2 cost.

Makes me wonder when governments are going to start requiring cost comparisons for system bids versus use of Amazon EC2?

Asgard for Cloud Management and Deployment

Friday, June 29th, 2012

Asgard for Cloud Management and Deployment

Amazon is touting the horn of one of its larger customers, Netflix when they say:

Our friends at Netflix have embraced AWS whole-heartedly. They have shared much of what they have learned about how they use AWS to build, deploy, and host their applications. You can read the Netflix Tech Blog benefit from what they have learned.

Earlier this week they released Asgard, a web-based cloud management and deployment tool, in open source form on GitHub. According to Norse mythology, Asgard is the home of the god of thunder and lightning, and therefore controls the clouds! This is the same tool that the engineers at Netflix use to control their applications and their deployments.

Asgard layers two additional abstractions on top of AWS — Applications and Clusters.

Even if you are just in the planning (dreaming?) stages of cloud deployment for your topic map application, it would be good to review the Netflix blog. On Asgard and others posts as well.

You know how I hate to complain, ;-), but the Elder Edda does not report “Asgard” as the “home of the god of thunder and lighting.” All the gods resided at Asgard.

Even the link in the quoted part of Jeff’s post gets that much right.

Most of the time old stories told aright are more moving than modern misconceptions.

Booting HCatalog on Elastic MapReduce [periodic discovery audits?]

Wednesday, June 27th, 2012

The Data Lifecycle, Part Three: Booting HCatalog on Elastic MapReduce by Russell Jurney.

From the post:

Series Introduction

This is part three of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re exploring the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in Hive, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

  • Series Part One: Avroizing the Enron Emails. In that post, we used Pig to extract, transform and load a MySQL database of the Enron emails to document format and serialize them in Avro.The Enron emails are available in Avro format here.
  • Series Part Two: Mining Avros with Pig, Consuming Data with Hive. In part two of the series, we extracted new and interesting properties from our data for consumption by analysts and users, using Pig, EC2 and Hive.Code examples for this post are available here:
  • Series Part Three: Booting HCatalog on Elastic MapReduce. Here we will use HCatalog to streamline the sharing of data between Pig and Hive, and to aid data discovery for consumers of processed data.

Russell continues walking the Enron Emails through a full data lifecycle in the Hadoop ecosystem.

Given the current use and foreseeable use of email, these are important lessons for more than one reason.

What about periodic discovery audits on enterprise email archives?

To see what others may find, or to identify poor wording/disclosure practices?

Sage Bionetworks and Amazon SWF

Friday, June 22nd, 2012

Sage Bionetworks and Amazon SWF

From the post:

Over the past couple of decades the medical research community has witnessed a huge increase in the creation of genetic and other bio molecular data on human patients. However, their ability to meaningfully interpret this information and translate it into advances in patient care has been much more modest. The difficulty of accessing, understanding, and reusing data, analysis methods, or disease models across multiple labs with complimentary expertise is a major barrier to the effective interpretation of genomic data. Sage Bionetworks is a non-profit biomedical research organization that seeks to revolutionize the way researchers work together by catalyzing a shift to an open, transparent research environment. Such a shift would benefit future patients by accelerating development of disease treatments, and society as a whole by reducing costs and efficacy of health care.

To drive collaboration among researchers, Sage Bionetworks built an on-line environment, called Synapse. Synapse hosts clinical-genomic datasets and provides researchers with a platform for collaborative analyses. Just like GitHub and Source Forge provide tools and shared code for software engineers, Synapse provides a shared compute space and suite of analysis tools for researchers. Synapse leverages a variety of AWS products to handle basic infrastructure tasks, which has freed the Sage Bionetworks development team to focus on the most scientifically-relevant and unique aspects of their application.

Amazon Simple Workflow Service (Amazon SWF) is a key technology leveraged in Synapse. Synapse relies on Amazon SWF to orchestrate complex, heterogeneous scientific workflows. Michael Kellen, Director of Technology for Sage Bionetworks states, “SWF allowed us to quickly decompose analysis pipelines in an orderly way by separating state transition logic from the actual activities in each step of the pipeline. This allowed software engineers to work on the state transition logic and our scientists to implement the activities, all at the same time. Moreover by using Amazon SWF, Synapse is able to use a heterogeneity of computing resources including our servers hosted in-house, shared infrastructure hosted at our partners’ sites, and public resources, such as Amazon’s Elastic Compute Cloud (Amazon EC2). This gives us immense flexibility is where we run computational jobs which enables Synapse to leverage the right combination of infrastructure for every project.”

The Sage Bionetworks case study (above) and another one, NASA JPL and Amazon SWF, will get you excited about reaching out to the documentation on Amazon Simple Workflow Service (Amazon SWF).

In ways that presentations that consist of reading slides about management advantages to Amazon SWF simply can’t reach. At least not for me.

Take the tip and follow the case studies, then onto the documentation.

Full disclosure: I have always been fascinated by space and really hard bioinformatics problems. And have < 0 interest in DRM antics on material if piped to /dev/null would raise a user's IQ.

Graph DB + Bioinformatics: Bio4j,…

Thursday, June 21st, 2012

Graph DB + Bioinformatics: Bio4j, recent applications and future directions by Pablo Pareja.

If you haven’t seen one of Pablo’s slide decks on Bio4j, get ready for a real treat!

Let me quote the numbers from slide 42, which is entitled: “Bio4j + MG7 + 24 Chip-Seq samples”

157 639 502 nodes

742 615 705 relationships

632 832 045 properties

149 relationship types

44 node types

And it works just fine!

Granting he is not running this on his cellphone but if you are going to process serious data, you need serious computing power. (OK, he uses Amazon Web Services. Like I said, not his cellphone.)

Did I mention everything done by Oh no sequences! ( is 100% Open source?

There is much to learn here. Enjoy!