Archive for the ‘Amazon Web Services AWS’ Category

Elephant

Thursday, March 28th, 2013

Elephant

From the webpage:

Elephant is an S3-backed key-value store with querying powered by Elastic Search. Your data is persisted on S3 as simple JSON documents, but you can instantly query it over HTTP.

Suddenly, your data becomes as durable as S3, as portable as JSON, and as queryable as HTTP. Enjoy!

i don’t recall seeing Elephant on the Database Landscape Map – February 2013. Do you?

Every database is thought, at least by its authors, to be different from all the others.

What dimensions would be the most useful ones for distinction/comparison?

Suggestions?

I first saw this in Nat Torkington’s Four short links: 27 March 2013.

Amazon Web Services Announces Amazon Redshift

Saturday, February 16th, 2013

Amazon Web Services Announces Amazon Redshift

From the post:

Amazon Web Services, Inc. today announced that Amazon Redshift, a managed, petabyte-scale data warehouse service in the cloud, is now broadly available for use.

Since Amazon Redshift was announced at the AWS re: Invent conference in November 2012, customers using the service during the limited preview have ranged from startups to global enterprises, with datasets from terabytes to petabytes, across industries including social, gaming, mobile, advertising, manufacturing, healthcare, e-commerce, and financial services.

Traditional data warehouses require significant time and resource to administer. In addition, the financial cost associated with building, maintaining, and growing self-managed, on-premise data warehouses is very high. Amazon Redshift aims to lower the cost of a data warehouse and make it easy to analyze large amounts of data very quickly.

Amazon Redshift uses columnar data storage, advanced compression, and high performance IO and network to achieve higher performance than traditional databases for data warehousing and analytics workloads. Redshift is currently available in the US East (N. Virginia) Region and will be rolled out to other AWS Regions in the coming months.

“When we set out to build Amazon Redshift, we wanted to leverage the massive scale of AWS to deliver ten times the performance at 1/10 the cost of on-premise data warehouses in use today,” said Raju Gulabani, Vice President of Database Services, Amazon Web Services….

Amazon Web Services

Wondering what impact a 90% reduction in cost, if borne out over a variety of customers, will have on the cost of on-premise data warehouses?

Suspect the cost for on-premise warehouses will go up because there will be a smaller market for the hardware and people to run them.

Something to consider as a startup that wants to deliver big data services.

Do you really want your own server room/farm, etc.?

Or for that matter, will VCs ask: Why are you allocating funds to a server farm?

PS: Amazon “Redshift” is another example of semantic pollution. “Redshift” had (past tense) a well know and generally accepted semantic. Well, except for the other dozen or so meanings for “redshift” that I counted in less than a minute. ;-)

Sigh, semantic confusion continues unabated.

2012 Year in Review: New AWS Technical Whitepapers, Articles and Videos Published

Saturday, January 19th, 2013

2012 Year in Review: New AWS Technical Whitepapers, Articles and Videos Published

From the post:

In addition to delivering great services and features to our customers, we are constantly working towards helping customers so that they can build highly-scalable, highly-available cost-effective cloud solutions using our services. We not only provide technical documentation for each service but also provide guidance on economics, cross-service architectures, reference implementations, best practices and details on how to get started so customers and partners can use the services effectively.

In this post, let’s review all the content that we published in 2012 so you can help build and prioritize our content roadmap for 2013. We are looking for feedback on content topics that you would like us to build this year.

A mother lode of technical content on AWS!

Definitely a page to bookmark even as new content appears in 2013!

Setting Up a Neo4J Cluster on Amazon

Friday, December 14th, 2012

Setting Up a Neo4J Cluster on Amazon by Max De Marzi.

From the post:

There are multiple ways to setup a Neo4j Cluster on Amazon Web Services (AWS) and I want to show you one way to do it.

Overview:

  1. Create a VPC
  2. Launch 1 Instance
  3. Install Neo4j HA
  4. Clone 2 Instances
  5. Configure the Instances
  6. Start the Coordinators
  7. Start the Neo4j Cluster
  8. Create 2 Load Balancers
  9. Next Steps

In case you are curious about moving off of your local box to something that can handle more demand.

AWS re:Invent Sold Out – Register for the Live Stream!

Friday, November 9th, 2012

AWS re:Invent Sold Out – Register for the Live Stream! by Jeff Barr.

November 28 and 29, 2012.

From the post:

I’m happy to be able to report that we have sold all of the available seats at AWS re:Invent! The halls here are ablaze with excitement and we’re all working 18 hours per day to bring you a conference that will be fun, informative, and memorable. We’ve lined up an amazing array of speakers and a good time will be had by all.

If you didn’t register in time or if you simply can’t make it to Las Vegas, you can register for the live stream of the re:Invent keynotes. This stream is free and it will be delivered over Amazon CloudFront.

The entire team of AWS evangelists is committed to doing everything possible to bring the excitement of the conference online. We’ll be live-blogging, tweeting (using the #reinvent hashtag), posting pictures, posting videos, and posting the slide decks to the Amazon Web Services SlideShare page.

Way cool! The program is stunning.

I would rather be in Los Vegas but will instead be moving meetings that conflict with the stream.

AWS re:Invent

Sunday, November 4th, 2012

AWS re:Invent

November 27-29, 2012 The Venetian – Las Vegas, NV.

From the webpage:

Amazon Web Services invites you to AWS re: Invent, our first global customer and partner conference. Your whole team can ramp up on everything needed to thrive in the AWS Cloud. AWS re: Invent will feature deep technical content on popular cloud use cases, new AWS services, cloud migration best practices, architecting for scale, operating at high availability and making your cloud apps secure.

Sessions: There are 16 tracks and 150+ sessions. The choices are going to be really hard.

A “streaming” registration was due to appear a month before the conference but as of 4 November 2012, no such option is available.

Unlike some conferences, it looks like conference content is going to be limited to registered attendees who physically attend the conference.

Welcome to the world of the cloud!


Update: Streaming video of keynotes, videos and slides to be posted for free! See: AWS re:Invent Sold Out – Register for the Live Stream!. Be sure to send a nice note to Jeff about this announcement.

Deploying a GraphLab/Spark/Mesos cluster on EC2

Tuesday, October 23rd, 2012

Deploying a GraphLab/Spark/Mesos cluster on EC2 by Danny Bickson.

From the post:

I got the following instructions from my collaborator Jay (Haijie Gu) who spent some time learning Spark cluster deployment and adapted those useful scripts to be used in GraphLab.

This tutorial will help you spawn a GraphLab distributed cluster, run alternating least squares task, collect the results and shutdown the cluster.

This tutorial is very new beta release. Please contact me if you are brave enough to try it out..

I haven’t seen any responses to Danny’s post. Is yours going to be the first one?

Amazon RDS for Oracle Database – Now Starting at $30/Month

Saturday, September 29th, 2012

Amazon RDS for Oracle Database – Now Starting at $30/Month by Jeff Barr.

From the post:

You can now create Amazon RDS database instances running Oracle Database on Micro instances.

This new, option will allow you to build, test, and run your low-traffic database-backed applications at a cost starting at $30 per month ($0.04 per hour) using the License Included option. If you have a more intensive application, the micro instance enables you to get hands on experience with Amazon RDS before you scale up to a larger instance size. You can purchase Reserved Instances in order to further lower your effectively hourly rate.

These instances are available now in all AWS Regions. You can learn more about using Amazon RDS for managing Oracle database instances by attending this webinar.

Oracle databases aren’t for the faint of heart but they are everywhere in enterprise settings.

If you are or aspire to be working with enterprise information systems, the more you know about Oracle databases the more valuable you become.

To your employer and your clients.

Process a Million Songs with Apache Pig

Friday, August 24th, 2012

Process a Million Songs with Apache Pig by Justin Kestelyn.

From the post:

The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!

Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, and loudness as well as artist name, popularity, localization (latitude and longitude pair), and many other things. There are no music files included here, but the links to MP3 song previews at 7digital.com can be easily constructed from the data.

The dataset consists of 339 tab-separated text files. Each file contains about 3,000 songs and each song is represented as one separate line of text. The dataset is publicly available and you can find it at Infochimps or Amazon S3. Since the total size of this data sums up to around 218GB, processing it using one machine may take a very long time.

Definitely, a much more interesting and efficient approach is to use multiple machines and process the songs in parallel by taking advantage of open-source tools from the Apache Hadoop ecosystem (e.g. Apache Pig). If you have your own machines, you can simply use CDH (Cloudera’s Distribution including Apache Hadoop), which includes the complete Apache Hadoop stack. CDH can be installed manually (quickly and easily by typing a couple of simple commands) or automatically using Cloudera Manager Free Edition (which is Cloudera’s recommended approach). Both CDH and Cloudera Manager are freely downloadable here. Alternatively, you may rent some machines from Amazon with Hadoop already installed and process the data using Amazon’s Elastic MapReduce (here is a cool description writen by Paul Lemere how to use it and pay as low as $1, and here is my presentation about Elastic MapReduce given at the second meeting of Warsaw Hadoop User Group).

An example of offering the reader their choice of implementation detail, on or off a cloud. ;-)

Suspect that is going to become increasingly common.

Distributed GraphLab: …

Wednesday, August 22nd, 2012

Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud by Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, Joseph M. Hellerstein.

Abstract:

While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees.

We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.

A gem from the first day as a member of the GraphLab and GraphChi group on LinkedIn!

This rocks!

Amazon Glacier: Archival Storage for One Penny Per GB Per Month

Tuesday, August 21st, 2012

Amazon Glacier: Archival Storage for One Penny Per GB Per Month by Jeff Barr.

From the post:

I’m going to bet that you (or your organization) spend a lot of time and a lot of money archiving mission-critical data. No matter whether you’re currently using disk, optical media or tape-based storage, it’s probably a more complicated and expensive process than you’d like which has you spending time maintaining hardware, planning capacity, negotiating with vendors and managing facilities.

True?

If so, then you are going to find our newest service, Amazon Glacier, very interesting. With Glacier, you can store any amount of data with high durability at a cost that will allow you to get rid of your tape libraries and robots and all the operational complexity and overhead that have been part and parcel of data archiving for decades.

Glacier provides – at a cost as low as $0.01 (one US penny, one one-hundredth of a dollar) per Gigabyte, per month – extremely low cost archive storage. You can store a little bit, or you can store a lot (Terabytes, Petabytes, and beyond). There’s no upfront fee and you pay only for the storage that you use. You don’t have to worry about capacity planning and you will never run out of storage space. Glacier removes the problems associated with under or over-provisioning archival storage, maintaining geographically distinct facilities and verifying hardware or data integrity, irrespective of the length of your retention periods.

With the caveat that you don’t have immediate access to your data (it is called “Glacier” for a reason), but it is still an impressive price.

Unless you are monitoring nuclear missile launch signatures or are a day trader, do you really need arbitrary and random access to all your data?

Or is that a requirement because you read some other department or agency was getting “real time” big data?

Deploying Neo4j Graph Database Server across AWS regions with Puppet

Monday, August 20th, 2012

Deploying Neo4j Graph Database Server across AWS regions with Puppet by Jussi Heinonen.

From the post:

It’s been more than a year now since I rolled out Neo4j Graph Database Server image in Amazon EC2.

In May 2011 the version of Neo4j was 1.3 and just recently guys at Neo Technologies published version 1.7.2 so I thought now is the time to revisit this exercise and make fresh AMIs available.

Last year I created Neo4j AMI manually in one region then copied it across to the remaining AWS regions. Due to the size of the AMI and the latency between regions this process was slow.

If you aren’t already familiar with AWS, perhaps this will be your incentive to take the plunge.

Learning Puppet and Neo4j are just a lagniappe.

Titan Provides Real-Time Big Graph Data

Tuesday, August 7th, 2012

Titan Provides Real-Time Big Graph Data

From the post:

Titan is an Apache 2 licensed, distributed graph database capable of supporting tens of thousands of concurrent users reading and writing to a single massive-scale graph. In order to substantiate the aforementioned statement, this post presents empirical results of Titan backing a simulated social networking site undergoing transactional loads estimated at 50,000–100,000 concurrent users. These users are interacting with 40 m1.small Amazon EC2 servers which are transacting with a 6 machine Amazon EC2 cc1.4xl Titan/Cassandra cluster.

The presentation to follow discusses the simulation’s social graph structure, the types of processes executed on that structure, and the various runtime analyses of those processes under normal and peak load. The presentation concludes with a discussion of the Amazon EC2 cluster architecture used and the associated costs of running that architecture in a production environment. In short summary, Titan performs well under substantial load with a relatively inexpensive cluster and as such, is capable of backing online services requiring real-time Big Graph Data.

Fuller version of the information you will find at: Titan Stress Poster [Government Comparison Shopping?].

BTW, Titan is reported to emerge as 0.1 (from 0.1 alpha) later this (2012) summer.

Titan Stress Poster [Government Comparison Shopping?]

Friday, July 13th, 2012

Titan Stress Poster from Marko A. Rodriguez.

Notice of a poster at GraphLab 2012 with Matthias Broecheler:

This poster presents an overview of Titan along with some excellent stress testing done by Matthias and Dan LaRoque. The stress test uses a 6 machine Titan cluster with 14 read/write servers slamming Titan with various read/writes. The results are presented in terms of the number of bytes being read/write from disk, the average runtime of the queries, the cost of a transaction on Amazon EC2, and a speculation of the number of concurrent users are concurrently interacting.

Being a poster you will have to pump up the size for legibility but I think you will like the poster.

Impressive numbers. Including the Amazon EC2 cost.

Makes me wonder when governments are going to start requiring cost comparisons for system bids versus use of Amazon EC2?

Asgard for Cloud Management and Deployment

Friday, June 29th, 2012

Asgard for Cloud Management and Deployment

Amazon is touting the horn of one of its larger customers, Netflix when they say:

Our friends at Netflix have embraced AWS whole-heartedly. They have shared much of what they have learned about how they use AWS to build, deploy, and host their applications. You can read the Netflix Tech Blog benefit from what they have learned.

Earlier this week they released Asgard, a web-based cloud management and deployment tool, in open source form on GitHub. According to Norse mythology, Asgard is the home of the god of thunder and lightning, and therefore controls the clouds! This is the same tool that the engineers at Netflix use to control their applications and their deployments.

Asgard layers two additional abstractions on top of AWS — Applications and Clusters.

Even if you are just in the planning (dreaming?) stages of cloud deployment for your topic map application, it would be good to review the Netflix blog. On Asgard and others posts as well.

You know how I hate to complain, ;-) , but the Elder Edda does not report “Asgard” as the “home of the god of thunder and lighting.” All the gods resided at Asgard.

Even the link in the quoted part of Jeff’s post gets that much right.

Most of the time old stories told aright are more moving than modern misconceptions.

Booting HCatalog on Elastic MapReduce [periodic discovery audits?]

Wednesday, June 27th, 2012

The Data Lifecycle, Part Three: Booting HCatalog on Elastic MapReduce by Russell Jurney.

From the post:

Series Introduction

This is part three of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re exploring the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in Hive, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

  • Series Part One: Avroizing the Enron Emails. In that post, we used Pig to extract, transform and load a MySQL database of the Enron emails to document format and serialize them in Avro.The Enron emails are available in Avro format here.
  • Series Part Two: Mining Avros with Pig, Consuming Data with Hive. In part two of the series, we extracted new and interesting properties from our data for consumption by analysts and users, using Pig, EC2 and Hive.Code examples for this post are available here: https://github.com/rjurney/enron-hcatalog.
  • Series Part Three: Booting HCatalog on Elastic MapReduce. Here we will use HCatalog to streamline the sharing of data between Pig and Hive, and to aid data discovery for consumers of processed data.

Russell continues walking the Enron Emails through a full data lifecycle in the Hadoop ecosystem.

Given the current use and foreseeable use of email, these are important lessons for more than one reason.

What about periodic discovery audits on enterprise email archives?

To see what others may find, or to identify poor wording/disclosure practices?

Sage Bionetworks and Amazon SWF

Friday, June 22nd, 2012

Sage Bionetworks and Amazon SWF

From the post:

Over the past couple of decades the medical research community has witnessed a huge increase in the creation of genetic and other bio molecular data on human patients. However, their ability to meaningfully interpret this information and translate it into advances in patient care has been much more modest. The difficulty of accessing, understanding, and reusing data, analysis methods, or disease models across multiple labs with complimentary expertise is a major barrier to the effective interpretation of genomic data. Sage Bionetworks is a non-profit biomedical research organization that seeks to revolutionize the way researchers work together by catalyzing a shift to an open, transparent research environment. Such a shift would benefit future patients by accelerating development of disease treatments, and society as a whole by reducing costs and efficacy of health care.

To drive collaboration among researchers, Sage Bionetworks built an on-line environment, called Synapse. Synapse hosts clinical-genomic datasets and provides researchers with a platform for collaborative analyses. Just like GitHub and Source Forge provide tools and shared code for software engineers, Synapse provides a shared compute space and suite of analysis tools for researchers. Synapse leverages a variety of AWS products to handle basic infrastructure tasks, which has freed the Sage Bionetworks development team to focus on the most scientifically-relevant and unique aspects of their application.

Amazon Simple Workflow Service (Amazon SWF) is a key technology leveraged in Synapse. Synapse relies on Amazon SWF to orchestrate complex, heterogeneous scientific workflows. Michael Kellen, Director of Technology for Sage Bionetworks states, “SWF allowed us to quickly decompose analysis pipelines in an orderly way by separating state transition logic from the actual activities in each step of the pipeline. This allowed software engineers to work on the state transition logic and our scientists to implement the activities, all at the same time. Moreover by using Amazon SWF, Synapse is able to use a heterogeneity of computing resources including our servers hosted in-house, shared infrastructure hosted at our partners’ sites, and public resources, such as Amazon’s Elastic Compute Cloud (Amazon EC2). This gives us immense flexibility is where we run computational jobs which enables Synapse to leverage the right combination of infrastructure for every project.”

The Sage Bionetworks case study (above) and another one, NASA JPL and Amazon SWF, will get you excited about reaching out to the documentation on Amazon Simple Workflow Service (Amazon SWF).

In ways that presentations that consist of reading slides about management advantages to Amazon SWF simply can’t reach. At least not for me.

Take the tip and follow the case studies, then onto the documentation.

Full disclosure: I have always been fascinated by space and really hard bioinformatics problems. And have < 0 interest in DRM antics on material if piped to /dev/null would raise a user’s IQ.

Graph DB + Bioinformatics: Bio4j,…

Thursday, June 21st, 2012

Graph DB + Bioinformatics: Bio4j, recent applications and future directions by Pablo Pareja.

If you haven’t seen one of Pablo’s slide decks on Bio4j, get ready for a real treat!

Let me quote the numbers from slide 42, which is entitled: “Bio4j + MG7 + 24 Chip-Seq samples”

157 639 502 nodes

742 615 705 relationships

632 832 045 properties

149 relationship types

44 node types

And it works just fine!

Granting he is not running this on his cellphone but if you are going to process serious data, you need serious computing power. (OK, he uses Amazon Web Services. Like I said, not his cellphone.)

Did I mention everything done by Oh no sequences! (www.ohnosequences.com) is 100% Open source?

There is much to learn here. Enjoy!

Lessons from Amazon RDS on Bringing Existing Apps to the Cloud

Thursday, June 21st, 2012

Lessons from Amazon RDS on Bringing Existing Apps to the Cloud by Nati Shalom.

From the post:

Its a common believe that Cloud is good for green field apps. There are many reasons for this, in particular the fact that the cloud forces a different kind of thinking on how to run apps. Native cloud apps were designed to scale elastically, they were designed with complete automation in mind, and so forth. Most of the existing apps (a.k.a brown field apps) were written in a pre-cloud world and therefore don’t support these attributes. Adding support for these attributes could carry a significant investment. In some cases, this investment could be so big that it would make more sense to go through a complete re-write.

In this post I want to challenge this common belief. Over the past few years I have found that many stateful applications running on the cloud don’t support all those attributes, elasticity in particular. One of the better-known examples of this is MySQL and its Amazon cloud offering, RDS, which I’ll use throughout this post to illustrate my point.

Amazon RDS as an example for migrating a brown-field applications

MySQL was written in a pre-cloud world and therefore fits into the definition of a brown-field app. As with many brown-field apps, it wasn’t designed to be elastic or to scale out, and yet it is one of the more common and popular services on the cloud. To me, this means that there are probably other attributes that matter even more when we consider our choice of application in the cloud. Amazon RDS is the cloud-enabled version of MySQL. It can serve as a good example to find what those other attributes could be.

You have to admit that the color imagery is telling. Pre-cloud applications are “brown-field” apps and cloud apps are “green.”

I think the survey numbers about migrating to the cloud are fairly soft and not always consistent. There will be “green” and “brown” field apps created or migrated to the cloud.

But brown field apps will remain just as relational databases did not displace all the non-relational databases, which persist to this day.

Technology is as often “in addition to” as it is “in place of.”

MapR Now Available as an Option on Amazon Elastic MapReduce

Sunday, June 17th, 2012

MapR Now Available as an Option on Amazon Elastic MapReduce

From the post:

MapR Technologies, Inc., the provider of the open, enterprise-grade distribution for Apache Hadoop, today announced the immediate availability of its MapR Distribution for Hadoop as an option within the Amazon Elastic MapReduce service. Customers can now provision dynamically scalable MapR clusters while taking advantage of the flexibility, agility and massive scalability of Amazon Web Services (AWS). In addition, AWS has made its own Hadoop enhancements available to MapR customers, allowing them to seamlessly use MapR with other AWS offerings such as Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB and Amazon CloudWatch.

“We’re excited to welcome MapR’s feature-rich distribution as an option for customers running Hadoop in the cloud,” said Peter Sirota, general manager of Amazon Elastic MapReduce, AWS. “MapR’s innovative high availability data protection and performance features combined with Amazon EMR’s managed Hadoop environment and seamless integration with other AWS services provides customers a powerful tool for generating insights from their data.”

Customers can provision MapR clusters on-demand and automatically terminate them after finishing data processing, reducing costs as they only pay for the resources they consume. Customers can augment their existing on-premise deployments with AWS-based clusters to improve disaster recovery and access additional compute resources as required.

“For many customers there is no longer a compelling business case for deploying an on-premise Hadoop cluster given the secure, flexible and highly cost effective platform for running MapR that AWS provides,” said John Schroeder, CEO and co-founder, MapR Technologies. “The combination of AWS infrastructure and MapR’s technology, support and management tools enables organizations to potentially lower their costs while increasing the flexibility of their data intensive applications.”

Are you doing topic maps in the cloud yet?

A rep from one of the “big iron” companies was telling me how much more reliable owning your own hardware with their software than the cloud.

True, but that has the same answer as the question: Who needs the capacity to process petabytes of data in real time?

If the truth were told, there are a few companies, organizations that could benefit from that capability.

But the rest of us don’t have that much data or the talent to process it if we did.

Over the summer I am going to try the cloud out, both generally and for topic maps.

Suggestions/comments?

One Trillion Stored (and counting) [new uncertainty principle?]

Tuesday, June 12th, 2012

Amazon S3 – The First Trillion Objects

Jeff Barr writes:

Late last week the number of objects stored in Amazon S3 reached one trillion (1,000,000,000,000 or 1012). That’s 142 objects for every person on Planet Earth or 3.3 objects for every star in our Galaxy. If you could count one object per second it would take you 31,710 years to count them all.

We knew this day was coming! Lately, we’ve seen the object count grow by up to 3.5 billion objects in a single day (that’s over 40,000 new objects per second).

Old news because no doubt the total is greater than one trillion a week later. Or perhaps any time period greater than 1/40,000 of a second?

Is there a new uncertainty principle? Overall counts for S3 are estimates for some time X?

New Mechanical Turk Categorization App

Saturday, May 19th, 2012

New Mechanical Turk Categorization App

Categorization is one of the more popular use cases for the Amazon Mechanical Turk. A categorization HIT (Human Intelligence Task) asks the Worker to select from a list of options. Our customers use HITs of this type to assign product categories, match URLs to business listings, and to discriminate between line art and photographs.

Using our new Categorization App, you can start categorizing your own items or data in minutes, eliminating the learning curve that has traditionally accompanied this type of activity. The app includes everything that you need to be successful including:

  1. Predefined HITs (no HTML editing required).
  2. Pre-qualified Master Workers (see Jinesh’s previous blog post on Mechanical Turk Masters).
  3. Price recommendations based on complexity and comparable HITs.
  4. Analysis tools.

The Categorization App guides you through the four simple steps that are needed to create your categorization project.

I thought the contrast between gamers (the GPU post) and MTurkers would be a nice to close the day. ;-)

Although, there are efforts to create games where useful activity happens, whether intended or not. (Would that take some of the joy out of a game?)

If you use this particular app, please blog or post a note about your experieince.

Thanks!

AWS NYC Summit 2012

Tuesday, May 1st, 2012

AWS NYC Summit 2012

The line that lead me to this read:

We posted 25 presentations from the New York 2012 AWS Summit.

Actually, no.

Posted 25 slide decks, not presentations.

Useful yes, presentations, no.

Not to complain too much given the rapid expansion of services and technical guidance but let’s not confuse slides with presentations.

The AWS Report (Episode 2) has one major improvement: The clouds in the background don’t move! (As they did in the first episode. Now there was a shadow that moved over the front of the desk.)

We need to ask Amazon to get Jeff a new laptop without all the stickers on the top. If Paula Abdul or Vanna White were doing the interview, the laptop stickers would not be distracting. Or at least not enough to complain. Jeff isn’t Paula Abdul or Vanna White. Sorry Jeff.

I think the AWS Report has real potential. Several short segments with more “facts” and fewer “general” statements would be great.

Enjoyed the Elastic Beanstalk episode but hearing customers are busy, happy and requirements were gathered for other language support (besides Java) is like hearing public service announcements on PBS.

Nothing to disagree with but no real content either.

Suggestion: Perhaps short, say 90 to 120 second description of a typical issue (off mailing list?) that ends with: What is your solution? and feature one or more solutions on the next show? To get the audience involved and get other people hawking the show.

Not quite the cover of the Rolling Stone but perhaps someday… ;-)

CloudSpokes Coding Challenge Winners – Build a DynamoDB Demo

Saturday, April 14th, 2012

CloudSpokes Coding Challenge Winners – Build a DynamoDB Demo

From the post:

Last November CloudSpokes was invited to participate in the DynamoDB private beta. We spent some time kicking the tires, participating in the forums and developing use cases for their Internet-scale NoSQL database service. We were really excited about the possibilities of DynamoDB and decided to crowdsource some challenge ideas from our 38,000 strong developer community. Needless to say, the release generated quite a bit of buzz.

When Amazon released DynamoDB in January, we launched our CloudSpokes challenge Build an #Awesome Demo with Amazon DynamoDB along with a blog post and a sample ”Kiva Loan Browser Demo” application to get people started. The challenge requirements were wide open and all about creating the coolest application using Amazon DynamoDB. We wanted to see what the crowd could come up with.

The feedback we received from numerous developers was extremely positive. The API was very straightforward and easy to work with. The SDKs and docs, as usual, were top-notch. Developers were able to get up to speed fast as DynamoDB’s simple storage and query methods were easy to grasp. These methods allowed developers to store and access data items with a flexible number of attributes using the simple “Put” or “Get” verbs that they are familiar with. No surprise here, but we had a number of comments regarding the speed of both read and write operations.

When our challenge ended a week later we were pleasantly surprised with the applications and chose to highlight the following top five:

I don’t think topic maps has 38,000 developers but challenges do seem to pull people out of the woodwork.

Any thoughts on what would make interesting/attractive challenges? Other than five figure prizes? ;-)

The CloudFormation Circle of Life : Part 1

Thursday, April 12th, 2012

The CloudFormation Circle of Life : Part 1

From the post:

AWS CloudFormation makes it easier for you to create, update, and manage your AWS resources in a predictable way. Today, we are announcing a new feature for AWS CloudFormation that allows you to add or remove resources from your running stack, enabling your stack to evolve as its requirements change over time. With AWS CloudFormation, you can now manage the complete lifecycle the AWS resources powering your application.

I think there is a name for this sort of thing. Innovation, that’s right! That’s the name for it!

As topic map services move into the clouds, being able to take advantage of resource stacks is likely to be important. Particularly if you have mapping empowered resources that can be placed in a stack of resources.

The “cloud” in general looks like an opportunity to move away from ETL (Extract-Transform-Load) into more of an ET (Extract-Transform) model. Particularly if you take a functional view of data. Will save on storage costs, particularly if the data sets are quite large.

Definitely a service that anyone working with topic maps in the cloud needs to know more about.

AWS Documentation Now Available on the Kindle

Wednesday, April 11th, 2012

AWS Documentation Now Available on the Kindle

From the post:

AWS documentation is now available on the Kindle – if this is all you need to know, start here and you’ll have access to the new documents in seconds.

I “purchased” (the actual cost is $0.00) the EC2 Getting Started Guide and had it delivered to my trusty Kindle DX, where it looked great:

[graphic omitted]

You can highlight, annotate, and search the content as desired.

We’ve uploaded 43 documents so far; others will follow shortly.

Two observations:

For the “cloud” Kindle (what I use on Linux to read Kindle titles), should be able to select multiple AWS documentation titles for a single batch download. Yes?

Ahem, at least the “Analyzing Big Data with AWS” did not have an index.

Indexing all the AWS titles together (not entirely auto-magically), would make AWS documentation a cut above its competitors. (At least a goal to start with. Later versions can mix in titles from publishers, blogs, etc.)

Amazon DynamoDB Libraries, Mappers, and Mock Implementations Galore!

Friday, April 6th, 2012

Amazon DynamoDB Libraries, Mappers, and Mock Implementations Galore!

From the post:

Today’s guest blogger is Dave Lang, Product Manager of the DynamoDB team, who has a great list of tools and SDKs that will allow you to use DynamoDB from just about any language or environment.

While you are learning AWS, you may as well take a look at the DynamoDB.

Comments on any of these resources? I just looked at them briefly but they seemed quite, err, uneven.

I understand wanting to thank everyone who made an effort but on the other hand, I think AWS customers would be well served by a top coder’s type list of products. X% of the top 100 AWS projects use Y. That sort of thing.

The 1000 Genomes Project

Monday, April 2nd, 2012

The 1000 Genomes Project

If Amazon is hosting a single dataset > 200 TB, is your data “big data?” ;-)

This merits quoting in full:

We're very pleased to welcome the 1000 Genomes Project data to Amazon S3. 

The original human genome project was a huge undertaking. It aimed to identify every letter of our genetic code, 3 billion DNA bases in total, to help guide our understanding of human biology. The project ran for over a decade, cost billions of dollars and became the corner stone of modern genomics. The techniques and tools developed for the human genome were also put into practice in sequencing other species, from the mouse to the gorilla, from the hedgehog to the platypus. By comparing the genetic code between species, researchers can identify biologically interesting genetic regions for all species, including us.

A few years ago there was a quantum leap in the technology for sequencing DNA, which drastically reduced the time and cost of identifying genetic code. This offered the promise of being able to compare full genomes from individuals, rather than entire species, leading to a much more detailed genetic map of where we, as individuals, have genetic similarities and differences. This will ultimately give us better insight into human health and disease.

The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available, ultimately with data from the genomes of over 2,661 people from 26 populations around the world. The project began with three pilot studies that assessed strategies for producing a catalog of genetic variants that are present at one percent or greater in the populations studied. We were happy to host the initial pilot data on Amazon S3 in 2010, and today we're making the latest dataset available to all, including results from sequencing the DNA of approximately 1,700 people.

The data is vast (the current set weighs in at over 200Tb), so hosting the data on S3 which is closely located to the computational resources of EC2 means that anyone with an AWS account can start using it in their research, from anywhere with internet access, at any scale, whilst only paying for the compute power they need, as and when they use it. This enables researchers from laboratories of all sizes to start exploring and working with the data straight away. The Cloud BioLinux AMIs are ready to roll with the necessary tools and packages, and are a great place to get going.

Making the data available via a bucket in S3 also means that customers can crunch the information using Hadoop via Elastic MapReduce, and take advantage of the growing collection of tools for running bioinformatics job flows, such as CloudBurst and Crossbow

You can find more information, the location of the data and how to get started using it on our 1000 Genomes web page, or from the project pages.

If that sounds like a lot of data, just imagine all of the recorded mathematical texts and the relationships between the concepts represented in such texts?

It is in our view that data looks smooth or simple. Or complex.

The Total Cost of (Non) Ownership of a NoSQL Database Service

Monday, April 2nd, 2012

The Total Cost of (Non) Ownership of a NoSQL Database Service

From the post:

We have received tremendous positive feedback from customers and partners since we launched Amazon DynamoDB two months ago. Amazon DynamoDB enables customers to offload the administrative burden of operating and scaling a highly available distributed database cluster while only paying for the actual system resources they consume. We also received a ton of great feedback about how simple it is get started and how easy it is to scale the database. Since Amazon DynamoDB introduced the new concept of a provisioned throughput pricing model, we also received several questions around how to think about its Total Cost of Ownership (TCO).

We are very excited to publish our new TCO whitepaper: The Total Cost of (Non) Ownership of a NoSQL Database service. Download PDF.

I bet you can guess how the numbers work out without reading the PDF file. ;-)

Makes me wonder though if there would be a market for a different hosted NoSQL database or topic map application? Particularly a topic map application.

Not along the lines of Maiana but more of a topic based data set, which could respond to data by merging it with already stored data. Say for example a firefighter scans the bar code on a railroad car lying alongside the tracks with fire getting closer. The only think they want is a list of the necessary equipment and whether to leave now, or not.

Most preparedness agencies would be well pleased to simply pay for the usage they get of such a topic map.

Two New AWS Getting Started Guides

Saturday, March 24th, 2012

Two New AWS Getting Started Guides

From the post:

We’ve put together a pair of new Getting Started Guides for Linux and Microsoft Windows. Both guides will show you how to use EC2, Elastic Load Balancing, Auto Scaling, and CloudWatch to host a web application.

The Linux version of the guide (HTML, PDF) is built around the popular Drupal content management system. The Windows version (HTML, PDF) is built around the equally popular DotNetNuke CMS.

These guides are comprehensive. You will learn how to:

  • Sign up for the services
  • Install the command line tools
  • Find an AMI
  • Launch an Instance
  • Deploy your application
  • Connect to the Instance using the MindTerm SSH Client or PuTTY
  • Configure the Instance
  • Create a custom AMI
  • Create an Elastic Load Balancer
  • Update a Security Group
  • Configure and use Auto Scaling
  • Create a CloudWatch Alarm
  • Clean up

Other sections cover pricing, costs, and potential cost savings.

Not quite a transparent computing fabric, yet. ;-)