Archive for the ‘Cloud Computing’ Category

Spending Time Rolling Your Own or Using Google Tools in Anger?

Wednesday, March 30th, 2016

The question: Spending Time Rolling Your Own or Using Google Tools in Anger? is one faced by many people who have watched computer technology evolve.

You could write your own blogging software or you can use one of the standard distributions.

You could write your own compiler or you can use one of the standard distributions.

You can install and maintain your own machine learning, big data apps, or you can use the tools offered by Google Machine Learning.

Tinkering with your local system until it is “just so” is fun, but it eats into billable time and honestly is a distraction.

Not promising I immersing in the Google-verse but an honest assessment of where to spend my time is in order.

Google takes Cloud Machine Learning service mainstream by Fausto Ibarra, Director, Product Management.

From the post:

Hundreds of different big data and analytics products and services fight for your attention as it’s one of the most fertile areas of innovation in our industry. And it’s no wonder; the most amazing consumer experiences are driven by insights derived from information. This is an area where Google Cloud Platform has invested almost two decades of engineering, and today at GCP NEXT we’re announcing some of the latest results of that work. This next round of innovation builds on our portfolio of data management and analytics capabilities by adding new products and services in multiples key areas:

Machine Learning:

We’re on a journey to create applications that can see, hear and understand the world around them. Today we’ve taken a major stride forward with the announcement of a new product family: Cloud Machine Learning. Cloud Machine Learning will take machine learning mainstream, giving data scientists and developers a way to build a new class of intelligent applications. It provides access to the same technologies that power Google Now, Google Photos and voice recognition in Google Search as easy to use REST APIs. It enables you to build powerful Machine Learning models on your data using the open-source TensorFlow machine learning library:

Big Data and Analytics:

Doing big data the cloud way means being more productive when building applications, with faster and better insights, without having to worry about the underlying infrastructure. To further this mission, we recently announced the general availability of Cloud Dataproc, our managed Apache Hadoop and Apache Spark service, and we’re adding new services and capabilities today:

Open Source:

Our Cloud Machine Learning offering leverages Google’s cutting edge machine learning and data processing technologies, some of which we’ve recently open sourced:

What, if anything, do you see as a serious omission in this version of the Google-verse?


DataGraft: Initial Public Release

Monday, September 7th, 2015

DataGraft: Initial Public Release

As a former resident of Louisiana and given my views on the endemic corruption in government contracts, putting “graft” in the title of anything is like waving a red flag at a bull!

From the webpage:

We are pleased to announce the initial public release of DataGraft – a cloud-based service for data transformation and data access. DataGraft is aimed at data workers and data developers interested in simplified and cost-effective solutions for managing their data. This initial release provides capabilities to:

  • Transform tabular data and share transformations: Interactively edit, host, execute, and share data transformations
  • Publish, share, and access RDF data: Data hosting and reliable RDF data access / data querying

Sign up for an account and try DataGraft now!

You may want to check out our FAQ, documentation, and the APIs. We’d be glad to hear from you – don’t hesitate to get in touch with us!

I followed a tweet from Kirk Borne recently to a demo of Pentaho on data integration. I mention that because Pentaho is a good representative of the commercial end of data integration products.

Oh, the demo was impressive, a visual interface selecting nicely styled icons from different data sources, integration, visualization, etc.

But, the one characteristic it shares with DataGraft is that I would be hard pressed to follow or verify your reasoning for the basis for integrating that particular data.

If it happens that both files have customerID and they both have the same semantic, by some chance, then you can glibly talk about integrating data from diverse resources. If not, well, then your mileage will vary a great deal.

The important point that is dropped by both Pentaho and DataGraft is that data integration isn’t just an issue for today, that same data integration must be robust long after I have moved onto another position.

Like spreadsheets, the next person in my position could just run the process blindly and hope that no one ever asks for a substantive change, but that sounds terribly inefficient.

Why not provide users with the ability to disclose the properties they “see” in the data sources and to indicate why they made the mappings they did?

That is make the mapping process more transparent.

NOAA weather data – Valuing Open Data – Guessing – History Repeats

Sunday, April 26th, 2015

Tech titans ready their clouds for NOAA weather data by Greg Otto.

From the post:

It’s fitting that the 20 terabytes of data the National Oceanic and Atmospheric Administration produces every day will now live in the cloud.

The Commerce Department took a step Tuesday to make NOAA data more accessible as Commerce Secretary Penny Pritzker announced a collaboration among some of the country’s top tech companies to give the public a range of environmental, weather and climate data to access and explore.

Amazon Web Services, Google, IBM, Microsoft and the Open Cloud Consortium have entered into a cooperative research and development agreement with the Commerce Department that will push NOAA data into the companies’ respective cloud platforms to increase the quantity of and speed at which the data becomes publicly available.

“The Commerce Department’s data collection literally reaches from the depths of the ocean to the surface of the sun,” Pritzker said during a Monday keynote address at the American Meteorological Society’s Washington Forum. “This announcement is another example of our ongoing commitment to providing a broad foundation for economic growth and opportunity to America’s businesses by transforming the department’s data capabilities and supporting a data-enabled economy.”

According to Commerce, the data used could come from a variety of sources: Doppler radar, weather satellites, buoy networks, tide gauges, and ships and aircraft. Commerce expects this data to launch new products and services that could benefit consumer goods, transportation, health care and energy utilities.

The original press release has this cheery note on the likely economic impact of this data:

So what does this mean to the economy? According to a 2013 McKinsey Global Institute Report, open data could add more than $3 trillion in total value annually to the education, transportation, consumer products, electricity, oil and gas, healthcare, and consumer finance sectors worldwide. If more of this data could be efficiently released, organizations will be able to develop new and innovative products and services to help us better understand our planet and keep communities resilient from extreme events.

Ah, yes, that would be the Open data: Unlocking innovation and performance with liquid information, on which the summary page says:

Open data can help unlock $3 trillion to $5 trillion in economic value annually across seven sectors.

But you need to read the full report (PDF) in order to find footnote 3 on “economic value:”

3. Throughout this report we express value in terms of annual economic surplus in 2013 US dollars, not the discounted value of future cash flows; this valuation represents estimates based on initiatives where open data are necessary but not sufficient for realizing value. Often, value is achieved by combining analysis of open and proprietary information to identify ways to improve business or government practices. Given the interdependence of these factors, we did not attempt to estimate open data’s relative contribution; rather, our estimates represent the total value created.

That is a disclosure that the estimate of $3 to $5 trillion is a guess and/or speculation.

Odd how the guess/speculation disclosure drops out of the Commerce Department press release and when it gets to Greg’s story it reads:

open data could add more than $3 trillion in total value annually to the education, transportation, consumer products, electricity, oil and gas, healthcare, and consumer finance sectors worldwide.

From guess/speculation to no mention to fact, all in the short space of three publications.

Does the valuing of open data remind you of:


(Image from:

The date of 1609 is important. Wikipedia has an article on Virginia, 1609-1610, titled, Starving Time. That year, only sixty (60) out of five hundred (500) colonists survived.

Does “Excellent Fruites by Planting” sound a lot like “new and innovative products and services?”

It does to me.

I first saw this in a tweet by Kirk Borne.

Seeding the cloud…

Sunday, December 28th, 2014

Seeding the cloud — AWS gives credits with select edX certs by Barb Darrow.

From the post:

Amazon definitely wants enterprises to adopt its cloud, but it’s still wooing little startups too. This week, it said it will issue $1,000 in Amazon Web Services credit to any student who completes qualifying edX certifications in entrepreneurship. EdX is the online education platform backed by MIT, Harvard, and a raft of other universities.

Barb mentions at least six (6) other special cloud offers in a very short article. No doubt more are going to show up in 2015.

Are you going to be in the cloud in 2015?

I searched for “minimum fee” information for the entrepreneurship courses but got caught in a loop of HTML pages, none of which offered an actual answer.

Looking at some of the series courses, I would guess the “minimum fee” would be at or less than $100 per course. Check when you enroll for the actual “minimum fee.” Why the site admins want to be cagey about such a reasonable fee I cannot say.

Hot Cloud Swap: Migrating a Database Cluster with Zero Downtime

Tuesday, December 23rd, 2014

Hot Cloud Swap: Migrating a Database Cluster with Zero Downtime by Jennifer Rullmann.

By now, you may have heard about, seen, or even tried your hand against the fault tolerance of our database. The Key-Value Store, and the layers that turn it into a multi-model database, handle a wide variety of disasters with ease. In this real-time demo video, we show off the ability to migrate a cluster to a new set of machines with zero downtime.


We’re calling this feature ‘hot cloud swap’, because although you can use it on your own machines, it’s particularly interesting to those who run their database in the cloud and may want to switch providers. And that’s exactly what I do in the video. Watch me migrate a database cluster from Digital Ocean to Amazon Web Services in under 7 minutes, real-time!

Its been years but I can remember as a sysadmin switching out “hot swapable” drives. Never lost any data but there was always that moment of doubt during the rebuild.

Personally I would have more than one complete and tested backups, to the extent that is possible, before trying a “hot cloud swap.” That may be overly cautious but better cautious than crossing into the “Sony Zone.”

At one point Jennifer says:

“…a little bit of hesitation but it worked it out.”

Difficult to capture but if you look at time marker 06.52.85 on the clock below the left hand window, writes start failing.

It recovers but it is not the case that the application never stops. At least in the sense of writes. Depends on your definition of “stops” I suppose.

I am sure that the fault tolerance build into FoundationDB made this less scary but the “hot swap” part should be doable with any clustering solution. Yes?

That is you add “new” machines to the cluster, then exclude the “old” machines from the cluster, which results in a complete transfer of data to the “new” machines, at which point you create new coordinators, exclude the “old” machines from the cluster and then eventually you close the “old” machines. Is there something unique about that process to FoundationDB?

Don’t get me wrong, I am hoping to learn a great deal more about FoundationDB in the new year but I intensely dislike distinctions between software packages that have no basis in fact.

Orleans Goes Open Source

Wednesday, December 17th, 2014

Orleans Goes Open Source

From the post:

Since the release of the Project “Orleans” Public Preview at //build/ 2014 we have received a lot of positive feedback from the community. We took your suggestions and fixed a number of issues that you reported in the Refresh release in September.

Now we decided to take the next logical step, and do the thing many of you have been asking for – to open-source “Orleans”. The preparation work has already commenced, and we expect to be ready in early 2015. The code will be released by Microsoft Research under an MIT license and published on GitHub. We hope this will enable direct contribution by the community to the project. We thought we would share the decision to open-source “Orleans” ahead of the actual availability of the code, so that you can plan accordingly.

The real excitement for me comes from a post just below this announcement: A Framework for Cloud Computing,

To avoid these complexities, we built the Orleans programming model and runtime, which raises the level of the actor abstraction. Orleans targets developers who are not distributed system experts, although our expert customers have found it attractive too. It is actor-based, but differs from existing actor-based platforms by treating actors as virtual entities, not as physical ones. First, an Orleans actor always exists, virtually. It cannot be explicitly created or destroyed. Its existence transcends the lifetime of any of its in-memory instantiations, and thus transcends the lifetime of any particular server. Second, Orleans actors are automatically instantiated: if there is no in-memory instance of an actor, a message sent to the actor causes a new instance to be created on an available server. An unused actor instance is automatically reclaimed as part of runtime resource management. An actor never fails: if a server S crashes, the next message sent to an actor A that was running on S causes Orleans to automatically re-instantiate A on another server, eliminating the need for applications to supervise and explicitly re-create failed actors. Third, the location of the actor instance is transparent to the application code, which greatly simplifies programming. And fourth, Orleans can automatically create multiple instances of the same stateless actor, seamlessly scaling out hot actors.

Overall, Orleans gives developers a virtual “actor space” that, analogous to virtual memory, allows them to invoke any actor in the system, whether or not it is present in memory. Virtualization relies on indirection that maps from virtual actors to their physical instantiations that are currently running. This level of indirection provides the runtime with the opportunity to solve many hard distributed systems problems that must otherwise be addressed by the developer, such as actor placement and load balancing, deactivation of unused actors, and actor recovery after server failures, which are notoriously difficult for them to get right. Thus, the virtual actor approach significantly simplifies the programming model while allowing the runtime to balance load and recover from failures transparently. (emphasis added)

Not in a distributed computing context but the “look and its there” model is something I recall from HyTime. So nice to see good ideas resurface!

Just imagine doing that with topic maps, including having properties of a topic, should you choose to look for them. If you don’t need a topic, why carry the overhead around? Wait for someone to ask for it.

This week alone, Microsoft continues its fight for users, announces an open source project that will make me at least read about .Net, ;-), I think Microsoft merits a lot of kudos and good wishes for the holiday season!

I first say this at: Microsoft open sources cloud framework that powers Halo by Jonathan Vanian.

Open Source Aerospike NoSQL Database Scales To 1M TPS For $1.68 Per Hour…

Tuesday, November 11th, 2014

Open Source Aerospike NoSQL Database Scales To 1M TPS For $1.68 Per Hour On A Single Amazon Web Services Instance at AWS re:Invent 2014

From the post:

Aerospike – the first flash-optimized open source database and the world’s fastest in-memory NoSQL database – will be at Amazon Web Services (AWS) re:Invent 2014 conference in Las Vegas, Nev.

An ultra low latency Key-Value Store, Aerospike can operate in pure RAM backed by Amazon Elastic Block Store (EBS) for persistence as well as in a hybrid mode using RAM and SSDs. Aerospike engineers have documented the performance of different AWS EC2 instances and described the best techniques to achieve 1 Million transactions per second on one instance with sub-millisecond latency.

The Aerospike AMI in the Amazon Marketplace comes with cloud formation scripts for simple, single click deployments. The open source Aerospike Community Edition is free and the Aerospike Enterprise Edition with certified binaries and Cross Data Center Replication (XDR) is also free for startups in the startup special program. Aerospike is priced simply based on the volume of unique data managed, with no charge for replicated data, for Transactions Per Second (TPS) or number of servers in a cluster.

Aerospike is popularly used as a session store, cookie store, user profile store, id-mapping store, for fraud detection, dynamic pricing, real-time product recommendations and personalization of cross channel user experiences on websites, mobile apps, e-commerce portals, travel portals, financial services portals and real-time bidding platforms. To ensure 24x7x365 operations; data in Aerospike is replicated synchronously with immediate consistency within a cluster and asynchronously across clusters in different availability zones using Aerospike Cross Data Center Replication (XDR).

This is not a plug for or against Aerospike. I am mostly posting this as a reminder to me as much as you that cloud data prices can be remarkably sane. Even $1.68 per hour could add up over a week but if you develop locally and test in the cloud, you should be able to meet your budget targets.

For any paying client, you can pass the cloud hosting fees (with an upfront deposit and one month in advance) to them.

Other examples of reasonable cloud pricing?

iCloud: Leak for Less

Wednesday, September 10th, 2014

Apple rolls out iCloud pricing cuts by Jonathan Vanian.

Jonathan details the new Apple pricing schedule for the iCloud.

Now you can leak your photographs for less!

Cheap storage = Cheap security.

Is there anything about that statement that is unclear?

Virtual Workshop and Challenge (NASA)

Tuesday, June 24th, 2014

Open NASA Earth Exchange (NEX) Virtual Workshop and Challenge 2014

From the webpage:

President Obama has announced a series of executive actions to reduce carbon pollution and promote sound science to understand and manage climate impacts for the U.S.

Following the President’s call for developing tools for climate resilience, OpenNEX is hosting a workshop that will feature:

  1. Climate science through lectures by experts
  2. Computational tools through virtual labs, and
  3. A challenge inviting participants to compete for prizes by designing and implementing solutions for climate resilience.

Whether you win any of the $60K in prize money or not, this looks like a great way to learn about climate data, approaches to processing climate data and the Amazon cloud all at one time!

Processing in the virtual labs is on the OpenNEX (Open NASA Earth Exchange) nickel. You can experience cloud computing without fear of the bill for computing services. Gain valuable cloud experience and possibly make a contribution to climate science.


Wolfram Programming Cloud Is Live!

Monday, June 23rd, 2014

Wolfram Programming Cloud Is Live! by Stephen Wolfram.

From the post:

Twenty-six years ago today we launched Mathematica 1.0. And I am excited that today we have what I think is another historic moment: the launch of Wolfram Programming Cloud—the first in a sequence of products based on the new Wolfram Language.

Wolfram Programming Cloud

My goal with the Wolfram Language in general—and Wolfram Programming Cloud in particular—is to redefine the process of programming, and to automate as much as possible, so that once a human can express what they want to do with sufficient clarity, all the details of how it is done should be handled automatically.

I’ve been working toward this for nearly 30 years, gradually building up the technology stack that is needed—at first in Mathematica, later also in Wolfram|Alpha, and now in definitive form in the Wolfram Language. The Wolfram Language, as I have explained elsewhere, is a new type of programming language: a knowledge-based language, whose philosophy is to build in as much knowledge about computation and about the world as possible—so that, among other things, as much as possible can be automated.

The Wolfram Programming Cloud is an application of the Wolfram Language—specifically for programming, and for creating and deploying cloud-based programs.

How does it work? Well, you should try it out! It’s incredibly simple to get started. Just go to the Wolfram Programming Cloud in any web browser, log in, and press New. You’ll get what we call a notebook (yes, we invented those more than 25 years ago, for Mathematica). Then you just start typing code.

I am waiting to validate my email address to access the Wolfram portal.

It will take weeks to evaluate some of the claims made for the portal but I can attest that the Wolfram site in general remains very responsive, despite what must be snowballing load today.

That in and of itself is a good sign.


I first saw this in a tweet by Christophe Lalanne.

Functional Programming in the Cloud:…

Thursday, May 8th, 2014

Functional Programming in the Cloud: How to Run Haskell on OpenShift by Katie Miller.

From the post:

One of the benefits of Platform as a Service (PaaS) is that it makes it trivial to try out alternative technology stacks. The OpenShift PaaS is a polyglot platform at the heart of a thriving open-source community, the contributions of which make it easy to experiment with applications written in a host of different programming languages. This includes the purely functional Haskell language.

Although it is not one of the Red Hat-supported languages for OpenShift, Haskell web applications run on the platform with the aid of the community-created Haskell cartridge project. This is great news for functional programming (FP) enthusiasts such as myself and those who want to learn more about the paradigm; Haskell is a popular choice for learning FP principles. In this blog post, I will discuss how to create a Haskell web application on OpenShift.


If you do not have an OpenShift account yet, sign up for OpenShift Online for free. You’ll receive three gears (containers) in which to run applications. At the time of writing, each of these free gears come with 512MB of RAM and 1GB of disk space.

To help you communicate with the OpenShift platform, you should install the RHC client tools on your machine. There are instructions on how to do that for a variety of operating systems at the OpenShift Dev Center. Once the RHC tools are installed, run the command rhc setup on the command line to configure RHC ready for use.

Katie’s post is a great way to get started with OpenShift!

However, it also reminds me of why I dislike Daylight Savings Time. It is getting dark later in the Eastern United States but there are still only twenty-four (24) hours in a day! An extra eight (8) hours a day and the stamina to stay awake for them would be better. 😉

Unlikely to happen so enjoy Katie’s post during the usual twenty-four (24) hour day.

No Silver Lining For U.S. Cloud Providers

Monday, April 28th, 2014

Judge’s ruling spells bad news for U.S. cloud providers by Barb Darrow.

From the post:

A court ruling on Friday over search warrants means continued trouble for U.S. cloud providers eager to build their businesses abroad.

In his ruling, U.S. Magistrate Judge James Francis found that big ISPs — including name brands Microsoft and Google — must comply with valid warrants to turn over customer information, including emails, even if that material resides in data centers outside the U.S., according to several reports.

Microsoft challenged such a warrant a few months back and this ruling was the response.

See the post for details but the bottom line is that you can’t rely on the loyalty of a cloud provider to its customers. U.S. cloud providers or non-U.S. cloud providers. When push comes to shove, you are going to find cloud providers siding with their local governments.

My suggestion is that you handle critical data security tasks yourself and off of any cloud provider.

That’s not a foolproof solution but otherwise you may as well cc: the NSA with your data.

BTW, I would not trust privacy or due process assurances from a government that admits it would target its own citizens for execution.


“May I?” on the Google Cloud Platform

Tuesday, January 21st, 2014

Learn about Permissions on Google Cloud Platform by Jeff Peck.

From the post:

Do your co-workers ask you “How should I set up Google Cloud Platform projects for my developers?” Have you wondered about the difference between the Project Id, the Project Number and the App Id? Do you know what a service account is and why you need one? Find the answers to these and many other questions in a newly published guide to understanding permissions, projects and accounts on Google Cloud Platform.

Especially if you are just getting started, and are still sorting out the various concepts and terminology, this is the guide for you. The article includes explanations, definitions, best practices and links to the relevant documentation for more details. It’s a good place to start when learning to use Cloud Platform.

It’s not exciting reading, but it may keep you from looking real dumb when the bill for Google cloud services comes in. Kinda hard to argue that Google configured your permissions incorrectly.

Be safe, read about permissions before your potential successor does.

A better way to explore and learn on GitHub (Google Cloud)

Saturday, January 18th, 2014

A better way to explore and learn on GitHub

From the post:

Almost one year ago, Google Cloud Platform launched our GitHub organization, with repositories ranging from tutorials to samples to utilities. This is where developers could find all resources relating to the platform, and get started developing quickly. We started with 36 repositories, with lofty plans to add more over time in response to requests from you, our developers. Many product releases, feature launches, and one logo redesign later, we are now up to 123 repositories illustrating how to use all parts of our platform!

Despite some clever naming schemes, it was becoming difficult to find exactly the code that you wanted amongst all of our repositories. Idly browsing through over 100 options wasn’t productive. The repository names gave you an idea of what stacks they used, but not what problems they solved.

Today, we are making it easier to browse our repositories and search for sample code with our landing page at Whether you want to find all Compute Engine resources, locate all samples that are available in your particular stack, or find examples that fit your particular area of interest, you can find it with the new GitHub page. We’ll be rotating the repositories in the featured section, so make sure to wander that way from time to time.

Less than a year old and their standard metadata (read navigation) details are changing.

Judging from the comments, their users deeply appreciate the new approach.

Change is something that funders calling for standard metadata just don’t get. Which is why new standard metadata projects are so common. It is the same mistake, repeated over and over again.

To be sure, domains need to take their best shot at today’s standard metadata, but with an eye of it pointing to tomorrow’s standard metadata. To be truly useful in STEM fields, it needs to point back to yesterday’s standard metadata as well.

Sorry, got distracted.

Check out the new resources and get thee to the cloud!

Sense Preview

Sunday, January 12th, 2014

Sense is in private beta but you can request an invitation.

Even though the presentation is well rehearsed, this is pretty damned impressive!

The bar for cloud based computing continues to rise.

Follow @senseplatform.

I first saw this at Danny Bickson’s Sense: collaborative data science in the cloud

PS: Learn more about Sense at the 3rd GraphLab Conference.

Get savvy on the latest cloud terminology

Sunday, January 12th, 2014

Get savvy on the latest cloud terminology by Nick Hardiman.

From the post:

As with all technology, some lingo stays popular, while other phrases decline in use. Use this list to find out the newest terminology for all things cloud.

Some cloud terms, like cloudstorming, cloudware and external cloud, are declining in popularity. Other terms are up-and-coming, like vertical cloud.

This list gives all the latest lingo to keep you up-to-date on the most popular terms for all things cloud:

Nick has assembled a list of fifty-one (51) cloud terms.

Could be useful in creating a vocabulary a la

As Nick says, the lingo is going to change. Using a microformat and vocabulary can help you maintain access to information.

For example, Nick says:


a collection of machines that work together to deliver a customer service. Cloud clusters grow and shrink on-demand. A cloud service provides an API for scaling out a cluster, by adding more machines.

When quoting that, I could say:

<blockquote itemprop="Thing" sameAs="">


a collection of machines that work together to deliver a customer service. Cloud clusters grow and shrink on-demand. A cloud service provides an API for scaling out a cluster, by adding more machines. </blockquote>

Which would distinguish (when searching) a “cluster” of computers from one of the other 38 uses of “cluster” found at:

Rather than using “Thing” from, I really should find or make an extension to that vocabulary for terms in various areas that are relevant to topic maps.

Secure Cloud Computing – Very Secure

Friday, December 27th, 2013

Daunting Mathematical Puzzle Solved, Enables Unlimited Analysis of Encrypted Data

From the post:

IBM inventors have received a patent for a breakthrough data encryption technique that is expected to further data privacy and strengthen cloud computing security.

The patented breakthrough, called “fully homomorphic encryption,” could enable deep and unrestricted analysis of encrypted information — intentionally scrambled data — without surrendering confidentiality. IBM’s solution has the potential to advance cloud computing privacy and security by enabling vendors to perform computations on client data, such as analyzing sales patterns, without exposing or revealing the original data.

IBM’s homomorphic encryption technique solves a daunting mathematical puzzle that confounded scientists since the invention of public-key encryption over 30 years ago.

Invented by IBM cryptography Researcher Craig Gentry, fully homomorphic encryption uses a mathematical object known as an “ideal lattice” that allows people to interact with encrypted data in ways previously considered impossible. The breakthrough facilitates analysis of confidential encrypted data without allowing the user to see the private data, yet it will reveal the same detailed results as if the original data was completely visible.

IBM received U.S. Patent #8,565,435: Efficient implementation of fully homomorphic encryption for the invention, which is expected to help cloud computing clients to make more informed business decisions, without compromising privacy and security.

If that sounds a bit dull, consider this prose from the IBM Homomorphic Encryption page:

What if you want to query a search engine, but don’t want to tell the search engine what you are looking for? You might consider encrypting your query, but if you use an ordinary encryption scheme, the search engine will not be able to manipulate your ciphertexts to construct a meaningful response. What you would like is a cryptographic equivalent of a photograph developer’s “dark room”, where the search engine can process your query intelligently without ever seeing it.

Or, what if you want to store your data on the internet, so that you can access it at your convenience? You want your data to remain private, even from the server that is storing them; so, you store your data in encrypt form. But you would also like to be able to access your data intelligently — e.g., you would like the server to be able to return exactly those files containing the word `homomorphic’ within five words of `encryption’. Again, you would like the server to be able to “process” your data while it remains encrypted.

A “fully homomorphic” encryption scheme creates exactly this cryptographic dark room. Using it, anyone can manipulate ciphertexts that encrypt data under some public key pk to construct a ciphertext that encrypts *any desired function* of that data under pk. Such a scheme is useful in the settings above (and many others).

The key sentence is:

“Using it, anyone can manipulate ciphertexts that encrypt data under some public key pk to construct a ciphertext that encrypts *any desired function* of that data under pk.”

Wikipedia has a number of references under: Homomorphic encryption.

You may also be interested in: A fully homographic encryption scheme (Craig Gentry’s PhD thesis.

One of the more obvious use cases of homomorphic encryption with topic maps being the encryption of topic maps as deliverables.

Purchasers could have access to the results of merging but not the grist that was ground to produce the merging.

The antics of the NSA, 2013’s poster boy for better digital security, such as subversion of security standards and software vendors, out-right theft, and perversion of governments, will bring other use cases to mind.


Monday, December 9th, 2013

San Francisco startup takes on collaborative Data Science from The R Backpages 2 by Joseph Rickert.

From the post:

Domino, a San Francisco based startup, is inviting users to sign up to beta test its vision of online, Data Science collaboration. The site is really pretty slick, and the vision of cloud computing infrastructure integrated with an easy to use collaboration interface and automatic code revisioning is compelling. Moreover, it is delightfully easy to get started with Domino. After filling out the new account form, a well thought out series of screens walks the new user through downloading the client software, running a script (R, MatLab or Python) and viewing the results online. The domino software creates a quick-start directory on your PC where it looks for scripts to run. After the installation is complete it is just a matter firing up a command window to run scripts in the cloud with:

Great review by Joseph on Domino an its use on a PC.

Persuaded me to do an install on my local box:

Installation on Ubuntu 12.04

  • Get a Domino Account
  • Download/Save the file to a convenient directory. (Just shy of 20MB.)
  • chmod -744
  • ./
  • If you aren’t root, just ignore the symlink question. A bug but it will continue happily with the install. Tech support promptly reported that will be fixed.
  • BTW, installing from a shell window, requires a new shell window to take advantage of your path being modified to include the domino executable.
  • Follow the QuickStart, Steps 3, 4, and 5.
  • Step six of the QuickStart seems to be unnecessary. As the owner of the job, I was set to get email notification anyway.
  • Steps seven and eight of the QuickStart require no elaboration.

BTW, tech support was quick and on point in response to my questions about the installation script.

I have only run the demo scripts at this point but Domino looks like an excellent resource for R users and a great model from bringing the cloud to your desktop.

Leveraging software a user already knows to seamlessly deliver greater capabilities, has to be a winning combination.

OrientDB becomes distributed…

Friday, November 8th, 2013

OrientDB becomes distributed using Hazelcast, leading open source in-memory data grid

From the post:

Hazelcast and Orient Technologies today announced that OrientDB has gained a multi-master replication feature powered by Hazelcast.

Clustering multiple server nodes is the most significant feature of OrientDB 1.6. Databases can be replicated across heterogeneous server nodes in multi-master mode achieving the best of scalability and performance.

“I think one of the added value of OrientDB against all the NoSQL products is the usage of Hazelcast while most of the others use Yahoo ZooKeeper to manage the cluster (discovery, split brain network, etc) and something else for the transport layer.” said Luca Garulli, CEO of Orient Technologies. “With ZooKeeper configuration is a nightmare, while Hazelcast let you to add OrientDB servers with ZERO configuration. This has been a big advantage for our clients and everything is much more ‘elastic’, specially when deployed on the Cloud. We’ve used Hazelcast not only for the auto-discovery, but also for the transport layer. Thanks to this new architecture all our clients can scale up horizontally by adding new servers without stopping or reconfigure the cluster”.

“We are amazed by the speed with which OrientDB has adopted Hazelcast and we are delighted to see such excellent technologists teaming up with Hazelcast.” said Talip Ozturk, CEO of Hazelcast. “We work hard to make the best open source in-memory data grid on the market and are happy to see it being used in this way.” (emphasis added)

It was just yesterday that I was writing about configuration issues in the Hadoop ecosystem, that includes Zookeeper. Hadoop Ecosystem Configuration Woes?

Where there is smoke, is there fire?

Towards GPU-Accelerated Large-Scale Graph Processing in the Cloud

Sunday, November 3rd, 2013

Towards GPU-Accelerated Large-Scale Graph Processing in the Cloud by Jianlong Zhong and Bingsheng He.


Recently, we have witnessed that cloud providers start to offer heterogeneous computing environments. There have been wide interests in both cluster and cloud of adopting graphics processors (GPUs) as accelerators for various applications. On the other hand, large-scale processing is important for many data-intensive applications in the cloud. In this paper, we propose to leverage GPUs to accelerate large-scale graph processing in the cloud. Specifically, we develop an in-memory graph processing engine G2 with three non-trivial GPU-specific optimizations. Firstly, we adopt fine-grained APIs to take advantage of the massive thread parallelism of the GPU. Secondly, G2 embraces a graph partition based approach for load balancing on heterogeneous CPU/GPU architectures. Thirdly, a runtime system is developed to perform transparent memory management on the GPU, and to perform scheduling for an improved throughput of concurrent kernel executions from graph tasks. We have conducted experiments on a local cluster of three nodes and an Amazon EC2 virtual cluster of eight nodes. Our preliminary results demonstrate that 1) GPU is a viable accelerator for cloud-based graph processing, and 2) the proposed optimizations further improve the performance of GPU-based graph processing engine.

GPUs in the cloud anyone?

The future of graph computing isn’t clear but it certainly promises to be interesting!

I first saw this in a tweet by Stefano Bertolo

…Boring HA And Scalability Problems

Friday, September 27th, 2013

Great Open Source Solution For Boring HA And Scalability Problems by Maarten Ectors and Frank Mueller.

From the post:

High-availability and scalability are exciting in general but there are certain problems that experts see over and over again. The list is long but examples are setting up MySQL clustering, sharding Mongo, adding data nodes to a Hadoop cluster, monitoring with Ganglia, building continuous deployment solutions, integrating Memcached / Varnish / Nginx,… Why are we reinventing the wheel?

At Ubuntu we made it our goal to have the community solve these repetitive and often boring tasks. How often have you had to set-up MySQL replication and scale it? What if the next time you just simply do:

  1. juju deploy mysql
  2. juju deploy mysql mysql-slave
  3. juju add-relation mysql:master mysql-slave:slave
  4. juju add-unit -n 10 mysql-slave

It’s easy to see how these four commands work. After deploying a master and a slave MySQL both are related as master and slave. After this you simply can add more slaves like it is done here with 10 more instances.

Responsible for this easy approach is one of our latest open source solutions, Juju. Juju allows any server software to be packaged inside what is called a Charm. A Charm describes how the software is deployed, integrated and scaled. Once an expert creates the Charm and uploads it to the Charm Store, anybody can use it instantly. Execution environments for Charms are abstracted via providers. The list of supported providers is growing and includes Amazon AWS, HP Cloud, Microsoft Azure, any Openstack private cloud, local LXC containers, bare metal deployments with MaaS, etc. So Juju allows any software to be instantly deployed, integrated and scaled on any cloud.

We all want HA and scalability problems to be “boring.”

When HA and scalabiilty problems are “exciting,” that’s a bad thing!

If you are topic mapping in the cloud, take the time to read about Juju.

Kamala Cloud 2.0!

Friday, June 14th, 2013


Kamala is a knowledge platform for organizations and people to link their data and share their knowledge. Key features of Kamala: smart suggestions, semantic search and efficient filtering. These help you perfecting your knowledge model and give you powerful, reusable search results.

Model your domain knowledge in the cloud.



I understand more Kamala videos are coming next week.

An example of how to advertise topic maps. Err, a good example of how to advertise topic maps! 😉

You will see Gabriel Hopmans put in a cameo appearance in the video.

Congratulations to the Kamala Team!

Details, discussion, criticisms, etc., to follow.

Rya: A Scalable RDF Triple Store for the Clouds

Tuesday, June 11th, 2013

Rya: A Scalable RDF Triple Store for the Clouds by Roshan Punnoose, Adina Crainiceanu, and David Rapp.


Resource Description Framework (RDF) was designed with the initial goal of developing metadata for the Internet. While the Internet is a conglomeration of many interconnected networks and computers, most of today’s best RDF storage solutions are confined to a single node. Working on a single node has significant scalability issues, especially considering the magnitude of modern day data. In this paper we introduce a scalable RDF data management system that uses Accumulo, a Google Bigtable variant. We introduce storage methods, indexing schemes, and query processing techniques that scale to billions of triples across multiple nodes, while providing fast and easy access to the data through conventional query mechanisms such as SPARQL. Our performance evaluation shows that in most cases, our system outperforms existing distributed RDF solutions, even systems much more complex than ours.

Based on Accumulo (open-source NoSQL database by the NSA).

Interesting re-thinking of indexing of triples.

Future work includes owl:sameAs, owl:inverseOf and other inferencing rules.

Certainly a project to watch.

Cloud Computing as a Low-Cost Commodity

Tuesday, May 21st, 2013

A Revolution in Cloud Pricing: Minute By Minute Cloud Billing for Everyone by Sean Murphy.

From the post:

Google IO wrapped up last week with a tremendous number of data-related announcements. Today’s post is going to focus on Google Compute Engine (GCE), Google’s answer to Amazon’s Elastic Compute Cloud (EC2) that allows you to create and run virtual compute instances within Google’s cloud. We have spent a good amount of time talking about GCE in the past, in particular, benchmarking it against EC2 here, here, here, and here.

The main GCE announcement at IO was, of course, the fact that now **anyone** and **everyone** can try out and use GCE. Yes, GCE instances now support up to 10 terabytes per disk volume, which is a BIG deal. However, the fact that GCE will use minute-by-minute pricing, which might not seem incredibly significant on the surface, is an absolute game changer.

Let’s say that I have a job that will take just a thousand instances each a little bit over an hour to finish (a total of just over a thousand “instance hours”). I launch my thousand instances, run the needed job, and then shut down my cloud 61 minutes later. Let’s also assume that Amazon and Google both charge about the same amount, say $0.50 per instance per hour (a relatively safe assumption) and that Amazon’s and Google’s instances have the same computational horsepower (this is not true, see my benchmark results). As Amazon charges by the hour, Amazon would charge me for two hours per instance or $1000.00 total (1000 instances x $0.50 per instance per hour x 2 hours per instance) whereas Google would only charge me $508.34 (1000 instances x $0.50 per instance per hour x 61/60 hours per instance). In this circumstance, Amazon’s hourly billing has almost doubled my costs but the impact is far worse.

Sean does a great job covering the impact of minute-by-minute pricing for cloud computing.

Great news for the short run and I suspect even greater news for the long run.

What happens when instances and storage become too cheap to meter?

Like domestic long distance telephone service.

When anything that can be computed is within the reach of everyone, what will be computed?

Semantics Moving into the Clouds (and you?)

Thursday, May 9th, 2013

OpenNebula 4.0 Released – The Finest Open-source Enterprise Cloud Manager!

From the post:

The fourth generation of OpenNebula is the result of seven years of continuous innovation in close collaboration with its users.

The OpenNebula Project is proud to announce the fourth major release of its widely deployed OpenNebula cloud management platform, a fully open-source enterprise-grade solution to build and manage virtualized data centers and enterprise clouds. OpenNebula 4.0 (codename Eagle) brings valuable contributions from many of its thousands of users that include leading research and supercomputing centers like FermiLab, NASA, ESA and SARA; and industry leaders like Blackberry, China Mobile, Dell, Cisco, Akamai and Telefonica O2.

OpenNebula is used by many enterprises as an open, flexible alternative to vCloud on their VMware-based data center. OpenNebula is a drop-in replacement to the VMware’s cloud stack that additionally brings support for multiple hypervisors and broad integration capabilities to leverage existing IT investments and keep existing operational processes. As an enterprise-class product, OpenNebula offers an upgrade path so all existing users can easily migrate their production and experimental environments to the new version.

OpenNebula 4.0 includes new features in most of its subsystems. It shows for the first time a completely redesigned Sunstone, with a fresh and modern look. A whole new set of operations are available for virtual machines like system and disk snapshotting, capacity re-sizing, programmable VM actions, NIC hotplugging and IPv6 among others. The OpenNebula backend has been also improved with the support of new datastores, like Ceph, and new features for the VMware, KVM and Xen hypervisors. The Project continues with its complete support to de-facto and open standards, like Amazon and Open Cloud Computing APIs.

Despite all the buzz words about “big datq” and “cloud computing,” no one has left semantics behind.

Semantics don’t get much press in “big data” or “cloud computing.”

You can take that to mean semantic issues, thousands of years old, have been silently solved, or current vendors lack a semantic solution to offer.

I think it is the latter.

How about you?

Real-Time Data Aggregation [White Paper Registration Warning]

Tuesday, April 30th, 2013

Real-Time Data Aggregation by Caroline Lim.

From the post:

Fast response times generate costs savings and greater revenue. Enterprise data architectures are incomplete unless they can ingest, analyze, and react to data in real-time as it is generated. While previously inaccessible or too complex — scalable, affordable real-time solutions are now finally available to any enterprise.

Infochimps Cloud::Streams

Read Infochimps’ newest whitepaper on how Infochimps Cloud::Streams is a proprietary stream processing framework based on four years of experience with sourcing and analyzing both bulk and in-motion data sources. It offers a linearly and fault-tolerant stream processing engine that leverages a number of well-proven web-scale solutions built by Twitter and Linkedin engineers, with an emphasis on enterprise-class scalability, robustness, and ease of use.

The price of this whitepaper is disclosure of your contact information.

Annoying considering the lack of substantive content about the solution. The use cases are mildly interesting but admit to any number of similar solutions.

If you need real-time data aggregation, skip the white paper and contact your IT consultant/vendor. (Including Infochimps, who do very good work, which is why a non-substantive white paper is so annoying.)

Beginner Tips For Elastic MapReduce

Thursday, April 25th, 2013

Beginner Tips For Elastic MapReduce by John Berryman.

From the post:

By this point everyone is well acquainted with the power of Hadoop’s MapReduce. But what you’re also probably well acquainted with is the pain that must be suffered when setting up your own Hadoop cluster. Sure, there are some really good tutorials online if you know where to look:

However, I’m not much of a dev ops guy so I decided I’d take a look at Amazon’s Elastic MapReduce (EMR) and for the most part I’ve been very pleased. However, I did run into a couple of difficulties, and hopefully this short article will help you avoid my pitfalls.

I often dream of setting up a cluster that requires a newspaper hat because of the oil from cooling the coils, wait!, that was replica of the early cyclotron, sorry, wrong experiment. 😉

I mean a cluster of computers humming and driving up my cooling bills.

But there are alternatives.

Amazon’s Elastic Map Reduce (EMR) is one.

You can learn Hadoop with Hortonworks Sandbox and when you need production power, EMR awaits.

From a cost effectiveness standpoint, that sounds like a good deal to me.


PS: Someone told me today that Amazon isn’t a reliable cloud because they have downtime. It is true that Amazon does have downtime but that isn’t a deciding factor.

You have to consider the relationship between Amazon’s aggressive pricing and how much reliability you need.

If you are running flight control for a moon launch, you probably should not use a public cloud.

Or for a heart surgery theater. And a few other places like that.

If you mean the webservices for your < 4,000 member NGO, 100% guaranteed uptime is a recipe for someone making money, off of you.

Hadoop, The Perfect App for OpenStack

Tuesday, April 16th, 2013

Hadoop, The Perfect App for OpenStack by Shaun Connolly.

From the post:

The convergence of big data and cloud is a disruptive market force that we at Hortonworks not only want to encourage but also accelerate. Our partnerships with Microsoft and Rackspace have been perfect examples of bringing Hadoop to the cloud in a way that enables choice and delivers meaningful value to enterprise customers. In January, Hortonworks joined the OpenStack Foundation in support of our efforts with Rackspace (i.e. OpenStack-based Hadoop solution for the public and private cloud).

Today, we announced our plans to work with engineers from Red Hat and Mirantis within the OpenStack community on open source Project Savanna to automate the deployment of Hadoop on enterprise-class OpenStack-powered clouds.

Why is this news important?

Because big data and cloud computing are two of the top priorities in enterprise IT today, and it’s our intention to work diligently within the Hadoop and OpenStack open source communities to deliver solutions in support of these market needs. By bringing our Hadoop expertise to the OpenStack community in concert with Red Hat (the leading contributor to OpenStack), Mirantis (the leading system integrator for OpenStack), and Rackspace (a founding member of OpenStack), we feel we can speed the delivery of operational agility and efficient sharing of infrastructure that deploying elastic Hadoop on OpenStack can provide.

Why is this news important for topic maps?

Have you noticed that none, read none of the big data or cloud efforts say anything about data semantics?

As if when big data and the cloud arrives, all your data integration problems will magically melt away.

I don’t think so.

What I think is going to happen is discordant data sets are going to start rubbing and binding on each other. Perhaps not a lot at first but as data explorers get bolder, the squeaks are going to get louder.

So loud in fact the squeaks (now tearing metal sounds) are going to attract the attention of… (drum roll)… the CEO.

What’s your answer for discordant data?

  • Ear plugs?
  • Job with another company?
  • Job in another country?
  • Job under an assumed name?

I would say none of the above.

…Cloud Integration is Becoming a Bigger Issue

Wednesday, April 10th, 2013

Survey Reports that Cloud Integration is Becoming a Bigger Issue by David Linthicum.

David cites a survey by KPMG that found thirty-three percent of executives complained of higher than expected costs for data integration in cloud projects.

One assume the brighter thirty-three percent of those surveyed. The remainder apparently did not recognize data integration issues in their cloud projects.

David writes:

Part of the problem is that data integration itself has never been sexy, and thus seems to be an issue that enterprise IT avoids until it can’t be ignored. However, data integration should be the life-force of the enterprise architecture, and there should be a solid strategy and foundational technology in place.

Cloud computing is not the cause of this problem, but it’s shining a much brighter light on the lack of data integration planning. Integrating cloud-based systems is a bit more complex and laborious. However, the data integration technology out there is well proven and supports cloud-based platforms as the source or the target in an integration chain. (emphasis added)

The more diverse data sources become, the larger data integration issues will loom.

Topic maps offer data integration efforts in cloud projects a choice:

1) You can integrate one off, either with inhouse or third-party tools, only to redo all that work with each new data source, or

2) You can integrate using a topic map (for integration or to document integration) and re-use the expertise from prior data integration efforts.

Suggest pitching topic maps as a value-add proposition.

Amazon S3 clone open-sourced by Riak devs [Cloud of Tomorrow?]

Sunday, March 31st, 2013

Amazon S3 clone open-sourced by Riak devs by Elliot Bentley.

From the post:

The developers of NoSQL database Riak have open-sourced their new project, an Amazon S3 clone called Riak CS.

In development for a year, Riak CS provides highly-available, fault-tolerant storage able to manage files as large as 5GB, with an API and authentication system compatible with Amazon S3. In addition, today’s open-source release introduces multipart upload and a new web-based admin tool.

Riak CS is built on top of Basho’s flagship product Riak, a decentralised key/value store NoSQL database. Riak was also based on an existing Amazon creation – in this case, Dynamo, which also served as the inspiration for Apache Cassandra.

In December’s issue of JAX Magazine, Basho EMEA boss Matt Heitzenroder (who has since left the company) explained that Riak CS was initially conceived as an exercise in “dogfooding” their own database product. “It was a goal of engineers to gain insight into use cases themselves and also to have something we can go out there and sell,” he said.

See also: The Riak CS Fast Track.

You may have noticed that files stored on/in (?) clouds are just like files on your local hard drive.

They can be copied, downloaded, pipelined, subjected to ETL, processed and transferred.

The cloud of your choice provides access to greater computing power and storage than before, but that’s a different of degree, not in kind.

A difference in kind would be the ability to find and re-use data based upon its semantics and not on happenstance of file or field names.

Riak CS isn’t that cloud today but in the competition to be the cloud of tomorrow, who knows?