Archive for the ‘Cloud Computing’ Category

Kamala Cloud 2.0!

Friday, June 14th, 2013

Description:

Kamala is a knowledge platform for organizations and people to link their data and share their knowledge. Key features of Kamala: smart suggestions, semantic search and efficient filtering. These help you perfecting your knowledge model and give you powerful, reusable search results.

Model your domain knowledge in the cloud.

Website: http://kamala-cloud.com
Kamala: http://kamala.mssm.nl
Morpheus: http://en.mssm.nl

Excellent!

I understand more Kamala videos are coming next week.

An example of how to advertise topic maps. Err, a good example of how to advertise topic maps! ;-)

You will see Gabriel Hopmans put in a cameo appearance in the video.

Congratulations to the Kamala Team!

Details, discussion, criticisms, etc., to follow.

Rya: A Scalable RDF Triple Store for the Clouds

Tuesday, June 11th, 2013

Rya: A Scalable RDF Triple Store for the Clouds by Roshan Punnoose, Adina Crainiceanu, and David Rapp.

Abstract:

Resource Description Framework (RDF) was designed with the initial goal of developing metadata for the Internet. While the Internet is a conglomeration of many interconnected networks and computers, most of today’s best RDF storage solutions are confined to a single node. Working on a single node has significant scalability issues, especially considering the magnitude of modern day data. In this paper we introduce a scalable RDF data management system that uses Accumulo, a Google Bigtable variant. We introduce storage methods, indexing schemes, and query processing techniques that scale to billions of triples across multiple nodes, while providing fast and easy access to the data through conventional query mechanisms such as SPARQL. Our performance evaluation shows that in most cases, our system outperforms existing distributed RDF solutions, even systems much more complex than ours.

Based on Accumulo (open-source NoSQL database by the NSA).

Interesting re-thinking of indexing of triples.

Future work includes owl:sameAs, owl:inverseOf and other inferencing rules.

Certainly a project to watch.

Cloud Computing as a Low-Cost Commodity

Tuesday, May 21st, 2013

A Revolution in Cloud Pricing: Minute By Minute Cloud Billing for Everyone by Sean Murphy.

From the post:

Google IO wrapped up last week with a tremendous number of data-related announcements. Today’s post is going to focus on Google Compute Engine (GCE), Google’s answer to Amazon’s Elastic Compute Cloud (EC2) that allows you to create and run virtual compute instances within Google’s cloud. We have spent a good amount of time talking about GCE in the past, in particular, benchmarking it against EC2 here, here, here, and here.

The main GCE announcement at IO was, of course, the fact that now **anyone** and **everyone** can try out and use GCE. Yes, GCE instances now support up to 10 terabytes per disk volume, which is a BIG deal. However, the fact that GCE will use minute-by-minute pricing, which might not seem incredibly significant on the surface, is an absolute game changer.

Let’s say that I have a job that will take just a thousand instances each a little bit over an hour to finish (a total of just over a thousand “instance hours”). I launch my thousand instances, run the needed job, and then shut down my cloud 61 minutes later. Let’s also assume that Amazon and Google both charge about the same amount, say $0.50 per instance per hour (a relatively safe assumption) and that Amazon’s and Google’s instances have the same computational horsepower (this is not true, see my benchmark results). As Amazon charges by the hour, Amazon would charge me for two hours per instance or $1000.00 total (1000 instances x $0.50 per instance per hour x 2 hours per instance) whereas Google would only charge me $508.34 (1000 instances x $0.50 per instance per hour x 61/60 hours per instance). In this circumstance, Amazon’s hourly billing has almost doubled my costs but the impact is far worse.

Sean does a great job covering the impact of minute-by-minute pricing for cloud computing.

Great news for the short run and I suspect even greater news for the long run.

What happens when instances and storage become too cheap to meter?

Like domestic long distance telephone service.

When anything that can be computed is within the reach of everyone, what will be computed?

Semantics Moving into the Clouds (and you?)

Thursday, May 9th, 2013

OpenNebula 4.0 Released – The Finest Open-source Enterprise Cloud Manager!

From the post:

The fourth generation of OpenNebula is the result of seven years of continuous innovation in close collaboration with its users.

The OpenNebula Project is proud to announce the fourth major release of its widely deployed OpenNebula cloud management platform, a fully open-source enterprise-grade solution to build and manage virtualized data centers and enterprise clouds. OpenNebula 4.0 (codename Eagle) brings valuable contributions from many of its thousands of users that include leading research and supercomputing centers like FermiLab, NASA, ESA and SARA; and industry leaders like Blackberry, China Mobile, Dell, Cisco, Akamai and Telefonica O2.

OpenNebula is used by many enterprises as an open, flexible alternative to vCloud on their VMware-based data center. OpenNebula is a drop-in replacement to the VMware’s cloud stack that additionally brings support for multiple hypervisors and broad integration capabilities to leverage existing IT investments and keep existing operational processes. As an enterprise-class product, OpenNebula offers an upgrade path so all existing users can easily migrate their production and experimental environments to the new version.

OpenNebula 4.0 includes new features in most of its subsystems. It shows for the first time a completely redesigned Sunstone, with a fresh and modern look. A whole new set of operations are available for virtual machines like system and disk snapshotting, capacity re-sizing, programmable VM actions, NIC hotplugging and IPv6 among others. The OpenNebula backend has been also improved with the support of new datastores, like Ceph, and new features for the VMware, KVM and Xen hypervisors. The Project continues with its complete support to de-facto and open standards, like Amazon and Open Cloud Computing APIs.

Despite all the buzz words about “big datq” and “cloud computing,” no one has left semantics behind.

Semantics don’t get much press in “big data” or “cloud computing.”

You can take that to mean semantic issues, thousands of years old, have been silently solved, or current vendors lack a semantic solution to offer.

I think it is the latter.

How about you?

Real-Time Data Aggregation [White Paper Registration Warning]

Tuesday, April 30th, 2013

Real-Time Data Aggregation by Caroline Lim.

From the post:

Fast response times generate costs savings and greater revenue. Enterprise data architectures are incomplete unless they can ingest, analyze, and react to data in real-time as it is generated. While previously inaccessible or too complex — scalable, affordable real-time solutions are now finally available to any enterprise.

Infochimps Cloud::Streams

Read Infochimps’ newest whitepaper on how Infochimps Cloud::Streams is a proprietary stream processing framework based on four years of experience with sourcing and analyzing both bulk and in-motion data sources. It offers a linearly and fault-tolerant stream processing engine that leverages a number of well-proven web-scale solutions built by Twitter and Linkedin engineers, with an emphasis on enterprise-class scalability, robustness, and ease of use.

The price of this whitepaper is disclosure of your contact information.

Annoying considering the lack of substantive content about the solution. The use cases are mildly interesting but admit to any number of similar solutions.

If you need real-time data aggregation, skip the white paper and contact your IT consultant/vendor. (Including Infochimps, who do very good work, which is why a non-substantive white paper is so annoying.)

Beginner Tips For Elastic MapReduce

Thursday, April 25th, 2013

Beginner Tips For Elastic MapReduce by John Berryman.

From the post:

By this point everyone is well acquainted with the power of Hadoop’s MapReduce. But what you’re also probably well acquainted with is the pain that must be suffered when setting up your own Hadoop cluster. Sure, there are some really good tutorials online if you know where to look:

However, I’m not much of a dev ops guy so I decided I’d take a look at Amazon’s Elastic MapReduce (EMR) and for the most part I’ve been very pleased. However, I did run into a couple of difficulties, and hopefully this short article will help you avoid my pitfalls.

I often dream of setting up a cluster that requires a newspaper hat because of the oil from cooling the coils, wait!, that was replica of the early cyclotron, sorry, wrong experiment. ;-)

I mean a cluster of computers humming and driving up my cooling bills.

But there are alternatives.

Amazon’s Elastic Map Reduce (EMR) is one.

You can learn Hadoop with Hortonworks Sandbox and when you need production power, EMR awaits.

From a cost effectiveness standpoint, that sounds like a good deal to me.

You?

PS: Someone told me today that Amazon isn’t a reliable cloud because they have downtime. It is true that Amazon does have downtime but that isn’t a deciding factor.

You have to consider the relationship between Amazon’s aggressive pricing and how much reliability you need.

If you are running flight control for a moon launch, you probably should not use a public cloud.

Or for a heart surgery theater. And a few other places like that.

If you mean the webservices for your < 4,000 member NGO, 100% guaranteed uptime is a recipe for someone making money, off of you.

Hadoop, The Perfect App for OpenStack

Tuesday, April 16th, 2013

Hadoop, The Perfect App for OpenStack by Shaun Connolly.

From the post:

The convergence of big data and cloud is a disruptive market force that we at Hortonworks not only want to encourage but also accelerate. Our partnerships with Microsoft and Rackspace have been perfect examples of bringing Hadoop to the cloud in a way that enables choice and delivers meaningful value to enterprise customers. In January, Hortonworks joined the OpenStack Foundation in support of our efforts with Rackspace (i.e. OpenStack-based Hadoop solution for the public and private cloud).

Today, we announced our plans to work with engineers from Red Hat and Mirantis within the OpenStack community on open source Project Savanna to automate the deployment of Hadoop on enterprise-class OpenStack-powered clouds.

Why is this news important?

Because big data and cloud computing are two of the top priorities in enterprise IT today, and it’s our intention to work diligently within the Hadoop and OpenStack open source communities to deliver solutions in support of these market needs. By bringing our Hadoop expertise to the OpenStack community in concert with Red Hat (the leading contributor to OpenStack), Mirantis (the leading system integrator for OpenStack), and Rackspace (a founding member of OpenStack), we feel we can speed the delivery of operational agility and efficient sharing of infrastructure that deploying elastic Hadoop on OpenStack can provide.

Why is this news important for topic maps?

Have you noticed that none, read none of the big data or cloud efforts say anything about data semantics?

As if when big data and the cloud arrives, all your data integration problems will magically melt away.

I don’t think so.

What I think is going to happen is discordant data sets are going to start rubbing and binding on each other. Perhaps not a lot at first but as data explorers get bolder, the squeaks are going to get louder.

So loud in fact the squeaks (now tearing metal sounds) are going to attract the attention of… (drum roll)… the CEO.

What’s your answer for discordant data?

  • Ear plugs?
  • Job with another company?
  • Job in another country?
  • Job under an assumed name?

I would say none of the above.

…Cloud Integration is Becoming a Bigger Issue

Wednesday, April 10th, 2013

Survey Reports that Cloud Integration is Becoming a Bigger Issue by David Linthicum.

David cites a survey by KPMG that found thirty-three percent of executives complained of higher than expected costs for data integration in cloud projects.

One assume the brighter thirty-three percent of those surveyed. The remainder apparently did not recognize data integration issues in their cloud projects.

David writes:

Part of the problem is that data integration itself has never been sexy, and thus seems to be an issue that enterprise IT avoids until it can’t be ignored. However, data integration should be the life-force of the enterprise architecture, and there should be a solid strategy and foundational technology in place.

Cloud computing is not the cause of this problem, but it’s shining a much brighter light on the lack of data integration planning. Integrating cloud-based systems is a bit more complex and laborious. However, the data integration technology out there is well proven and supports cloud-based platforms as the source or the target in an integration chain. (emphasis added)

The more diverse data sources become, the larger data integration issues will loom.

Topic maps offer data integration efforts in cloud projects a choice:

1) You can integrate one off, either with inhouse or third-party tools, only to redo all that work with each new data source, or

2) You can integrate using a topic map (for integration or to document integration) and re-use the expertise from prior data integration efforts.

Suggest pitching topic maps as a value-add proposition.

Amazon S3 clone open-sourced by Riak devs [Cloud of Tomorrow?]

Sunday, March 31st, 2013

Amazon S3 clone open-sourced by Riak devs by Elliot Bentley.

From the post:

The developers of NoSQL database Riak have open-sourced their new project, an Amazon S3 clone called Riak CS.

In development for a year, Riak CS provides highly-available, fault-tolerant storage able to manage files as large as 5GB, with an API and authentication system compatible with Amazon S3. In addition, today’s open-source release introduces multipart upload and a new web-based admin tool.

Riak CS is built on top of Basho’s flagship product Riak, a decentralised key/value store NoSQL database. Riak was also based on an existing Amazon creation – in this case, Dynamo, which also served as the inspiration for Apache Cassandra.

In December’s issue of JAX Magazine, Basho EMEA boss Matt Heitzenroder (who has since left the company) explained that Riak CS was initially conceived as an exercise in “dogfooding” their own database product. “It was a goal of engineers to gain insight into use cases themselves and also to have something we can go out there and sell,” he said.

See also: The Riak CS Fast Track.

You may have noticed that files stored on/in (?) clouds are just like files on your local hard drive.

They can be copied, downloaded, pipelined, subjected to ETL, processed and transferred.

The cloud of your choice provides access to greater computing power and storage than before, but that’s a different of degree, not in kind.

A difference in kind would be the ability to find and re-use data based upon its semantics and not on happenstance of file or field names.

Riak CS isn’t that cloud today but in the competition to be the cloud of tomorrow, who knows?

…Cloud Computing is Changing Data Integration

Monday, March 25th, 2013

More Evidence that Cloud Computing is Changing Data Integration by David Linthicum.

From the post:

In a recent Sand Hill article, Jeff Kaplan, the managing director of THINKstrategies, reports on the recent and changing state of data integration with the addition of cloud computing. “One of the ongoing challenges that continues to frustrate businesses of all sizes is data integration, and that issue has only become more complicated with the advent of the cloud. And, in the brave new world of the cloud, data integration must morph into a broader set of data management capabilities to satisfy the escalating needs of today’s business.”

In the article, Jeff reviews a recent survey conducted with several software vendors, concluding:

  • Approximately 90 percent of survey respondents said integration is important in their ability to win new customers.
  • Eighty-four percent of the survey respondents reported that integration has become a difficult task that is getting in the way of business.
  • A quarter of the respondents said they’ve still lost customers because of integration issues.

It’s interesting to note that these issues affect legacy software vendors, as well as Software-as-a-Service (SaaS) vendors. No matter if you sell software in the cloud or deliver it on-demand, the data integration issues are becoming a hindrance.

If cloud computing and/or big data are bringing data integration into the limelight, that sounds like good news for topic maps.

Particularly topic maps of data sources that enable quick and reliable data integration without a round of exploration and testing first.

Using Clouds for MapReduce Measurement Assignments [Grad Class Near You?]

Tuesday, February 19th, 2013

Using Clouds for MapReduce Measurement Assignments by Ariel Rabkin, Charles Reiss, Randy Katz, and David Patterson. (ACM Trans. Comput. Educ. 13, 1, Article 2 (January 2013), 18 pages. DOI = 10.1145/2414446.2414448)

Abstract:

We describe our experiences teaching MapReduce in a large undergraduate lecture course using public cloud services and the standard Hadoop API. Using the standard API, students directly experienced the quality of industrial big-data tools. Using the cloud, every student could carry out scalability benchmarking assignments on realistic hardware, which would have been impossible otherwise. Over two semesters, over 500 students took our course. We believe this is the first large-scale demonstration that it is feasible to use pay-as-you-go billing in the cloud for a large undergraduate course. Modest instructor effort was sufficient to prevent students from overspending. Average per-pupil expenses in the Cloud were under $45. Students were excited by the assignment: 90% said they thought it should be retained in future course offerings.

With properly structured assignments, I can see this technique being used to introduce library graduate students to data mining and similar topics on non-trivial data sets.

Getting “hands on” experience should make them more than a match for the sales types from information vendors.

Not to mention that data mining flourishes when used with an understanding of the underlying semantics of the data set.

I first saw this at: On Teaching MapReduce via Clouds

Amazon Web Services Announces Amazon Redshift

Saturday, February 16th, 2013

Amazon Web Services Announces Amazon Redshift

From the post:

Amazon Web Services, Inc. today announced that Amazon Redshift, a managed, petabyte-scale data warehouse service in the cloud, is now broadly available for use.

Since Amazon Redshift was announced at the AWS re: Invent conference in November 2012, customers using the service during the limited preview have ranged from startups to global enterprises, with datasets from terabytes to petabytes, across industries including social, gaming, mobile, advertising, manufacturing, healthcare, e-commerce, and financial services.

Traditional data warehouses require significant time and resource to administer. In addition, the financial cost associated with building, maintaining, and growing self-managed, on-premise data warehouses is very high. Amazon Redshift aims to lower the cost of a data warehouse and make it easy to analyze large amounts of data very quickly.

Amazon Redshift uses columnar data storage, advanced compression, and high performance IO and network to achieve higher performance than traditional databases for data warehousing and analytics workloads. Redshift is currently available in the US East (N. Virginia) Region and will be rolled out to other AWS Regions in the coming months.

“When we set out to build Amazon Redshift, we wanted to leverage the massive scale of AWS to deliver ten times the performance at 1/10 the cost of on-premise data warehouses in use today,” said Raju Gulabani, Vice President of Database Services, Amazon Web Services….

Amazon Web Services

Wondering what impact a 90% reduction in cost, if borne out over a variety of customers, will have on the cost of on-premise data warehouses?

Suspect the cost for on-premise warehouses will go up because there will be a smaller market for the hardware and people to run them.

Something to consider as a startup that wants to deliver big data services.

Do you really want your own server room/farm, etc.?

Or for that matter, will VCs ask: Why are you allocating funds to a server farm?

PS: Amazon “Redshift” is another example of semantic pollution. “Redshift” had (past tense) a well know and generally accepted semantic. Well, except for the other dozen or so meanings for “redshift” that I counted in less than a minute. ;-)

Sigh, semantic confusion continues unabated.

CamelOne 2012 (videos/presentations) Boston, MA

Tuesday, January 29th, 2013

CamelOne 2012 (videos/presentations) Boston, MA

Videos and presentations for your enjoyment from the CamelOne 2012 conference.

As usual I was looking for something else and found more than I bargained for! ;-)

Getting Started with VM Depot

Friday, January 11th, 2013

Getting Started with VM Depot by Doug Mahugh.

From the post:

Do you need to deploy a popular OSS package on a Windows Azure virtual machine, but don’t know where to start? Or do you have a favorite OSS configuration that you’d like to make available for others to deploy easily? If so, the new VM Depot community portal from Microsoft Open Technologies is just what you need. VM Depot is a community-driven catalog of preconfigured operating systems, applications, and development stacks that can easily be deployed on Windows Azure.

You can learn more about VM Depot in the announcement from Gianugo Rabellino over on Port 25 today. In this post, we’re going to cover the basics of how to use VM Depot, so that you can get started right away.

Doug outlines simple steps to get you rolling with the VM Depot.

Sounds a lot easier than trying to walk casual computer users through installation and configuration of software. I assume you could even load data onto the VMs.

Users just need to fire up the VM and they have the interface and data they want.

Sounds like a nice way to distribute topic map based information systems.

The Cooperative Computing Lab

Monday, December 17th, 2012

The Cooperative Computing Lab

I encountered this site while tracking down resources for the DASPOS post.

From the homepage:

The Cooperative Computing Lab at the University of Notre Dame seeks to give ordinary users the power to harness large systems of hundreds or thousands of machines, often called clusters, clouds, or grids. We create real software that helps people to attack extraordinary problems in fields such as physics, chemistry, bioinformatics, biometrics, and data mining. We welcome others at the University to make use of our computing systems for research and education.

As the computing requirements of your data mining or topic maps increase, so will your need for clusters, clouds, or grids.

The CCL offers several software packages for free download that you may find useful.

BigMLer in da Cloud: Machine Learning made even easier [Amateur vs. Professional Models]

Sunday, December 9th, 2012

BigMLer in da Cloud: Machine Learning made even easier by Martin Prats.

From the post:

We have open-sourced BigMLer, a command line tool that will let you create predictive models much easier than ever before.

BigMLer wraps BigML’s API Python bindings to offer a high-level command-line script to easily create and publish Datasets and Models, create Ensembles, make local Predictions from multiple models, and simplify many other machine learning tasks. BigMLer is open sourced under the Apache License, Version 2.0.

“…will let you create predictive models much easier than ever before.”

Well…., true, but the amount of effort you invest in a predictive model has a relationship to the usefulness of the model for some given purpose.

It is a great idea to create an easy “on ramp” to introduce machine learning. But it may lead some users to confuse “…easier than ever before” models with professionally crafted models.

An old friend confided their organization was about to write a classification system for a well know subject. Exciting to think they will put all past errors to rest while adding new capabilities.

But in reality librarians have labored in such areas for centuries. It isn’t an good target for a start-up project. Particularly for those innocent of existing classification systems and the theory/praxis that drove their creation.

Librarians didn’t invent the Internet. If they had, we wouldn’t be searching for ways to curate information on the Internet, in a backwards compatible way.

Abusing Cloud-Based Browsers for Fun and Profit [Passing Messages, Not Data]

Thursday, November 29th, 2012

Abusing Cloud-Based Browsers for Fun and Profit by Vasant Tendulkar, Joe Pletcher, Ashwin Shashidharan, Ryan Snyder, Kevin Butler and William Enck.

Abstract:

Cloud services have become a cheap and popular means of computing. They allow users to synchronize data between devices and relieve low-powered devices from heavy computations. In response to the surge of smartphones and mobile devices, several cloud-based Web browsers have become commercially available. These “cloud browsers” assemble and render Web pages within the cloud, executing JavaScript code for the mobile client. This paper explores how the computational abilities of cloud browsers may be exploited through a Browser MapReduce (BMR) architecture for executing large, parallel tasks. We explore the computation and memory limits of four cloud browsers, and demonstrate the viability of BMR by implementing a client based on a reverse engineering of the Puffin cloud browser. We implement and test three canonical MapReduce applications (word count, distributed grep, and distributed sort). While we perform experiments on relatively small amounts of data (100 MB) for ethical considerations, our results strongly suggest that current cloud browsers are a viable source of arbitrary free computing at large scale.

Excellent work on extending the use of cloud-based browsers. Whether you intend to use them for good or ill.

The use of messaging as opposed to passage of data is particularly interesting.

Shouldn’t that work for the process of merging as well?

Comments/suggestions?

Rx for Asychronous Data Streams in the Clouds

Wednesday, November 7th, 2012

Claudio Caldato wrote: MS Open Tech Open Sources Rx (Reactive Extensions) – a Cure for Asynchronous Data Streams in Cloud Programming.

I was tired by the time I got to the end of the title! His is more descriptive than mine but if you know the context, you don’t need the description.

From the post:

If you are a developer that writes asynchronous code for composite applications in the cloud, you know what we are talking about, for everybody else Rx Extensions is a set of libraries that makes asynchronous programming a lot easier. As Dave Sexton describes it, “If asynchronous spaghetti code were a disease, Rx is the cure.”

Reactive Extensions (Rx) is a programming model that allows developers to glue together asynchronous data streams. This is particularly useful in cloud programming because helps create a common interface for writing applications that come from diverse data sources, e.g., stock quotes, Tweets, computer events, Web service requests.

Today, Microsoft Open Technologies, Inc., is open sourcing Rx. Its source code is now hosted on CodePlex to increase the community of developers seeking a more consistent interface to program against, and one that works across several development languages. The goal is to expand the number of frameworks and applications that use Rx in order to achieve better interoperability across devices and the cloud.

Rx was developed by Microsoft Corp. architect Erik Meijer and his team, and is currently used on products in various divisions at Microsoft. Microsoft decided to transfer the project to MS Open Tech in order to capitalize on MS Open Tech’s best practices with open development.

There are applications that you probably touch every day that are using Rx under the hood. A great example is GitHub for Windows.

According to Paul Betts at GitHub, “GitHub for Windows uses the Reactive Extensions for almost everything it does, including network requests, UI events, managing child processes (git.exe). Using Rx and ReactiveUI, we’ve written a fast, nearly 100% asynchronous, responsive application, while still having 100% deterministic, reliable unit tests. The desktop developers at GitHub loved Rx so much, that the Mac team created their own version of Rx and ReactiveUI, called ReactiveCocoa, and are now using it on the Mac to obtain similar benefits.”

What if the major cloud players started competing on the basis of interoperability? So your app here will work there.

Reducing the impedance for developers enables more competition between developers. Resulting in better services/product for consumers.

Cloud owners get more options to offer their customers.

Topic map applications have an easier time mining, identifying and recombining subjects across diverse sources and even clouds.

Does anyone see a downside here?

The personal cloud series

Sunday, October 21st, 2012

The personal cloud series by Jon Udell.

Excellent source of ideas on the web/cloud as we experience it today and as we may experience it tomorrow.

Going through prior posts now and will call some of them out for further discussion.

Which ones impress you the most?

Axemblr’s Java Client for the Cloudera Manager API

Thursday, October 18th, 2012

Axemblr’s Java Client for the Cloudera Manager API by Justin Kestelyn.

From the post:

Axemblr, purveyors of a cloud-agnostic MapReduce Web Service, have recently announced the availability of an Apache-licensed Java Client for the Cloudera Manager API.

The task at hand, according to Axemblr, is to ”deploy Hadoop on Cloud with as little user interaction as possible. We have the code to provision the hosts but we still need to install and configure Hadoop on all nodes and make it so the user has a nice experience doing it.” And voila, the answer is Cloudera Manager, with the process made easy via the REST API introduced in Release 4.0.

Thus, says Axemblr: “In the pursuit of our greatest desire (second only to coffee early in the morning), we ended up writing a Java client for Cloudera Manager’s API. Thus we achieved to automate a CDH3 Hadoop installation on Amazon EC2 and Rackspace Cloud. We also decided to open source the client so other people can play along.”

Another goodie to ease your way to Hadoop deployment on your favorite cloud.

Do you remember the lights at radio stations that would show “On Air?”

I need an “On Cloud” that lights up. More realistic than the data appliance.

Lacking Data Integration, Cloud Computing Suffers

Friday, October 12th, 2012

Lacking Data Integration, Cloud Computing Suffers by David Linthicum.

From the post:

The findings of the Cloud Market Maturity study, a survey conducted jointly by Cloud Security Alliance (CSA) and ISACA, show that government regulations, international data privacy, and integration with internal systems dominate the top 10 areas where trust in the cloud is at its lowest.

The Cloud Market Maturity study examines the maturity of cloud computing and helps identify market changes. In addition, the report provides detailed information on the adoption of cloud services at all levels within global companies, including senior executives.

Study results reveal that cloud users from 50 countries expressed the lowest level of confidence in the following (ranked from most reliable to least reliable):

  • Government regulations keeping pace with the market
  • Exit strategies
  • International data privacy
  • Legal issues
  • Contract lock in
  • Data ownership and custodian responsibilities
  • Longevity of suppliers
  • Integration of cloud with internal systems
  • Credibility of suppliers
  • Testing and assurance

Questions:

As “big data” gets “bigger,” will cloud integration issues get better or worse?

Do you prefer disposable data integration or reusable data integration? (For bonus points, why?)

Cloudera Enterprise in Less Than Two Minutes

Wednesday, September 12th, 2012

Cloudera Enterprise in Less Than Two Minutes by Justin Kestelyn.

I had to pause “Born Under A Bad Sign” by Cream to watch the video but it was worth it!

Good example of selling technique too!

Focused on common use cases and user concerns. Promises a solution without all the troublesome details.

Time enough for that after a prospect is interested. And even then, ease them into the details.

How To Take Big Data to the Cloud [Webinar - 13 Sept 2012 - 10 AM PDT]

Wednesday, September 12th, 2012

How To Take Big Data to the Cloud by Lisa Sensmeier.

From the post:

Hortonworks boasts a rich and vibrant ecosystem of partners representing a huge array of solutions that leverage Hadoop, and specifically Hortonworks Data Platform, to provide big data insights for customers. The goal of our Partner Webinar Series is to help communicate the value and benefit of our partners’ solutions and how they connect and use Hortonworks Data Platform.

Look to the CloudsBig-Data-and-the-cloud

Setting up a big data cluster can be difficult, especially considering the assembly of all the all the equipment, power, and space to make it happen. One option to consider is using the cloud for a practical and economical way to go. The cloud is also used to provide extra capacity for an existing cluster or for test your Hadoop applications.

Join our webinar and we will show how you can build a flexible and reliable Hadoop cluster in the cloud using Amazon EC2 cloud infrastructure, StackIQ Apache Hadoop Amazon Machine Image (AMI) and Hortonworks Data Platform. The panel of speakers includes Matt Tavis, Solutions Architect for Amazon Web Services, Mason Katz, CTO and co-founder of StackIQ, and Rohit Bakhshi, Product Manager at Hortonworks.

OK, it is a vendor/partner presentation but most of us work for vendors and use vendor created tools.

Yes?

The real question is whether tool X does what is necessary at a cost project Y can afford?

Whether vendor sponsored tool, service, home grown or otherwise.

Yes?

Looking forward to it!

Big Data on Heroku – Hadoop from Treasure Data

Friday, August 24th, 2012

Big Data on Heroku – Hadoop from Treasure Data by Istvan Szegedi.

From the post:

This time I write about Heroku and Treasure Data Hadoop solution – I found it really to be a ‘gem’ in the Big Data world.

Heroku is a cloud platform as a service (PaaS) owned by Salesforce.com. Originally it started with supporting Ruby as its main programming language but it has been extended to Java, Scala, Node.js, Python and Clojure, too. It also supports a long list of addons including – among others – RDBMS and NoSQL capabilities and Hadoop-based data warehouse developed by Treasure Data.

Not to leave the impression that your only cloud option is AWS.

I don’t know of any comparisons of cloud services/storage plus cost on an apples to apples basis.

Do you?

Using the Cloudant Data Layer for Windows Azure

Thursday, August 16th, 2012

Using the Cloudant Data Layer for Windows Azure by Doug Mahugh.

From the post:

If you need a highly scalable data layer for your cloud service or application running on Windows Azure, the Cloudant Data Layer for Windows Azure may be a great fit. This service, which was announced in preview mode in June and is now in beta, delivers Cloudant’s “database as a service” offering on Windows Azure.

From Cloudant’s data layer you’ll get rich support for data replication and synchronization scenarios such as online/offline data access for mobile device support, a RESTful Apache CouchDB-compatible API, and powerful features including full-text search, geo-location, federated analytics, schema-less document collections, and many others. And perhaps the greatest benefit of all is what you don’t get with Cloudant’s approach: you’ll have no responsibility for provisioning, deploying, or managing your data layer. The experts at Cloudant take care of those details, while you stay focused on building applications and cloud services that use the data layer.

….

For an example of how to use the Cloudant Data Layer, see the tutorial “Using the Cloudant Data Layer for Windows Azure,” which takes you through the steps needed to set up an account, create a database, configure access permissions, and develop a simple PHP-based photo album application that uses the database to store text and images:

Not that you need a Cloudant Data Layer for a photo album but it will help get your feet wet with cloud computing.

The Coming Majority: Mainstream Adoption and Entrepreneurship [Cloud Gift Certificates?]

Saturday, July 28th, 2012

The Coming Majority: Mainstream Adoption and Entrepreneurship by James Locus.

From the post:

Small companies, big data.

Big data is sometimes at odds with the business-savvy entrepreneur who wants to exploit its full potential. In essence, the business potential of big data is the massive (but promising) elephant in the room that remains invisible because the available talent necessary to take full advantage of the technology is difficult to obtain.

Inventing new technology for the platform is critical, but so too is making it easier to use.

The future of big data may not be a technological breakthrough by a select core of contributing engineers, but rather a platform that allows common, non-PhD holding entrepreneurs and developers to innovate. Some incredible progress has been made in Apache Hadoop with Hortonworks’ HDP (Hortonworks Data Platform) in minimizing the installation process required for full implementation. Further, the improved MapReduce v2 framework also greatly lowers the risk of adoption for businesses by expressly creating features designed to increase efficiency and usability (e.g. backward and forward compatibility). Finally, with HCatalog, the platform is opened up to integrate with new and existing enterprise applications.

What kinds of opportunities lie ahead when more barriers are eliminated?

You really do need a local installation of Hadoop for experimenting.

But at the same time, having a minimal cloud account where you can whistle up some serious computing power isn’t a bad idea either.

That would make an interesting “back to school” or “holiday present for your favorite geek” sort of present. A “gift certificate” for so many hours/cycles a month on a cloud platform.

BTW, what projects would you undertake if barriers of access and capacity were diminished if not removed?

PostgreSQL’s place in the New World Order

Friday, July 27th, 2012

PostgreSQL’s place in the New World Order by Matthew Soldo.

Description:

Mainstream software development is undergoing a radical shift. Driven by the agile development needs of web, social, and mobile apps, developers are increasingly deploying to platforms-as-a-service (PaaS). A key enabling technology of PaaS is cloud-services: software, often open-source, that is consumed as a service and operated by a third-party vendor. This shift has profound implications for the open-source world. It enables new business models, increases emphasis on user-experience, and creates new opportunities.

PostgreSQL is an excellent case study in this shift. The PostgreSQL project has long offered one of the most reliable open source databases, but has received less attention than competing technologies. But in the PaaS and cloud-services world, reliability and open-ness become increasingly important. As such, we are seeing the beginning of a shift in adoption towards PostgreSQL.

The datastore landscape is particularly interesting because of the recent attention given to the so-called NoSQL technologies. Data is suddenly sexy again. This attention is largely governed by the same forces driving developers to PaaS, namely the need for agility and scalability in building modern apps. Far from being a threat to PostgreSQL, these technologies present an amazing opportunity for showing the way towards making PostgreSQL more powerful and more widely adopted.

The presentation sounds great, but alas, the slidedeck is just a slidedeck. :-(

I do recommend it for the next to last slide graphic. Very cool!

(And it may be time to take a another look at PostgreSQL as well.)

Cloudera Manager 4.0.3 Released!

Friday, July 20th, 2012

Cloudera Manager 4.0.3 Released! by Bala Venkatrao.

From the post:

We are pleased to announce the availability of Cloudera Manager 4.0.3. This is an enhancement release, with several improvements to configurability and usability. Some key enhancements include:

  • Configurable user/group settings for Oozie, HBase, YARN, MapReduce, and HDFS processes.
  • Support new configuration parameters for MapReduce services.
  • Auto configuration of reserved space for non-DFS use parameter for HDFS service.
  • Improved cluster upgrade process.
  • Support for LDAP users/groups that belong to more than one Organization Unit (OU).
  • Flexibility with distribution of key tabs when using existing Kerberos infrastructure (e.g. Active Directory).

Detailed release notes available at:

https://ccp.cloudera.com/display/ENT4DOC/New+Features+in+Cloudera+Manager+4.0

Cloudera Manager 4.0.3 is available to download from:

https://ccp.cloudera.com/display/SUPPORT/Downloads

Something for the weekend!

Nutch 1.5/1.5.1 [Cloud Setup for Experiements?]

Sunday, July 15th, 2012

Before the release of Nutch 2.0, there was the release of Nutch 1.5 and 1.5.1.

From the 1.5 release note:

The 1.5 release of Nutch is now available. This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few. Please see the list of changes

http://www.apache.org/dist/nutch/CHANGES-1.5.txt

[WRONG URL - Should be: http://www.apache.org/dist/nutch/1.5/CHANGES-1.5.txt (version /1.5/" missing from the path, took me a while to notice the nature of the problem.)]

made in this version for a full breakdown of the 50 odd improvements the release boasts. A full PMC release statement can be found below

http://nutch.apache.org/#07+June+2012+-+Apache+Nutch+1.5+Released

Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and and array other document formats. Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. The system can be enhanced (eg other document formats can be parsed) using a highly flexible, easily extensible and thoroughly maintained plugin infrastructure.

Nutch is available in source and binary form (zip and tar.gz) from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/

And 1.5.1:

http://www.apache.org/dist/nutch/1.5.1/CHANGES.txt

Nutch is available in source and binary form (zip and tar.gz) from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/

Question: Would you put together some commodity boxes for local experimentation or would you spin up an installation in one of the clouds?

As hot as the summer promises to be near Atlanta, I am leaning towards the cloud route.

As I write that I can hear a close friend from the West Coast shouting “…trust, trust issues….” But I trust the local banking network, credit card, utilities, finance, police/fire, etc., with just as little reason as any of the “clouds.”

Not really even “trust,” I don’t even think about it. The credit card industry knows $X fraud is going to occur every year and it is a cost of liquid transactions. So they allow for it in their fees. They proceed in the face of known rates of fraud. How’s that for trust? ;-) Trusting fraud is going to happen.

Same will be true for the “clouds” and mechanisms will evolve to regulate the amount of exposure versus potential damage. I am going to be experimenting with non-client data so the worst exposure I have is loss of time. Perhaps some hard lessons learned on configuration/security. But hardly a reason to avoid the “clouds” and to incur the local hardware cost.

I was serious when I suggested governments should start requiring side by side comparison of hardware costs for local installs versus cloud services. I would call the major cloud services up and ask them for independent bids.

Would the “clouds” be less secure? Possibly, but I don’t think any of them allow Lady Gaga CDs on premises.

Google Compute Engine: Computing without limits

Friday, June 29th, 2012

Google Compute Engine: Computing without limits by Craig McLuckie.

From the post:

Over the years, Google has built some of the most high performing, scalable and efficient data centers in the world by constantly refining our hardware and software. Since 2008, we’ve been working to open up our infrastructure to outside developers and businesses so they can take advantage of our cloud as they build applications and websites and store and analyze data. So far this includes products like Google App Engine, Google Cloud Storage, and Google BigQuery.

Today, in response to many requests from developers and businesses, we’re going a step further. We’re introducing Google Compute Engine, an Infrastructure-as-a-Service product that lets you run Linux Virtual Machines (VMs) on the same infrastructure that powers Google. This goes beyond just giving you greater flexibility and control; access to computing resources at this scale can fundamentally change the way you think about tackling a problem.

Google Compute Engine offers:

  • Scale. At Google we tackle huge computing tasks all the time, like indexing the web, or handling billions of search queries a day. Using Google’s data centers, Google Compute Engine reduces the time to scale up for tasks that require large amounts of computing power. You can launch enormous compute clusters – tens of thousands of cores or more.
  • Performance. Many of you have learned to live with erratic performance in the cloud. We have built our systems to offer strong and consistent performance even at massive scale. For example, we have sophisticated network connections that ensure consistency. Even in a shared cloud you don’t see interruptions; you can tune your app and rely on it not degrading.
  • Value. Computing in the cloud is getting even more appealing from a cost perspective. The economy of scale and efficiency of our data centers allows Google Compute Engine to give you 50% more compute for your money than with other leading cloud providers. You can see pricing details here.

The capabilities of Google Compute Engine include:

  • Compute. Launch Linux VMs on-demand. 1, 2, 4 and 8 virtual core VMs are available with 3.75GB RAM per virtual core.
  • Storage. Store data on local disk, on our new persistent block device, or on our Internet-scale object store, Google Cloud Storage.
  • Network. Connect your VMs together using our high-performance network technology to form powerful compute clusters and manage connectivity to the Internet with configurable firewalls.
  • Tooling. Configure and control your VMs via a scriptable command line tool or web UI. Or you can create your own dynamic management system using our API.

Google Compute Engine Preview – Signup

Wondering how this will impact evaluations of CS papers? And what data sets will be used on a routine basis?

To say nothing of exploration of data/text mining.

Now if we can just get access to the majority of research literature, well, but that’s an issue for another forum.