Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 10, 2013

…Cloud Integration is Becoming a Bigger Issue

Filed under: Cloud Computing,Data Integration,Marketing — Patrick Durusau @ 5:27 am

Survey Reports that Cloud Integration is Becoming a Bigger Issue by David Linthicum.

David cites a survey by KPMG that found thirty-three percent of executives complained of higher than expected costs for data integration in cloud projects.

One assume the brighter thirty-three percent of those surveyed. The remainder apparently did not recognize data integration issues in their cloud projects.

David writes:

Part of the problem is that data integration itself has never been sexy, and thus seems to be an issue that enterprise IT avoids until it can’t be ignored. However, data integration should be the life-force of the enterprise architecture, and there should be a solid strategy and foundational technology in place.

Cloud computing is not the cause of this problem, but it’s shining a much brighter light on the lack of data integration planning. Integrating cloud-based systems is a bit more complex and laborious. However, the data integration technology out there is well proven and supports cloud-based platforms as the source or the target in an integration chain. (emphasis added)

The more diverse data sources become, the larger data integration issues will loom.

Topic maps offer data integration efforts in cloud projects a choice:

1) You can integrate one off, either with inhouse or third-party tools, only to redo all that work with each new data source, or

2) You can integrate using a topic map (for integration or to document integration) and re-use the expertise from prior data integration efforts.

Suggest pitching topic maps as a value-add proposition.

March 31, 2013

Amazon S3 clone open-sourced by Riak devs [Cloud of Tomorrow?]

Filed under: Cloud Computing,Riak CS — Patrick Durusau @ 9:20 am

Amazon S3 clone open-sourced by Riak devs by Elliot Bentley.

From the post:

The developers of NoSQL database Riak have open-sourced their new project, an Amazon S3 clone called Riak CS.

In development for a year, Riak CS provides highly-available, fault-tolerant storage able to manage files as large as 5GB, with an API and authentication system compatible with Amazon S3. In addition, today’s open-source release introduces multipart upload and a new web-based admin tool.

Riak CS is built on top of Basho’s flagship product Riak, a decentralised key/value store NoSQL database. Riak was also based on an existing Amazon creation – in this case, Dynamo, which also served as the inspiration for Apache Cassandra.

In December’s issue of JAX Magazine, Basho EMEA boss Matt Heitzenroder (who has since left the company) explained that Riak CS was initially conceived as an exercise in “dogfooding” their own database product. “It was a goal of engineers to gain insight into use cases themselves and also to have something we can go out there and sell,” he said.

See also: The Riak CS Fast Track.

You may have noticed that files stored on/in (?) clouds are just like files on your local hard drive.

They can be copied, downloaded, pipelined, subjected to ETL, processed and transferred.

The cloud of your choice provides access to greater computing power and storage than before, but that’s a different of degree, not in kind.

A difference in kind would be the ability to find and re-use data based upon its semantics and not on happenstance of file or field names.

Riak CS isn’t that cloud today but in the competition to be the cloud of tomorrow, who knows?

March 25, 2013

…Cloud Computing is Changing Data Integration

Filed under: Cloud Computing,Data Integration,Topic Maps — Patrick Durusau @ 4:51 am

More Evidence that Cloud Computing is Changing Data Integration by David Linthicum.

From the post:

In a recent Sand Hill article, Jeff Kaplan, the managing director of THINKstrategies, reports on the recent and changing state of data integration with the addition of cloud computing. “One of the ongoing challenges that continues to frustrate businesses of all sizes is data integration, and that issue has only become more complicated with the advent of the cloud. And, in the brave new world of the cloud, data integration must morph into a broader set of data management capabilities to satisfy the escalating needs of today’s business.”

In the article, Jeff reviews a recent survey conducted with several software vendors, concluding:

  • Approximately 90 percent of survey respondents said integration is important in their ability to win new customers.
  • Eighty-four percent of the survey respondents reported that integration has become a difficult task that is getting in the way of business.
  • A quarter of the respondents said they’ve still lost customers because of integration issues.

It’s interesting to note that these issues affect legacy software vendors, as well as Software-as-a-Service (SaaS) vendors. No matter if you sell software in the cloud or deliver it on-demand, the data integration issues are becoming a hindrance.

If cloud computing and/or big data are bringing data integration into the limelight, that sounds like good news for topic maps.

Particularly topic maps of data sources that enable quick and reliable data integration without a round of exploration and testing first.

February 19, 2013

Using Clouds for MapReduce Measurement Assignments [Grad Class Near You?]

Filed under: Cloud Computing,MapReduce — Patrick Durusau @ 11:30 am

Using Clouds for MapReduce Measurement Assignments by Ariel Rabkin, Charles Reiss, Randy Katz, and David Patterson. (ACM Trans. Comput. Educ. 13, 1, Article 2 (January 2013), 18 pages. DOI = 10.1145/2414446.2414448)

Abstract:

We describe our experiences teaching MapReduce in a large undergraduate lecture course using public cloud services and the standard Hadoop API. Using the standard API, students directly experienced the quality of industrial big-data tools. Using the cloud, every student could carry out scalability benchmarking assignments on realistic hardware, which would have been impossible otherwise. Over two semesters, over 500 students took our course. We believe this is the first large-scale demonstration that it is feasible to use pay-as-you-go billing in the cloud for a large undergraduate course. Modest instructor effort was sufficient to prevent students from overspending. Average per-pupil expenses in the Cloud were under $45. Students were excited by the assignment: 90% said they thought it should be retained in future course offerings.

With properly structured assignments, I can see this technique being used to introduce library graduate students to data mining and similar topics on non-trivial data sets.

Getting “hands on” experience should make them more than a match for the sales types from information vendors.

Not to mention that data mining flourishes when used with an understanding of the underlying semantics of the data set.

I first saw this at: On Teaching MapReduce via Clouds

February 16, 2013

Amazon Web Services Announces Amazon Redshift

Filed under: Amazon Web Services AWS,Cloud Computing,Data Warehouse — Patrick Durusau @ 4:47 pm

Amazon Web Services Announces Amazon Redshift

From the post:

Amazon Web Services, Inc. today announced that Amazon Redshift, a managed, petabyte-scale data warehouse service in the cloud, is now broadly available for use.

Since Amazon Redshift was announced at the AWS re: Invent conference in November 2012, customers using the service during the limited preview have ranged from startups to global enterprises, with datasets from terabytes to petabytes, across industries including social, gaming, mobile, advertising, manufacturing, healthcare, e-commerce, and financial services.

Traditional data warehouses require significant time and resource to administer. In addition, the financial cost associated with building, maintaining, and growing self-managed, on-premise data warehouses is very high. Amazon Redshift aims to lower the cost of a data warehouse and make it easy to analyze large amounts of data very quickly.

Amazon Redshift uses columnar data storage, advanced compression, and high performance IO and network to achieve higher performance than traditional databases for data warehousing and analytics workloads. Redshift is currently available in the US East (N. Virginia) Region and will be rolled out to other AWS Regions in the coming months.

“When we set out to build Amazon Redshift, we wanted to leverage the massive scale of AWS to deliver ten times the performance at 1/10 the cost of on-premise data warehouses in use today,” said Raju Gulabani, Vice President of Database Services, Amazon Web Services….

Amazon Web Services

Wondering what impact a 90% reduction in cost, if borne out over a variety of customers, will have on the cost of on-premise data warehouses?

Suspect the cost for on-premise warehouses will go up because there will be a smaller market for the hardware and people to run them.

Something to consider as a startup that wants to deliver big data services.

Do you really want your own server room/farm, etc.?

Or for that matter, will VCs ask: Why are you allocating funds to a server farm?

PS: Amazon “Redshift” is another example of semantic pollution. “Redshift” had (past tense) a well know and generally accepted semantic. Well, except for the other dozen or so meanings for “redshift” that I counted in less than a minute. 😉

Sigh, semantic confusion continues unabated.

January 29, 2013

CamelOne 2012 (videos/presentations) Boston, MA

Filed under: Apache Camel,Cloud Computing — Patrick Durusau @ 6:49 pm

CamelOne 2012 (videos/presentations) Boston, MA

Videos and presentations for your enjoyment from the CamelOne 2012 conference.

As usual I was looking for something else and found more than I bargained for! 😉

January 11, 2013

Getting Started with VM Depot

Filed under: Azure Marketplace,Cloud Computing,Linux OS,Microsoft,Virtual Machines — Patrick Durusau @ 7:35 pm

Getting Started with VM Depot by Doug Mahugh.

From the post:

Do you need to deploy a popular OSS package on a Windows Azure virtual machine, but don’t know where to start? Or do you have a favorite OSS configuration that you’d like to make available for others to deploy easily? If so, the new VM Depot community portal from Microsoft Open Technologies is just what you need. VM Depot is a community-driven catalog of preconfigured operating systems, applications, and development stacks that can easily be deployed on Windows Azure.

You can learn more about VM Depot in the announcement from Gianugo Rabellino over on Port 25 today. In this post, we’re going to cover the basics of how to use VM Depot, so that you can get started right away.

Doug outlines simple steps to get you rolling with the VM Depot.

Sounds a lot easier than trying to walk casual computer users through installation and configuration of software. I assume you could even load data onto the VMs.

Users just need to fire up the VM and they have the interface and data they want.

Sounds like a nice way to distribute topic map based information systems.

December 17, 2012

The Cooperative Computing Lab

Filed under: Cloud Computing,Clustering (servers),HPC,Parallel Programming,Programming — Patrick Durusau @ 2:39 pm

The Cooperative Computing Lab

I encountered this site while tracking down resources for the DASPOS post.

From the homepage:

The Cooperative Computing Lab at the University of Notre Dame seeks to give ordinary users the power to harness large systems of hundreds or thousands of machines, often called clusters, clouds, or grids. We create real software that helps people to attack extraordinary problems in fields such as physics, chemistry, bioinformatics, biometrics, and data mining. We welcome others at the University to make use of our computing systems for research and education.

As the computing requirements of your data mining or topic maps increase, so will your need for clusters, clouds, or grids.

The CCL offers several software packages for free download that you may find useful.

December 9, 2012

BigMLer in da Cloud: Machine Learning made even easier [Amateur vs. Professional Models]

Filed under: Cloud Computing,Machine Learning,WWW — Patrick Durusau @ 5:19 pm

BigMLer in da Cloud: Machine Learning made even easier by Martin Prats.

From the post:

We have open-sourced BigMLer, a command line tool that will let you create predictive models much easier than ever before.

BigMLer wraps BigML’s API Python bindings to offer a high-level command-line script to easily create and publish Datasets and Models, create Ensembles, make local Predictions from multiple models, and simplify many other machine learning tasks. BigMLer is open sourced under the Apache License, Version 2.0.

“…will let you create predictive models much easier than ever before.”

Well…., true, but the amount of effort you invest in a predictive model has a relationship to the usefulness of the model for some given purpose.

It is a great idea to create an easy “on ramp” to introduce machine learning. But it may lead some users to confuse “…easier than ever before” models with professionally crafted models.

An old friend confided their organization was about to write a classification system for a well know subject. Exciting to think they will put all past errors to rest while adding new capabilities.

But in reality librarians have labored in such areas for centuries. It isn’t an good target for a start-up project. Particularly for those innocent of existing classification systems and the theory/praxis that drove their creation.

Librarians didn’t invent the Internet. If they had, we wouldn’t be searching for ways to curate information on the Internet, in a backwards compatible way.

November 29, 2012

Abusing Cloud-Based Browsers for Fun and Profit [Passing Messages, Not Data]

Filed under: Cloud Computing,Javascript,MapReduce,Messaging — Patrick Durusau @ 12:58 pm

Abusing Cloud-Based Browsers for Fun and Profit by Vasant Tendulkar, Joe Pletcher, Ashwin Shashidharan, Ryan Snyder, Kevin Butler and William Enck.

Abstract:

Cloud services have become a cheap and popular means of computing. They allow users to synchronize data between devices and relieve low-powered devices from heavy computations. In response to the surge of smartphones and mobile devices, several cloud-based Web browsers have become commercially available. These “cloud browsers” assemble and render Web pages within the cloud, executing JavaScript code for the mobile client. This paper explores how the computational abilities of cloud browsers may be exploited through a Browser MapReduce (BMR) architecture for executing large, parallel tasks. We explore the computation and memory limits of four cloud browsers, and demonstrate the viability of BMR by implementing a client based on a reverse engineering of the Puffin cloud browser. We implement and test three canonical MapReduce applications (word count, distributed grep, and distributed sort). While we perform experiments on relatively small amounts of data (100 MB) for ethical considerations, our results strongly suggest that current cloud browsers are a viable source of arbitrary free computing at large scale.

Excellent work on extending the use of cloud-based browsers. Whether you intend to use them for good or ill.

The use of messaging as opposed to passage of data is particularly interesting.

Shouldn’t that work for the process of merging as well?

Comments/suggestions?

November 7, 2012

Rx for Asychronous Data Streams in the Clouds

Filed under: Cloud Computing,Data Streams,Microsoft,Rx — Patrick Durusau @ 4:29 pm

Claudio Caldato wrote: MS Open Tech Open Sources Rx (Reactive Extensions) – a Cure for Asynchronous Data Streams in Cloud Programming.

I was tired by the time I got to the end of the title! His is more descriptive than mine but if you know the context, you don’t need the description.

From the post:

If you are a developer that writes asynchronous code for composite applications in the cloud, you know what we are talking about, for everybody else Rx Extensions is a set of libraries that makes asynchronous programming a lot easier. As Dave Sexton describes it, “If asynchronous spaghetti code were a disease, Rx is the cure.”

Reactive Extensions (Rx) is a programming model that allows developers to glue together asynchronous data streams. This is particularly useful in cloud programming because helps create a common interface for writing applications that come from diverse data sources, e.g., stock quotes, Tweets, computer events, Web service requests.

Today, Microsoft Open Technologies, Inc., is open sourcing Rx. Its source code is now hosted on CodePlex to increase the community of developers seeking a more consistent interface to program against, and one that works across several development languages. The goal is to expand the number of frameworks and applications that use Rx in order to achieve better interoperability across devices and the cloud.

Rx was developed by Microsoft Corp. architect Erik Meijer and his team, and is currently used on products in various divisions at Microsoft. Microsoft decided to transfer the project to MS Open Tech in order to capitalize on MS Open Tech’s best practices with open development.

There are applications that you probably touch every day that are using Rx under the hood. A great example is GitHub for Windows.

According to Paul Betts at GitHub, “GitHub for Windows uses the Reactive Extensions for almost everything it does, including network requests, UI events, managing child processes (git.exe). Using Rx and ReactiveUI, we’ve written a fast, nearly 100% asynchronous, responsive application, while still having 100% deterministic, reliable unit tests. The desktop developers at GitHub loved Rx so much, that the Mac team created their own version of Rx and ReactiveUI, called ReactiveCocoa, and are now using it on the Mac to obtain similar benefits.”

What if the major cloud players started competing on the basis of interoperability? So your app here will work there.

Reducing the impedance for developers enables more competition between developers. Resulting in better services/product for consumers.

Cloud owners get more options to offer their customers.

Topic map applications have an easier time mining, identifying and recombining subjects across diverse sources and even clouds.

Does anyone see a downside here?

October 21, 2012

The personal cloud series

Filed under: Cloud Computing,Users,WWW — Patrick Durusau @ 9:52 am

The personal cloud series by Jon Udell.

Excellent source of ideas on the web/cloud as we experience it today and as we may experience it tomorrow.

Going through prior posts now and will call some of them out for further discussion.

Which ones impress you the most?

October 18, 2012

Axemblr’s Java Client for the Cloudera Manager API

Filed under: Cloud Computing,Cloudera,Hadoop — Patrick Durusau @ 10:38 am

Axemblr’s Java Client for the Cloudera Manager API by Justin Kestelyn.

From the post:

Axemblr, purveyors of a cloud-agnostic MapReduce Web Service, have recently announced the availability of an Apache-licensed Java Client for the Cloudera Manager API.

The task at hand, according to Axemblr, is to ”deploy Hadoop on Cloud with as little user interaction as possible. We have the code to provision the hosts but we still need to install and configure Hadoop on all nodes and make it so the user has a nice experience doing it.” And voila, the answer is Cloudera Manager, with the process made easy via the REST API introduced in Release 4.0.

Thus, says Axemblr: “In the pursuit of our greatest desire (second only to coffee early in the morning), we ended up writing a Java client for Cloudera Manager’s API. Thus we achieved to automate a CDH3 Hadoop installation on Amazon EC2 and Rackspace Cloud. We also decided to open source the client so other people can play along.”

Another goodie to ease your way to Hadoop deployment on your favorite cloud.

Do you remember the lights at radio stations that would show “On Air?”

I need an “On Cloud” that lights up. More realistic than the data appliance.

October 12, 2012

Lacking Data Integration, Cloud Computing Suffers

Filed under: Cloud Computing,Data Integration — Patrick Durusau @ 2:11 pm

Lacking Data Integration, Cloud Computing Suffers by David Linthicum.

From the post:

The findings of the Cloud Market Maturity study, a survey conducted jointly by Cloud Security Alliance (CSA) and ISACA, show that government regulations, international data privacy, and integration with internal systems dominate the top 10 areas where trust in the cloud is at its lowest.

The Cloud Market Maturity study examines the maturity of cloud computing and helps identify market changes. In addition, the report provides detailed information on the adoption of cloud services at all levels within global companies, including senior executives.

Study results reveal that cloud users from 50 countries expressed the lowest level of confidence in the following (ranked from most reliable to least reliable):

  • Government regulations keeping pace with the market
  • Exit strategies
  • International data privacy
  • Legal issues
  • Contract lock in
  • Data ownership and custodian responsibilities
  • Longevity of suppliers
  • Integration of cloud with internal systems
  • Credibility of suppliers
  • Testing and assurance

Questions:

As “big data” gets “bigger,” will cloud integration issues get better or worse?

Do you prefer disposable data integration or reusable data integration? (For bonus points, why?)

September 12, 2012

Cloudera Enterprise in Less Than Two Minutes

Filed under: Cloud Computing,Cloudera,Hadoop,MapReduce — Patrick Durusau @ 4:10 pm

Cloudera Enterprise in Less Than Two Minutes by Justin Kestelyn.

I had to pause “Born Under A Bad Sign” by Cream to watch the video but it was worth it!

Good example of selling technique too!

Focused on common use cases and user concerns. Promises a solution without all the troublesome details.

Time enough for that after a prospect is interested. And even then, ease them into the details.

How To Take Big Data to the Cloud [Webinar – 13 Sept 2012 – 10 AM PDT]

Filed under: BigData,Cloud Computing,Hortonworks — Patrick Durusau @ 10:17 am

How To Take Big Data to the Cloud by Lisa Sensmeier.

From the post:

Hortonworks boasts a rich and vibrant ecosystem of partners representing a huge array of solutions that leverage Hadoop, and specifically Hortonworks Data Platform, to provide big data insights for customers. The goal of our Partner Webinar Series is to help communicate the value and benefit of our partners’ solutions and how they connect and use Hortonworks Data Platform.

Look to the CloudsBig-Data-and-the-cloud

Setting up a big data cluster can be difficult, especially considering the assembly of all the all the equipment, power, and space to make it happen. One option to consider is using the cloud for a practical and economical way to go. The cloud is also used to provide extra capacity for an existing cluster or for test your Hadoop applications.

Join our webinar and we will show how you can build a flexible and reliable Hadoop cluster in the cloud using Amazon EC2 cloud infrastructure, StackIQ Apache Hadoop Amazon Machine Image (AMI) and Hortonworks Data Platform. The panel of speakers includes Matt Tavis, Solutions Architect for Amazon Web Services, Mason Katz, CTO and co-founder of StackIQ, and Rohit Bakhshi, Product Manager at Hortonworks.

OK, it is a vendor/partner presentation but most of us work for vendors and use vendor created tools.

Yes?

The real question is whether tool X does what is necessary at a cost project Y can afford?

Whether vendor sponsored tool, service, home grown or otherwise.

Yes?

Looking forward to it!

August 24, 2012

Big Data on Heroku – Hadoop from Treasure Data

Filed under: Cloud Computing,Hadoop,Heroku — Patrick Durusau @ 3:32 pm

Big Data on Heroku – Hadoop from Treasure Data by Istvan Szegedi.

From the post:

This time I write about Heroku and Treasure Data Hadoop solution – I found it really to be a ‘gem’ in the Big Data world.

Heroku is a cloud platform as a service (PaaS) owned by Salesforce.com. Originally it started with supporting Ruby as its main programming language but it has been extended to Java, Scala, Node.js, Python and Clojure, too. It also supports a long list of addons including – among others – RDBMS and NoSQL capabilities and Hadoop-based data warehouse developed by Treasure Data.

Not to leave the impression that your only cloud option is AWS.

I don’t know of any comparisons of cloud services/storage plus cost on an apples to apples basis.

Do you?

August 16, 2012

Using the Cloudant Data Layer for Windows Azure

Filed under: Cloud Computing,Windows Azure — Patrick Durusau @ 4:00 pm

Using the Cloudant Data Layer for Windows Azure by Doug Mahugh.

From the post:

If you need a highly scalable data layer for your cloud service or application running on Windows Azure, the Cloudant Data Layer for Windows Azure may be a great fit. This service, which was announced in preview mode in June and is now in beta, delivers Cloudant’s “database as a service” offering on Windows Azure.

From Cloudant’s data layer you’ll get rich support for data replication and synchronization scenarios such as online/offline data access for mobile device support, a RESTful Apache CouchDB-compatible API, and powerful features including full-text search, geo-location, federated analytics, schema-less document collections, and many others. And perhaps the greatest benefit of all is what you don’t get with Cloudant’s approach: you’ll have no responsibility for provisioning, deploying, or managing your data layer. The experts at Cloudant take care of those details, while you stay focused on building applications and cloud services that use the data layer.

….

For an example of how to use the Cloudant Data Layer, see the tutorial “Using the Cloudant Data Layer for Windows Azure,” which takes you through the steps needed to set up an account, create a database, configure access permissions, and develop a simple PHP-based photo album application that uses the database to store text and images:

Not that you need a Cloudant Data Layer for a photo album but it will help get your feet wet with cloud computing.

July 28, 2012

The Coming Majority: Mainstream Adoption and Entrepreneurship [Cloud Gift Certificates?]

Filed under: Cloud Computing,Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 6:22 pm

The Coming Majority: Mainstream Adoption and Entrepreneurship by James Locus.

From the post:

Small companies, big data.

Big data is sometimes at odds with the business-savvy entrepreneur who wants to exploit its full potential. In essence, the business potential of big data is the massive (but promising) elephant in the room that remains invisible because the available talent necessary to take full advantage of the technology is difficult to obtain.

Inventing new technology for the platform is critical, but so too is making it easier to use.

The future of big data may not be a technological breakthrough by a select core of contributing engineers, but rather a platform that allows common, non-PhD holding entrepreneurs and developers to innovate. Some incredible progress has been made in Apache Hadoop with Hortonworks’ HDP (Hortonworks Data Platform) in minimizing the installation process required for full implementation. Further, the improved MapReduce v2 framework also greatly lowers the risk of adoption for businesses by expressly creating features designed to increase efficiency and usability (e.g. backward and forward compatibility). Finally, with HCatalog, the platform is opened up to integrate with new and existing enterprise applications.

What kinds of opportunities lie ahead when more barriers are eliminated?

You really do need a local installation of Hadoop for experimenting.

But at the same time, having a minimal cloud account where you can whistle up some serious computing power isn’t a bad idea either.

That would make an interesting “back to school” or “holiday present for your favorite geek” sort of present. A “gift certificate” for so many hours/cycles a month on a cloud platform.

BTW, what projects would you undertake if barriers of access and capacity were diminished if not removed?

July 27, 2012

PostgreSQL’s place in the New World Order

Filed under: Cloud Computing,Database,Heroku,PostgreSQL — Patrick Durusau @ 4:22 am

PostgreSQL’s place in the New World Order by Matthew Soldo.

Description:

Mainstream software development is undergoing a radical shift. Driven by the agile development needs of web, social, and mobile apps, developers are increasingly deploying to platforms-as-a-service (PaaS). A key enabling technology of PaaS is cloud-services: software, often open-source, that is consumed as a service and operated by a third-party vendor. This shift has profound implications for the open-source world. It enables new business models, increases emphasis on user-experience, and creates new opportunities.

PostgreSQL is an excellent case study in this shift. The PostgreSQL project has long offered one of the most reliable open source databases, but has received less attention than competing technologies. But in the PaaS and cloud-services world, reliability and open-ness become increasingly important. As such, we are seeing the beginning of a shift in adoption towards PostgreSQL.

The datastore landscape is particularly interesting because of the recent attention given to the so-called NoSQL technologies. Data is suddenly sexy again. This attention is largely governed by the same forces driving developers to PaaS, namely the need for agility and scalability in building modern apps. Far from being a threat to PostgreSQL, these technologies present an amazing opportunity for showing the way towards making PostgreSQL more powerful and more widely adopted.

The presentation sounds great, but alas, the slidedeck is just a slidedeck. 🙁

I do recommend it for the next to last slide graphic. Very cool!

(And it may be time to take a another look at PostgreSQL as well.)

July 20, 2012

Cloudera Manager 4.0.3 Released!

Filed under: Cloud Computing,Cloudera — Patrick Durusau @ 4:39 am

Cloudera Manager 4.0.3 Released! by Bala Venkatrao.

From the post:

We are pleased to announce the availability of Cloudera Manager 4.0.3. This is an enhancement release, with several improvements to configurability and usability. Some key enhancements include:

  • Configurable user/group settings for Oozie, HBase, YARN, MapReduce, and HDFS processes.
  • Support new configuration parameters for MapReduce services.
  • Auto configuration of reserved space for non-DFS use parameter for HDFS service.
  • Improved cluster upgrade process.
  • Support for LDAP users/groups that belong to more than one Organization Unit (OU).
  • Flexibility with distribution of key tabs when using existing Kerberos infrastructure (e.g. Active Directory).

Detailed release notes available at:

https://ccp.cloudera.com/display/ENT4DOC/New+Features+in+Cloudera+Manager+4.0

Cloudera Manager 4.0.3 is available to download from:

https://ccp.cloudera.com/display/SUPPORT/Downloads

Something for the weekend!

July 15, 2012

Nutch 1.5/1.5.1 [Cloud Setup for Experiements?]

Filed under: Cloud Computing,Nutch,Search Engines — Patrick Durusau @ 3:41 pm

Before the release of Nutch 2.0, there was the release of Nutch 1.5 and 1.5.1.

From the 1.5 release note:

The 1.5 release of Nutch is now available. This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few. Please see the list of changes

http://www.apache.org/dist/nutch/CHANGES-1.5.txt

[WRONG URL – Should be: http://www.apache.org/dist/nutch/1.5/CHANGES-1.5.txt (version /1.5/” missing from the path, took me a while to notice the nature of the problem.)]

made in this version for a full breakdown of the 50 odd improvements the release boasts. A full PMC release statement can be found below

http://nutch.apache.org/#07+June+2012+-+Apache+Nutch+1.5+Released

Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and and array other document formats. Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. The system can be enhanced (eg other document formats can be parsed) using a highly flexible, easily extensible and thoroughly maintained plugin infrastructure.

Nutch is available in source and binary form (zip and tar.gz) from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/

And 1.5.1:

http://www.apache.org/dist/nutch/1.5.1/CHANGES.txt

Nutch is available in source and binary form (zip and tar.gz) from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/

Question: Would you put together some commodity boxes for local experimentation or would you spin up an installation in one of the clouds?

As hot as the summer promises to be near Atlanta, I am leaning towards the cloud route.

As I write that I can hear a close friend from the West Coast shouting “…trust, trust issues….” But I trust the local banking network, credit card, utilities, finance, police/fire, etc., with just as little reason as any of the “clouds.”

Not really even “trust,” I don’t even think about it. The credit card industry knows $X fraud is going to occur every year and it is a cost of liquid transactions. So they allow for it in their fees. They proceed in the face of known rates of fraud. How’s that for trust? 😉 Trusting fraud is going to happen.

Same will be true for the “clouds” and mechanisms will evolve to regulate the amount of exposure versus potential damage. I am going to be experimenting with non-client data so the worst exposure I have is loss of time. Perhaps some hard lessons learned on configuration/security. But hardly a reason to avoid the “clouds” and to incur the local hardware cost.

I was serious when I suggested governments should start requiring side by side comparison of hardware costs for local installs versus cloud services. I would call the major cloud services up and ask them for independent bids.

Would the “clouds” be less secure? Possibly, but I don’t think any of them allow Lady Gaga CDs on premises.

June 29, 2012

Google Compute Engine: Computing without limits

Filed under: Cloud Computing,Google Compute Engine — Patrick Durusau @ 3:17 pm

Google Compute Engine: Computing without limits by Craig McLuckie.

From the post:

Over the years, Google has built some of the most high performing, scalable and efficient data centers in the world by constantly refining our hardware and software. Since 2008, we’ve been working to open up our infrastructure to outside developers and businesses so they can take advantage of our cloud as they build applications and websites and store and analyze data. So far this includes products like Google App Engine, Google Cloud Storage, and Google BigQuery.

Today, in response to many requests from developers and businesses, we’re going a step further. We’re introducing Google Compute Engine, an Infrastructure-as-a-Service product that lets you run Linux Virtual Machines (VMs) on the same infrastructure that powers Google. This goes beyond just giving you greater flexibility and control; access to computing resources at this scale can fundamentally change the way you think about tackling a problem.

Google Compute Engine offers:

  • Scale. At Google we tackle huge computing tasks all the time, like indexing the web, or handling billions of search queries a day. Using Google’s data centers, Google Compute Engine reduces the time to scale up for tasks that require large amounts of computing power. You can launch enormous compute clusters – tens of thousands of cores or more.
  • Performance. Many of you have learned to live with erratic performance in the cloud. We have built our systems to offer strong and consistent performance even at massive scale. For example, we have sophisticated network connections that ensure consistency. Even in a shared cloud you don’t see interruptions; you can tune your app and rely on it not degrading.
  • Value. Computing in the cloud is getting even more appealing from a cost perspective. The economy of scale and efficiency of our data centers allows Google Compute Engine to give you 50% more compute for your money than with other leading cloud providers. You can see pricing details here.

The capabilities of Google Compute Engine include:

  • Compute. Launch Linux VMs on-demand. 1, 2, 4 and 8 virtual core VMs are available with 3.75GB RAM per virtual core.
  • Storage. Store data on local disk, on our new persistent block device, or on our Internet-scale object store, Google Cloud Storage.
  • Network. Connect your VMs together using our high-performance network technology to form powerful compute clusters and manage connectivity to the Internet with configurable firewalls.
  • Tooling. Configure and control your VMs via a scriptable command line tool or web UI. Or you can create your own dynamic management system using our API.

Google Compute Engine Preview – Signup

Wondering how this will impact evaluations of CS papers? And what data sets will be used on a routine basis?

To say nothing of exploration of data/text mining.

Now if we can just get access to the majority of research literature, well, but that’s an issue for another forum.

Asgard for Cloud Management and Deployment

Filed under: Amazon Web Services AWS,Asgard,Cloud Computing — Patrick Durusau @ 3:16 pm

Asgard for Cloud Management and Deployment

Amazon is touting the horn of one of its larger customers, Netflix when they say:

Our friends at Netflix have embraced AWS whole-heartedly. They have shared much of what they have learned about how they use AWS to build, deploy, and host their applications. You can read the Netflix Tech Blog benefit from what they have learned.

Earlier this week they released Asgard, a web-based cloud management and deployment tool, in open source form on GitHub. According to Norse mythology, Asgard is the home of the god of thunder and lightning, and therefore controls the clouds! This is the same tool that the engineers at Netflix use to control their applications and their deployments.

Asgard layers two additional abstractions on top of AWS — Applications and Clusters.

Even if you are just in the planning (dreaming?) stages of cloud deployment for your topic map application, it would be good to review the Netflix blog. On Asgard and others posts as well.

You know how I hate to complain, ;-), but the Elder Edda does not report “Asgard” as the “home of the god of thunder and lighting.” All the gods resided at Asgard.

Even the link in the quoted part of Jeff’s post gets that much right.

Most of the time old stories told aright are more moving than modern misconceptions.

June 24, 2012

Rise above the Cloud hype with OpenShift

Filed under: Cloud Computing,Red Hat — Patrick Durusau @ 1:29 pm

Rise above the Cloud hype with OpenShift by Eric D. Schabell.

From the post:

Are you tired of requesting a new development machine for your application? Are you sick of having to setup a new test environment for your application? Do you just want to focus on developing your application in peace without ‘dorking with the stack’ all of the time? We hear you. We have been there too. Have no fear, OpenShift is here!

In this article will walk you through the simple steps it takes to setup not one, not two, not three, but up to five new machines in the Cloud with OpenShift. You will have your applications deployed for development, testing or to present them to the world at large in minutes. No more messing around.

We start with an overview of what OpenShift is, where it comes from and how you can get the client tooling setup on your workstation. You will then be taken on a tour of the client tooling as it applies to the entry level of OpenShift, called Express. In minutes you will be off and back to focusing on your application development, deploying to test it in OpenShift Express. When finished you will just discard your test machine and move on. When you have mastered this, it will be time to ramp up into the next level with OpenShift Flex. This opens up your options a bit so you can do more with complex applications and deployments that might need a bit more fire power. After this you will be fully capable of ascending into the OpenShift Cloud when you chose, where you need it and at a moments notice. This is how development is supposed to be, development without stack distractions.

Specific to the Red Hat Cloud but that doesn’t trouble me if it doesn’t trouble you.

What is important is that like many cloud providers, the goal is to make software development in the cloud as free from “extra” concerns as possible.

Think of users who rely upon network based applications for word processing, spreadsheets, etc. Fewer of them would do so if every use of the application required steps that expose the network-based nature of the application. Users just want the application to work. (full stop)

A bit more of the curtain can be drawn back for developers but even there, the goal isn’t to master the intricacies of cloud computing but to produce robust applications that so happen to run on the cloud.

This is one small step towards a computing fabric where developers write and deploy software. (full stop) The details of where it is executed, where data is actually stored, being known only by computing fabric specialists. The application serves it users, produces the expected answers, delivers specified performance, what more do you need to know?

I would like to see topic maps playing a role in developing the transparency for the interconnected systems that grow into that fabric.

(I first saw this at DZone’s replication of the Java Code Geeks reposting at: http://www.dzone.com/links/r/rise_above_the_cloud_hype_with_openshift.html)

June 22, 2012

Sage Bionetworks and Amazon SWF

Sage Bionetworks and Amazon SWF

From the post:

Over the past couple of decades the medical research community has witnessed a huge increase in the creation of genetic and other bio molecular data on human patients. However, their ability to meaningfully interpret this information and translate it into advances in patient care has been much more modest. The difficulty of accessing, understanding, and reusing data, analysis methods, or disease models across multiple labs with complimentary expertise is a major barrier to the effective interpretation of genomic data. Sage Bionetworks is a non-profit biomedical research organization that seeks to revolutionize the way researchers work together by catalyzing a shift to an open, transparent research environment. Such a shift would benefit future patients by accelerating development of disease treatments, and society as a whole by reducing costs and efficacy of health care.

To drive collaboration among researchers, Sage Bionetworks built an on-line environment, called Synapse. Synapse hosts clinical-genomic datasets and provides researchers with a platform for collaborative analyses. Just like GitHub and Source Forge provide tools and shared code for software engineers, Synapse provides a shared compute space and suite of analysis tools for researchers. Synapse leverages a variety of AWS products to handle basic infrastructure tasks, which has freed the Sage Bionetworks development team to focus on the most scientifically-relevant and unique aspects of their application.

Amazon Simple Workflow Service (Amazon SWF) is a key technology leveraged in Synapse. Synapse relies on Amazon SWF to orchestrate complex, heterogeneous scientific workflows. Michael Kellen, Director of Technology for Sage Bionetworks states, “SWF allowed us to quickly decompose analysis pipelines in an orderly way by separating state transition logic from the actual activities in each step of the pipeline. This allowed software engineers to work on the state transition logic and our scientists to implement the activities, all at the same time. Moreover by using Amazon SWF, Synapse is able to use a heterogeneity of computing resources including our servers hosted in-house, shared infrastructure hosted at our partners’ sites, and public resources, such as Amazon’s Elastic Compute Cloud (Amazon EC2). This gives us immense flexibility is where we run computational jobs which enables Synapse to leverage the right combination of infrastructure for every project.”

The Sage Bionetworks case study (above) and another one, NASA JPL and Amazon SWF, will get you excited about reaching out to the documentation on Amazon Simple Workflow Service (Amazon SWF).

In ways that presentations that consist of reading slides about management advantages to Amazon SWF simply can’t reach. At least not for me.

Take the tip and follow the case studies, then onto the documentation.

Full disclosure: I have always been fascinated by space and really hard bioinformatics problems. And have < 0 interest in DRM antics on material if piped to /dev/null would raise a user's IQ.

June 21, 2012

Knowledge Discovery Using Cloud and Distributed Computing Platforms (KDCloud, 2012)

Filed under: Cloud Computing,Conferences,Distributed Systems,Knowledge Discovery — Patrick Durusau @ 2:49 pm

Knowledge Discovery Using Cloud and Distributed Computing Platforms (KDCloud, 2012)

From the website:

Paper Submission August 10, 2012

Acceptance Notice October 01, 2012

Camera-Read Copy October 15, 2012

Workshop December 10, 2012 Brussels, Belgium

Collocated with the IEEE International Conference on Data Mining, ICDM 2012

From the website:

The 3rd International Workshop on Knowledge Discovery Using Cloud and Distributed Computing Platforms (KDCloud, 2012) provides an international platform to share and discuss recent research results in adopting cloud and distributed computing resources for data mining and knowledge discovery tasks.

Synopsis: Processing large datasets using dedicated supercomputers alone is not an economical solution. Recent trends show that distributed computing is becoming a more practical and economical solution for many organizations. Cloud computing, which is a large-scale distributed computing, has attracted significant attention of both industry and academia in recent years. Cloud computing is fast becoming a cheaper alternative to costly centralized systems. Many recent studies have shown the utility of cloud computing in data mining, machine learning and knowledge discovery. This workshop intends to bring together researchers, developers, and practitioners from academia, government, and industry to discuss new and emerging trends in cloud computing technologies, programming models, and software services and outline the data mining and knowledge discovery approaches that can efficiently exploit this modern computing infrastructures. This workshop also seeks to identify the greatest challenges in embracing cloud computing infrastructure for scaling algorithms to petabyte sized datasets. Thus, we invite all researchers, developers, and users to participate in this event and share, contribute, and discuss the emerging challenges in developing data mining and knowledge discovery solutions and frameworks around cloud and distributed computing platforms.

Topics: The major topics of interest to the workshop include but are not limited to:

  • Programing models and tools needed for data mining, machine learning, and knowledge discovery
  • Scalability and complexity issues
  • Security and privacy issues relevant to KD community
  • Best use cases: are there a class of algorithms that best suit to cloud and distributed computing platforms
  • Performance studies comparing clouds, grids, and clusters
  • Performance studies comparing various distributed file systems for data intensive applications
  • Customizations and extensions of existing software infrastructures such as Hadoop for streaming, spatial, and spatiotemporal data mining
  • Applications: Earth science, climate, energy, business, text, web and performance logs, medical, biology, image and video.

It’s December, Belgium and an interesting workshop. Can’t ask for much more than that!

Lessons from Amazon RDS on Bringing Existing Apps to the Cloud

Filed under: Amazon Web Services AWS,Cloud Computing — Patrick Durusau @ 5:56 am

Lessons from Amazon RDS on Bringing Existing Apps to the Cloud by Nati Shalom.

From the post:

Its a common believe that Cloud is good for green field apps. There are many reasons for this, in particular the fact that the cloud forces a different kind of thinking on how to run apps. Native cloud apps were designed to scale elastically, they were designed with complete automation in mind, and so forth. Most of the existing apps (a.k.a brown field apps) were written in a pre-cloud world and therefore don’t support these attributes. Adding support for these attributes could carry a significant investment. In some cases, this investment could be so big that it would make more sense to go through a complete re-write.

In this post I want to challenge this common belief. Over the past few years I have found that many stateful applications running on the cloud don’t support all those attributes, elasticity in particular. One of the better-known examples of this is MySQL and its Amazon cloud offering, RDS, which I’ll use throughout this post to illustrate my point.

Amazon RDS as an example for migrating a brown-field applications

MySQL was written in a pre-cloud world and therefore fits into the definition of a brown-field app. As with many brown-field apps, it wasn’t designed to be elastic or to scale out, and yet it is one of the more common and popular services on the cloud. To me, this means that there are probably other attributes that matter even more when we consider our choice of application in the cloud. Amazon RDS is the cloud-enabled version of MySQL. It can serve as a good example to find what those other attributes could be.

You have to admit that the color imagery is telling. Pre-cloud applications are “brown-field” apps and cloud apps are “green.”

I think the survey numbers about migrating to the cloud are fairly soft and not always consistent. There will be “green” and “brown” field apps created or migrated to the cloud.

But brown field apps will remain just as relational databases did not displace all the non-relational databases, which persist to this day.

Technology is as often “in addition to” as it is “in place of.”

June 13, 2012

Azure Changes Dress Code, Allows Tuxedos

Filed under: Cloud Computing,Linux OS,Marketing — Patrick Durusau @ 4:12 am

Azure Changes Dress Code, Allows Tuxedos by Robert Gelber.

Had it on my list to mention that Azure is now supporting Linux. Robert summarizes as follows:

Microsoft has released previews of upcoming services on their Azure cloud platform. The company seems focused on simplifying the transition of in-house resources to hybrid or external cloud deployments. Most notable is the ability for end users to create virtual machines with Linux images. The announcement will be live streamed later today at 1 p.m. PST.

Azure’s infrastructure will support CentOS 6.2, OpenSUSE 12.1, SUSE Linux Enterprise Server SP2 and Ubuntu 12.04 VM images. Microsoft has already updated their Azure site to reflect the compatibility. Other VM features include:

  • Virtual Hard Disks – Allowing end users to migrate data between on-site and cloud permises.
  • Workload Migration – Moving SQL Server, Sharepoint, Windows Server or Linux images to cloud services.
  • Common Virtualization Format – Microsoft has made the VHD file format freely available under an open specification promise.

Cloud offerings are changing, perhaps evolving would be a better word, at a rapid pace.

Although standardization may be premature, it is certainly a good time to start gathering information on services, vendors, in a way that cuts across the verbal jungle that is cloud computing PR.

Topic maps anyone?

May 9, 2012

Converged Cloud Growth…[Ally or Fan Fears on Interoperability]

Filed under: Cloud Computing,Interoperability — Patrick Durusau @ 2:58 pm

Demand For Standards—Interoperability To Fuel Converged Cloud Growth

Confused terminology is often a mess in stable CS areas, to say nothing of rapidly developing one such as cloud computing.

Add to that all the marketing hype that creates even more confusion.

Thinking there should be opportunities for standardizing terminology and mappings to vendor terminology in the process.

Topic maps would be a natural for the task.

Interested?

« Newer PostsOlder Posts »

Powered by WordPress