Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 12, 2011

SDSC’s New Storage Cloud: ‘Flickr for Scientific Data’

Filed under: Cloud Computing — Patrick Durusau @ 4:39 pm

SDSC’s New Storage Cloud: ‘Flickr for Scientific Data’ by Michael Feldman.

From the post:

Last month, the San Diego Supercomputer Center launched what it believes is “the largest academic-based cloud storage system in the U.S.” The infrastructure is designed to serve the country’s research community and will be available to scientists and engineers from essentially any government agency that needs to archive and share super-sized data sets.

Certainly the need for such a service exists. The modern practice of science is a community activity and the way researchers collaborate is by sharing their data. Before the emergence of cloud, the main way to accomplish that was via emails and sending manuscripts back and forth over the internet. But with the coalescence of some old and new technologies, there are now economically viable ways for sharing really large amounts of data with colleagues.

In the press release describing the storage cloud, SDSC director Michael Norman described it thusly: “We believe that the SDSC Cloud may well revolutionize how data is preserved and shared among researchers, especially massive datasets that are becoming more prevalent in this new era of data-intensive research and computing.” Or as he told us more succinctly, “I think of it as Flickr for scientific data.”

The article ends with:

Whether the center’s roll-your-own cloud will be able to compete against commercial clouds on a long-term basis remains to be seen. One of the reasons a relatively small organization like SDSC can even build such a beast today is thanks in large part to the availability of cheap commodity hardware and the native expertise at the center to build high-end storage systems from parts.

There is also OpenStack — an open-source cloud OS that the SDSC is using as the basis of their offering. Besides being essentially free for the taking, the non-proprietary nature of OpenStack also means the center will not be locked into any particular software or hardware vendors down the road.

“With OpenStack going open source, it’s now possible for anybody to set up a little cloud business,” explains Norman “We’re just doing it in an academic environment.”

From a long term need/employment situation, having lots of “little cloud” businesses, each with its own semantics, isn’t a bad thing.

It does make me wonder to what degree the ability to have more semantics accessible increases the semantic resistance (not the Newcomb word, dissonance perhaps?) or it is simply more evident. That is the overall level of semantic dissonance is the same, but the WWW, etc. has increased the rate at which we encounter it.

Different companies always had different database semantics, for example, but the only way to encounter it was to either change jobs or merge the companies, both of which were one on one events. Now it is easy to encounter multiple database semantics in a single day or hour, etc.

October 10, 2011

Bio4jExplorer

Filed under: Bio4j,Bioinformatics,Biomedical,Cloud Computing,Graphs — Patrick Durusau @ 6:17 pm

Bio4jExplorer: familiarize yourself with Bio4j nodes and relationships

From the post:

I just uploaded a new tool aimed to be used both as a reference manual and initial contact for Bio4j domain model: Bio4jExplorer

Bio4jExplorer allows you to:

  • Navigate through all nodes and relationships
  • Access the javadocs of any node or relationship
  • Graphically explore the neighbourhood of a node/relationship
  • Look up for the different indexes that may serve as an entry point for a node
  • Check incoming/outgoing relationships of a specific node
  • Check start/end nodes of a specific relationship

And take note:

For those interested on how this was done, on the server side I created an AWS SimpleDB database holding all the information about the model of Bio4j, i.e. everything regarding nodes, relationships, indexes… (here you can check the program used for creating this database using java aws sdk)

Meanwhile, in the client side I used Flare prefuse AS3 library for the graph visualization.

When people are this productive as well as a benefit to the community, I am deeply envious but glad for them (and the rest of us) at the same time. Simply must work harder. 😉

October 9, 2011

Apache Software Foundation Public Mail Archives

Filed under: Cloud Computing,Dataset — Patrick Durusau @ 6:40 pm

Apache Software Foundation Public Mail Archives

From the webpage:

Submitted By: Grant Ingersoll
US Snapshot ID (Linux/Unix): snap-­17f7f476
Size: 200 GB
License: Public Domain (See http://apache.org/foundation/public-­archives.html)
Source: The Apache Software Foundation (http://www.apache.org)
Created On: August 15, 2011 10:00 PM GMT
Last Updated: August 15, 2011 10:00 PM GMT

A collection of all publicly available mail archives from the Apache55 Software Foundation (ASF), taken on July 11, 2011.

This collection contains all publicly available email archives from the ASF’s 80+ projects (http://mail-archives.apache.org/mod_mbox/), including mailing lists such as Apache HTTPD Server, Apache Tomcat, Apache Lucene and Solr, Apache Hadoop and many more.

Generally speaking, most projects have at least three lists: user, dev and commits, but some have more, some have less. The user lists are where users of the software ask questions on usage, while the dev list usually contains discussions on the development of the project (code, releases, etc.)

The commit lists usually consists of automated notifications sent by the various ASF version control tools, like Subversion or CVS, and contain information about changes made to the project’s source code.

Both tarballs and per project sets are available in the snapshot. The tarballs are organized according to project name. Thus, a-d.tar.gz contains all ASF projects that begin with the letters a, b, c or d, such as abdera.apache.org. Files within the project are usually gzipped mbox files. (I split the first paragraph into several paragraphs for readability reasons.)

Rather meager documentation for a 200 GB data set don’t you think? I think a first step would be to create basic documentation on what projects are present, their mailing lists, some basic statistical counts to serve as reference points.

You have been waiting for a motivation to “get into” cloud computing, well, now you have the motivation and an interesting dataset!

September 28, 2011

Getting Started with Amazon Elastic MapReduce

Filed under: Cloud Computing,MapReduce — Patrick Durusau @ 7:35 pm

Getting Started with Amazon Elastic MapReduce

I happened across this video yesterday. I recommend that you watch it at least two or three times, if not more.

Not that any of you need to learn how to run a Python word count script with Amazon Elastic MapReduce.

Rather this is a very effective presentation that does not get side tracked by rat holes and edge cases. It has an agenda and doesn’t deviate from that agenda.

A lot of lessons can be learned from this video for presentations at conferences or even to clients.

Dimensions to use to compare NoSQL data stores

Filed under: Cloud Computing,NoSQL — Patrick Durusau @ 7:35 pm

Dimensions to use to compare NoSQL data stores by Huan Liu.

From the post:

You have decided to use a NoSQL data store in favor of a DBMS store, possibly due to scaling reasons. But, there are so many NoSQL stores out there, which one should you choose? Part of the NoSQL movement is the acknowledgment that there are tradeoffs, and the various NoSQL projects have pursued different tradeoff points in the design space. Understanding the tradeoffs they have made, and figuring out which one fits your application better is a major undertaking.

Obviously, choosing the right data store is a much bigger topic, which is not something that can be covered in a single blog. There are also many resources comparing the various NoSQL data stores, e.g., here, so that there is no point repeating them. Instead, in this post, I will highlight the dimensions you should use when you compare the various data stores.

Useful information to have on hand when discussing NoSQL data stores.

September 23, 2011

ParLearning 2012 (silos or maps?)

ParLearning 2012 : Workshop on Parallel and Distributed Computing for Machine Learning and Inference Problems

Dates:

When May 25, 2012 – May 25, 2012
Where Shanghai, China
Submission Deadline Dec 19, 2011
Notification Due Feb 1, 2012
Final Version Due Feb 21, 2012

From the notice:

HIGHLIGHTS

  • Foster collaboration between HPC community and AI community
  • Applying HPC techniques for learning problems
  • Identifying HPC challenges from learning and inference
  • Explore a critical emerging area with strong industry interest without overlapping with existing IPDPS workshops
  • Great opportunity for researchers worldwide for collaborating with Chinese Academia and Industry

CALL FOR PAPERS

Authors are invited to submit manuscripts of original unpublished research that demonstrate a strong interplay between parallel/distributed computing techniques and learning/inference applications, such as algorithm design and libraries/framework development on multicore/ manycore architectures, GPUs, clusters, supercomputers, cloud computing platforms that target applications including but not limited to:

  • Learning and inference using large scale Bayesian Networks
  • Large scale inference algorithms using parallel TPIC models, clustering and SVM etc.
  • Parallel natural language processing (NLP).
  • Semantic inference for disambiguation of content on web or social media
  • Discovering and searching for patterns in audio or video content
  • On-line analytics for streaming text and multimedia content
  • Comparison of various HPC infrastructures for learning
  • Large scale learning applications in search engine and social networks
  • Distributed machine learning tools (e.g., Mahout and IBM parallel tool)
  • Real-time solutions for learning algorithms on parallel platforms

If you are wondering what role topic maps have to play in this arena, ask yourself the following question:

Will the systems and techniques demonstrated at this conference use the same means to identify the same subjects?*

If your answer is no, what would you suggest is the solution for mapping different identifications of the same subjects together?

My answer to that question is to use topic maps.

*Whatever your ascribe as its origin, semantic diversity is part and parcel of the human condition. We can either develop silos or maps across silos. Which do you prefer?

September 20, 2011

Running Mahout in the Cloud using Apache Whirr

Filed under: Cloud Computing,Hadoop,Mahout — Patrick Durusau @ 7:51 pm

Running Mahout in the Cloud using Apache Whirr

From the post:

This blog shows you how to run Mahout in the cloud, using Apache Whirr. Apache Whirr is a promising Apache incubator project for quickly launching cloud instances, from Hadoop to Cassandra, Hbase, Zookeeper and so on. I will show you how to setup a Hadoop cluster and run Mahout jobs both via the command line and Whirr’s Java API (version 0.4).

Running Mahout in the cloud with Apache Whirr will prepare you for using Whirr or similar tools to run services in the cloud.

September 15, 2011

Enterprise-level Cloud at no charge

Filed under: Cloud Computing,Hadoop — Patrick Durusau @ 7:49 pm

Enterprise-level cloud at no charge

From September 12 – November 11, 2011.

Signup deadline: 28 October 2011

From the webpage:

  • 64-bit Copper and 32-bit Silver machines
  • Virtual machines to run Linux® (Red Hat or Novell SUSE) or Microsoft® Windows® Server 2003/2008
  • Select IBM software images
  • 1 block (256 gigabytes) of persistent storage

For the promotional period, IBM will suppress charges for use of these services. You may terminate the promotion at any time, although we don’t think you’ll want to! At the end of the promotional period, your account will transition to a standard pay-as-you-go account at the rates effective at that time. You may elect to add on more services, including, but not limited to:

  • Reserved virtual machine instances
  • On-boarding support
  • Premium and Advanced Premium support options
  • Virtual Private Network services
  • Additional images from IBM software brands, along with offerings from independent software vendors
  • Access to other IBM SmartCloud data centers
  • Additional services that are regularly being added to the IBM SmartCloud Enterprise offering

With these features and more, don’t miss this opportunity to try the IBM SmartCloud. With our enterprise-level servers, software and services, we offer a cloud computing infrastructure that you can approach with confidence. The IBM SmartCloud is built on the skills, experience and best practices gained from years of managing and operating security-rich data centers for enterprises and public institutions around the world.

If you want to try the cloud computing waters or IBM offerings, this could be your chance.

September 7, 2011

An Open Source Platform for Virtual Supercomputing

Filed under: Cloud Computing,Erlang,GPU,Supercomputing — Patrick Durusau @ 6:55 pm

An Open Source Platform for Virtual Supercomputing, Michael Feldman reports:

Erlang Solutions and Massive Solutions will soon launch a new cloud platform for high performance computing. Last month they announced their intent to bring a virtual supercomputer (VSC) product to market, the idea being to enable customers to share their HPC resources either externally or internally, in a cloud-like manner, all under the banner of open source software.

The platform will be based on Clustrx and Xpandrx, two HPC software operating systems that were the result of several years of work done by Erlang Solutions, based in the UK, and Massive Solutions, based in Gibraltar. Massive Solutions has been the driving force behind the development of these two OS’s, using Erlang language technology developed by its partner.

In a nutshell, Clustrx is an HPC operating system, or more accurately, middleware, which sits atop Linux, providing the management and monitoring functions for supercomputer clusters. It is run on its own small server farm of one or more nodes, which are connected to the compute servers that make up the HPC cluster. The separation between management and compute enables it to support all the major Linux distros as well as Windows HPC Server. There is a distinct Clustrx-based version of Linux for the compute side as well, called Compute Based Linux.

A couple of things to note from within the article:

The only limitation to this model is its dependency on the underlying capabilities of Linux. For example, although Xpandrx is GPU-aware, since GPU virtualization is not yet supported in any Linux distros, the VSC platform can’t support virtualization of those resources. More exotic HPC hardware technology would, likewise, be out of the virtual loop.

The common denominator for VSC is Erlang, not just the company, but the language http://www.erlang.org/, which is designed for programming massively scalable systems. The Erlang runtime has built-in to support for things like concurrency, distribution and fault tolerance. As such, it is particularly suitable for HPC system software and large-scale interprocess communication, which is why both Clustrx and Xpandrx are implemented in the language.

As computing power and access to computing power increases, have you seen an increase in robust (in your view) topic map applications?

SolrCloud

Filed under: Cloud Computing,Solr — Patrick Durusau @ 6:52 pm

SolrCloud

From the webpage:

SolrCloud is the set of Solr features that take Solr’s distributed search to the next level, enabling and simplifying the creation and use of Solr clusters.

  • Central configuration for the entire cluster
  • Automatic load balancing and fail-over for queries

Zookeeper is integrated and used to coordinate and store the configuration of the cluster.

Under the Developer-TODO section I noticed:

optionally allow user to query by multiple collections (assume schemas are compatible)

I assume it would have to be declarative but shouldn’t there be re-write functions that cause different schemas to be seen as “compatible?”

From a topic map perspective I would like to see the “why” of such a mapping but having the capacity for the mapping is a step in the right direction.

Oh, and the ability to use the mapping or not and perhaps to choose between mappings.

Mappings, even between ontologies are in someone’s view. Pick one, topic maps will be waiting.

August 25, 2011

Micro Cloud Foundry

Filed under: Cloud Computing,Software — Patrick Durusau @ 6:59 pm

Micro Cloud Foundry

Described in NoSQL Weekly as:

VMware has issued a free version of its Cloud Foundry Platform-as-a-Service (PaaS) stack that can run on a single laptop or desktop computer.The idea behind this package, called Micro Cloud Foundry, is to give developers an easy way to build out Cloud Foundry applications and test them before moving them to an actual Cloud Foundry service. The package includes all the components in the full-fledged Cloud Foundry stack, including the Spring framework for Java, Ruby on Rails, the Sinatra Ruby framework, the JavaScript Node.js library, the Grails framework, and the MongoDB, MySQL, and Redis data stores.

Are you building topic map applications for Cloud Foundry services? Interested in your comments, experiences.

August 19, 2011

High Performance Computing (HPC)

Filed under: Cloud Computing — Patrick Durusau @ 8:31 pm

High Performance Computing (HPC) over at Amazon Web Services.

From the website:

Researchers and businesses alike have complex computational workloads such as tightly coupled parallel processes or demanding network-bound applications, from genome sequencing to financial modeling. Regardless of the application, one major issue affects them both: procuring and provisioning machines. In typical cluster environments, there is a long queue to access machines, and purchasing dedicated, purpose-built hardware takes time and considerable upfront investment.

With Amazon Web Services, businesses and researchers can easily fulfill their high performance computational requirements with the added benefit of ad-hoc provisioning and pay-as-you-go pricing.

I have a pretty full Fall but want to investigate AWS for topic map experiments and possibly even delivery of content.

Yes, AWS has crashed and on that see: Why the AWS Crash is Good Thing by Chris Hawkins.

Anyone using it presently? War stories you want to share?

July 19, 2011

Excel DataScope

Filed under: Algorithms,Cloud Computing,Excel Datascope,Hadoop — Patrick Durusau @ 7:51 pm

Excel DataScope

From the webpage:

From the familiar interface of Microsoft Excel, Excel DataScope enables researchers to accelerate data-driven decision making. It offers data analytics, machine learning, and information visualization by using Windows Azure for data and compute-intensive tasks. Its powerful analysis techniques are applicable to any type of data, ranging from web analytics to survey, environmental, or social data.

And:

Excel DataScope is a technology ramp between Excel on the user’s client machine, the resources that are available in the cloud, and a new class of analytics algorithms that are being implemented in the cloud. An Excel user can simply select an analytics algorithm from the Excel DataScope Research Ribbon without concern for how to move their data to the cloud, how to start up virtual machines in the cloud, or how to scale out the execution of their selected algorithm in the cloud. They simply focus on exploring their data by using a familiar client application.

Excel DataScope is an ongoing research and development project. We envision a future in which a model developer can publish their latest data analysis algorithm or machine learning model to the cloud and within minutes Excel users around the world can discover it within their Excel Research Ribbon and begin using it to explore their data collection. (emphasis added)

I added emphasis to the last sentence because that is the sort of convenience/collaboration that will make cloud computing and collaboration meaningful.

Imagine that sort of sharing across MS and non-MS cloud resources. Well, you would have to have an Excel DataScope interface on non-MS cloud resources, but one hopes that will be a product offering in the near future.

July 16, 2011

TempleScript cloud control

Filed under: Cloud Computing,Topic Map Software — Patrick Durusau @ 5:40 pm

TempleScript cloud control

Robert Barta’s efforts at weather control. No, wait, that’s not right, must mean that other “cloud.” 😉

July 6, 2011

Building a Database-Backed Clojure Web
Application

Filed under: Clojure,Cloud Computing — Patrick Durusau @ 2:15 pm

Building a Database-Backed Clojure Web Application

From the webpage:

This article will explore creating a database-backed Clojure web application and deploying it to the Heroku Cedar stack.

The app we’ll be building is called Shouter, a small Twitter clone that lets users enter in “shouts” which are stored in a PostgreSQL database and displayed on the front page of the app. You can see an example of the finished Shouter deployed to Heroku or view the finished source.

See Heroku to sign up for its cloud application platform.

I started to tease the DevCenter about the article Building a Facebook Application since Google is attempting to do the same thing. 😉

The I found that the article covers, however briefly, the Graph API and Open Graph Protocol, which makes it of more than passing interest for topic map applications.

April 25, 2011

Inside Horizon: interactive analysis at cloud scale

Filed under: Cloud Computing,Data Analysis,Data Mining — Patrick Durusau @ 3:36 pm

Inside Horizon: interactive analysis at cloud scale

From the website:

Late last year, we were honored to be invited to talk at Reflections|Projections, ACM@UIUC’s annual student-run computing conference. We decided to bring a talk about Horizon, our system for doing aggregate analysis and filtering across very large amounts of data. The video of the talk was posted a few weeks back on the conference website.

Horizon started as research project / technology demonstrator built as part of Palantir’s Hack Week – a periodic innovation sprint that our engineering team uses to build brand new ideas from whole cloth. It was then used to by the Center For Public Integrity in their Who’s Behind The Subprime Meltdown report. We produced a short video on the subject, Beyond the Cloud: Project Horizon, released on our analysis blog. Subsequently, it was folded into our product offering, under the name Object Explorer.

In this hour-long talk, two of the engineers that built this technology tell the story of how Horizon came to be, how it works, and show a live demo of doing analysis on hundreds of millions of records in interactive time.

From the presentation:

Mission statement: Organize the world’s information and make it universally accessible and useful. -> Google’s statement

Which should say:

Organize the world’s [public] information and make it universally accessible and useful.

Palantir’s misson:

Organize the world’s [private] information and make it universally accessible and useful.

Closes on human-driven analysis.

A couple of points:

The demo was of a pre-beta version even though the product version shipped several months prior to the presentation. What’s with that?

Long on general statements and short on any specifics.

Did mention this is a column-store solution. Appears to work well with very clean data, but then what solution doesn’t?

Good emphasis on user interface and interactive responses to queries.

I wonder if the emphasis on interactive responses creates unrealistic expectations among customers?

Or an emphasis on problems that can be solved or appear to be solvable, interactively?

My comments about intelligence community bias the other day for example. You can measure and visualize tweets that originate in Tahrir Square, but if they are mostly from Western media, how meaningful is that?

March 4, 2011

ApacheCon NA 2011

Filed under: Cassandra,Cloud Computing,Conferences,CouchDB,HBase,Lucene,Mahout,Solr — Patrick Durusau @ 7:17 am

ApacheCon NA 2011

Proposals: Be sure to submit your proposal no later than Friday, 29 April 2011 at midnight Pacific Time.

7-11 November 2011 Vancouver

From the website:

This year’s conference theme is “Open Source Enterprise Solutions, Cloud Computing, and Community Leadership”, featuring dozens of highly-relevant technical, business, and community-focused sessions aimed at beginner, intermediate, and expert audiences that demonstrate specific professional problems and real-world solutions that focus on “Apache and …”:

  • … Enterprise Solutions (from ActiveMQ to Axis2 to ServiceMix, OFBiz to Chemistry, the gang’s all here!)
  • … Cloud Computing (Hadoop, Cassandra, HBase, CouchDB, and friends)
  • … Emerging Technologies + Innovation (Incubating projects such as Libcloud, Stonehenge, and Wookie)
  • … Community Leadership (mentoring and meritocracy, GSoC and related initiatives)
  • … Data Handling, Search + Analytics (Lucene, Solr, Mahout, OODT, Hive and friends)
  • … Pervasive Computing (Felix/OSGi, Tomcat, MyFaces Trinidad, and friends)
  • … Servers, Infrastructure + Tools (HTTP Server, SpamAssassin, Geronimo, Sling, Wicket and friends)

February 18, 2011

Building a free, massively scalable cloud computing platform

Filed under: Algorithms,Cloud Computing,Topic Maps — Patrick Durusau @ 4:59 am

Building a free, massively scalable cloud computing platform

Soren Hansen at FOSDEM 2011:

A developer’s look into Openstack architecture

OpenStack is a very new, very popular cloud computing project, backed by Rackspace and NASA. It’s all free software (Apache Licensed) and is in production use already.

It’s written entirely in Python and uses Twisted, Eventlet, AMQP, SQLAlchemy, WSGI and many other high quality libraries and standards.

We’ll take a detailed look at the architecture and dive into some of the challenges we’ve faced building a platform that is supposed to handle millions of gigabytes of data and millions of virtual machines, and how we’ve dealt with them.

Video of the presentation

This presentation will be of interest to those who think the answer to topic map processing is to turn up the power dial.

That is one answer and it may be the best one under some circumstances.

That said, given the availability of cloud computing resources, the question becomes one of rolling your own or simply buying the necessary cycles.

Unless you are just trying to throw money at your IT department (that happens even in standards organizations), I suspect buying the cycles is the most cost effective option.

“Free” software really isn’t, particularly server class software, unless you have all volunteers administrators, donated hardware, backup facilities, etc.

Note that NASA, one of the sponsors of this project, can whistle up if not language authors, major contributors to any software package of interest to it. Can your organization say the same thing?

Just to be fair, my suspicions are in favor of a mix of algorithm development, innovative data structures and high end computing structures. (Yes, I know, I cheated and made three choices. Author’s choice.)

« Newer Posts

Powered by WordPress