Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 9, 2012

GATE Teamware: Collaborative Annotation Factories (HOT!)

GATE Teamware: Collaborative Annotation Factories

From the webpage:

Teamware is a web-based management platform for collaborative annotation & curation. It is a cost-effective environment for annotation and curation projects, enabling you to harness a broadly distributed workforce and monitor progress & results remotely in real time.

It’s also very easy to use. A new project can be up and running in less than five minutes. (As far as we know, there is nothing else like it in this field.)

GATE Teamware delivers a multi-function user interface over the Internet for viewing, adding and editing text annotations. The web-based management interface allows for project set-up, tracking, and management:

  • Loading document collections (a “corpus” or “corpora”)
  • Creating re-usable project templates
  • Initiating projects based on templates
  • Assigning project roles to specific users
  • Monitoring progress and various project statistics in real time
  • Reporting of project status, annotator activity and statistics
  • Applying GATE-based processing routines (automatic annotations or post-annotation processing)

I have known about the GATE project in general for years and came to this site after reading: Crowdsourced Legal Case Annotation.

Could be the basis for annotations that are converted into a topic map, but…, I have been a sysadmin before. Maintaining servers, websites, software, etc. Great work, interesting work, but not what I want to be doing now.

Then I read:

Where to get it? The easiest way to get started is to buy a ready-to-run Teamware virtual server from GATECloud.net.

Not saying it will or won’t meet your particular needs, but, certainly is worth a “look see.”

Let me know if you take the plunge!

May 7, 2012

Startups are Creating a New System of the World for IT

Filed under: Cloud Computing — Patrick Durusau @ 7:19 pm

Startups are Creating a New System of the World for IT

Todd Hoff writes:

It remains that, from the same principles, I now demonstrate the frame of the System of the World. — Isaac Newton

The practice of IT reminds me a lot of the practice of science before Isaac Newton. Aristotelianism was dead, but there was nothing to replace it. Then Newton came along, created a scientific revolution with his System of the World. And everything changed. That was New System of the World number one.

New System of the World number two was written about by the incomparable Neal Stephenson in his incredible Baroque Cycle series. It explores the singular creation of a new way of organizing society grounded in new modes of thought in business, religion, politics, and science. Our modern world emerged Enlightened as it could from this roiling cauldron of forces.

In IT we may have had a Leonardo da Vinci or even a Galileo, but we’ve never had our Newton. Maybe we don’t need a towering genius to make everything clear? For years startups, like the frenetically inventive age of the 17th and 18th centuries, have been creating a New System of the World for IT from a mix of ideas that many thought crazy at first, but have turned out to be the founding principles underlying our modern world of IT.

If you haven’t guessed it yet, I’m going to make the case that the New System of the World for IT is that much over hyped word: cloud. I hope to show, using many real examples from real startups, that the cloud is built on a powerful system of ideas and technologies that make it a superior model for delivering IT.

Interesting piece but Todd misses a couple of critical points:

First, Newton was wrong. (full stop) True, his imagining of the world was sufficient and over-determining for centuries, but it wasn’t true. It took until the 20th century for his hegemony to be over turned but it was.

Newtonian mechanics are still taught, but for how much longer? As our understanding of quantum systems grow and our designs move closer and closer to realms unimagined by Newton.

Second, every effort you find at Sourceforge or Freshmeat or similar locales, is the project of someone or small group of someones, all utterly convinced that their project has some unique insight that isn’t contained in the other N projects of the same type. That may well be true, at least for some of them.

But the point remains that the “cloud” enables that fracturing of IT services to a degree not seen up until now. Well, at least not for a long time.

I remember there being 300 or so formats that conversion software offered to handle. How many exist in the cloud today? How many do you think there will be a year from now? (Or perhaps better, how many clouds do you think there will be a year from now?)

With or without the cloud, greater data access is going to drive the need for an understanding and modeling of the subject identities that underlie data and its structures. Brave new world or no.

Enjoy your Newtonian (or is that Napoleonic?) dreams.

May 4, 2012

Snooze

Filed under: Cloud Computing,Snooze — Patrick Durusau @ 3:41 pm

Snooze

From the website:

Snooze is an open-source scalable, autonomic, and energy-efficient virtual machine (VM) management framework for private clouds. It allows users to build compute infrastructures from virtualized resources and manage a large number of VMs. Snooze is one of the core results of Eugen Feller`s PhD thesis under the supervision of Dr. Christine Morin at the INRIA MYRIADS project-team. The prototype is now used within the MYRIADS project-team in various cloud computing research activities.

For scalability Snooze employs a self-organizing hierarchical architecture and performs distributed VM management. Particularly, VM management is achieved by multiple managers, with each manager being in charge of a subset of nodes. In addition, fault tolerance is provided at all levels of the hierarchy. Finally, VM monitoring and live migration is integrated into the framework and a generic VM scheduler exists to facilitate the development of advanced VM placement algorithms. Last but not least once idle, servers are automatically transitioned into the system administrator specified power-state (e.g. suspend) to save energy. To facilitate idle times Snooze integrates dynamic VM relocation and consolidation.

Just in case you need to build a private cloud for your topic map or want to work on application of topic maps to a cloud and its components.

PS: Do note the additional subject identified by the string “snooze.”

May 1, 2012

AWS NYC Summit 2012

Filed under: Amazon Web Services AWS,Cloud Computing — Patrick Durusau @ 4:46 pm

AWS NYC Summit 2012

The line that lead me to this read:

We posted 25 presentations from the New York 2012 AWS Summit.

Actually, no.

Posted 25 slide decks, not presentations.

Useful yes, presentations, no.

Not to complain too much given the rapid expansion of services and technical guidance but let’s not confuse slides with presentations.

The AWS Report (Episode 2) has one major improvement: The clouds in the background don’t move! (As they did in the first episode. Now there was a shadow that moved over the front of the desk.)

We need to ask Amazon to get Jeff a new laptop without all the stickers on the top. If Paula Abdul or Vanna White were doing the interview, the laptop stickers would not be distracting. Or at least not enough to complain. Jeff isn’t Paula Abdul or Vanna White. Sorry Jeff.

I think the AWS Report has real potential. Several short segments with more “facts” and fewer “general” statements would be great.

Enjoyed the Elastic Beanstalk episode but hearing customers are busy, happy and requirements were gathered for other language support (besides Java) is like hearing public service announcements on PBS.

Nothing to disagree with but no real content either.

Suggestion: Perhaps short, say 90 to 120 second description of a typical issue (off mailing list?) that ends with: What is your solution? and feature one or more solutions on the next show? To get the audience involved and get other people hawking the show.

Not quite the cover of the Rolling Stone but perhaps someday… 😉

April 26, 2012

Graphs in the Cloud: Neo4j and Heroku

Filed under: Cloud Computing,Heroku,Neo4j — Patrick Durusau @ 6:30 pm

Graphs in the Cloud: Neo4j and Heroku

From the registration page:

Thursday May 10 10:00 PDT / 17:00 GMT

With more and more applications in the cloud, developers are looking for a fast solution to deploy their applications. This webinar is intended for developers that are interested in the value of launching your application in the cloud, and the power of using a graph database.

In this session, you will learn:

  • how to build Java applications that connect to the Neo4j graph database.
  • how to instantly deploy and scale those applications on the cloud with Heroku.

Speaker: James Ward, Heroku Developer Evangelist

Sounds interesting. Not as much fun as being in Amsterdam but not everyday can be like that! Besides, this way you may remember some of the presentation. 😉

April 24, 2012

Software Review- BigML.com – Machine Learning meets the Cloud

Filed under: Cloud Computing,Machine Learning — Patrick Durusau @ 7:13 pm

Software Review- BigML.com – Machine Learning meets the Cloud.

Ajay Ohri reviews BigML.com, an attempt to lower the learning curve for working with machine learning and large data sets.

Ajay concludes:

Overall a welcome addition to make software in the real of cloud computing and statistical computation/business analytics both easy to use and easy to deploy with fail safe mechanisms built in.

Check out https://bigml.com/ for yourself to see.

I have to agree they are off to a good start.

Lowering the learning curve applications look like “hot” properties for the coming future. Some lose of flexibility but offset by immediate and possibly useful results. Maybe the push some users need to become real experts.

April 21, 2012

Counterpoint: Why We Should Not Use the Cloud

Filed under: Cloud Computing — Patrick Durusau @ 4:36 pm

Counterpoint: Why We Should Not Use the Cloud by Andrea Di Maio.

Andrea writes:

The IT world has embraced the concept of cloud computing. Vendors, users, consultants, analysts, we all try to figure out how to leverage the increasing commoditization of IT from both an enterprise and a personal perspective.

Discussions on COTS have turned into discussions on SaaS, People running their own data center claim they run (or are developing) a private cloud. Shared service providers rebrand their services as community cloud. IT professionals in user enterprises dream to move up the value chain by leaving the boring I&O stuff to vendors and developing more vertical business analysis and demand management skills. What used to be called outsourcing is now named cloud sourcing, while selective sourcing morphs into hybrid clouds or cloud brokerage. Also personally, we look at our USB stick or disk drive with disdain, waiting for endless, ultracheap personal clouds to host all of our emails, pictures, music.

It looks like none of us is truly reflecting about whether this is good or bad. Of course, many are moving cautiously, they understand they are not ready for prime time for all sorts of security, confidentiality, maturity reasons. However it always looks like they have to justify themselves. “Cloud first”, some say, and you’ll have to tell us why you are not planning to go cloud. So those who want to hold to their own infrastructure (without painting it as a “private cloud”) or want to keep using traditional delivery models from their vendors (such as hosting or colocation) almost feel like children of a lesser God when compared to all those bright and lucky IT executives who can venture into the cloud (and – when moving early enough – still get an interview on a newspaper or a magazine).

Let me be clear. I am intimately convinced that the move to cloud computing is inevitable and necessary, even it may happen more slowly that many believe or hope for. However I would like to voice some concerns that may give good reasons not to move. There are probably many others, but it is important to ask ourselves – both as users and providers – tougher questions to make sure we have convincing answers as we approach or dive into the cloud.

That’s like saying your firm doesn’t have “big data.” 😉

The biggest caution is one that Andrea misses.

That is thinking that moving to the “cloud” is going to save on IT expenses.

A commonly repeated mantra in Washington by McNamara types. If you don’t remember the “cost saving” reforms in the military in the early 1960’s, now would be a good time to brush up on your history. An elaborate scheme was created to determine equipment requirements based on usage.

So if you were in a warm climate, most of the year, you did not need snowplows, for example. Except that if you are an air field and it does snow, oops, you need a snowplow that day and little else will work.

At a sufficient distance, the plans seemed reasonable. Particularly with people who did not understand the subject under discussion. Like cost saving consolidations in IT now under way in Washington.

April 12, 2012

The CloudFormation Circle of Life : Part 1

Filed under: Amazon Web Services AWS,Cloud Computing — Patrick Durusau @ 7:04 pm

The CloudFormation Circle of Life : Part 1

From the post:

AWS CloudFormation makes it easier for you to create, update, and manage your AWS resources in a predictable way. Today, we are announcing a new feature for AWS CloudFormation that allows you to add or remove resources from your running stack, enabling your stack to evolve as its requirements change over time. With AWS CloudFormation, you can now manage the complete lifecycle the AWS resources powering your application.

I think there is a name for this sort of thing. Innovation, that’s right! That’s the name for it!

As topic map services move into the clouds, being able to take advantage of resource stacks is likely to be important. Particularly if you have mapping empowered resources that can be placed in a stack of resources.

The “cloud” in general looks like an opportunity to move away from ETL (Extract-Transform-Load) into more of an ET (Extract-Transform) model. Particularly if you take a functional view of data. Will save on storage costs, particularly if the data sets are quite large.

Definitely a service that anyone working with topic maps in the cloud needs to know more about.

April 11, 2012

Tukey: Integrated Cloud Services For Big Data

Filed under: BigData,Cloud Computing — Patrick Durusau @ 6:17 pm

Open Cloud Consortium Announces First Integrated Set of Cloud Services for Researchers Working with Big Data

I just could not bring myself to use the original title for the post.

BTW, I mis-read the service to be named: “Turkey.” Maybe that is how it is pronounced?

Service for researchers, described as:

Today, the Open Cloud Consortium (OCC) announced the availability of Tukey, which is an innovative integrated set of cloud services designed specifically to enable scientific researchers to manage, analyze and make discoveries with big data.

Several public cloud service providers provide resources for individual scientists and small research groups, and large research groups can build their own dedicated infrastructure for big data. However,currently, there is no cloud service provider that is focused on providing services to projects that must work with big data, but are not large enough to build their own dedicated clouds.

Tukey is the first set of integrated cloud services to fill this niche.

Tukey was developed by the Open Cloud Consortium, a not-for-profit multi-organizational partnership. Many scientific projects are more comfortable hosting their data with a not-for-profit organization than with a commercial cloud service provider.

Cloud Service Providers (CSP) that are focused on meeting the needs of the research community are beginning to be called Science Cloud Service Providers or Sci CSPs (pronounced psi-sip). Cloud Service Providers serving the scientific community must support the long term archiving of data, large data flows so that large datasets can be easily imported and exported, parallel processing frameworks for analyzing large datasets, and high end computing.

“The Open Cloud Consortium is one of the first examples of an innovative resource that is being called a Science Cloud Service Provider or Sci CSP,” says Robert Grossman, Director of the Open Cloud Consortium. “Tukey makes it easy for scientific research projects to manage, analyze and share big data, something this is quite difficult to do with the services from commercial Cloud Service Providers.”

The beta version of Tukey is being used by several research projects, including: the Matsu Project, which hosts over two years of data from NASA’s EO-1 satellite; Bionimbus, which is a system for managing, analyzing, and sharing large genomic datasets; and bookworm, which is an applications that extracts patterns from large collections of books.

The services include: hosting large public scientific datasets; standard installations of the open source OpenStack and Eucalyptus systems, which provide instant on demand computing infrastructure; standard installations of the open source Hadoop system, which is the most popular platform for processing big data; standard installations of UDT, which is a protocol for transporting large datasets; and a variety of domain specific applications.

It isn’t clear to me what short-comings of commercial cloud providers are being addressed?

Many researchers can’t build their own clouds but with commercial cloud providers, why would you want to?

Or take the claim:

“Tukey makes it easy for scientific research projects to manage, analyze and share big data, something this is quite difficult to do with the services from commercial Cloud Service Providers.”

How so? What prevents this with commercial cloud providers? Being on different clouds? But Tukey “corrects” this by requiring membership on its cloud. How is that any better?

Nothing against Tukey but I think being a non-profit isn’t enough of a justification for yet another cloud. What else makes it different from other clouds?

Clouds are important for topic maps as semantics will collide in clouds and making meaningful semantics rain from clouds is going to require topic maps or something quite similar.

April 2, 2012

The 1000 Genomes Project

The 1000 Genomes Project

If Amazon is hosting a single dataset > 200 TB, is your data “big data?” 😉

This merits quoting in full:

We're very pleased to welcome the 1000 Genomes Project data to Amazon S3. 

The original human genome project was a huge undertaking. It aimed to identify every letter of our genetic code, 3 billion DNA bases in total, to help guide our understanding of human biology. The project ran for over a decade, cost billions of dollars and became the corner stone of modern genomics. The techniques and tools developed for the human genome were also put into practice in sequencing other species, from the mouse to the gorilla, from the hedgehog to the platypus. By comparing the genetic code between species, researchers can identify biologically interesting genetic regions for all species, including us.

A few years ago there was a quantum leap in the technology for sequencing DNA, which drastically reduced the time and cost of identifying genetic code. This offered the promise of being able to compare full genomes from individuals, rather than entire species, leading to a much more detailed genetic map of where we, as individuals, have genetic similarities and differences. This will ultimately give us better insight into human health and disease.

The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available, ultimately with data from the genomes of over 2,661 people from 26 populations around the world. The project began with three pilot studies that assessed strategies for producing a catalog of genetic variants that are present at one percent or greater in the populations studied. We were happy to host the initial pilot data on Amazon S3 in 2010, and today we're making the latest dataset available to all, including results from sequencing the DNA of approximately 1,700 people.

The data is vast (the current set weighs in at over 200Tb), so hosting the data on S3 which is closely located to the computational resources of EC2 means that anyone with an AWS account can start using it in their research, from anywhere with internet access, at any scale, whilst only paying for the compute power they need, as and when they use it. This enables researchers from laboratories of all sizes to start exploring and working with the data straight away. The Cloud BioLinux AMIs are ready to roll with the necessary tools and packages, and are a great place to get going.

Making the data available via a bucket in S3 also means that customers can crunch the information using Hadoop via Elastic MapReduce, and take advantage of the growing collection of tools for running bioinformatics job flows, such as CloudBurst and Crossbow

You can find more information, the location of the data and how to get started using it on our 1000 Genomes web page, or from the project pages.

If that sounds like a lot of data, just imagine all of the recorded mathematical texts and the relationships between the concepts represented in such texts?

It is in our view that data looks smooth or simple. Or complex.

The Total Cost of (Non) Ownership of a NoSQL Database Service

Filed under: Amazon DynamoDB,Amazon Web Services AWS,Cloud Computing — Patrick Durusau @ 5:47 pm

The Total Cost of (Non) Ownership of a NoSQL Database Service

From the post:

We have received tremendous positive feedback from customers and partners since we launched Amazon DynamoDB two months ago. Amazon DynamoDB enables customers to offload the administrative burden of operating and scaling a highly available distributed database cluster while only paying for the actual system resources they consume. We also received a ton of great feedback about how simple it is get started and how easy it is to scale the database. Since Amazon DynamoDB introduced the new concept of a provisioned throughput pricing model, we also received several questions around how to think about its Total Cost of Ownership (TCO).

We are very excited to publish our new TCO whitepaper: The Total Cost of (Non) Ownership of a NoSQL Database service. Download PDF.

I bet you can guess how the numbers work out without reading the PDF file. 😉

Makes me wonder though if there would be a market for a different hosted NoSQL database or topic map application? Particularly a topic map application.

Not along the lines of Maiana but more of a topic based data set, which could respond to data by merging it with already stored data. Say for example a firefighter scans the bar code on a railroad car lying alongside the tracks with fire getting closer. The only think they want is a list of the necessary equipment and whether to leave now, or not.

Most preparedness agencies would be well pleased to simply pay for the usage they get of such a topic map.

March 31, 2012

Erlang as a Cloud Citizen

Filed under: Cloud Computing,Erlang — Patrick Durusau @ 4:08 pm

Erlang as a Cloud Citizen by Paolo Negri. (Erlang Factory San Francisco 2012)

From the description:

This talk wants to sum up the experience of designing, deploying and maintaining an Erlang application targeting the cloud and precisely AWS as hosting infrastructure.

As the application now serves a significantly large user base with a sustained throughput of thousands of games actions per second we’re able to analyse retrospectively our engineering and architectural choices and see how Erlang fits in the cloud environment also comparing it to previous experiences of clouds deployments of other platforms.

We’ll discuss properties of Erlang as a language and OTP as a framework and how we used them to design a system that is a good cloud citizen. We’ll also discuss topics that are still open for a solution.

Interesting but you probably want to wait for the video. The slides are interesting, considering the argument for fractal-like engineering for scale, but not enough detail to be really useful.

Still, responding to 0.25 billion uncacheable reqs/day is a performance number you should not ignore. Depends on your use case.

March 24, 2012

Two New AWS Getting Started Guides

Filed under: Amazon Web Services AWS,Cloud Computing — Patrick Durusau @ 7:36 pm

Two New AWS Getting Started Guides

From the post:

We’ve put together a pair of new Getting Started Guides for Linux and Microsoft Windows. Both guides will show you how to use EC2, Elastic Load Balancing, Auto Scaling, and CloudWatch to host a web application.

The Linux version of the guide (HTML, PDF) is built around the popular Drupal content management system. The Windows version (HTML, PDF) is built around the equally popular DotNetNuke CMS.

These guides are comprehensive. You will learn how to:

  • Sign up for the services
  • Install the command line tools
  • Find an AMI
  • Launch an Instance
  • Deploy your application
  • Connect to the Instance using the MindTerm SSH Client or PuTTY
  • Configure the Instance
  • Create a custom AMI
  • Create an Elastic Load Balancer
  • Update a Security Group
  • Configure and use Auto Scaling
  • Create a CloudWatch Alarm
  • Clean up

Other sections cover pricing, costs, and potential cost savings.

Not quite a transparent computing fabric, yet. 😉

March 15, 2012

Linguamatics Puts Big Data Mining on the Cloud

Filed under: Cloud Computing,Data Mining,Medical Informatics — Patrick Durusau @ 8:03 pm

Linguamatics Puts Big Data Mining on the Cloud

From the post:

In response to market demand, Linguamatics is pleased to announce the launch of the first NLP-based, scaleable text mining platform on the cloud. Text mining allows users to extract more value from vast amounts of unstructured textual data. The new service builds on the successful launch by Linguamatics last year of I2E OnDemand, the Software-as-a-Service version of Linguamatics’ I2E text mining software. I2E OnDemand proved to be so popular with both small and large organizations, that I2E is now fully available as a managed services offering, with the same flexibility in choice of data resources as with the in-house, Enterprise version of I2E. Customers are thus able to benefit from best-of-breed text mining with minimum setup and maintenance costs. Such is the strength of demand for this new service that Linguamatics believes that by 2015, well over 50% of its revenues could be earned from cloud and mobile-based products and services.

Linguamatics is responding to the established trend in industry to move software applications on to the cloud or to externally managed servers run by service providers. This allows a company to concentrate on its core competencies whilst reducing the overhead of managing an application in-house. The new service, called “I2E Managed Services”, is a hosted and managed cloud-based text mining service which includes: a dedicated, secure I2E server with full-time operational support; the MEDLINE document set, updated and indexed regularly; and access to features to enable the creation and tailoring of proprietary indexes. Upgrades to the latest version of I2E happen automatically, as soon as they become available. (emphasis added)

Interesting but not terribly so, until I saw the MEDLINE document set was part of the service.

I single that out as an example of creating a value-add for a service by including a data set of known interest.

You could do a serious value-add for MEDLINE or find a collection that hasn’t been made available to an interested audience. Perhaps one for which you could obtain an exclusive license for some period of time. State/local governments are hurting for money and they have lots of data. Can’t buy it but exclusive licensing isn’t the same as buying, in most jurisdictions. Check with local counsel to be sure.

March 6, 2012

Cloudera Manager | Activity Monitoring & Operational Reports Demo Video

Filed under: Cloud Computing,Cloudera,Hadoop — Patrick Durusau @ 8:10 pm

Cloudera Manager | Activity Monitoring & Operational Reports Demo Video by Jon Zuanich.

From the post:

In this demo video, Philip Zeyliger, a software engineer at Cloudera, discusses the Activity Monitoring and Operational Reports in Cloudera Manager.

Activity Monitoring

The Activity Monitoring feature in Cloudera Manager consolidates all Hadoop cluster activities into a single, real-time view. This capability lets you see who is running what activities on the Hadoop cluster, both at the current time and through historical activity views. Activities are either individual MapReduce jobs or those that are part of larger workflows (via Oozie, Hive or Pig).

Operational Reports

Operational Reports provide a visualization of current and historical disk utilization by user, user groups and directory. In addition, it tracks MapReduce activity on the Hadoop cluster by job, user, group or job ID. These reports are aggregated over selected time periods (hourly, daily, weekly, etc.) and can be exported as XLS or CSV files.

It is a sign of Hadoop’s maturity that professional management interfaces have started to appear.

Hadoop has always been manageable. The question was how to find someone to marry your cluster? And what happened in the case of a divorce?

Professional management tools enable a less intimate relationship between your cluster and its managers. Not to mention the availability of a larger pool of managers for your cluster.

One request, please avoid the default security options on vimeo videos. They should be embeddable and downloadable in all cases.

February 24, 2012

Having a ChuQL at XML on the Cloud

Filed under: ChuQL,Cloud Computing,XML — Patrick Durusau @ 5:04 pm

Having a ChuQL at XML on the Cloud by Shahan Khatchadourian, Mariano P. Consens, and Jérôme Siméon.

Abstract:

MapReduce/Hadoop has gained acceptance as a framework to process, transform, integrate, and analyze massive amounts of Web data on the Cloud. The MapReduce model (simple, fault tolerant, data parallelism on elastic clouds of commodity servers) is also attractive for processing enterprise and scienti c data. Despite XML ubiquity, there is yet little support for XML processing on top of MapReduce.

In this paper, we describe ChuQL, a MapReduce extension to XQuery, with its corresponding Hadoop implementation. The ChuQL language incorporates records to support the key/value data model of MapReduce, leverages higher-order functions to provide clean semantics, and exploits side-e ffects to fully expose to XQuery developers the Hadoop framework. The ChuQL implementation distributes computation to multiple XQuery engines, providing developers with an expressive language to describe tasks over big data.

The aggregation and co-grouping were the most interesting examples for me.

The description of ChuQL was a bit thin. Pointers to more resources would be appreciated.

Entity Matching for Semistructured Data in the Cloud

Filed under: Cloud Computing,Entities,Entity Resolution — Patrick Durusau @ 5:03 pm

Entity Matching for Semistructured Data in the Cloud by Marcus Paradies.

From the slides:

Main Idea

  • Use MapReduce and ChuQL to process semistructured data
  • Use a search-based blocking to generate candidate pairs
  • Apply similarity functions to candidate pairs within a block

Uses two of my favorite sources, CiteSeer and Wikipedia.

Looks like the start of an authoring stage of topic map work flow to me. You?

February 13, 2012

Deploy Spring Data Neo4j into the Cloud (Feb. 16 – Webinar)

Filed under: Cloud Computing,Neo4j,Spring Data — Patrick Durusau @ 8:18 pm

Deploy Spring Data Neo4j into the Cloud (Feb. 16 – Webinar)

From the webpage:

Webinar
Thursday, February 16
10:00 PST

Join this webinar for a practical guide to deploying your Spring Data Neo4j application into the cloud.

Spring Data Neo4j is part of the Spring Data project which aims to provide convenient support for NOSQL databases.

Michael Hunger will demonstrate, using examples from his upcoming book Good Relationships, how to set up your own Spring Data Neo4j database onto Heroku.

January 27, 2012

Cloud deployments, Heroku, Spring Data Neo4j and other cool stuff (Stockholm, Sweden)

Filed under: Cloud Computing,Heroku,Neo4j,Spring Data — Patrick Durusau @ 4:36 pm

Cloud deployments, Heroku, Spring Data Neo4j and other cool stuff

From the announcement:

We will meet up at The Hub (no not the github unfortunately, though that’s cool too). This time it will be a a visit by Peter Neubauer, VP Community at Neo Technology (and may be some other Neo4j hackers) who will talk about Cloud deployments, Heroku, Spring Data Neo4j etc. This will be a very interesting meetup as we will touch subjects connected to Python, Ruby, Java and what not. Laptops are optional but hey we wont stop you from hacking later :).

We also plan on doing a community brain storm session where we can talk about

  • what are the things that we would like to see Neo4j do, things that are missing, things that can be improved
  • how can we help spread the adoption of Neo4j. how to improve your learning

As usual we would love to see people contribute, so if you have some thing to show or share please let us know and we can modify the agenda. We will take 1 hour earlier on the Friday (from the usual 6:30, so we don’t come between you and your well deserved friday weekend)

This meetup invite will remain open till 31st of January 2012. So bring your friends, have some beer and discuss graphy things with us.

The RSVP closes 31 January 2012.

Notes, posts, and pointers to the same greatly appreciated!

January 20, 2012

ISO 25964-­-1 Thesauri for information retrieval

Filed under: Cloud Computing,Information Retrieval,ISO/IEC,JTC1,Standards,Thesaurus — Patrick Durusau @ 9:18 pm

Information and documentation -­- Thesauri and interoperability with other vocabularies -­- Part 1: Thesauri for information retrieval

Actually that is the homepage for Networked Knowledge Organization Systems/Services – N K O S but the lead announcement item is for ISO 25964-1, etc.

From that webpage:

New international thesaurus standard published

ISO 25964-­-1 is the new international standard for thesauri, replacing ISO 2788 and ISO 5964. The full title is Information and documentation -­- Thesauri and interoperability with other vocabularies -­- Part 1: Thesauri for information retrieval. As well as covering monolingual and multilingual thesauri, it addresses 21st century needs for data sharing, networking and interoperability.

Content includes:

  • construction of mono-­- and multi-­-lingual thesauri;
  • clarification of the distinction between terms and concepts, and their inter-­-relationships;
  • guidance on facet analysis and layout;
  • guidance on the use of thesauri in computerized and networked systems;
  • best practice for the management and maintenance of thesaurus development;
  • guidelines for thesaurus management software;
  • a data model for monolingual and multilingual thesauri;
  • brief recommendations for exchange formats and protocols.

An XML schema for data exchange has been derived from the data model, and is available free of charge at http://www.niso.org/schemas/iso25964/ . Coming next ISO 25964-­-1 is the first of two publications. Part 2: Interoperability with other vocabularies is in the public review stage and will be available by the end of 2012.

Find out how you can obtain a copy from the news release.

Let me help you there, the correct number is: ISO 25964-1:2011 and the list price for a PDF copy is CHF 238,00, or in US currency (today), $257.66 (for 152 pages).

Shows what I know about semantic interoperability.

If you want semantic interoperability, you change people $1.69 per page (152 pages) for access to the principles of thesauri to be used for information retrieval.

ISO/IEC and JTC 1 are all parts of a system of viable international (read non-vendor dominated) organizations for information/data standards. They are the natural homes for the management of data integration standards that transcend temporal, organizational, governmental and even national boundaries.

But those roles will not fall to them by default. They must seize the initiative and those roles. Clinging to old-style publishing models for support makes them appear timid in the face of current challenges.

Even vendors recognize their inability to create level playing fields for technology/information standards. And the benefits that come to vendors from de jure as well as non-de jure standards organizations.

ISO/IEC/JTC1, provided they take the initiative, can provide an international, de jure home for standards that form the basis for information retrieval and integration.

The first step to take is to make ISO/IEC/JTC1 information standards publicly available by default.

The second step is to call up all members and beneficiaries, both direct and indirect, of ISO/IEC/JTC 1 work, to assist in the creation of mechanisms to support the vital roles played by ISO/IEC/JTC 1 as de jure standards bodies.

We can all learn something from ISO 25964-1 but how many of us will with that sticker price?

January 17, 2012

NIST CC Business Use Cases Working Group

Filed under: Cloud Computing,Marketing — Patrick Durusau @ 8:19 pm

NIST CC Business Use Cases Working Group

From the description:

NIST will lead interested USG agencies and industry to define target USG Cloud Computing business use cases (set of candidate deployments to be used as examples) for Cloud Computing model options, to identify specific risks, concerns and constraints.

Not about topic maps per se but certainly about opportunities to apply topic maps! USG agencies, to say nothing of industry, are a hot-bed of semantic diversity.

The more agencies move towards “cloud” computing, the more likely they are to encounter “foreign” or “rogue” data.

Someone is going to have to assist with their assimilation or understanding of that data. May as well be you!

January 7, 2012

Caching in HBase: SlabCache

Filed under: Cloud Computing,Hadoop,HBase — Patrick Durusau @ 4:06 pm

Caching in HBase: SlabCache by Li Pi.

From the post:

The amount of memory available on a commodity server has increased drastically in tune with Moore’s law. Today, its very feasible to have up to 96 gigabytes of RAM on a mid-end, commodity server. This extra memory is good for databases such as HBase which rely on in memory caching to boost read performance.

However, despite the availability of high memory servers, the garbage collection algorithms available on production quality JDK’s have not caught up. Attempting to use large amounts of heap will result in the occasional stop-the-world pause that is long enough to cause stalled requests and timeouts, thus noticeably disrupting latency sensitive user applications.

Introduces management of the file system cache for those with loads and memory to justify and enable it.

Quite interesting work, particularly if you are ignoring the nay-sayers about the adoption of Hadoop and the Cloud in the coming year.

What the nay-sayers are missing is that yes, unimaginative mid-level managers and admins have no interest in Hadoop or the Cloud. What Hadoop and the Cloud present are opportunities that imaginative re-packagers and re-processing startups are going to use to provide new data streams and services.

Can’t ask startups that don’t exist yet why they have chosen to go with Hadoop and the Cloud.

That goes unnoticed by unimaginative commentators who reflect the opinions of uninformed managers, whose opinions are confirmed by the publication of the columns by unimaginative commentators. One of those feedback loops I mentioned earlier today.

December 28, 2011

Apache Whirr 0.7.0 has been released

Filed under: Cloud Computing,Clustering (servers),Mahout,Whirr — Patrick Durusau @ 9:30 pm

Apache Whirr 0.7.0 has been released

From Patrick Hunt at Cloudera:

Apache Whirr release 0.7.0 is now available. It includes changes covering over 50 issues, four of which were considered blockers. Whirr is a tool for quickly starting and managing clusters running on cloud services like Amazon EC2. This is the first Whirr release as a top level Apache project (previously releases were under the auspices of the Incubator). In addition to improving overall stability some of the highlights are described below:

Support for Apache Mahout as a deployable component is new in 0.7.0. Mahout is a scalable machine learning library implemented on top of Apache Hadoop.

  • WHIRR-384 – Add Mahout as a service
  • WHIRR-49 – Allow Whirr to use Chef for configuration management
  • WHIRR-258 – Add Ganglia as a service
  • WHIRR-385 – Implement support for using nodeless, masterless Puppet to provision and run scripts

Whirr 0.7.0 will be included in a scheduled update to CDH4.

Getting Involved

The Apache Whirr project is working on a number of new features. The How To Contribute page is a great place to start if you’re interested in getting involved as a developer.

Cluster management or even the “cloud” in your topic map future?

You could do worse than learning one of the most recent top level Apache top level projects to prepare for a future that may arrive sooner than you think!

December 15, 2011

Ambiguity in the Cloud

Filed under: Cloud Computing,Marketing,Topic Maps,Uncategorized — Patrick Durusau @ 7:43 pm

If you are interested at all in cloud computing and its adoption, you need to read US Government Cloud Computing Technology Roadmap Volume I Release 1.0 (Draft). I know, a title like that is hardly inviting. But read it anyway. Part of a three volume set, for the other volumes see: NIST Cloud Computing Program.

Would you care to wager on out of ten (10) requirements, how many cited a need for interoperability that is presently lacking due to different understandings, terminology, in other words, ambiguity?

Good decision.

The answer? 8 out of 10 requirements cited by NIST have interoperability as a component.

The plan from NIST is to develop a common model, which will be a useful exercise, but how do we discuss differing terminologies until we can arrive at a common one?

Or allow for discussion of previous SLAs, for example, after we have all moved onto a new terminology?

If you are looking for a “hot” topic that could benefit from the application of topic maps (as opposed to choir programs at your local church during the Great Depression) this could be the one. One of those is a demonstration of a commercial grade technology, the other is at best a local access channel offering. You pick which is which.

December 11, 2011

tokenising the visible english text of common crawl

Filed under: Cloud Computing,Dataset,Natural Language Processing — Patrick Durusau @ 10:20 pm

tokenising the visible english text of common crawl by Mat Kelcey.

From the post:

Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.

Well, 30TB of data, that certainly sounds like a small project. 😉

What small amount of data are you using for your next project?

November 23, 2011

Google Plugin for Eclipse (GPE) is Now Open Source

Filed under: Cloud Computing,Eclipse,Interface Research/Design,Java — Patrick Durusau @ 7:41 pm

Google Plugin for Eclipse (GPE) is Now Open Source by Eric Clayberg.

From the post:

Today is quite a milestone for the Google Plugin for Eclipse (GPE). Our team is very happy to announce that all of GPE (including GWT Designer) is open source under the Eclipse Public License (EPL) v1.0. GPE is a set of software development tools that enables Java developers to quickly design, build, optimize, and deploy cloud-based applications using the Google Web Toolkit (GWT), Speed Tracer, App Engine, and other Google Cloud services.

….

As of today, all of the code is available directly from the new GPE project and GWT Designer project on Google Code. Note that GWT Designer itself is based upon the WindowBuilder open source project at Eclipse.org (contributed by Google last year). We will be adopting the same guidelines for contributing code used by the GWT project.

Important for the reasons given but also one possible model for topic map services. What if your topic map services were hosted in the cloud and developers could write against against it? That is they would not have to concern themselves with the niceties of topic maps but simply request the information of interest to them, using tools you have provided to make that easier for them.

Take for example the Statement of Disbursements that I covered recently. If that were hosted as a topic map in the cloud, a developer, say working for a resturant promoter, might want to query the topic map for who frequents eateries in a particular area. They are not concerned with the merging that has to take place between various budgets and the alignment of those merges with individuals, etc. They are looking for a list of places with House members alphabetically sorted after it.

November 17, 2011

Next Generation Cluster Computing on Amazon EC2 – The CC2 Instance Type

Filed under: Cloud Computing,Topic Map Software,Topic Maps — Patrick Durusau @ 8:37 pm

Next Generation Cluster Computing on Amazon EC2 – The CC2 Instance Type

From the post:

Today we are introducing a new member of the Cluster Compute Family, the Cluster Compute Eight Extra Large. The API name of this instance is cc2.8xlarge so we’ve taken to calling it the CC2 for short. This instance features some incredible specifications at a remarkably low price. Let’s take a look at the specs:

Processing – The CC2 instance type includes 2 Intel Xeon processors, each with 8 hardware cores. We’ve enabled Hyper-Threading, allowing each core to process a pair of instruction streams in parallel. Net-net, there are 32 hardware execution threads and you can expect 88 EC2 Compute Units (ECU’s) from this 64-bit instance type. That’s nearly 90x the rating of the original EC2 small instance, and almost 3x the rating of the first-generation Cluster Compute instance.

Storage – On the storage front, the CC2 instance type is packed with 60.5 GB of RAM and 3.37 TB of instance storage.

Networking – As a member of our Cluster Compute family, this instance is connected to a 10 Gigabit network and offers low latency connectivity with full bisection bandwidth to other CC2 instances within a Placement Group. You can create a Placement Group using the AWS Management Console:

Pricing – You can launch an On-Demand CC2 instance for just $2.40 per hour. You can buy Reserved Instances, and you can also bid for CC2 time on the EC2 Spot Market. We have also lowered the price of the existing CC1 instances to $1.30 per hour.

You have the flexibility to choose the pricing model that works for you based on your application, your budget, your deadlines, and your ability to utilize the instances. We believe that the price-performance of this new instance type, combined with the number of ways that you can choose to acquire it, will result in a compelling value for scientists, engineers, and researchers.

Seems like it was only yesterday that I posted a note that NuvolaBase.com was running a free cloud beta. Hey! That was only yesterday!

Still a ways off from unmetered computing resources but moving in that direction.

If you have some experience with one of the cloud services, consider writing up a pricing example for experimenting with topic maps. I suspect that would help a lot of people (including me) get their feet wet with topic maps and cloud computing.

November 16, 2011

NuvolaBase.com

Filed under: Cloud Computing,Graphs,OrientDB — Patrick Durusau @ 8:19 pm

NuvolaBase.com

I was surprised to see this at the end of the OrientDB slides on the multi-master architecture, “the first graph database on the Cloud,” but I am used to odd things in slide decks. 😉

From the FAQ:

What is the technology behind NuvolaBase?

NuvolaBase is a cloud of several OrientDB servers deployed in multiple data centers around the globe.

What is the architecture of your cloud?

The cloud is based on multiple servers in different server farms around the globe. This guarantee low latency and high availability. Today we have three server farms, two in Europe and one in USA. We’ve future plans to expand the cloud in China and South America.

Oh, did I mention that during the beta test is it free?

November 8, 2011

Toad for Cloud Databases (Quest Software)

Filed under: BigData,Cloud Computing,Hadoop,HBase,Hive,MySQL,Oracle,SQL Server — Patrick Durusau @ 7:45 pm

Toad for Cloud Databases (Quest Software)

From the news release:

The data management industry is experiencing more disruption than at any other time in more than 20 years. Technologies around cloud, Hadoop and NoSQL are changing the way people manage and analyze data, but the general lack of skill sets required to manage these new technologies continues to be a significant barrier to mainstream adoption. IT departments are left without a clear understanding of whether development and DBA teams, whose expertise lies with traditional technology platforms, can effectively support these new systems. Toad® for Cloud Databases addresses the skill-set shortage head-on, empowering database professionals to directly apply their existing skills to emerging Big Data systems through an easy-to-use and familiar SQL-based interface for managing non-relational data. 

News Facts:

  • Toad for Cloud Databases is now available as a fully functional, commercial-grade product, for free, at www.quest.com/toad-for-cloud-databases.  Toad for Cloud Databases enables users to generate queries, migrate, browse, and edit data, as well as create reports and tables in a familiar SQL view. By simplifying these tasks, Toad for Cloud Databases opens the door to a wider audience of developers, allowing more IT teams to experience the productivity gains and cost benefits of NoSQL and Big Data.
  • Quest first released Toad for Cloud Databases into beta in June 2010, making the company one of the first to provide a SQL-based database management tool to support emerging, non-relational platforms. Over the past 18 months, Quest has continued to drive innovation for the product, growing its list of supported platforms and integrating a UI for its bi-directional data connector between Oracle and Hadoop.
  • Quest’s connector between Oracle and Hadoop, available within Toad for Cloud Databases, delivers a fast and scalable method for data transfer between Oracle and Hadoop in both directions. The bidirectional characteristic of the utility enables organizations to take advantage of Hadoop’s lower cost of storage and analytical capabilities. Quest also contributed the connector to the Apache Hadoop project as an extension to the existing SQOOP framework, and is also available as part of Cloudera’s Distribution Including Apache Hadoop. 
  • Toad for Cloud Databases today supports:
    • Apache Hive
    • Apache HBase
    • Apache Cassandra
    • MongoDB
    • Amazon SimpleDB
    • Microsoft Azure Table Services
    • Microsoft SQL Azure, and
    • All Open Database Connectivity (ODBC)-enabled relational databases (Oracle, SQL Server, MySQL, DB2, etc)

 

Anything that eases the transition to cloud computing is going to be welcome. Toad being free will increase the ranks of DBAs who will at least experiment on their own.

October 27, 2011

OpenStack

Filed under: Cloud Computing — Patrick Durusau @ 4:46 pm

OpenStack

From the OpenStack wiki:

The OpenStack Open Source Cloud Mission: to produce the ubiquitous Open Source Cloud Computing platform that will meet the needs of public and private clouds regardless of size, by being simple to implement and massively scalable.

There are three (3) core projects:

OPENSTACK COMPUTE: open source software and standards for large-scale deployments of automatically provisioned virtual compute instances.

OPENSTACK OBJECT STORAGE: open source software and standards for large-scale, redundant storage of static objects.

OPENSTACK IMAGE SERVICE: provides discovery, registration, and delivery services for virtual disk images.

Two (2) new projects that will be promoted to core on the next release:

OpenStack Identity: Code-named Keystone, The OpenStack Identity Service provides unified authentication across all OpenStack projects and integrates with existing authentication systems.

OpenStack Dashboard: Dashboard enables administrators and users to access and provision cloud-based resources through a self-service portal.

And a host of unofficial projects, related to one or more OpenStack components. (OpenStack Projects)

So far as I could tell, no projects to deal with mapping between data sets in any re-usable way.

Do you think cloud computing will make semantic impedance more or less obvious?

More obvious because of the clash of the unknown semantics of data sets.

Less obvious because the larger the data sets, the greater the tendency to assume the answer(s), however curious, must be correct.

Which do you think it will be?

« Newer PostsOlder Posts »

Powered by WordPress