A reversion release of Cassandra. Details: Cassandra changes.
Looks like the holidays are going to be filled with upgrades, new releases!
Cassandra Conference, December 6, 2011, New York City
From the call for speakers:
BURLINGAME, Calif. – November 9, 2011 –DataStax, the commercial leader in Apache Cassandra™, along with the NYC Cassandra User Group, NoSQL NYC, and Big Data NYC are joining together to present the first Cassandra New York City conference on December 6. This all day, two-track event will focus on enterprise use cases as well as the latest developments in Cassandra. Early bird registration is now open here.
Coming on the heels of a sold-out DataStax Cassandra SF earlier this year, the event will feature some of the most interesting Cassandra use-cases from up and down the Eastern Seaboard. Cassandra NYC will be keynoted by Jonathan Ellis, chairman of the Apache Cassandra project, who will highlight what’s new in Cassandra 1.0, and what’s in store for the future. Additional confirmed speakers include Nathan Marz, lead engineer for the Storm project at Twitter and Jim Ancona, systems architect at Constant Contact.
“With the recent 1.0 release, we are seeing users doing amazing new things with Cassandra that are going beyond even our expectations and imagination,” said Ellis. “We look forward to sharing these stories with the broader community, to further hasten the adoption and usage of Cassandra to meet their real-time, big data challenges.”
Call for Speakers and Press Registration
The call for speakers is now also open for the event. Submissions can be made to email@example.com.
Press interested in attending the event may contact Zenobia@intersectcom.com for a complimentary press pass.
The event will be held at the Lighthouse International Conference Center on 59th St.
I am not sure about “early bird” registration for an event less than a month away but this sounds quite interesting. I hope the presentations will be recorded and posted for asynchronous access.
From the announcement:
BURLINGAME, Calif. – Nov.1, 2011 –DataStax, the commercial leader in Apache Cassandra™, today announced that DataStax Enterprise, the industry’s first distributed, scalable, and highly available database platform powered by Apache Cassandra™ 1.0, is now available.
“The ability to manage both real-time and analytic data in a simple, massively scalable, integrated solution is at the heart of challenges faced by most businesses with legacy databases,” said Billy Bosworth, CEO, DataStax. “Our goal is to ensure businesses can conquer these challenges with a modern application solution that provides operational simplicity, optimal performance and incredible cost savings.”
“Apache Cassandra is the scalable, high-impact, comprehensive data platform that is well-suited to the rapidly-growing real-time data needs of our social media platform,” said Christian Carollo, Senior Manager, Mobile for GameFly. “We leveraged the expertise of DataStax to deploy our new social media platform, and were able to complete the project without worrying about scale or distribution – we simply built a great application and Apache Cassandra took care of the rest.”
BTW, DataStax just added its 100th customer. You might recognize some of them, Netflix, Cisco, etc.
From Andreas Harth and Günter Ladwig:
[W]e are happy to announce the first public release of CumulusRDF, a Linked Data server that uses Apache Cassandra  as a cloud-based storage backend. CumulusRDF provides a simple HTTP interface  to manage RDF data stored in an Apache Cassandra cluster.
* By way of Apache Cassandra, CumulusRDF provides distributed, fault-tolerant and elastic RDF storage
* Supports Linked Data and triple pattern lookups
* Proxy mode: CumulusRDF can act as a proxy server  for other Linked Data applications, allowing to deploy any RDF dataset as Linked Data
This is a first beta release that is still somewhat rough around the edges, but the basic functionality works well. The HTTP interface is work-in-progress. Eventually, we plan to extend the storage model to support quads.
CumulusRDF is available from http://code.google.com/p/cumulusrdf/
See http://code.google.com/p/cumulusrdf/wiki/GettingStarted to get started using CumulusRDF.
There is also a paper  on CumulusRDF that I presented at the Scalable Semantic Knowledge Base Systems (SSWS) workshop at ISWC last week.
Andreas Harth and Günter Ladwig
Everybody knows I hate to be picky but the abstract of  promises:
Results on a cluster of up to 8 machines indicate that CumulusRDF is competitive to state-of-the-art distributed RDF stores.
But I didn’t see any comparison to “state-of-the-art” RDF stores, distributed or not. Did I just overlook something?
I ask because I think this approach has promise, at least as an exploration of indexing strategies for RDF and how usage scenarios may influence those strategies. But that will be difficult to evaluate in the absence of comparison to less imaginative approaches to RDF indexing.
It doesn’t get much better or fresher (for non-attendees) than this!
And yes, I made a separate blog post on Neo4j and Dr. Who. What can I say? I am a fan of both.
From the webpage:
We’re announcing today the first source code release of Usergrid, a comprehensive platform stack for mobile and rich client applications. The entire codebase is now available on GitHub at https://github.com/usergrid/stack. Usergrid is built in Java and runs on top of Cassandra. Although we built Usergrid as a highly scalable cloud service, we’ve also taken a few steps to make it easy to run “small”, including providing a double-clickable desktop app that lets you run your own personal installation on your desktop, so you can get started right away.
I thought I read about “rich clients” with HTML5.
But the W3C web design team buried the HTML 5 draft 5 clicks deep from their homepage. Good thing I knew to keep looking. That’s not just poor marketing, that’s also poor design.
A future of incompatiblity awaits.
From the post:
I was looking at Cassandra, one of the major NoSQL solutions, and I was immediately impressed with its write speed even on my notebook. But I also noticed that it was very volatile in its response time, so I took a deeper look at it.
Michael Kopp uses dynaTrace to look inside Cassandra. Lots of information in between and hopefully his conclusion will make you read this posts and those he promises to follow.
NoSQL or BigData Solutions are very very different from your usual RDBMS, but they are still bound by the usual constraints: CPU, I/O and most importantly how it is used! Although Cassandra is lighting fast and mostly I/O bound it’s still Java and you have the usual problems – e.g. GC needs to be watched. Cassandra provides a lot of monitoring metrics that I didn’t explain here, but seeing the flow end-to-end really helps to understand whether the time is spent on the client, network or server and makes the runtime dynamics of Cassandra much clearer.
Understanding is really the key for effective usage of NoSQL solutions as we shall see in my next blogs. New problem patterns emerge and they cannot be solved by simply adding an index here or there. It really requires you to understand the usage pattern from the application point of view. The good news is that these new solutions allow us a really deep look into their inner workings, at least if you have the right tools at hand.
What tools are you using to “look inside” your topic map engine?
From the post:
Cassandra 1.0 introduces support for data compression on a per-ColumnFamily basis, one of the most-requested features since the project started. Compression maximizes the storage capacity of your Cassandra nodes by reducing the volume of data on disk. In addition to the space-saving benefits, compression also reduces disk I/O, particularly for read-dominated workloads.
OK, maybe someone can help me here.
As a standards editor I understand being optimistic about what is “…going to appear…” in a future release but isn’t version 0.8.6 a little early to be treating features for 1.0 a bit early? (I don’t find “compression” mentioned in the cumulative release notes as of 0.8.6.)
May just be me.
Aggregation of feeds on Cassandra. If you need to follow Cassandra closely, this would be among your first stops.
From the post:
Was it just two or three years ago when choosing a database was easy? Those with a Cadillac budget bought Oracle, those in a Microsoft shop installed SQL Server, those with no budget chose MySQL. Everyone in between tried to figure out where they belonged.
Those days are gone forever. Everyone and his brother are coming out with their own open source project for storing information. In most cases, these projects are tossing aside many of the belts-and-suspenders protections that people expect from the classic databases. There are enough of them now that some joker started calling them NoSQL and claiming, perhaps tongue-in-cheek, that the acronym stood for Not Only SQL.
I remember reading somewhere that the #1 reason for firing sysadmins was failure to maintain proper backups. A RDBMS system isn’t a magic answer to data security and anyone who thinks so, is probably a former sysadmin at one or more locations.
You need to read Jim Grey’s Transaction Processing: Concepts and Techniques if you want to design reliable systems. Or that is at least one of the works you need to read.
Do use the “print” option so you can read the article while avoiding most of the annoying distractions typical for this type of site.
Not detailed enough to be particularly useful. Actually I haven’t seen a comparison yet that was detailed enough to be really useful. I suppose in part because the approaches are different, would be hard compare apples with apples.
What might be useful would be to compare the use cases where each system claims to excel. Now that might be a continuum of interest to readers.
What do you think?
Cassandra: Introduction for System Administrators by Nathan Milford.
Introductory slide deck for administrators interested in Cassandra (or being asked to participate in its use).
Suggestions for slide 6 that reads in part:
Figure in Greek Mythology, sounds like Pig
True enough but in terms of a control language, the play Pygmalion by Shaw would have been the better reference.
I presume the reader/listener would get the sound similarity without prompting.
Sorry, read the slide deck and see the source code at: https://github.com/jeromatron/pygmalion/.
Indexing in Cassandra by Ed Anuff.
As if you haven’t noticed by now, I have a real weakness for indexing and indexing related material.
Interesting coverage of composite indexes.
NoSQL @ Netflix, Part 2 by Sid Anand.
OSCON 2011 presentation.
I think the RDBMS Concepts to Key-Value Store Concepts was the best part of the slide deck.
What do you think?
Slides with videos to follow!
From the website:
- Jonathan Ellis (DataStax) – State of Cassandra, 2011 (Slides)
- Ed Anuff – Indexing in Cassandra (Slides)
- Gary Dusbabek (RackSpace) – Cassandra Internals (Slides)
- Sylvain Lesbresne (DataStax) – Counters in Cassandra (Slides)
High-Level Cassandra Development
- Eric Evans (Rackspace) – CQL – Not just NoSQL, It’s MoSQL (Slides)
- Jake Luciani (DataStax) – Scaling Solr with Cassandra (Slides)
- Ben Coverston (DataStax) – Redesigned Compaction LevelDB (Slides)
- Joaquin Casares (DataStax) – The Auto-Clustering Brisk AMI (Slides)
- Matt Dennis (DataStax) – Cassandra Anti-Patterns (Slides)
- Mike Bulman (DataStax) – OpsCenter: Cluster Management Doesn’t Have To Be Hard (Slides)
- Stu Hood (Twitter) – Prometheus’ Patch: #674 and You (Slides)
- Jeremy Hanna (Dachis) – Using Pig alongside Cassandra (Slides)
- Matt Dennis (DataStax) – Data Modeling Workshop (Slides)
- Nate McCall (DataStax) – Cassandra for Java Developers (Slides)
- Yewei Zhang (DataStax) – Hive Over Brisk (Slides)
- Jake Luciani (DataStax) – Introduction to Brisk (Slides)
- Kyle Roche (Isidorey) – Cloudsandra: Multi-tenant Platform Build on Brisk (Slides)
- Adrian Cockcroft (Netflix) – Migrating Netflix from DataCenter Oracle to Global Cassandra (Slides)
- Chris Goffinet (Twitter) – Cassandra at Twitter (Slides)
- David Strauss (Pantheon) – Highly Available DNS and Request Routing Using Apache Cassandra (Slides)
- Edward Capriolo (media6degrees) – Real World Capacity Planning: Cassandra on Blades and Big Iron (Slides)
- Eric Onnen (Urban Airship) – From 100s to 100′s of Millions (Slides)
From the post:
I’m writing this up because there’s always quite a bit of discussion on both the Cassandra and Hector mailing lists about indexes and the best ways to use them. I’d written a previous post about Secondary indexes in Cassandra last July, but there are a few more options and considerations today. I’m going to do a quick run through of the different approaches for doing indexes in Cassandra so that you can more easily navigate these and determine what’s the best approach for your application.
Good article on indexes in Cassandra.
About the post:
This blog post is the first one in a series of articles that describe the use of NoSQL databases to efficiently store and retrieve mutation data. Part one introduces the notion of mutation data and describes the conceptual use of the Cassandra NoSQL datastore.
From the post:
The only way to learn a new technology is by putting it into practice. Just try to find a suitable use case in your immediate working environment and give it go. In my case, it was trying to efficiently store and retrieve mutation data through a variety of NoSQL data stores, including Cassandra, MongoDB and Neo4J.
Promises to be an interesting series of posts that focus on a common data set and problem!
From the webpage:
Our view is that the new data intensive workloads that are increasingly common are a poor match for the legacy storage systems they tend to run on. These systems are built on a set of assumptions about the capacity and performance of hardware that are simply no longer true. The Acunu Storage Platform is the result of a radical re-think of those assumptions; the result is high performance from low cost commodity hardware.
It includes the Acunu Storage Core which runs in the Linux kernel. On top of this core, we provide a modified version of Apache Cassandra. This is essentially the same as “vanilla” Cassandra but uses the Acunu Storage Core to store data instead of the Linux file system and is therefore able to take advantage of the performance benefits of our platform. In addition to Cassandra, there is also an object store similar to Amazon’s S3; we have a number of other more experimental projects in the pipeline which we’ll talk about in future posts.
Perhaps the start of something very interesting.
It took NoSQL a couple of years to flower into the range of current offerings.
I wonder if working in the kernel will have a similar path?
Will we see a graph engine as part of the kernel?
Good thumb-nail comparison of the major features of all six (6) NoSQL databases by Kristóf Kovács.
Sorry to see that Neo4J didn’t make the comparison.
From the post:
Today, DataStax, the commercial leader in Apache Cassandra™, released DataStax’ Brisk – a second-generation open-source Hadoop distribution that eliminates the key operational complexities with deploying and running Hadoop and Hive in production. Brisk is powered by Cassandra and offers a single platform containing a low-latency database for extremely high-volume web and real-time applications, while providing tightly coupled Hadoop and Hive analytics.
Download Brisk -> Here.
This series starts here and goes for five (5) parts for Cassandra 0.6.4.
From the introduction:
I’m going to write a few postings on how to use the Cassandra database with Java, although i am in no way an expert on how to use Cassandra i am very intrigued about the database because of it’s small installation, high performance and scalability. During the writing of these posts i am also learning the Cassandra database and i’m sharing my experiences with it through my posts on this blog.
Like i said before, Cassandra is a very high performing and scalable database, it doesn’t follow the normal SQL database principles like schema’s, tables / columns, datatypes and a query language like SQL. Instead it’s a non-relational database similar to Google’s BigTable. Cassandra was initially developed by Facebook which has contributed it to the open source community. Currently it is used by websites like Facebook, Twitter, Digg, Rackspace and many others. So even though it is still only version 0.6 at the time of writing this it has already proven itself in production environments.
It isn’t possible to say which (if any) of the NoSQL databases will prove to be the best fits for topic maps in particular or general situations.
What is clear is that a lot of experimentation and development is underway and hopefully the results will be interesting.
Podcasts from the London Cassandra User Group.
Cassandra – Thrift Application Jools Enticknap: 21 February 2011
Cassandra in TWEETMEME Nick Telford: 21 February 2011
Cassandra Meetup January 17th Jan 2011
Cassandra London Meetup Jake Luciani : 8th Dec 2010
In Cassandra 0.7, there are expiring columns.
From the blog:
Sometimes, data comes with an expiration date, either by its nature or because it’s simply intractable to keep all of a rapidly growing dataset indefinitely.
In most databases, the only way to deal with such expiring data is to write a job running periodically to delete what is expired. Unfortunately, this is usually both error-prone and inefficient: not only do you have to issue a high volume of deletions, but you often also have to scan through lots of data to find what is expired.
Fortunately, Cassandra 0.7 has a better solution: expiring columns. Whenever you insert a column, you can specify an optional TTL (time to live) for that column. When you do, the column will expire after the requested amount of time and be deleted auto-magically (though asynchronously — see below). Importantly, this was designed to be as low-overhead as possible.
Now there is an interesting idea!
Goes along with the idea that a topic map does not (should not?) present a timeless view of information. That is a topic map should maintain state so that we can determine what was known at any particular time.
Take a simple example, a call for papers for a conference. It could be that a group of conferences all share the same call for papers, the form, submission guidelines, etc. And that call for papers is associated with each conference by an association.
Shouldn’t we be able to set an expiration date on that association so that at some point in time, all those facilities are no longer available for that conference? Perhaps it switches over to another set of properties in the same association to note that the submission dates have passed? That would remove the necessity for the association expiring.
But there are cases where associations do expire or at least end. Divorce in an unhappy example. Being hired is a happier one.
Something to think about.
From the website:
Agamemnon is a thin library built on top of pycassa. It allows you to use the Cassandra database (http://cassandra.apache.org) as a graph database. Much of the api was inspired by the excellent neo4j.py project (http://components.neo4j.org/neo4j.py/snapshot/)
Thanks to Jack Park for pointing this out!
A bit dated now but I thought some readers might find it useful.
From the posting:
If you’re coming from an RDBMS background (which is almost everyone) you’ll probably trip over some of the naming conventions while learning about Cassandra’s data model. It took me and my team members at Digg a couple days of talking things out before we “got it”. In recent weeks a bikeshed went down in the dev mailing list proposing a completely new naming scheme to alleviate some of the confusion. Throughout this discussion I kept thinking: “maybe if there were some decent examples out there people wouldn’t get so confused by the naming.” So, this is my stab at explaining Cassandra’s data model; It’s intended to help you get your feet wet & doesn’t go into every single detail but, hopefully, it helps clarify a few things.
Seems like I have heard about grouping sets of key/value pairs before but I will have to look for it.
More seriously, the current wave of data sets only aggravates the known semantic impedance problem.
A wave of data sets that promises to only increase.
So semantic impedance is going to increase.
Semantic impedance can be:
Proposals: Be sure to submit your proposal no later than Friday, 29 April 2011 at midnight Pacific Time.
7-11 November 2011 Vancouver
From the website:
This year’s conference theme is “Open Source Enterprise Solutions, Cloud Computing, and Community Leadership”, featuring dozens of highly-relevant technical, business, and community-focused sessions aimed at beginner, intermediate, and expert audiences that demonstrate specific professional problems and real-world solutions that focus on “Apache and …”:
- … Enterprise Solutions (from ActiveMQ to Axis2 to ServiceMix, OFBiz to Chemistry, the gang’s all here!)
- … Cloud Computing (Hadoop, Cassandra, HBase, CouchDB, and friends)
- … Emerging Technologies + Innovation (Incubating projects such as Libcloud, Stonehenge, and Wookie)
- … Community Leadership (mentoring and meritocracy, GSoC and related initiatives)
- … Data Handling, Search + Analytics (Lucene, Solr, Mahout, OODT, Hive and friends)
- … Pervasive Computing (Felix/OSGi, Tomcat, MyFaces Trinidad, and friends)
- … Servers, Infrastructure + Tools (HTTP Server, SpamAssassin, Geronimo, Sling, Wicket and friends)
What would be even cooler, would be to have real-time associations with subjects that have information from outside the data set.
Or better yet, real-time on-demand associations with subjects that have information from outside the data set.
I suppose the classic use case would be running stats on all the sports events on a Saturday or Sunday, including individuals stats and merging in the latest doping, paternity and similar tests.
NoSQL Databases: Why, what and when by Lorenzo Alberton.
When I posted RDBMS in the Social Networks Age I did not anticipate returning the very next day with another slide deck from Lorenzo. But, after viewing this slide deck, I just had to post it.
It is a very good overview of NoSQL databases and their underlying principles, with useful graphics as well (as opposed to the other kind).
I am going to have to study his graphic technique in hopes of applying it to the semantic issues that are at the core of topic maps.
From the post:
I have to admit I’ve never really been happy with Cassandra’s data model, or to be more precisely, I’ve never really been with my understanding of the model. However I’ve realized that if we think of two use cases for column families then things may become a bit clearer. For me, Column families can be used in one of two ways, either as a record or an ordered list.
I thought it was helpful, what do you think?