Cassandra « Another Word For It

December 8, 2015

Apache Cassandra 3.1!

Filed under: Cassandra — Patrick Durusau @ 7:32 pm

Apache Cassandra 3.1 hit the streets today!

If you don’t know Apache Cassandra, from the home page:

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Cassandra’s data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.

The full set of changes for release 3.1.

Enjoy!

Comments Off

October 29, 2015

Cassandra Summit 2015 (videos)

Filed under: Cassandra,Conferences — Patrick Durusau @ 7:18 pm

Cassandra Summit 2015

Courtesy of DataStax, thirty-six (36) presentations from Cassandra Summit 2015 are now online!

Comments Off

May 19, 2015

Apache Cassandra 2.2.0-beta1 released

Filed under: Cassandra — Patrick Durusau @ 12:42 pm

Apache Cassandra 2.2.0-beta1 released

From the post:

The Cassandra team is pleased to announce the release of Apache Cassandra version 2.2.0-beta1.

This release is *not* production ready. We are looking for testing of existing and new features. If you encounter any problem please let us know [1].

Cassandra 2.2 features major enhancements such as:

* Resume-able Bootstrapping
* JSON Support [4]
* User Defined Functions [5]
* Server-side Aggregation [6]
* Role based access control

Read [2] and [3] to learn about all the new features.

Downloads of source and binary distributions are listed in our download section:

http://cassandra.apache.org/download/

Enjoy!

-The Cassandra Team

[1]: https://issues.apache.org/jira/browse/CASSANDRA
[2]: http://goo.gl/MyOEib (NEWS.txt)
[3]: http://goo.gl/MBJd1S (CHANGES.txt)
[4]: http://cassandra.apache.org/doc/cql3/CQL-2.2.html#json
[5]: http://cassandra.apache.org/doc/cql3/CQL-2.2.html#udfs
[6]: http://cassandra.apache.org/doc/cql3/CQL-2.2.html#udas

I was wondering what I would be reading this week!

Enjoy!

Comments Off

May 2, 2015

On The Bleeding Edge – PySpark, DataFrames, and Cassandra

Filed under: Cassandra,Data Frames,Python — Patrick Durusau @ 8:17 pm

On The Bleeding Edge – PySpark, DataFrames, and Cassandra.

From the post:

A few months ago I wrote a post on Getting Started with Cassandra and Spark.

I’ve worked with Pandas for some small personal projects and found it very useful. The key feature is the data frame, which comes from R. Data Frames are new in Spark 1.3 and was covered in this blog post. Till now I’ve had to write Scala in order to use Spark. This has resulted in me spending a lot of time looking for libraries that would normally take me less than a second to recall the proper Python library (JSON being an example) since I don’t know Scala very well.
…

If you need help deciding whether to read this post, take a look at Spark SQL and DataFrame Guide to see what you stand to gain.

Enjoy!

Comments Off

March 18, 2015

Should Topic Maps Gossip?

Filed under: Cassandra,Topic Map Software,Topic Map Systems — Patrick Durusau @ 7:12 pm

Efficient Reconciliation and Flow Control for Anti-Entropy Protocols byRobbert van Renesse, Dan Dumitriu, Valient Gough and Chris Thomas.

Abstract:

The paper shows that anti-entropy protocols can process only a limited rate of updates, and proposes and evaluates a new state reconciliation mechanism as well as a flow control scheme for anti-entropy protocols.

Excuse the title, I needed a catchier line than the title of the original paper!

This is the Scuttlebutt paper that underlies Cassandra.

Rather than an undefined notion of consistency, ask yourself how much consistency is required by an application?

I first saw this in a tweet by Jason Brown.

Comments Off

March 3, 2015

KillrWeather

Filed under: Akka,Cassandra,Spark,Time Series — Patrick Durusau @ 2:31 pm

KillrWeather

From the post:

KillrWeather is a reference application (which we are constantly improving) showing how to easily leverage and integrate Apache Spark, Apache Cassandra, and Apache Kafka for fast, streaming computations in asynchronous Akka event-driven environments. This application focuses on the use case of time series data.

…

The site doesn’t give enough emphasis to the importance of time series data. Yes, weather is an easy example of time series data, but consider another incomplete listing of the uses of time series data:

A time series is a sequence of data points, typically consisting of successive measurements made over a time interval. Examples of time series are ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average. Time series are very frequently plotted via line charts. Time series are used in statistics, signal processing, pattern recognition, econometrics, mathematical finance, weather forecasting, earthquake prediction, electroencephalography, control engineering, astronomy, communications engineering, and largely in any domain of applied science and engineering which involves temporal measurements.

(Time Series)

Mastering KillrWeather will put you on the road to many other uses of time series data.

Enjoy!

I first saw this in a tweet by Chandra Gundlapalli.

Comments Off

February 25, 2015

DataStax – New for 2015 – Free Online Instructor Led Training

Filed under: Cassandra,DataStax — Patrick Durusau @ 5:11 pm

DataStax – New for 2015 – Free Online Instructor Led Training

I count six (6) online free courses in March 2015:

March 2-5 DS201: Cassandra Core Concepts
March 16-18 DS220: Data Modeling with DataStax Enterprise
March 16-19 DS201: Cassandra Core Concepts
March 16-19 DS210: DataStax Enterprise Operations and Performance Turning
March 23-25 DS310: DataStax Enterprise Search with Solr
March 23-26 DS320: DataStax Enterprise Analytics with Spark

As of today, both:

March 2-5 DS201: Cassandra Core Concepts
March 16-18 DS220: Data Modeling with DataStax Enterprise

report being “sold out” and you can join a waiting list.

If you take one or more of these courses, don’t keep your attendance a secret. Provide feedback to DataStax and post your comments about the experience online.

High quality online training isn’t cheap and positive feedback will strengthen the hand of those responsible for these free training classes.

Comments Off

January 22, 2015

Streaming Big Data with Spark, Spark Streaming, Kafka, Cassandra and Akka

Filed under: Akka,Cassandra,Kafka,Spark,Streams — Patrick Durusau @ 3:47 pm

Webinar: Streaming Big Data with Spark, Spark Streaming, Kafka, Cassandra and Akka by Helena Edelson.

From the post:

On Tuesday, January 13 I gave a webinar on Apache Spark, Spark Streaming and Cassandra. Over 1700 registrants from around the world signed up. This is a follow-up post to that webinar, answering everyone’s questions. In the talk I introduced Spark, Spark Streaming and Cassandra with Kafka and Akka and discussed why these particular technologies are a great fit for lambda architecture due to some key features and strategies they all have in common, and their elegant integration together. We walked through an introduction to implementing each, then showed how to integrate them into one clean streaming data platform for real-time delivery of meaning at high velocity. All this in a highly distributed, asynchronous, parallel, fault-tolerant system.

Video | Slides | Code | Diagram

…

About The Presenter: Helena Edelson is a committer on several open source projects including the Spark Cassandra Connector, Akka and previously Spring Integration and Spring AMQP. She is a Senior Software Engineer on the Analytics team at DataStax, a Scala and Big Data conference speaker, and has presented at various Scala, Spark and Machine Learning Meetups.

I have long contended that it is possible to have a webinar that has little if any marketing fluff and maximum technical content. Helena’s presentation is an example of that type of webinar.

Very much worth the time to watch.

BTW, being so content full, questions were answered as part of this blog post. Technical webinars just don’t get any better organized than this one.

Perhaps technical webinars should be marked with TW and others with CW (for c-suite webinars). To prevent disorientation in the first case and disappointment in the second one.

Comments Off

December 24, 2014

Cassandra Summit Europe 2014 (December 3-4, 2014) Videos!

Filed under: Cassandra,Conferences — Patrick Durusau @ 2:46 pm

Cassandra Summit Europe 2014 (December 3-4, 2014) Videos!

As usual, I sorted the presentations by the first author’s last name.

Good thing too because I noticed that Ben Laplanche was attributed with two presentations that differed only in having “Apache” in one title and not in the other.

On inspection I discovered an incorrectly labeled presentation by David Borsos and Tareq Abedrabbo, of OpenCredo. I corrected the listing but retained the current URL.

I am curious why the original webpage offers filtering by company? That’s an unlikely category for a developer to use in searching for Cassandra related content.

Consider annotating future presentations with the versions of software covered. It would make searching presentations much more robust.

Enjoy!

Ok.ru: Add a Bit of ACID to Apache Cassandra by Oleg Anastasyev, Lead Platform Developer, Company: Ok.ru
UBS Securities: A Journey with Cassandra at UBS Securities by Roy Bailey, Director of Neo Platform Services, Company: UBS Securities
Sky: Cassandra Is Great, but How Do I Test It? by Christopher Batey, Senior Software Engineer, Company: Sky
BigStep: Cassandra’s Scaling Economics by Alex Bordei, Product Manager, Company: BigStep
Building a Scalable Event Service with Cassandra Design to Code by David Borsos and Tareq Abedrabbo, Company: OpenCredo
Cassandra Summit Europe 2014 Keynote by Billy Bosworth, CEO & Jonathan Ellis, CTO, Company: DataStax
Instaclustr: Streaming from Backups, Reducing Cluster Load When Adding Nodes by Ben Bromhead, Co-Founder, Company: Instaclustr
Cisco: Implementing Apache Cassandra at Cisco Commerce by Hatinder Chawla, Senior Manager of Cisco Commerce Architecture and Engineering, Company: Cisco
Chronopost International: How to Bring Cassandra to Your “Serious” IT Department by Alexander Dejanovski, Expert Software Engineer, Company: Chronopost International
Stratio: Advanced Search and Top K Queries in Cassandra by Andres De La Peña, Big Data Architect & Daniel Higuero, Big Data Architect, Company: Stratio
DataStax & Arkey: Billions of Records from SQL to Cassandra, Lessons Learned by Brice Dutheil, Freelance Senior Developer at Arkey & DuyHai Doan, Cassandra Evangelist, Company: DataStax
Google: The Challenge of Tuning Apache Cassandra on Cloud Environments by Ivan Filho, Performance Engineer, Company: Google
DataStax: Diagnosing Performance Problems in Production by Jon Haddad, Apache Cassandra Evangelist, Company: DataStax
Noble Group: Normalised, Non Tick Time Series Wearing a DSL Cap by David Haines, Head of Front Office Development & Aleksa Vukotic, Head of Platform Development, Company: Noble Group
Burt: Using Apache Cassandra For All The Things by Theo Hultberg, Chief Architect at Burt, Company: Burt
DataStax: Architecting a Big Data Platform with DataStax Enterprise: Cassandra, Spark & In-Memory by Piotr Kołaczkowski, Lead Software Engineer (Analytics), Company: DataStax
Pivotal: Building a Multi-Tenant Cassandra for Cloud Foundry by Ben Laplanche, Product Manager, Company: Pivotal
ABC Arbitrage: Building Your Own Distributed System the Easy Way by Kévin Lovato, Software Engineer, Company: ABC Arbitrage
DataStax: Hit the Turbo Button on Your Cassandra Application Performance by Patrick McFadin, Chief Evangelist for Apache Cassandra, Company: DataStax
Spotify: How to Use Apache Cassandra by Jimmy Mårdell, Tech Product Owner, Company: Spotify
Credit Suisse: An Overview of the Hippo Project at Credit Suisse by Phillip Meredith, Application Developer & Jay Modha, Vice President, Company: Credit Suisse
The Last Pickle: Repeatable, Scalable, Reliable, Observable Cassandra by Aaron Morton, Co-Founder & Principal Consultant, Company: The Last Pickle
ING: Apache Cassandra at ING — Testing the Waters: Consistency Required! by Christopher Reedijk, Advisory IT Specialist & Christopher Reedijk, Advisory IT Specialist, Company: ING
RedHat: Scalable Geospatial Indexing with Cassandra by Rebecca Simmonds, Research Associate at Newcastle University & Jonathan Halliday, Software Engineer, Company: RedHat
The Weather Channel: The New Analytics Toolbox, Going Beyond Hadoop by Robbie Strickland, Director of Software Development, Company: The Weather Channel
User Defined Functions in Apache Cassandra 3.0 by Robert Stupp, Consultant, Company: None
Demonware (Activision Blizzard): Deploying Apache Cassandra for Call of Duty by Seán O Sullivan, Service Reliability Engineer & Tim Czerniak, Software Engineer, Company: Demonware (Activision Blizzard)
Postcode Anywhere: Optimize eCommerce with Machine Learning — Cassandra, Elasticsearch and Spark by Jamie Turner, CTO & Joe Chittenden-Veal, Big Data Consultant, Company: Postcode Anywhere
Finn.no: Apache Cassandra (and Hadoop) Case Studies from FINN.no by Mick Semb Wever, Programmer, Company: Finn.no
i2O Water: How Cassandra Helps i2OWater Save Over 235 Million Litres of Water Everyday by Mike Williams, Software and IT Director, Company: i2O Water

Comments Off

October 17, 2014

DevCenter 1.2 delivers support for Cassandra 2.1 and query tracing

Filed under: Cassandra,DataStax — Patrick Durusau @ 6:17 pm

DevCenter 1.2 delivers support for Cassandra 2.1 and query tracing by Alex Popescu.

From the post:

We’re very pleased to announce the availability of DataStax DevCenter 1.2, which you can download now. We’re excited to see how DevCenter has already become the defacto query and development tool for those of you working with Cassandra and DataStax Enterprise, and now with version 1.2, we’ve added additional support and options to make your development work even easier.

Version 1.2 of DevCenter delivers full support for the many new features in Apache Cassandra 2.1, including user defined types and tuples. DevCenter’s built-in validations, quick fix suggestions, the updated code assistance engine and the new snippets can greatly simplify your work with all the new features of Cassandra 2.1.

The download page offers the DataStax Sandbox if you are interested in a VM version.

Enjoy!

Comments Off

September 14, 2014

Cassandra Performance Testing with cstar_perf

Filed under: Cassandra,Performance — Patrick Durusau @ 6:32 am

Cassandra Performance Testing with cstar_perf by Ryan Mcguire.

From the post:

It’s frequently been reiterated on this blog that performance testing of Cassandra is often done incorrectly. In my role as a Cassandra test engineer at DataStax, I’ve certainly done it incorrectly myself, numerous times. I’m convinced that the only way to do it right, consistently, is through automation – there’s simply too many variables to keep track of when doing things by hand.

cstar_perf is an easy to use tool to run performance tests on Cassandra clusters. A brief outline of what it does for you:

Downloads and builds Cassandra source code.

Configures your cassandra.yaml and environment settings.

Bootstraps nodes on a real cluster.

Runs a series of test operations on multiple versions or configs.

Collects and aggregates cluster performance metrics.

Creates easy to read performance charts comparing multiple test configurations in one view.

Runs a web frontend for convenient test scheduling, monitoring and reporting.

A great tool for Cassandra developers and a reminder of the first requirement for performance testing, automation. How’s your performance testing?

I first saw this in a tweet by Jason Brown.

Comments Off

September 13, 2014

CQL Under the Hood

Filed under: Cassandra,CQL - Cassandra Query Language — Patrick Durusau @ 10:26 am

CQL Under the Hood by Robbie Strickland.

Description:

As a reformed CQL critic, I’d like to help dispel the myths around CQL and extol its awesomeness. Most criticism comes from people like me who were early Cassandra adopters and are concerned about the SQL-like syntax, the apparent lack of control, and the reliance on a defined schema. I’ll pop open the hood, showing just how the various CQL constructs translate to the underlying storage layer–and in the process I hope to give novices and old-timers alike a reason to love CQL.

Slides from CassandraSummit 2014

Best viewed with a running instance of Cassandra.

Comments Off

September 3, 2014

Apache Cassandra 2.1.0-rc7

Filed under: Cassandra — Patrick Durusau @ 1:17 pm

Apache Cassandra 2.1.0-rc7 (Changes)

A new Apache Cassandra release candidate!

Downloads: http://cassandra.apache.org/download/

I like the generated list of changes, but as dead text, it is of limited usefulness. This works better for me:

2.1.0-rc7

Add frozen keyword and require UDT to be frozen (CASSANDRA-7857)
Track added sstable size correctly (CASSANDRA-7239)
(cqlsh) Fix case insensitivity (CASSANDRA-7834)
Fix failure to stream ranges when moving (CASSANDRA-7836)
Correctly remove tmplink files (CASSANDRA-7803)
(cqlsh) Fix column name formatting for functions, CAS operations, and UDT field selections ()CASSANDRA-7806
(cqlsh) Fix COPY FROM handling of null/empty primary key values (CASSANDRA-7792)
Fix ordering of static cells (CASSANDRA-7763)

Merged from 2.0:

Forbid re-adding dropped counter columns (CASSANDRA-7831)
Fix CFMetaData#isThriftCompatible() for PK-only tables (CASSANDRA-7832)
Always reject inequality on the partition key without token (CASSANDRA-7722)
Always send Paxos commit to all replicas (CASSANDRA-7479)
Don’t send schema change responses and events for no-op DDL statements (CASSANDRA-7600)
(Hadoop) fix cluster initialisation for a split fetching (CASSANDRA-7774)
Configure system.paxos with LeveledCompactionStrategy (CASSANDRA-7753)
Fix ALTER clustering column type from DateType to TimestampType when using DESC clustering order (CASSANRDA-7797)
Throw EOFException if we run out of chunks in compressed datafile (CASSANDRA-7664)
Fix PRSI handling of CQL3 row markers for row cleanup (CASSANDRA-7787)
Fix dropping collection when it’s the last regular column (CASSANDRA-7744)
Properly reject operations on list index with conditions (CASSANDRA-7499)
Make StreamReceiveTask thread safe and gc friendly (CASSANDRA-7795)
Validate empty cell names from counter updates (CASSANDRA-7798)

Merged from 1.2:

Don’t allow compacted sstables to be marked as compacting (CASSANDRA-7145)
Track expired tombstones (CASSANDRA-7810)

Being “on the web” should require more than access via the web. Whenever available, links to other web resources should be present as well.

Comments Off

June 2, 2014

Cassandra 2.1 (1st RC)

Filed under: Cassandra — Patrick Durusau @ 7:09 pm

Since we were just talking about Cassandra in connection with Titan, thought you would be interested in the newest release candidate for Cassandra 2.1.

Download here (Under Development Cassandra Server Releases (not production ready)).

Changes.

Test at your own risk but I am sure useful bug reports will be deeply appreciated.

Follow: How to File a Good Bug Report or similar documents.

Comments Off

December 21, 2013

…Titan Cluster on Cassandra and ElasticSearch on AWS EC2

Filed under: Cassandra,ElasticSearch,Graphs,Titan — Patrick Durusau @ 8:10 pm

Setting up a Titan Cluster on Cassandra and ElasticSearch on AWS EC2 by Jenny Kim.

From the post:

This purpose of this post is to provide a walkthrough of a Titan cluster setup and highlight some key gotchas I’ve learned along the way. This walkthrough will utilize the following versions of each software package:

Versions

Datastax Cassandra Auto-Clustering Community AMI Version 2.4

Oracle Java 1.7 (should be automatically included in the Datastax AMI)

Titan 0.4.1 Full Distribution

ElasticSearch 0.90.7

The cluster in this walkthrough will utilize 2 M1.Large instances, which mirrors our current Staging cluster setup. A typical production graph cluster utilizes 4 M1.XLarge instances.

NOTE: While the Datastax Community AMI requires at minimum, M1.Large instances, the exact instance-type and cluster size should depend on your expected graph size, concurrent requests, and replication and consistency needs.

…

Great post!

You will be gaining experience with cloud computing along with very high end graph software (Titan).

Comments Off

November 27, 2013

Boutique Graph Data with Titan

Filed under: Cassandra,Faunus,Graphs,Gremlin,Titan — Patrick Durusau @ 5:11 pm

Boutique Graph Data with Titan by Marko A. Rodriguez.

From the post:

Titan is a distributed graph database capable of supporting graphs on the order of 100 billion edges and sustaining on the order of 1 billion transactions a day (see Educating the Planet with Pearson). Software architectures that leverage such Big Graph Data typically have 100s of application servers traversing a distributed graph represented across a multi-machine cluster. These architectures are not common in that perhaps only 1% of applications written today require that level of software/machine power to function. The other 99% of applications may only require a single machine to store and query their data (with a few extra nodes for high availability). Such boutique graph applications, which typically maintain on the order of 100 million edges, are more elegantly served by Titan 0.4.1+. In Titan 0.4.1, the in-memory caches have been advanced to support faster traversals which makes Titan’s single-machine performance comparable to other single machine-oriented graph databases. Moreover, as the application scales beyond the confines of a single machine, simply adding more nodes to the Titan cluster allows boutique graph applications to seamlessly grow to become Big Graph Data applications (see Single Server to Highly Available Cluster).

A short walk on the technical side of Titan.

I would replace “boutique” with “big data” and say Titan allows customers to seamlessly transition from “big data” to “bigger data.”

Having “big data” is like having a large budget under your control.

What matters is the user is the status of claiming to possess it.

Let’s not disillusion them.

Comments Off

November 16, 2013

Cassandra and Naive Bayes

Filed under: Bayesian Data Analysis,Cassandra — Patrick Durusau @ 7:14 pm

Using Cassandra to Build a Naive Bayes Classifier of Users Based Upon Behavior by John Berryman.

From the post:

In our last post, we found out how simple it is to use Cassandra to estimate ad conversion. It’s easy, because effectively all you have to do is accumulate counts – and Cassandra is quite good at counting. As we demonstrated in that post, Cassandra can be used as a giant, distributed, redundant, “infinitely” scalable counting framework. During this post will take the online ad company example just a bit further by creating a Cassandra-backed Naive Bayes Classifier. Again, we see that the “secret sauce” is simply keeping track of the appropriate counts.

In the previous post, we helped equip your online ad company with the ability to track ad conversion rates. But competition is steep and we’ll need to do a little better than ad conversion rates if your company is to stay on top. Recently, suspicions have arisen that ads are often being shown to unlikely customers. A quick look at the logs confirms this concern. For instance, there was a case of one internet user that clicked almost every single ad that he was shown – so long as it related to the camping gear. Several times, he went on to make purchases: a tent, a lantern, and a sleeping bag. But despite this users obvious interest in outdoor sporting goods, your logs indicated that fully 90% of the ads he was shown were for women’s apparel. Of these ads, this user clicked none of them.

Let’s attack this problem by creating a classifier. Fortunately for us, your company specializes in two main genres, fashion, and outdoors sporting goods. If we can determine which type of user we’re dealing with, then we can improve our conversion rates considerably by simply showing users the appropriate ads.

So long as you remember the unlikely assumption of feature independence of Naive Bayes, you should be ok.

That is whatever features you are measuring are independent of each other.

Has been “successfully” used in a number of contexts, but the descriptions I have read don’t specify what they meant by “successful.”

Comments Off

October 27, 2013

Big Data Modeling with Cassandra

Filed under: BigData,Cassandra,Modeling — Patrick Durusau @ 7:34 pm

Big Data Modeling with Cassandra by Mat Brown.

Description:

When choosing the right data store for an application, developers face a trade-off between scalability and programmer-friendliness. With the release of version 3 of the Cassandra Query Language, Cassandra provides a uniquely attractive combination of both, exposing robust and intuitive data modeling capabilities while retaining the scalability and availability of a distributed, masterless data store.

This talk will focus on practical data modeling and access in Cassandra using CQL3. We’ll cover nested data structures; different types of primary keys; and the many shapes your tables can take. There will be a particular focus on understanding the way Cassandra stores and accesses data under the hood, to better reason about designing schemas for performant queries. We’ll also cover the most important (and often unexpected) differences between ACID databases and distributed data stores like Cassandra.

Mat Brown (twitter.com/0utoftime) is a software engineer at Rap Genius, a platform for annotating and explaining the world’s text. Mat is the author of Cequel, a Ruby object/row mapper for Cassandra, as well as Elastictastic, an object/document mapper for ElasticSearch, and Sunspot, a Ruby model integration layer for Solr.

Mat covers limitations of Cassandra without being pressed. Not unknown but not common either.

Migration from relational schema to Cassandra is a bad idea. (paraphrase)

Mat examines the internal data structures that influence how you should model data in Cassandra.

At 17:40, shows how the data structure is represented internally.

The internal representation drives schema design.

You may also like Cequel by the presenter.

PS: I suspect that if considered carefully, the internal representation of data in most databases drives the advice given by tech support.

Comments Off

September 16, 2013

Cassandra – A Decentralized Structured Storage System [Annotated]

Filed under: Cassandra,CQL - Cassandra Query Language,NoSQL — Patrick Durusau @ 4:11 pm

Cassandra – A Decentralized Structured Storage System by Avinash Lakshman, Facebook and Prashant Malik, Facebook.

Abstract:

Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different data centers). At this scale, small and large components fail continuously. The way Cassandra manages the persistent state in the face of these failures drives the reliability and scalability of the software systems relying on this service. While in many ways Cassandra resembles a database and shares many design and implementation strategies therewith, Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format. Cassandra system was designed to run on cheap commodity hardware and handle high write throughput while not sacrificing read efficiency.

Annotated version of the original 2009 Cassandra paper.

Not a guide to future technology but a very interesting read about how Cassandra arrived at the present.

Comments Off

September 10, 2013

Building the Perfect Cassandra Test Environment

Filed under: Cassandra — Patrick Durusau @ 3:42 am

Building the Perfect Cassandra Test Environment by John Berryman.

John outlines the qualities of a Cassandra test framework as follows:

Light-weight and available — A good test framework will take up as little resources as possible and be accessible right when you want it.

Parity with Production — The test environment should perfectly simulate the production environment. This is a no-brainer. After all what good does it do you to pass a test only to wonder whether or not an error lurks in the differences between the test and production environments?

Stateless — Between running tests, there’s no reason to keep any information around. So why not just throw it all away?

Isolated — Most often there will be several developers on a team, and there’s a good chance they’ll be testing things at the same time. It’s important to keep each developer quarantined from the others.

Fault Resistant — Remember, we’re a little concerned here that Cassandra is going to be a resource hog or otherwise just not work. Being “fault resistant” means striking the right balance so that Cassandra takes up as little resources as possible without actually failing.

Projects without test environments are like sky diving without a main chute, only the reserve.

If it works, ok. If not, very much not ok.

With John’s notes, you too can have a Cassandra test environment!

Comments Off

September 4, 2013

What’s under the hood in Cassandra 2.0

Filed under: Cassandra — Patrick Durusau @ 6:48 pm

What’s under the hood in Cassandra 2.0 by Jonathan Ellis.

If you haven’t already downloaded Cassandra 2.0, Jonathan has twenty-three (23) reasons why you should.

Comments Off

September 3, 2013

Cassandra [2.0]

Filed under: Cassandra — Patrick Durusau @ 6:17 pm

Cassandra [2.0]

Cassandra 2.0 dropped today from the Apache Software Foundation.

If you don’t know Cassandra, check out the Getting Started guide.

Or visit Planet Cassandra that describes Cassandra this way:

Apache Cassandra is a massively scalable open source NoSQL database. Cassandra is perfect for managing large amounts of structured, semi-structured, and unstructured data across multiple data centers and the cloud. Cassandra delivers linear scalability and performance across many commodity servers with no single point of failure, and provides a powerful dynamic data model designed for maximum flexibility and fast response times.

Enjoy!

Comments Off

July 25, 2013

Cassandra 2.0.0 Beta Release

Filed under: Cassandra — Patrick Durusau @ 12:56 pm

Cassandra 2.0.0 (2nd beta) was released today.

See the release notes.

Sounds like a release of 2.0.0 is coming soon.

Make it a solid one with your participation in testing the current beta release.

Comments Off

June 25, 2013

Getting Started with Cassandra: Overview

Filed under: Cassandra — Patrick Durusau @ 3:56 pm

Getting Started with Cassandra: Overview by Patricia Gorla.

The start of a four-part introduction to Cassandra.

From the post:

Instead, Cassandra column families (tables) are modeled around the queries you intend to ask.

Not for every use case but no technology meets every possible use case.

A start to a promising series.

Comments Off

June 16, 2013

Cassandra project chair: We’re taking on Oracle (Cassandra 2.0)

Filed under: Cassandra,NoSQL — Patrick Durusau @ 6:35 pm

Cassandra project chair: We’re taking on Oracle by Paul Krill.

From the post:

Apache Cassandra is an open source, NoSQL database accommodating large-scale workloads and attracting a lot of attention, having been deployed in such organizations as Netflix, eBay, and Twitter. It was developed at Facebook, which open-sourced it in 2008, and its database can be deployed across multiple data centers and in cloud environments.

Jonathan Ellis is the chair of the project at Apache, and he serves as chief technical officer at DataStax, which has built a business around Cassandra. InfoWorld Editor-at-Large Paul Krill spoke with Ellis at the company’s recent Cassandra Summit 2013 conference in San Francisco, where Ellis discussed efforts to make the database easier to use and how it has become a viable competitor to Oracle’s relational database technology.

InfoWorld: What is the biggest value-add for Cassandra?

Ellis: It’s driving the Web applications. We’re the ones who power Netflix, Spotify. Cassandra is actually powering the applications directly. It lets you scale to millions of operations per second and software-as-a-service, machine-generated data, Web applications. Those are all really hot spots for Cassandra.

Cassandra 2.0 is targeted for the end of July, 2013. Lightweight transactions and triggers are on the menu.

Comments Off

May 9, 2013

Become a Super Modeler

Filed under: Cassandra,CQL - Cassandra Query Language,Modeling — Patrick Durusau @ 2:00 pm

Become a Super Modeler (Webinar)

Thursday, May 16th
11am PDT / 2pm EDT / 7pm BST / 8pm CEST

Sure you can do some time series modeling. Maybe some user profiles. What’s going to make you a super modeler? Let’s take a look at some great techniques taken from real world applications where we exploit the Cassandra big table model to it’s fullest advantage. We’ll cover some of the new features in CQL 3 as well as some tried and true methods. In particular, we will look at fast indexing techniques to get data faster at scale. You’ll be jet setting through your data like a true super modeler in no time.

Speaker: Patrick McFadin, Principal Solutions Architect at DataStax

Looks interesting and I have neglected to look closely at CQL 3.

Could be some incentive to read up before the webinar.

Comments Off

April 19, 2013

How to Compare NoSQL Databases

Filed under: Aerospike,Benchmarks,Cassandra,Couchbase,Database,MongoDB,NoSQL — Patrick Durusau @ 12:45 pm

How to Compare NoSQL Databases by Ben Engber. (video)

From the description:

Ben Engber, CEO and founder of Thumbtack Technology, will discuss how to perform tuned benchmarking across a number of NoSQL solutions (Couchbase, Aerospike, MongoDB, Cassandra, HBase, others) and to do so in a way that does not artificially distort the data in favor of a particular database or storage paradigm. This includes hardware and software configurations, as well as ways of measuring to ensure repeatable results.

We also discuss how to extend benchmarking tests to simulate different kinds of failure scenarios to help evaluate the maintainablility and recoverability of different systems. This requires carefully constructed tests and significant knowledge of the underlying databases — the talk will help evaluators overcome the common pitfalls and time sinks involved in trying to measure this.

Lastly we discuss the YCSB benchmarking tool, its significant limitations, and the significant extensions and supplementary tools Thumbtack has created to provide distributed load generation and failure simulation.

Ben makes a very good case for understanding the details of your use case versus the characteristics of particular NoSQL solutions.

Where you will find “better” performance depends on non-obvious details.

Watch the use of terms like “consistency” in this presentation.

The paper Ben refers to: Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs.

Forty-three pages of analysis and charts.

Slow but interesting reading.

If you are into the details of performance and NoSQL databases.

Comments (1)

January 16, 2013

Tombstones in Topic Map Future?

Filed under: Cassandra,Distributed Consistency,Distributed Systems,Functional Programming,Merging — Patrick Durusau @ 7:55 pm

Watching the What’s New in Cassandra 1.2 (Notes) webcast and encountered an unfamiliar term: “tombstones.”

If you are already familiar with the concept, skip to another post.

If you’re not, the concept is used in distributed systems that maintain “eventual” consistency by the nodes replicating their content. Which works if all nodes are available but what if you delete data and a node is unavailable? When it comes back, the other nodes are “missing” data that needs to be replicated.

From the description at the Cassandra wiki, DistributedDeletes, not an easy problem to solve.

So, Cassandra turns it into a solvable problem.

Deletes are implemented with a special value known as a tombstone. The tombstone is propogated to nodes that missed the initial delete.

Since you will eventually want to delete the tombstones as well, a grace period can be set, which is slightly longer than the period needed to replace a non-responding node.

Distributed topic maps will face the same issue.

Complicated by imperative programming models of merging that make changes in properties that alter merging difficult to manage.

Perhaps functional models of merging, as with other forms of distributed processing, will carry the day.

Comments Off

January 12, 2013

What’s New in Cassandra 1.2 (Notes)

Filed under: Cassandra,Clustering (servers),CQL - Cassandra Query Language — Patrick Durusau @ 6:59 pm

What’s New in Cassandra 1.2

From the description:

Apache Cassandra Project Chair, Jonathan Ellis, looks at all the great improvements in Cassandra 1.2, including Vnodes, Parallel Leveled Compaction, Collections, Atomic Batches and CQL3.

There is only so much you can cover in an hour but Jonathan did a good job of hitting the high points of virtual nodes (rebuild failed drives/nodes faster), atomic batches (fewer requirements on clients, new default btw), CQL improvements, and tracing.

Enough to make you interested in running (not watching) the examples plus your own.

The slides: http://www.slideshare.net/DataStax/college-credit-whats-new-in-apache-cassandra-12

Cassandra homepage.

CQL 3 Language Reference.

Comments (1)

January 2, 2013

Cassandra 1.2.0 released

Filed under: Cassandra — Patrick Durusau @ 2:33 pm

Cassandra 1.2.0 released by Jonathan Ellis.

From the post:

The new year is here, and so is Cassandra 1.2.0!

Key improvements include:

Virtual nodes, which improve the granularity of capacity increases and dramatically improve repair and rebuild times in larger clusters. See also this post on upgrading an existing cluster to vnodes.
CQL3 improvements, notably the addition of collection types, queryable system information, and a CQL-native protocol.
Request tracing is available to both CQL and classic Thrift requests, and can also be managed programatically.
Atomic batches address the possibility of mid-batch coordinator failure.
Configurable policies for disk failure
Last but not least, many performance improvements to memory usage, column indexes, compaction, streaming, startup time, and more.

2013 is going to be another good year to be a Cassandra user!

Reminder: What’s New in Apache Cassandra 1.2 [Webinar], Wednesday, January 9, 2013, Time: 11AM PT / 2 PM ET.

Comments Off

Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 8, 2015

October 29, 2015

May 19, 2015

May 2, 2015

March 18, 2015

March 3, 2015

February 25, 2015

January 22, 2015

December 24, 2014

October 17, 2014

September 14, 2014

September 13, 2014

September 3, 2014

June 2, 2014

December 21, 2013

Versions

November 27, 2013

November 16, 2013

October 27, 2013

September 16, 2013

September 10, 2013

September 4, 2013

September 3, 2013

July 25, 2013

June 25, 2013

June 16, 2013

May 9, 2013

April 19, 2013

January 16, 2013

January 12, 2013

January 2, 2013