Cassandra « Another Word For It

December 27, 2012

What’s New in Apache Cassandra 1.2 [Webinar]

Filed under: Cassandra — Patrick Durusau @ 4:14 pm

From the registration page:

Date: Wednesday, January 9, 2013
Time: 11AM PT / 2 PM ET
Duration: 1 hour

Description:

Join Apache Cassandra Project Chair, Jonathan Ellis, as he looks at all the great improvements in Cassandra 1.2, including Vnodes, Parallel Leveled Compaction, Collections, Atomic Batches and CQL3.

About the Speaker:

Jonathan Ellis (@spyced), CTO at DataStax and Apache Cassandra Project Chair

Jonathan is CTO and co-founder at DataStax. Prior to DataStax, Jonathan worked extensively with Apache Cassandra while employed at Rackspace. Prior to Rackspace, Jonathan built a multi-petabyte, scalable storage system based on Reed-Solomon encoding for backup provider Mozy. In addition to his work with DataStax, Jonathan is project chair of Apache Cassandra.

Another opportunity to start the new year by acquiring new skills or knowledge!

Comments (1)

November 24, 2012

A thrift to CQL3 upgrade guide [Cassandra Query Language]

Filed under: Cassandra,Query Language — Patrick Durusau @ 2:43 pm

A thrift to CQL3 upgrade guide by Sylvain Lebresne.

From the post:

CQL3 (the Cassandra Query Language) provides a new API to work with Cassandra. Where the legacy thrift API exposes the internal storage structure of Cassandra pretty much directly, CQL3 provides a thin abstraction layer over this internal structure. This is A Good Thing as it allows hiding from the API a number of distracting and useless implementation details (such as range ghosts) and allows to provide native syntaxes for common encodings/idioms (like the CQL3 collections as we’ll discuss below), instead of letting each client or client library reimplement them in their own, different and thus incompatible, way. However, the fact that CQL3 provides a thin layer of abstraction means that thrift users will have to understand the basics of this abstraction if they want to move existing application to CQL3. This is what this post tries to address. It explains how to translate thrift to CQL3. In doing so, this post also explains the basics of the implementation of CQL3 and can thus be of interest for those that want to understand that.

But before getting to the crux of the matter, let us have a word about when one should use CQL3. As described above, we believe that CQL3 is a simpler and overall better API for Cassandra than the thrift API is. Therefore, new projects/applications are encouraged to use CQL3 (though remember that CQL3 is not final yet, and so this statement will only be fully valid with Cassandra 1.2). But the thrift API is not going anywhere. Existing applications do not have to upgrade to CQL3. Internally, both CQL3 and thrift use the same storage engine, so all future improvements to this engine will impact both of them equally. Thus, this guide is for those that 1) have an existing application using thrift and 2) want to port it to CQL3.

Finally, let us remark that CQL3 does not claim to fundamentally change how to model applications for Cassandra. The main modeling principles are the same than they always have been: efficient modeling is still based on collocating that data that are accessed together through denormalization and the use of the ordering the storage engine provides, and is thus largely driven by the queries. If anything, CQL3 claims to make it easier to model along those principles by providing a simpler and more intuitive syntax to implement a number of idioms that this kind of modeling requires.

If you are using Cassandra, definitely the time to sit up and take notice. CQL3 is coming.

I first saw this at Alex Popescu’s myNoSQL.

Comments Off

November 9, 2012

Apache Cassandra 1.2.0-beta2 released

Filed under: Cassandra — Patrick Durusau @ 11:01 am

Apache Cassandra 1.2.0-beta2 released by Sylvain Lebresne.

From the post:

The Cassandra team is pleased to announce the release of the second beta for the future Apache Cassandra 1.2.0.

Let me first stress that this is beta software and as such is *not* ready for production use.

This release is still beta so is likely not bug free. However, lots have been fixed since beta1 and if everything goes right, we are hopeful that a first release candidate may follow shortly. Please do help testing this beta to help make that happen. If you encounter any problem during your testing, please report[3,4] them. And be sure to a look at the change log[1] and the release notes[2] to see where Cassandra 1.2 differs from the previous series.

Something to keep you busy over the weekend!

Comments Off

November 1, 2012

Design a Twitter Like Application with Nati Shalom

Filed under: Analytics,Cassandra,Stream Analytics,Tweets — Patrick Durusau @ 6:32 pm

Design a Twitter Like Application with Nati Shalom

From the description:

Design a large scale NoSQL/DataGrid application similar to Twitter with Nati Shalom.

The use case is solved with Gigaspaces and Cassandra but other NoSQL and DataGrids solutions could be used.

Slides : xebia-video.s3-website-eu-west-1.amazonaws.com/2012-02/realtime-analytics-for-big-data-a-twitter-case-study-v2-ipad.pdf

If you enjoyed the posts I pointed to at: Building your own Facebook Realtime Analytics System, you will enjoy the video. (Same author.)

Not to mention Nati teaches patterns, the specific software being important but incidental.

Comments Off

October 4, 2012

Could Cassandra be the first breakout NoSQL database?

Filed under: Cassandra,NoSQL — Patrick Durusau @ 4:58 pm

Could Cassandra be the first breakout NoSQL database? by Chris Mayer.

From the post:

Years of misunderstanding haven’t been kind to the NoSQL database. Aside from the confusing name (generally understood to mean ‘not only SQL’), there’s always been an air of reluctance from the enterprise world to move away from Oracle’s steady relational database, until there was a definite need to switch from tables to documents

The emergence of Big Data in the past few years has been the kickstart NoSQL distributors needed. Relational databases cannot cope with the sheer amount of data coming in and can’t provide the immediacy large-scale enterprises need to obtain information.

Open source offerings have been lurking in the background for a while, with the highly-tunable Apache Cassandra becoming a community favourite quickly. Emerging from the incubator in October 2011, Cassandra’s beauty lies in its flexible schema, its hybrid data model (lying somewhere between a key-value and tabular database) and also through its high availability. Being from the Apache Software Foundation, there’s also intrinsic links to the big data ‘kernel’ Apache Hadoop, and search server Apache Solr giving users an extra dimension to their data processing and storage.

Using NoSQL on cheap servers for processing and querying data is proving an enticing option for companies of all sizes, especially in combination with MapReduce technology to crunch it all.

One company that appears to be leading this data-driven charge is DataStax, who this week announced the completion of a $25 million C round of funding. Having already permeated the environments of some large companies (notably Netflix), the San Mateo startup are making big noises about their enterprise platform, melding the worlds of Cassandra and Hadoop together. Netflix is a client worth crowing about, with DataStax’s enterprise option being used as one of their primary data stores

Chris mentions some other potential players, MongoDB comes to mind, along with the Hadoop crowd.

I take the move from tables to documents as a symptom of deeper issue.

Relational databases rely on normalization to achieve their performance and reliability. So what happens if data is too large or coming too quickly to be normalized?

Relational databases remain the weapon of choice for normalized data but that doesn’t mean they work well with “dirty” data.

“Dirty data,” as opposed to “documents,” seems to catch the real shift for which NoSQL solutions are better adapted.

Your result are only as good as the data, but you know that up front. Not when you realize your “normalized” data, wasn’t.

That has to be a sinking feeling.

Comments Off

September 20, 2012

Pig as Duct Tape, Part Three: TF-IDF Topics with Cassandra, Python Streaming and Flask

Filed under: Cassandra,Pig — Patrick Durusau @ 7:15 pm

Pig as Duct Tape, Part Three: TF-IDF Topics with Cassandra, Python Streaming and Flask by Russell Jurney.

From the post:

Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.

But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems to enable you to process data from wherever and to wherever you like.

Working code for this post as well as setup instructions for the tools we use and their environment variables are available at https://github.com/rjurney/enron-python-flask-cassandra-pig and you can download the Enron emails we use in the example in Avro format at http://s3.amazonaws.com/rjurney.public/enron.avro. You can run our example Pig scripts in local mode (without Hadoop) with the -x local flag: pig -x local. This enables new Hadoop users to try out Pig without a Hadoop cluster.

Part one and two can get you started using Pig if you’re not familiar.

With this post in the series, “duct tape,” made it into the title.

In case you don’t know (I didn’t), Flask is a “lightweight web application framework in Python.”

Just once I would like to see a “heavyweight, cumbersome, limited and annoying web application framework in (insert language of your choice).”

Just for variety.

Rather than characterizing software, say what it does.

Sorry, I have been converting one of the most poorly edited documents I have ever seen into a csv file. Proofing will follow the conversion process but hope to finish that by the end of next week.

Comments Off

September 10, 2012

Heroku and Cassandra – Cassandra.io RESTful APIs

Filed under: Cassandra,Heroku — Patrick Durusau @ 6:47 am

Heroku and Cassandra – Cassandra.io RESTful APIs by Istvan Szegedi.

From the post:

Introduction

Last time I wrote about Hadoop on Heroku which is on add-on from Treasure Data – this time I am going to cover NoSQL on Heroku.

There are various datastore services – add-ons in Heroku terms – available from MongoDB (MongoHQ) to CouchDB (Cloudant) to Cassandra (Cassandra.io). This post is devoted to Cassandra.io.

Cassandra.io

Cassandra.io is a hosted and managed Cassandra ring based on Apache Cassandra and makes it accessible via RESTful API. As of writing this article, the Cassandra.io client helper libraries are available in Java, Ruby and PHP, and there is also a Objective-C version in private beta. The libraries can be downloaded from github. I use the Java library in my tests.

Heroku – and Cassandra.io add-on, too – is built on Amazon Elastic Compute Cloud (EC2) and it is supported in all Amazon’s locations. Note: Cassandra.io add-on is in public beta now that means you have only one option called Test available – this is free.

Another opportunity to explore your NoSQL options.

Comments Off

August 7, 2012

Cassandra and OpsCenter from Datastax

Filed under: Cassandra,DataStax — Patrick Durusau @ 4:03 pm

Cassandra and OpsCenter from Datastax

Istvan Szegedi details installation of both Cassandra and OpsCenter along with some basic operations.

From the post:

Cassandra – originally developed at Facebook – is another popular NoSQL database that combines Amazon’s Dynamo distributed systems technologies and Google’s Bigtable data model based on Column Families. It is designed for distributed data at large scale.Its key components are as follows:

Keyscape: it acts as a container for data, similar to RDBMS schema. This determines the replication parameters such as replication factor and replication placement strategy as we will see it later in this post. More details on replication placement strategy can be read here.

Column Familiy: within a keyscape you can have one or more column families. This is similar to tables in RDBMS world. They contain multiple columns which are referenced by row keys.

Column: it is the smallest increment of data. It is a tuple having a name, a value and and a timestamp.

Another information center you are likely to encounter.

Comments Off

Choking Cassandra Bolt

Filed under: Cassandra,Kafka,Storm,Tuples — Patrick Durusau @ 1:57 pm

Got your attention? Good!

Brian O’Neill details in A Big Data Trifecta: Storm, Kafka and Cassandra an architecture that was fast enough to choke the Cassandra Bolt component. (And also details how to fix that problem.)

Based on the exchange of tuples. Writing at 5,000 writes per second on a laptop.

More details to follow but I think you can get enough from the post to start experimenting on your own.

I first saw this at: Alex Popesu’s myNoSQL under A Big Data Triefecta: Storm, Kafka and Cassandra.

Comments (3)

July 18, 2012

U.S. Senate vs. Apache Accumulo: Whose side are you on?

Filed under: Accumulo,Cassandra,HBase — Patrick Durusau @ 10:47 am

Jack Park sent a link to NSA Mimics Google, Pisses Off Senate this morning. If you are unfamiliar with the software, see: Apache Accumulo.

Long story made short:

The bill bars the DoD from using the database unless the department can show that the software is sufficiently different from other databases that mimic BigTable. But at the same time, the bill orders the director of the NSA to work with outside organizations to merge the Accumulo security tools with alternative databases, specifically naming HBase and Cassandra.

At issue is:

The bill indicates that Accumulo may violate OMB Circular A-130, a government policy that bars agencies from building software if it’s less expensive to use commercial software that’s already available. And according to one congressional staffer who worked on the bill, this is indeed the case. He asked that his name not be used in this story, as he’s not authorized to speak with the press.

On its face, OMB Circular A-130 sounds like a good idea. Don’t build your own if it is cheaper to buy commercial.

But here the Senate trying to play favorites.

I have a suggestion: Let’s disappoint them.

Let’s contribute to all three projects:

Apache Accumulo

Apache Cassandra

Apache HBase

Would you look at that! All three of these projects are based at Apache!

Let’s push all three projects forward in terms of working on releases, documentation, testing, etc.

But more than that, let’s build applications based on all three projects that analyze political contributions, patterns of voting, land transfers, stock purchases, virtually every fact than can be known about members of the Senate and the Senate Armed Services Committee in particular.

They are accustomed to living in a gold fish bowl.

Let’s move them into a frying pan.

PS: +1 if the NSA is ordered to contribute to open source projects, if the projects are interested. Direction from the U.S. Senate is not part of the Apache governance model.

Comments Off

June 21, 2012

Basic Time Series with Cassandra

Filed under: Cassandra,Time,Time Series — Patrick Durusau @ 4:24 pm

Basic Time Series with Cassandra

From the post:

One of the most common use cases for Cassandra is tracking time-series data. Server log files, usage, sensor data, SIP packets, stuff that changes over time. For the most part this is a straight forward process but given that Cassandra has real-world limitations on how much data can or should be in a row, there are a few details to consider.

As it says in the title, “basic” time series, the post concludes with:

Indexing and Aggregation

Indexing and aggregation of time-series data is a more complicated topic as they are highly application dependent. Various new and upcoming features of Cassandra also change the best practices for how things like aggregation are done so I won’t go into that. For more details, hit #cassandra on irc.freenode and ask around. There is usually somebody there to help.

But why would you collect time-series data if you weren’t going to index and/or aggregate it?

Anyone care to suggest “best practices?”

Comments Off

May 25, 2012

Build your own twitter like real time analytics – a step by step guide

Filed under: Analytics,Cassandra,Tweets — Patrick Durusau @ 4:22 am

Build your own twitter like real time analytics – a step by step guide

Where else but High Scalability would you find a “how-to” article like this one? Complete with guide and source code.

Good DYI project for the weekend.

Major social networking platforms like Facebook and Twitter have developed their own architectures for handling the need for real-time analytics on huge amounts of data. However, not every company has the need or resources to build their own Twitter-like solution.

In this example we have taken the same Twitter/Facebook-like blueprint, and made it simple enough for developers to implement. We have taken the following approach in our implementation:

Use In Memory Data Grid (XAP) for handling the real time stream data-processing.

BigData data-base (Cassandra) for storing the historical data and manage the trend analytics

Use Cloudify (cloudifysource.org) for managing and automating the deployment on private or pubic cloud

The example demonstrate a simple case of word count analytics. It uses Spring Social to plug-in to real twitter feeds. The solution is designed to efficiently cope with getting and processing the large volume of tweets. First, we partition the tweets so that we can process them in parallel, but we have to decide on how to partition them efficiently. Partitioning by user might not be sufficiently balanced, therefore we decided to partition by the tweet ID, which we assume to be globally unique.

Then we need persist and process the data with low latency, and for this we store the tweets in memory.

Automated harvesting of tweets has real potential, even with clear text transmission. Or perhaps because of it.

Comments Off

May 16, 2012

Progressive NoSQL Tutorials

Filed under: Cassandra,Couchbase,CouchDB,MongoDB,Neo4j,NoSQL,RavenDB,Riak — Patrick Durusau @ 10:20 am

Have you ever gotten an advertising email with clean links in it? I mean a link without all the marketing crap appended to the end. The stuff you have to clean off before using it in a post or sending it to a friend?

Got my first one today. From Skills Matter on the free videos for their Progressive NoSQL Tutorials that just concluded.

High quality presentations, videos freely available after presentation, friendly links in email, just a few of the reasons to support Skills Matter.

The tutorials:

Cassandra

Consistency

Matt Heitzenroder on Eventual Consistency

Couchbase

CouchDB

Tom McMillen on CouchDB at the Hut Group

MongoDB

Neo4j

RavenDB

Riak

Comments Off

May 8, 2012

What’s new in Cassandra 1.1

Filed under: Cassandra — Patrick Durusau @ 3:47 pm

What’s new in Cassandra 1.1 by Jonathan Ellis.

From the post:

Cassandra 1.1 was just released with some useful improvements over 1.0. We’ve been describing these as 1.1 was developed, but it’s useful to list them all in one place:

Cassandra 1.1 supports CQL3, with support for compound keys and wide rows. Paul explains the nuts and bolts of upgrading from CQL2 (which remains the default for now).
Caching has been completely updated to dramatically simplify tuning cache sizes. Just tell Cassandra how much total memory to use and it will allocate it to your tables appropriately!
Support for mixing SSDs and magnetic storage, allocated by table.
Concurrent schema changes allow applications to create and destroy tables at will, with no schema locking required.
Row-level isolation means the Cassandra storage engine supports Atomicity, Isolation, and Durablity out of ACID. Readers always see either an old version of a row, or the newest, never partially updated rows.
Live traffic sampling allows you to test drive performance tuning changes such as compaction strategy or compression against live traffic without risking the stability of your production, client-facing nodes.
The BulkOutputFormat is a more-performant alternative to the old ColumnFamilyOutputFormat for Hadoop jobs.

There is a lot to digest here!

Pointers to posts taking advantage of these new capacities?

Enjoy!

Comments Off

April 6, 2012

Cassandra Europe 2012 (Slides)

Filed under: Cassandra,Conferences,NoSQL — Patrick Durusau @ 6:45 pm

Cassandra Europe 2012 (Slides)

Slides are up from Cassandra Europe, 28 March 2012.

From the program:

Andrew Byde – Acunu Analytics: Simple, Powerful, Real-time
Gary Dusbabek – Cassandra at Rackspace: Cloud Monitoring
Eric Evans – CQL: Then, Now, and When
Nicolas Favre-Felix – Cassandra Storage Internals
Dave Gardner – Introduction to NoSQL and Cassandra
Jeremy Hanna – Powering Social Business Intelligence: Cassandra and Hadoop at the Dachis Group
Sylvain Lebresne – On Cassandra Development: Past, Present and Future
Richard Low – Data Modelling Workshop
Richard Lowe – Cassandra at Arkivum
Sam Overton – Highly Available: The Cassandra Distribution Model
Noa Resare – Cassandra at Spotify
Denis Sheahan – Netflix’s Cassandra Architecture and Open Source Efforts
Tom Wilkie – Next Generation Cassandra

Comments Off

March 5, 2012

“Modern” Algorithms and Data Structures (Bloom Filters, Merkle Trees)

Filed under: Bloom Filters,Cassandra,HBase,Merkle Trees — Patrick Durusau @ 7:51 pm

“Modern” Algorithms and Data Structures (Bloom Filters, Merkle Trees) by Lorenzo Alberton.

From the description:

The first part of a series of talks about modern algorithms and data structures, used by nosql databases like HBase and Cassandra. An explanation of Bloom Filters and several derivates, and Merkle Trees.

Looking forward to more of this series!

Comments Off

February 27, 2012

Cassandra Radical NoSQL Scalability

Filed under: Cassandra,CQL - Cassandra Query Language,NoSQL,Scalability — Patrick Durusau @ 8:25 pm

Cassandra Radical NoSQL Scalability by Tim Berglund.

From the description:

Cassandra is a scalable, highly available, column-oriented data store in use use at Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Digg, CloudKick, Ooyala and more companies that have large, active data sets. The largest known Cassandra cluster has over 300 TB of data in over 400 machines.

This open source project managed by the Apache foundation offers a compelling combination of a rich data model, a robust deployment track record, and a sound architecture. This video presents the Cassandra’s data model, works through its API in Java and Groovy, talks about how to deploy it and looks at use cases in which it is an appropriate data storage solution.

It explores the Amazon Dynamo project and Google’s BigTable and explains how its architecture helps us achieve the gold standard of scalability: horizontal scalability on commodity hardware. You will be ready to begin experimenting with Cassandra immediately and planning its adoption in your next project.

Take some time to look at CQL – Cassandra Query Language.

BTW, Berglund is a good presenter.

Comments Off

February 24, 2012

Cassandra Europe! Wednesday March 28 – London

Filed under: Cassandra,Conferences — Patrick Durusau @ 5:02 pm

Cassandra Europe! Wednesday March 28 – London

From the announcement:

Acunu is proud to announce the first Apache Cassandra Europe Conference in London on March 28. This is a packed one-day event with two tracks – ‘Case Studies’ and ‘Cassandra 101 – Beat the learning curve’. Get your early bird ticket!

Who should attend?

If you’re using Cassandra and looking for better support or performance tips, or if you’re wondering what all the fuss is about and want to learn more, you should attend!

Experts from Acunu will be on hand to share insights and we’ll have a drop-in room where attendees can turn up for help and advice with Cassandra problems.

We’ll be tweeting with hashtag #cassandraeu. For any comments or questions, contact Konrad Kennedy.

Sign up -win an iPad2!

An iPad2 drawing isn’t enough to get me to London but a Cassandra conference could tip the balance. How about you?

Comments Off

January 27, 2012

Getting Started with Apache Cassandra (realistic data import example)

Filed under: Cassandra,NoSQL — Patrick Durusau @ 4:30 pm

Getting Started with Apache Cassandra

From the post:

If you haven’t begun using Apache Cassandra yet and you wanted a little handholding to help get you started, you’re in luck. This article will help you get your feet wet with Cassandra and show you the basics so you’ll be ready to start developing Cassandra applications in no time.

Why Cassandra?

Do you need a more flexible data model than what’s offered in the relational database world? Would you like to start with a database you know can scale to meet any number of concurrent user connections and/or data volume size and run blazingly fast? Have you been needing a database that has no single point of failure and one that can easily distribute data among multiple geographies, data centers, and the cloud? Well, that’s Cassandra.

Not to pick on Cassandra or this post in particular but have you noticed that introductory articles have you enter a trivial amount of data as a starting point? Which makes sense, you need to learn the basics but why not conclude with importing a real data set? Particularly for databases what “scale” so well.

For example, detail how to import campaign donations records from the Federal Election Commission in the United States. Which are written in COBOL format. That would give the user a better data set for CQL exercises.

Comments (1)

January 1, 2012

Cassandra NYC 2011 Presentation Slides and Videos

Filed under: Cassandra,NoSQL — Patrick Durusau @ 5:55 pm

Cassandra NYC 2011 Presentation Slides and Videos

Almost the first half:

Chris Burroughs (Clearspring) – Apache Cassandra Clearspring (HD Video)

David Weinstein (Adobe) – Cassandra at Adobe (HD Video)

Drew Robb (SocialFlow) – Cassandra at Social Flow (HD Video)

Ed Capriolo (m6d) – Cassandra in Online Advertising (Slides and HD Video)

Eric Evans (Acunu) – CQL: SQL for Cassandra (Slides and HD Video)

Ilya Maykov (Ooyala) – Scaling Video Analytics with Apache Cassandra (Slides)

Joe Stein (Medialets) – Cassandra as the Central Nervous System of Your Distributed Systems (Slides and HD Video)

I count nine (9) more at the Datastax site.

Just in case you want to get started on your New Year’s resolution to learn one (or another?) NoSQL database cold.

I would amend that resolution to learn one of: DB2, Oracle, MySQL, PostgreSQL, SQL Server as well. That will enable you to make an intelligent assessment of the requirements of your projects and the capabilities of a range of storage solutions.

Comments Off

Gora Graduates!

Filed under: Cassandra,Hadoop,HBase,Hive,Lucene,MapReduce,Pig,Solr — Patrick Durusau @ 5:54 pm

Gora Graduates! (Incubator location)

Over Twitter I just saw a post announcing that Gora has graduated from the Apache Incubator!

Congratulations to all involved.

Oh, the project:

What is Gora?

Gora is an ORM framework for column stores such as Apache HBase and Apache Cassandra with a specific focus on Hadoop.

Why Gora?

Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores. Gora fills this gap by giving the user an easy-to-use ORM framework with data store specific mappings and built in Apache Hadoop support.

The overall goal for Gora is to become the standard data representation and persistence framework for big data. The roadmap of Gora can be grouped as follows.

Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS.

Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location.

Indexing : Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API.

Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading

MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.

Comments Off

December 14, 2011

agamemnon – update

Filed under: agamemnon,Cassandra — Patrick Durusau @ 7:44 pm

agamemnon – update

When I last looked at agamemnon, it was not part of the globusonline.org project.

Nor do I recall support for RDF or the extended example of how to use it with your code.

Sorry, agamemnon, for those of you who don”t recall, is a library that enables you to use Cassandra as graph database.

Think about the speed and scalability of Cassandra for a moment and you will see the potential for such a matchup.

Comments Off

December 2, 2011

Acunu Data Platform v1.2 released!

Filed under: Acunu,Cassandra,NoSQL — Patrick Durusau @ 4:55 pm

Acunu Data Platform v1.2 released!

From the announcement:

We’re excited to announce the release of version 1.2 of the Acunu Data Platform, incorporating Apache Cassandra — the fastest and lowest-risk route to building a production-grade Cassandra cluster.

The Acunu Data Platform (ADP) is an all-in-one distributed database solution, delivered as a software appliance for your own data center or an Amazon Machine Image (AMI) for cloud deployments. It includes:

A hardened version of Apache Cassandra that is 100% compatible with existing Cassandra applications

The Acunu Core, a file system and embedded database designed from the ground-up for Big Data workloads

A web-based management console that simplifies deployment, monitoring and scaling of your cluster.

Your standard Linux Centos

Comments Off

Cassandra Drivers

Filed under: Cassandra,JDBC,Python — Patrick Durusau @ 4:50 pm

Cassandra Drivers

I not more write about the new release of Cassandra than I see a post about Python DB and JDBC drivers for Cassandra!

Enjoy!

Comments Off

Cassandra 1.0.5

Filed under: Cassandra,NoSQL — Patrick Durusau @ 4:50 pm

Cassandra 1.0.5

A reversion release of Cassandra. Details: Cassandra changes.

Looks like the holidays are going to be filled with upgrades, new releases!

Comments Off

November 11, 2011

Are You a Cassandra Jedi?

Filed under: Cassandra,Conferences,NoSQL — Patrick Durusau @ 7:38 pm

Are You a Cassandra Jedi?

Cassandra Conference, December 6, 2011, New York City

From the call for speakers:

BURLINGAME, Calif. – November 9, 2011 –DataStax, the commercial leader in Apache Cassandra™, along with the NYC Cassandra User Group, NoSQL NYC, and Big Data NYC are joining together to present the first Cassandra New York City conference on December 6. This all day, two-track event will focus on enterprise use cases as well as the latest developments in Cassandra. Early bird registration is now open here.

Coming on the heels of a sold-out DataStax Cassandra SF earlier this year, the event will feature some of the most interesting Cassandra use-cases from up and down the Eastern Seaboard. Cassandra NYC will be keynoted by Jonathan Ellis, chairman of the Apache Cassandra project, who will highlight what’s new in Cassandra 1.0, and what’s in store for the future. Additional confirmed speakers include Nathan Marz, lead engineer for the Storm project at Twitter and Jim Ancona, systems architect at Constant Contact.

“With the recent 1.0 release, we are seeing users doing amazing new things with Cassandra that are going beyond even our expectations and imagination,” said Ellis. “We look forward to sharing these stories with the broader community, to further hasten the adoption and usage of Cassandra to meet their real-time, big data challenges.”

Call for Speakers and Press Registration

The call for speakers is now also open for the event. Submissions can be made to lynnbender@datastax.com.

Press interested in attending the event may contact Zenobia@intersectcom.com for a complimentary press pass.

The event will be held at the Lighthouse International Conference Center on 59th St.

I am not sure about “early bird” registration for an event less than a month away but this sounds quite interesting. I hope the presentations will be recorded and posted for asynchronous access.

Comments Off

DataStax Enterprise and DataStax Community Edition

Filed under: Cassandra,DataStax,NoSQL — Patrick Durusau @ 7:38 pm

DataStax Enterprise and DataStax Community Edition

From the announcement:

BURLINGAME, Calif. – Nov.1, 2011 –DataStax, the commercial leader in Apache Cassandra™, today announced that DataStax Enterprise, the industry’s first distributed, scalable, and highly available database platform powered by Apache Cassandra™ 1.0, is now available.

“The ability to manage both real-time and analytic data in a simple, massively scalable, integrated solution is at the heart of challenges faced by most businesses with legacy databases,” said Billy Bosworth, CEO, DataStax. “Our goal is to ensure businesses can conquer these challenges with a modern application solution that provides operational simplicity, optimal performance and incredible cost savings.”

“Apache Cassandra is the scalable, high-impact, comprehensive data platform that is well-suited to the rapidly-growing real-time data needs of our social media platform,” said Christian Carollo, Senior Manager, Mobile for GameFly. “We leveraged the expertise of DataStax to deploy our new social media platform, and were able to complete the project without worrying about scale or distribution – we simply built a great application and Apache Cassandra took care of the rest.”

BTW, DataStax just added its 100th customer. You might recognize some of them, Netflix, Cisco, etc.

Comments Off

November 7, 2011

CumulusRDF

Filed under: Cassandra,RDF — Patrick Durusau @ 7:28 pm

CumulusRDF

From Andreas Harth and Günter Ladwig:

[W]e are happy to announce the first public release of CumulusRDF, a Linked Data server that uses Apache Cassandra [1] as a cloud-based storage backend. CumulusRDF provides a simple HTTP interface [2] to manage RDF data stored in an Apache Cassandra cluster.

Features
* By way of Apache Cassandra, CumulusRDF provides distributed, fault-tolerant and elastic RDF storage
* Supports Linked Data and triple pattern lookups
* Proxy mode: CumulusRDF can act as a proxy server [3] for other Linked Data applications, allowing to deploy any RDF dataset as Linked Data

This is a first beta release that is still somewhat rough around the edges, but the basic functionality works well. The HTTP interface is work-in-progress. Eventually, we plan to extend the storage model to support quads.

CumulusRDF is available from http://code.google.com/p/cumulusrdf/

See http://code.google.com/p/cumulusrdf/wiki/GettingStarted to get started using CumulusRDF.

There is also a paper [4] on CumulusRDF that I presented at the Scalable Semantic Knowledge Base Systems (SSWS) workshop at ISWC last week.

Cheers,
Andreas Harth and Günter Ladwig

[1] http://cassandra.apache.org/
[2] http://code.google.com/p/cumulusrdf/wiki/HttpInterface
[3] http://code.google.com/p/cumulusrdf/wiki/ProxyMode
[4] http://people.aifb.kit.edu/gla/cumulusrdf/cumulusrdf-ssws2011.pdf

Everybody knows I hate to be picky but the abstract of [4] promises:

Results on a cluster of up to 8 machines indicate that CumulusRDF is competitive to state-of-the-art distributed RDF stores.

But I didn’t see any comparison to “state-of-the-art” RDF stores, distributed or not. Did I just overlook something?

I ask because I think this approach has promise, at least as an exploration of indexing strategies for RDF and how usage scenarios may influence those strategies. But that will be difficult to evaluate in the absence of comparison to less imaginative approaches to RDF indexing.

Comments Off

November 3, 2011

NoSQL Exchange – 2 November 2011

Filed under: Acunu,BigData,Cassandra,Conferences,CouchDB,leveldb,MongoDB,NoSQL,RethinkDB,Riak,Scala,Tokutek — Patrick Durusau @ 7:20 pm

NoSQL Exchange – 2 November 2011

It doesn’t get much better or fresher (for non-attendees) than this!

Dr Jim Webber of Neo Technology starts the day by welcoming everyone to the first of many annual NOSQL eXchanges. View the podcast here…
Emil Eifrém gives a Keynote talk to the NOSQL eXchange on the past, present and future of NOSQL, and the state of NOSQL today. View the podcast here…
HANDLING CONFLICTS IN EVENTUALLY CONSISTENT SYSTEMS In this talk, Russell Brown examines how conflicting values are kept to a minimum in Riak and illustrates some techniques for automating semantic reconciliation. There will be practical examples from the Riak Java Client and other places.
MONGODB + SCALA: CASE CLASSES, DOCUMENTS AND SHARDS FOR A NEW DATA MODEL Brendan McAdams — creator of Casbah, a Scala toolkit for MongoDB — will give a talk on “MongoDB + Scala: Case Classes, Documents and Shards for a New Data Model”
REAL LIFE CASSANDRA Dave Gardner: In this talk for the NOSQL eXchange, Dave Gardner introduces why you would want to use Cassandra, and focuses on a real-life use case, explaining each Cassandra feature within this context.
DOCTOR WHO AND NEO4J Ian Robinson: Armed only with a data store packed full of geeky Doctor Who facts, by the end of this session we’ll have you tracking down pieces of memorabilia from a show that, like the graph theory behind Neo4j, is older than Codd’s relational model.
BUILDING REAL WORLD SOLUTION WITH DOCUMENT STORAGE, SCALA AND LIFT Aleksa Vukotic will look at how his company assessed and adopted CouchDB in order to rapidly and successfully deliver a next generation insurance platform using Scala and Lift.
ROBERT REES ON POLYGLOT PERSISTENCE Robert Rees: Based on his experiences of mixing CouchDB and Neo4J at Wazoku, an idea management startup, Robert talks about the theory of mixing your stores and the practical experience.
PARKBENCH DISCUSSION This Park Bench discussion will be chaired by Jim Webber.
THE FUTURE OF NOSQL AND BIG DATA STORAGE Tom Wilkie: Tom Wilkie takes a whistle-stop tour of developments in NOSQL and Big Data storage, comparing and contrasting new storage engines from Google (LevelDB), RethinkDB, Tokutek and Acunu (Castle).

And yes, I made a separate blog post on Neo4j and Dr. Who. 😉 What can I say? I am a fan of both.

Comments Off

October 7, 2011

Usergrid Source Code Release on GitHub

Filed under: Cassandra,Usergrid — Patrick Durusau @ 6:19 pm

Usergrid Source Code Release on GitHub

From the webpage:

We’re announcing today the first source code release of Usergrid, a comprehensive platform stack for mobile and rich client applications. The entire codebase is now available on GitHub at https://github.com/usergrid/stack. Usergrid is built in Java and runs on top of Cassandra. Although we built Usergrid as a highly scalable cloud service, we’ve also taken a few steps to make it easy to run “small”, including providing a double-clickable desktop app that lets you run your own personal installation on your desktop, so you can get started right away.

I thought I read about “rich clients” with HTML5.

But the W3C web design team buried the HTML 5 draft 5 clicks deep from their homepage. Good thing I knew to keep looking. That’s not just poor marketing, that’s also poor design.

A future of incompatiblity awaits.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 27, 2012

November 24, 2012

November 9, 2012

November 1, 2012

October 4, 2012

September 20, 2012

September 10, 2012

August 7, 2012

July 18, 2012

June 21, 2012

May 25, 2012

May 16, 2012

May 8, 2012

April 6, 2012

March 5, 2012

February 27, 2012

February 24, 2012

January 27, 2012

January 1, 2012

December 14, 2011

December 2, 2011

November 11, 2011

November 7, 2011

November 3, 2011

October 7, 2011