Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 21, 2011

Cassandra Write Performance – A quick look inside

Filed under: Cassandra,NoSQL,Software — Patrick Durusau @ 7:09 pm

Cassandra Write Performance – A quick look inside

From the post:

I was looking at Cassandra, one of the major NoSQL solutions, and I was immediately impressed with its write speed even on my notebook. But I also noticed that it was very volatile in its response time, so I took a deeper look at it.

Michael Kopp uses dynaTrace to look inside Cassandra. Lots of information in between and hopefully his conclusion will make you read this posts and those he promises to follow.

Conclusion

NoSQL or BigData Solutions are very very different from your usual RDBMS, but they are still bound by the usual constraints: CPU, I/O and most importantly how it is used! Although Cassandra is lighting fast and mostly I/O bound it’s still Java and you have the usual problems – e.g. GC needs to be watched. Cassandra provides a lot of monitoring metrics that I didn’t explain here, but seeing the flow end-to-end really helps to understand whether the time is spent on the client, network or server and makes the runtime dynamics of Cassandra much clearer.

Understanding is really the key for effective usage of NoSQL solutions as we shall see in my next blogs. New problem patterns emerge and they cannot be solved by simply adding an index here or there. It really requires you to understand the usage pattern from the application point of view. The good news is that these new solutions allow us a really deep look into their inner workings, at least if you have the right tools at hand.

What tools are you using to “look inside” your topic map engine?

What’s new in Cassandra 1.0: Compression

Filed under: Cassandra,NoSQL — Patrick Durusau @ 7:08 pm

What’s new in Cassandra 1.0: Compression

From the post:

Cassandra 1.0 introduces support for data compression on a per-ColumnFamily basis, one of the most-requested features since the project started. Compression maximizes the storage capacity of your Cassandra nodes by reducing the volume of data on disk. In addition to the space-saving benefits, compression also reduces disk I/O, particularly for read-dominated workloads.

OK, maybe someone can help me here.

Cassandra, an Apache project, just released version 8.6. Here are the release notes for 8.6.

As a standards editor I understand being optimistic about what is “…going to appear…” in a future release but isn’t version 0.8.6 a little early to be treating features for 1.0 a bit early? (I don’t find “compression” mentioned in the cumulative release notes as of 0.8.6.)

May just be me.

August 14, 2011

Planet Cassandra

Filed under: Cassandra,NoSQL — Patrick Durusau @ 7:13 pm

Planet Cassandra

Aggregation of feeds on Cassandra. If you need to follow Cassandra closely, this would be among your first stops.

August 12, 2011

NoSQL standouts: New databases for new applications

Filed under: Cassandra,CouchDB,FlockDB,Neo4j — Patrick Durusau @ 7:21 pm

NoSQL standouts: New databases for new applications: Cassandra, CouchDB, MongoDB, Redis, Riak, Neo4J, and FlockDB reinvent the data store.

From the post:

Was it just two or three years ago when choosing a database was easy? Those with a Cadillac budget bought Oracle, those in a Microsoft shop installed SQL Server, those with no budget chose MySQL. Everyone in between tried to figure out where they belonged.

Those days are gone forever. Everyone and his brother are coming out with their own open source project for storing information. In most cases, these projects are tossing aside many of the belts-and-suspenders protections that people expect from the classic databases. There are enough of them now that some joker started calling them NoSQL and claiming, perhaps tongue-in-cheek, that the acronym stood for Not Only SQL.

I remember reading somewhere that the #1 reason for firing sysadmins was failure to maintain proper backups. A RDBMS system isn’t a magic answer to data security and anyone who thinks so, is probably a former sysadmin at one or more locations. 😉

You need to read Jim Grey’s Transaction Processing: Concepts and Techniques if you want to design reliable systems. Or that is at least one of the works you need to read.

Do use the “print” option so you can read the article while avoiding most of the annoying distractions typical for this type of site.

Not detailed enough to be particularly useful. Actually I haven’t seen a comparison yet that was detailed enough to be really useful. I suppose in part because the approaches are different, would be hard compare apples with apples.

What might be useful would be to compare the use cases where each system claims to excel. Now that might be a continuum of interest to readers.

What do you think?

August 11, 2011

Cassandra: Introduction for System Administrators

Filed under: Cassandra,NoSQL — Patrick Durusau @ 6:32 pm

Cassandra: Introduction for System Administrators by Nathan Milford.

Introductory slide deck for administrators interested in Cassandra (or being asked to participate in its use).

August 1, 2011

Pig with Cassandra: Adventures in Analytics

Filed under: Cassandra,Pig,Pygmalion — Patrick Durusau @ 3:54 pm

Pig with Cassandra: Adventures in Analytics

Suggestions for slide 6 that reads in part:

Pygmalion

Figure in Greek Mythology, sounds like Pig

True enough but in terms of a control language, the play Pygmalion by Shaw would have been the better reference.

I presume the reader/listener would get the sound similarity without prompting.

Sorry, read the slide deck and see the source code at: https://github.com/jeromatron/pygmalion/.

July 28, 2011

Indexing in Cassandra

Filed under: Cassandra,Indexing — Patrick Durusau @ 6:55 pm

Indexing in Cassandra by Ed Anuff.

As if you haven’t noticed by now, I have a real weakness for indexing and indexing related material.

Interesting coverage of composite indexes.

July 27, 2011

NoSQL @ Netflix, Part 2

Filed under: Cassandra,NoSQL,SQL — Patrick Durusau @ 2:17 pm

NoSQL @ Netflix, Part 2 by Sid Anand.

OSCON 2011 presentation.

I think the RDBMS Concepts to Key-Value Store Concepts was the best part of the slide deck.

What do you think?

July 20, 2011

Cassandra SF 2011

Filed under: Cassandra,Conferences — Patrick Durusau @ 12:55 pm

Cassandra SF 2011

Slides with videos to follow!

From the website:

Keynote Presentation

  • Jonathan Ellis (DataStax)State of Cassandra, 2011 (Slides)

Cassandra Internals

  • Ed AnuffIndexing in Cassandra (Slides)
  • Gary Dusbabek (RackSpace)Cassandra Internals (Slides)
  • Sylvain Lesbresne (DataStax) Counters in Cassandra (Slides)

High-Level Cassandra Development

  • Eric Evans (Rackspace)CQL – Not just NoSQL, It’s MoSQL (Slides)
  • Jake Luciani (DataStax) Scaling Solr with Cassandra (Slides)

Lightning Talks

  • Ben Coverston (DataStax)Redesigned Compaction LevelDB (Slides)
  • Joaquin Casares (DataStax)The Auto-Clustering Brisk AMI (Slides)
  • Matt Dennis (DataStax)Cassandra Anti-Patterns (Slides)
  • Mike Bulman (DataStax)OpsCenter: Cluster Management Doesn’t Have To Be Hard (Slides)
  • Stu Hood (Twitter)Prometheus’ Patch: #674 and You (Slides)

Practical Development

  • Jeremy Hanna (Dachis)Using Pig alongside Cassandra (Slides)
  • Matt Dennis (DataStax)Data Modeling Workshop (Slides)
  • Nate McCall (DataStax)Cassandra for Java Developers (Slides)
  • Yewei Zhang (DataStax)Hive Over Brisk (Slides)

Products

  • Jake Luciani (DataStax) Introduction to Brisk (Slides)
  • Kyle Roche (Isidorey) Cloudsandra: Multi-tenant Platform Build on Brisk (Slides)

Use Cases

  • Adrian Cockcroft (Netflix)Migrating Netflix from DataCenter Oracle to Global Cassandra (Slides)
  • Chris Goffinet (Twitter)Cassandra at Twitter (Slides)
  • David Strauss (Pantheon)Highly Available DNS and Request Routing Using Apache Cassandra (Slides)
  • Edward Capriolo (media6degrees)Real World Capacity Planning: Cassandra on Blades and Big Iron (Slides)
  • Eric Onnen (Urban Airship)From 100s to 100′s of Millions (Slides)

July 9, 2011

Indexing in Cassandra

Filed under: Cassandra,Indexing — Patrick Durusau @ 7:00 pm

Indexing in Cassandra

From the post:

I’m writing this up because there’s always quite a bit of discussion on both the Cassandra and Hector mailing lists about indexes and the best ways to use them. I’d written a previous post about Secondary indexes in Cassandra last July, but there are a few more options and considerations today. I’m going to do a quick run through of the different approaches for doing indexes in Cassandra so that you can more easily navigate these and determine what’s the best approach for your application.

Good article on indexes in Cassandra.

June 28, 2011

Big Data Genomics – How to efficiently store and retrieve mutation

Filed under: Bioinformatics,Biomedical,Cassandra — Patrick Durusau @ 9:49 am

Big Data Genomics – How to efficiently store and retrieve mutation data by David Suvee.

About the post:

This blog post is the first one in a series of articles that describe the use of NoSQL databases to efficiently store and retrieve mutation data. Part one introduces the notion of mutation data and describes the conceptual use of the Cassandra NoSQL datastore.

From the post:

The only way to learn a new technology is by putting it into practice. Just try to find a suitable use case in your immediate working environment and give it go. In my case, it was trying to efficiently store and retrieve mutation data through a variety of NoSQL data stores, including Cassandra, MongoDB and Neo4J.

Promises to be an interesting series of posts that focus on a common data set and problem!

May 25, 2011

Near Bare Metal – Acunu

Filed under: Acunu,Cassandra,NoSQL — Patrick Durusau @ 1:27 pm

Acunu Storage Platform

From the webpage:

The Acunu Storage Platform is a powerful storage solution that brings simpler, faster and more predictable performance to NOSQL stores like Apache Cassandra.

Our view is that the new data intensive workloads that are increasingly common are a poor match for the legacy storage systems they tend to run on. These systems are built on a set of assumptions about the capacity and performance of hardware that are simply no longer true. The Acunu Storage Platform is the result of a radical re-think of those assumptions; the result is high performance from low cost commodity hardware.

It includes the Acunu Storage Core which runs in the Linux kernel. On top of this core, we provide a modified version of Apache Cassandra. This is essentially the same as “vanilla” Cassandra but uses the Acunu Storage Core to store data instead of the Linux file system and is therefore able to take advantage of the performance benefits of our platform. In addition to Cassandra, there is also an object store similar to Amazon’s S3; we have a number of other more experimental projects in the pipeline which we’ll talk about in future posts.

Perhaps the start of something very interesting.

It took NoSQL a couple of years to flower into the range of current offerings.

I wonder if working in the kernel will have a similar path?

Will we see a graph engine as part of the kernel?

May 12, 2011

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison

Filed under: Cassandra,CouchDB,HBase,MongoDB,NoSQL,Redis,Riak — Patrick Durusau @ 7:56 am

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison

Good thumb-nail comparison of the major features of all six (6) NoSQL databases by Kristóf Kovács.

Sorry to see that Neo4J didn’t make the comparison.

May 10, 2011

Brisk: Simpler, More Reliable, High-Performance Hadoop Solution

Filed under: Brisk,Cassandra,Hadoop — Patrick Durusau @ 3:30 pm

DataStax Releases Dramatically Simpler, More Reliable, High-Performance Hadoop Solution

From NoSQLDatabases coverage of Brisk a second generation Hadoop soltuion from Datastax.

From the post:

Today, DataStax, the commercial leader in Apache Cassandra™, released DataStax’ Brisk – a second-generation open-source Hadoop distribution that eliminates the key operational complexities with deploying and running Hadoop and Hive in production. Brisk is powered by Cassandra and offers a single platform containing a low-latency database for extremely high-volume web and real-time applications, while providing tightly coupled Hadoop and Hive analytics.

Download Brisk -> Here.

May 7, 2011

Cassandra – New Beta

Filed under: Cassandra,NoSQL — Patrick Durusau @ 5:50 pm

Cassandra – New Beta

Version 0.8.0 beta2 has been posted!

Changes.

May 1, 2011

Installing and using Apache Cassandra With Java Part 1 (Installation)

Filed under: Cassandra,NoSQL — Patrick Durusau @ 5:25 pm

Installing and using Apache Cassandra With Java Part 1 (Installation)

This series starts here and goes for five (5) parts for Cassandra 0.6.4.

From the introduction:

I’m going to write a few postings on how to use the Cassandra database with Java, although i am in no way an expert on how to use Cassandra i am very intrigued about the database because of it’s small installation, high performance and scalability. During the writing of these posts i am also learning the Cassandra database and i’m sharing my experiences with it through my posts on this blog.

Like i said before, Cassandra is a very high performing and scalable database, it doesn’t follow the normal SQL database principles like schema’s, tables / columns, datatypes and a query language like SQL. Instead it’s a non-relational database similar to Google’s BigTable. Cassandra was initially developed by Facebook which has contributed it to the open source community. Currently it is used by websites like Facebook, Twitter, Digg, Rackspace and many others. So even though it is still only version 0.6 at the time of writing this it has already proven itself in production environments.

It isn’t possible to say which (if any) of the NoSQL databases will prove to be the best fits for topic maps in particular or general situations.

What is clear is that a lot of experimentation and development is underway and hopefully the results will be interesting.

March 17, 2011

Cassandra – London Podcasts

Filed under: Cassandra,NoSQL — Patrick Durusau @ 6:48 pm

Cassandra – London Podcasts

Podcasts from the London Cassandra User Group.

Cassandra – Thrift Application Jools Enticknap: 21 February 2011

Cassandra in TWEETMEME Nick Telford: 21 February 2011

Cassandra Meetup January 17th Jan 2011

Cassandra London Meetup Jake Luciani : 8th Dec 2010

March 15, 2011

Expiring columns

Filed under: Cassandra,NoSQL — Patrick Durusau @ 5:08 am

Expiring columns

In Cassandra 0.7, there are expiring columns.

From the blog:

Sometimes, data comes with an expiration date, either by its nature or because it’s simply intractable to keep all of a rapidly growing dataset indefinitely.

In most databases, the only way to deal with such expiring data is to write a job running periodically to delete what is expired. Unfortunately, this is usually both error-prone and inefficient: not only do you have to issue a high volume of deletions, but you often also have to scan through lots of data to find what is expired.

Fortunately, Cassandra 0.7 has a better solution: expiring columns. Whenever you insert a column, you can specify an optional TTL (time to live) for that column. When you do, the column will expire after the requested amount of time and be deleted auto-magically (though asynchronously — see below). Importantly, this was designed to be as low-overhead as possible.

Now there is an interesting idea!

Goes along with the idea that a topic map does not (should not?) present a timeless view of information. That is a topic map should maintain state so that we can determine what was known at any particular time.

Take a simple example, a call for papers for a conference. It could be that a group of conferences all share the same call for papers, the form, submission guidelines, etc. And that call for papers is associated with each conference by an association.

Shouldn’t we be able to set an expiration date on that association so that at some point in time, all those facilities are no longer available for that conference? Perhaps it switches over to another set of properties in the same association to note that the submission dates have passed? That would remove the necessity for the association expiring.

But there are cases where associations do expire or at least end. Divorce in an unhappy example. Being hired is a happier one.

Something to think about.

March 11, 2011

agamemnon

Filed under: Cassandra,Graphs,NoSQL — Patrick Durusau @ 6:59 pm

agamemnon

From the website:

Agamemnon is a thin library built on top of pycassa. It allows you to use the Cassandra database (http://cassandra.apache.org) as a graph database. Much of the api was inspired by the excellent neo4j.py project (http://components.neo4j.org/neo4j.py/snapshot/)

Thanks to Jack Park for pointing this out!

March 5, 2011

Cassandra Data Model – Semantic Impedance

Filed under: Cassandra,NoSQL — Patrick Durusau @ 3:13 pm

WTF is a SuperColumn? An Intro to the Cassandra Data Model

A bit dated now but I thought some readers might find it useful.

From the posting:

If you’re coming from an RDBMS background (which is almost everyone) you’ll probably trip over some of the naming conventions while learning about Cassandra’s data model. It took me and my team members at Digg a couple days of talking things out before we “got it”. In recent weeks a bikeshed went down in the dev mailing list proposing a completely new naming scheme to alleviate some of the confusion. Throughout this discussion I kept thinking: “maybe if there were some decent examples out there people wouldn’t get so confused by the naming.” So, this is my stab at explaining Cassandra’s data model; It’s intended to help you get your feet wet & doesn’t go into every single detail but, hopefully, it helps clarify a few things.

Seems like I have heard about grouping sets of key/value pairs before but I will have to look for it. 😉

More seriously, the current wave of data sets only aggravates the known semantic impedance problem.

A wave of data sets that promises to only increase.

So semantic impedance is going to increase.

Semantic impedance can be:

  • ignored – most current stove-piped information systems
  • save-the-world semantic solutions – poor adoption rates
  • broken by self-interested mapping that is reusable – the topic maps solution

March 4, 2011

ApacheCon NA 2011

Filed under: Cassandra,Cloud Computing,Conferences,CouchDB,HBase,Lucene,Mahout,Solr — Patrick Durusau @ 7:17 am

ApacheCon NA 2011

Proposals: Be sure to submit your proposal no later than Friday, 29 April 2011 at midnight Pacific Time.

7-11 November 2011 Vancouver

From the website:

This year’s conference theme is “Open Source Enterprise Solutions, Cloud Computing, and Community Leadership”, featuring dozens of highly-relevant technical, business, and community-focused sessions aimed at beginner, intermediate, and expert audiences that demonstrate specific professional problems and real-world solutions that focus on “Apache and …”:

  • … Enterprise Solutions (from ActiveMQ to Axis2 to ServiceMix, OFBiz to Chemistry, the gang’s all here!)
  • … Cloud Computing (Hadoop, Cassandra, HBase, CouchDB, and friends)
  • … Emerging Technologies + Innovation (Incubating projects such as Libcloud, Stonehenge, and Wookie)
  • … Community Leadership (mentoring and meritocracy, GSoC and related initiatives)
  • … Data Handling, Search + Analytics (Lucene, Solr, Mahout, OODT, Hive and friends)
  • … Pervasive Computing (Felix/OSGi, Tomcat, MyFaces Trinidad, and friends)
  • … Servers, Infrastructure + Tools (HTTP Server, SpamAssassin, Geronimo, Sling, Wicket and friends)

March 3, 2011

Real-Time Log Processing System based on Flume and Cassandra – Post

Filed under: Cassandra,Flume,NoSQL — Patrick Durusau @ 10:01 am

Real-Time Log Processing System based on Flume and Cassandra

Very cool!

What would be even cooler, would be to have real-time associations with subjects that have information from outside the data set.

Or better yet, real-time on-demand associations with subjects that have information from outside the data set.

I suppose the classic use case would be running stats on all the sports events on a Saturday or Sunday, including individuals stats and merging in the latest doping, paternity and similar tests.

Other applications?

March 1, 2011

NoSQL Databases: Why, what and when

NoSQL Databases: Why, what and when by Lorenzo Alberton.

When I posted RDBMS in the Social Networks Age I did not anticipate returning the very next day with another slide deck from Lorenzo. But, after viewing this slide deck, I just had to post it.

It is a very good overview of NoSQL databases and their underlying principles, with useful graphics as well (as opposed to the other kind).

I am going to have to study his graphic technique in hopes of applying it to the semantic issues that are at the core of topic maps.

February 24, 2011

Cassandra’s data model as records and lists – Post

Filed under: Cassandra,NoSQL — Patrick Durusau @ 3:23 pm

Cassandra’s data model as records and lists

From the post:

I have to admit I’ve never really been happy with Cassandra’s data model, or to be more precisely, I’ve never really been with my understanding of the model. However I’ve realized that if we think of two use cases for column families then things may become a bit clearer. For me, Column families can be used in one of two ways, either as a record or an ordered list.

I thought it was helpful, what do you think?

February 17, 2011

Solandra

Filed under: Cassandra,Solr — Patrick Durusau @ 6:46 am

Solandra

From the website:

Solandra is a real-time distributed search engine built on Apache Solr and Apache Cassandra.

At its core Solandra is a tight integration of Solr and Cassandra, meaning within a single JVM both Solr and Cassandra are running, and documents are stored and disributed using Cassandra’s data model.

Solandra makes managing and dynamically growing Solr simple(r).

See the Solandra wiki for more details.

The more searching that occurs across diverse data sets, the more evident the use case(s) for topic maps will become.

Will you be there to answer the call?

February 15, 2011

Cassandra 0.7.1 Release

Filed under: Cassandra,NoSQL — Patrick Durusau @ 11:06 am

Cassandra 0.7.1

Largest production cluster reported to be 100 TB spread over 150 machines.

It occurs to me that most topic map engines support SQL backends.

I will be checking in on the SQL world for recent developments that are relevant to topic maps.

January 24, 2011

Cassandra – New Release

Filed under: Cassandra,NoSQL — Patrick Durusau @ 6:28 am

Cassandra – 0.70 released 2011-01-09.

Homepage reports largest production version has 100 terabytes of data in over 150 machines.

Sounds like a candidate for topic maps. Yes? 😉

January 10, 2011

NoSQL Tapes

Filed under: Cassandra,CouchDB,Graphs,MongoDB,Neo4j,Networks,NoSQL,OrientDB,Social Networks — Patrick Durusau @ 1:33 pm

NoSQL Tapes: A filmed compilation of interviews, explanations & case studies

From the email announcement by Tim Anglade:

Late last year, as the NOSQL Summer drew to a close, I got the itch to start another NOSQL community project. So, with the help of vendors Scality and InfiniteGraph, I toured around the world for 77 days to meet and record video interviews with 40+ NOSQL vendors, users and dudes-you-can-trust.

….

My original goals were to attempt to map a comprehensive view of the NOSQL world, its origins, its current trends and potential future. NOSQL knowledge seemed to me to be heavily fragmented and hard to reconcile across projects, vendors & opinions. I wanted to try to foster more sharing in our community and figure out what people thought ‘NOSQL’ meant. As it happens, I ended up learning quite a lot in the process (as I’m sure even seasoned NOSQLers on this list will too).

I’d like to take this opportunity to thank everybody who agreed to participate in this series: 10gen, Basho, Cloudant, CouchOne, FourSquare, Ben Black, RethinkDB, MarkLogic, Cloudera, SimpleGeo, LinkedIn, Membase, Ryan Rawson, Cliff Moon, Gemini Mobile, Furuhashi-san, Luca Garulli, Sergio Bossa, Mathias Meyer, Wooga, Neo4J, Acunu (and a few other special guests I’m keeping under wraps for now); I couldn’t have done it without them and learned by leaps & bounds for every hour I spent with each of them.

I’d also like to thank my two sponsors, Scality & InfiniteGraph, from the bottom of my heart. They were supportive in a way I didn’t think companies could be and let me total control of the shape & content of the project. I’d encourage you to check them out if you haven’t done so already.

As always, I’ll be glad to take any comments or suggestions you may have either by email (tim@nosqltapes.com) or on Twitter (@timanglade).

Simply awesome!

December 31, 2010

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison – Post

Filed under: Cassandra,CouchDB,HBase,NoSQL,Redis,Riak — Patrick Durusau @ 11:01 am

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison

Not enough detail for decision making but a useful overview nonetheless.

October 9, 2010

BigTable Model with Cassandra and HBase – Post

Filed under: Cassandra,HBase,NoSQL — Patrick Durusau @ 6:29 am

BigTable Model with Cassandra and HBase Non-hand-waving explanation of Cassandra and HBase.

Has anyone tried to column of values approach where subjectIdentifier or subjectLocator is a set of values?

« Newer PostsOlder Posts »

Powered by WordPress