Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 7, 2012

MongoDB Index Shootout: Covered Indexes vs. Clustered Fractal Tree Indexes

Filed under: Clustering,Fractal Trees,Fractals,MongoDB — Patrick Durusau @ 1:05 pm

MongoDB Index Shootout: Covered Indexes vs. Clustered Fractal Tree Indexes by Tim Callaghan.

From the post:

In my two previous blogs I wrote about our implementation of Fractal Tree Indexes on MongoDB, showing a 10x insertion performance increase and a 268x query performance increase. MongoDB’s covered indexes can provide some performance benefits over a regular MongoDB index, as they reduce the amount of IO required to satisfy certain queries. In essence, when all of the fields you are requesting are present in the index key, then MongoDB does not have to go back to the main storage heap to retrieve anything. My benchmark results are further down in this write-up, but first I’d like to compare MongoDB’s Covered Indexes with Tokutek’s Clustered Fractal Tree Indexes.

MongoDB Covered Indexes Tokutek Clustered Fractal Tree Indexes
Query Efficiency Improved when all requested fields are part of index key Always improved, all non-keyed fields are stored in the index
Index Size Data is not compressed Generally 10x to 20x compression, user selects zlib, quicklz, or lzma. Note that non-clustered indexes are compressed as well.
Planning/Maintenance Index “covers” a fixed set of fields, adding a new field to an existing covered index requires a drop and recreate of the index. None, all fields in the document are always available in the index.

When putting my ideas together for the above table it struck me that covered indexes are really about a well defined schema, yet NoSQL is often thought of as “schema-less”. If you have a very large MongoDB collection and add a new field that you want covered by an existing index, the drop and recreate process will take a long time. On the other hand, a clustered Fractal Tree Index will automatically include this new field so there is no need to drop/recreate unless you need the field to be part of a .find() operation itself.

If you have some time to experiment this weekend, more MongoDB benchmarks/improvements to consider.

September 2, 2012

268x Query Performance Bump for MongoDB

Filed under: Fractal Trees,MongoDB,Tokutek — Patrick Durusau @ 6:24 pm

268x Query Performance Increase for MongoDB with Fractal Tree Indexes, SAY WHAT? by Tim Callaghan.

From the post:

Last week I wrote about our 10x insertion performance increase with MongoDB. We’ve continued our experimental integration of Fractal Tree® Indexes into MongoDB, adding support for clustered indexes. A clustered index stores all non-index fields as the “value” portion of the index, as opposed to a standard MongoDB index that stores a pointer to the document data. The benefit is that indexed lookups can immediately return any requested values instead of needing to do an additional lookup (and potential disk IOs) for the requested fields.

I’m trying to recover from learning about scalable subgraph matching, Efficient Subgraph Matching on Billion Node Graphs [Parallel Graph Processing], and now the nice folks at Tokutek post a 26,816% query performance increase for MongoDB.

They claim to not be MongoDB experts. I guess that’s right. The increase in performance would have been higher. 😉

Serious question: How long will it take this sort of performance increase to impact the modeling and design of information systems?

And in what way?

With high enough performance, can subject identity be modeled interactively?

August 30, 2012

MongoDB 2.2 Released [Aggregation News – Expiring Data From Merges?]

Filed under: Aggregation,MongoDB — Patrick Durusau @ 10:03 am

MongoDB 2.2 Released

From the post:

We are pleased to announce the release of MongoDB version 2.2. This release includes over 1,000 new features, bug fixes, and performance enhancements, with a focus on improved flexibility and performance. For additional details on the release:

Of particular interest to topic map fans:

Aggregation Framework

The Aggregation Framework is available in its first production-ready release as of 2.2. The aggregation framework makes it easier to manipulate and process documents inside of MongoDB, without needing to use Map Reduce, or separate application processes for data manipulation.

See the aggregation documentation for more information.

The H Open also mentions TTL (time to live) which can remove documents from collections.

MongoDB documentation: Expire Data from Collections by Setting TTL.

Have you considered “expiring” data from merges?

August 26, 2012

MongoDB: Pumping Fractal Iron

Filed under: Fractal Trees,MongoDB — Patrick Durusau @ 5:46 pm

10x Insertion Performance Increase for MongoDB with Fractal Tree Indexes by Tim Callaghan.

From the post:

The challenge of handling massive data processing workloads has spawned many new innovations and techniques in the database world, from indexing innovations like our Fractal Tree® technology to a myriad of “NoSQL” solutions (here is our Chief Scientist’s perspective). Among the most popular and widely adopted NoSQL solutions is MongoDB and we became curious if our Fractal Tree indexing could offer some advantage when combined with it. The answer seems to be a strong “yes”.

Earlier in the summer we kicked off a small side project and here’s what we did: we implemented a “version 2” IndexInterface as a Fractal Tree index and ran some benchmarks. Note that our integration only affects MongoDB’s secondary indexes; primary indexes continue to rely on MongoDB’s indexing code. All the changes we made to the MongoDB source are available here. Caveat: this was a quick and dirty project – the code is experimental grade so none of it is supported or went through any careful design analysis.

For our initial benchmark we measured the performance of a single threaded insertion workload. The inserted documents contained the following: URI (character), name (character), origin (character), creation date (timestamp), and expiration date (timestamp). We created a total of four secondary indexes: URI, name, origin, and creation date. The point of the benchmark is to insert enough documents such that the indexes are larger than main memory and show the insertion performance from an empty database to one that is largely dependent on disk IO. We ran the benchmark with journaling disabled, then again with journaling enabled.

Not for production use but the performance numbers should give you pause.

A long pause.

August 16, 2012

Pig as Hadoop Connector, Part One: Pig, MongoDB and Node.js

Filed under: Hadoop,MongoDB,node-js,Pig — Patrick Durusau @ 6:53 pm

Pig as Hadoop Connector, Part One: Pig, MongoDB and Node.js by Russell Jurney.

From the post:

Series Introduction

Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.

But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.

Working code for this post as well as setup instructions for the tools we use are available at https://github.com/rjurney/enron-node-mongo and you can download the Enron emails we use in the example in Avro format at http://s3.amazonaws.com/rjurney.public/enron.avro. You can run our example Pig scripts in local mode (without Hadoop) with the -x local flag: pig -x local. This enables new Hadoop users to try out Pig without a Hadoop cluster.

Introduction

In this post we’ll be using Hadoop, Pig, mongo-hadoop, MongoDB and Node.js to turn Avro records into a web service. We do so to illustrate Pig’s ability to act as glue between distributed systems, and to show how easy it is to publish data from Hadoop to the web.

I was tempted to add ‘duct tape’ as a category. But there could only be one entry. 😉

Take an early weekend and have some fun with this tomorrow. August will be over sooner than you think.

July 26, 2012

MongoDB 2.2.0-rc0

Filed under: MongoDB — Patrick Durusau @ 2:38 pm

MongoDB 2.2.0-rc0

The latest unstable release of MongoDB.

Release notes for 2.2.0-rc0.

Among the changes you will find:

  • Aggregation Framework
  • TTL Collections
  • Concurrency Improvements
  • Query Optimizer Improvements
  • Tag Aware Sharding

among others.

July 23, 2012

MongoDB-as-a-service for private rolled out by ScaleGrid, in MongoDirector

Filed under: MongoDB,NoSQL — Patrick Durusau @ 3:16 pm

MongoDB-as-a-service for private rolled out by ScaleGrid, in MongoDirector by Chris Mayer.

From the post:

Of all the NoSQL databases emerging at the moment, there appears to be one constant discussion taking place – are you using MongoDB?

It appears to be the open source, document-oriented NoSQL database solution of choice, mainly due to its high performance nature, its dynamism and its similarities to the JSON data structure (in BSON). Despite being written in C++, it is attracting attention from developers of different creeds. Its enterprise level features have helped a fair bit in its charge up the rankings to leading NoSQL database, with it being the ideal datastore for highly scalable environments. Just a look at the latest in-demand skills on Indeed.com shows you that 10gen’s flagship product has infiltrated the enterprise well and truly.

Quite often, an enterprise can find the switch from SQL to NoSQL daunting and needs a helping hand. Due to this, many MongoDB-related products are arriving just as quickly as MongoDB converts The latest of which to launch as a public beta is MongoDirector from Seattle start-up ScaleGrid. MongoDirector offers an end-to-end lifecycle manager for MongoDB to guide newcomers along.

I don’t have anything negative to say about MongoDB but I’m not sure the discussion of NoSQL solutions is quite as one-sided as Chris seems to think.

The Indeed.com site is a fun one to play around with but I would not take the numbers all that seriously. For one thing, it doesn’t appear to control for duplicate job ads posted in different source, for example. But that’s a nitpicking objection.

A more serious one is when you start to explore the site and discover the top three job titles for IT.

Care to guess what they are? Would you believe they don’t have anything to do with databases or MongoDB?

As least as of today, and I am sure it changes over time, Graphic Designer, Technical Writer, and Project Manager all rank higher than Data Analyst, where you would hope to find some MongoDB jobs. (Information Technology Industry – 23 July 2012)

BTW, for your amusement, when I was looking for information on database employment, I encountered Database Administrators, from the Bureau of Labor Statistics in the United States. The data is available for download as XLS files.

The site says blanks on the maps are from lack of data. I suspect the truth is there are no database administrators in Wyoming. 😉 Or at least I could point to the graphic as some evidence for my claim.

I think you need to consider the range of database options, from very traditional SQL vendors to bleeding edge No/New/Maybe/SQL solutions, including MongoDB. The question is which one meets your requirements, whether flavor of the month or no.

July 12, 2012

Real-time Twitter heat map with MongoDB

Filed under: Mapping,Maps,MongoDB,Tweets — Patrick Durusau @ 1:54 pm

Real-time Twitter heat map with MongoDB

From the post:

Over the last few weeks I got in touch with the fascinating field of data visualisation which offers great ways to play around with the perception of information.

In a more formal approach data visualisation denotes “The representation and presentation of data that exploits our visual perception abilities in order to amplify cognition

Nowadays there is a huge flood of information that hit’s us everyday. Enormous amounts of data collected from various sources are freely available on the internet. One of these data gargoyles is Twitter producing around 400 million (400 000 000!) tweets per day!

Tweets basically offer two “layers” of information. The obvious direct information within the text of the Tweet itself and also a second layer that is not directly perceived which is the Tweets’ metadata. In this case Twitter offers a large number of additional information like user data, retweet count, hashtags, etc. This metadata can be leveraged to experience data from Twitter in a lot of exciting new ways!

So as a little weekend project I have decided to build a small piece of software that generates real-time heat maps of certain keywords from Twitter data.

Yes, “…in a lot of exciting new ways!” +1!

What about maintenance issues on such a heat map? The capture of terms to the map is fairly obvious, but a subsequent user may be left in the dark as to why this term and not some other term? Or some then current synonym for a term that is being captured?

Or imposing semantics on tweets or terms that are unexpected or non-obvious to a casual or not so casual observer.

You and I can agree red means go and green means stop in a tweet. That’s difficult to maintain as the number of participants and terms go up.

A great starting place to experiment with topic maps to address such issues.

I first saw this in the NoSQL Weekly Newsletter.

July 10, 2012

MongoDB Installer for Windows Azure

Filed under: Azure Marketplace,Microsoft,MongoDB — Patrick Durusau @ 7:17 am

MongoDB Installer for Windows Azure by Doug Mahugh.

From the post:

Do you need to build a high-availability web application or service? One that can scale out quickly in response to fluctuating demand? Need to do complex queries against schema-free collections of rich objects? If you answer yes to any of those questions, MongoDB on Windows Azure is an approach you’ll want to look at closely.

People have been using MongoDB on Windows Azure for some time (for example), but recently the setup, deployment, and development experience has been streamlined by the release of the MongoDB Installer for Windows Azure. It’s now easier than ever to get started with MongoDB on Windows Azure!

If you are developing or considering developing with MongoDB, this is definitely worth a look. In part because it frees you to concentrate on software development and not running (or trying to run) a server farm. Different skill sets.

Another reason is that is levels the playing field with big IT firms with server farms. You get the advantages of a server farm without the capital investment in one.

And as Microsoft becomes a bigger and bigger tent for diverse platforms and technologies, you have more choices. Choices for the changing requirements of your clients.

Not that I expect to see an Apple hanging from the Microsoft tree anytime soon but you can’t ever tell. Enough consumer demand and it could happen.

In the meantime, while we wait for better games and commercials, consider how you would power semantic integration in the cloud?

June 26, 2012

Implementing Aggregation Functions in MongoDB

Filed under: Aggregation,MapReduce,MongoDB,NoSQL — Patrick Durusau @ 1:51 pm

Implementing Aggregation Functions in MongoDB by Arun Viswanathan and Shruthi Kumar.

From the post:

With the amount of data that organizations generate exploding from gigabytes to terabytes to petabytes, traditional databases are unable to scale up to manage such big data sets. Using these solutions, the cost of storing and processing data will significantly increase as the data grows. This is resulting in organizations looking for other economical solutions such as NoSQL databases that provide the required data storage and processing capabilities, scalability and cost effectiveness. NoSQL databases do not use SQL as the query language. There are different types of these databases such as document stores, key-value stores, graph database, object database, etc.

Typical use cases for NoSQL database includes archiving old logs, event logging, ecommerce application log, gaming data, social data, etc. due to its fast read-write capability. The stored data would then require to be processed to gain useful insights on customers and their usage of the applications.

The NoSQL database we use in this article is MongoDB which is an open source document oriented NoSQL database system written in C++. It provides a high performance document oriented storage as well as support for writing MapReduce programs to process data stored in MongoDB documents. It is easily scalable and supports auto partitioning. Map Reduce can be used for aggregation of data through batch processing. MongoDB stores data in BSON (Binary JSON) format, supports a dynamic schema and allows for dynamic queries. The Mongo Query Language is expressed as JSON and is different from the SQL queries used in an RDBMS. MongoDB provides an Aggregation Framework that includes utility functions such as count, distinct and group. However more advanced aggregation functions such as sum, average, max, min, variance and standard deviation need to be implemented using MapReduce.

This article describes the method of implementing common aggregation functions like sum, average, max, min, variance and standard deviation on a MongoDB document using its MapReduce functionality. Typical applications of aggregations include business reporting of sales data such as calculation of total sales by grouping data across geographical locations, financial reporting, etc.

Not terribly advanced but enough to get you started with creating aggregation functions.

Includes “testing” of the aggregation functions that are written in the article.

If Python is more your cup of tea, see: Aggregation in MongoDB (part1) and Aggregation in MongoDB (part 2).

June 16, 2012

Deep Dive with MongoDB [Virtual Conference]

Filed under: Conferences,MongoDB — Patrick Durusau @ 3:46 pm

Deep Dive with MongoDB (online conference)

Wednesday July 11th 11:00 AM EDT / 8:00 AM PDT

From the webpage:

This four hour online conference will introduce you to some MongoDB basics and get you up to speed with why and how you should choose MongoDB for your next project. The conference will begin at 8:00am PST with a brief introduction and last until 12:00pm PST covering four topics with plenty of time for Q&A.

The program:

  • Introduction 8:00-8:10am PST
  • 8:10am PST Building Your First App – Asya Kamsky, Senior Solutions Architect, 10gen
  • 9:00am PST Schema Design with MongoDB: Principles and Practices – Antoine Girbal, Solutions Architect, 10gen
  • 9:50am PST Replication and Replica Sets – Asya Kamsky, Senior Solutions Architect, 10gen
  • 11:05am PST – Introducing MongoDB into Your Organization – Edouard Servan-Schreiber, Director for Solution Architecture, 10gen

What I wonder about is which startup or startup conference is going to put out a call for papers, do peer review and then on the days of the meeting, conference speakers in with inexpensive software and tweets the presentations right before they start.

Imagine having 1,000 people listening to your presentation instead of < 50. Could increase the impact of your ideas and the reach of your startup. (Jack Park forwarded this to my attention.)

June 9, 2012

Hadoop Streaming Support for MongoDB

Filed under: Hadoop,Javascript,MapReduce,MongoDB,Python,Ruby — Patrick Durusau @ 7:13 pm

Hadoop Streaming Support for MongoDB

From the post:

MongoDB has some native data processing tools, such as the built-in Javascript-oriented MapReduce framework, and a new Aggregation Framework in MongoDB v2.2. That said, there will always be a need to decouple persistance and computational layers when working with Big Data.

Enter MongoDB+Hadoop: an adapter that allows Apache’s Hadoop platform to integrate with MongoDB.

[graphic omitted]

Using this adapter, it is possible to use MongoDB as a real-time datastore for your application while shifting large aggregation, batch processing, and ETL workloads to a platform better suited for the task.

[graphic omitted]

Well, the engineers at 10gen have taken it one step further with the introduction of the streaming assembly for Mongo-Hadoop.

What does all that mean?

The streaming assembly lets you write MapReduce jobs in languages like Python, Ruby, and JavaScript instead of Java, making it easy for developers that are familiar with MongoDB and popular dynamic programing languages to leverage the power of Hadoop.

I like that, “…popular dynamic programming languages…” 😉

Any improvement to increase usability without religious conversion (using a programming language not your favorite) is a good move.

June 4, 2012

Using MongoDB’s New Aggregation Framework in Python (MongoDB Aggregation Part 2)

Filed under: Aggregation,MongoDB,NoSQL,Python — Patrick Durusau @ 4:32 pm

Using MongoDB’s New Aggregation Framework in Python (MongoDB Aggregation Part 2) by Rick Copeland.

From the post:

Continuing on in my series on MongoDB and Python, this article will explore the new aggregation framework introduced in MongoDB 2.1. If you’re just getting started with MongoDB, you might want to read the previous articles in the series first:

And now that you’re all caught up, let’s jump right in….

Why a new framework?

If you’ve been following along with this article series, you’ve been introduced to MongoDB’s mapreduce command, which up until MongoDB 2.1 has been the go-to aggregation tool for MongoDB. (There’s also the group() command, but it’s really no more than a less-capable and un-shardable version of mapreduce(), so we’ll ignore it here.) So if you already have mapreduce() in your toolbox, why would you ever want something else?

Mapreduce is hard; let’s go shopping

The first motivation behind the new framework is that, while mapreduce() is a flexible and powerful abstraction for aggregation, it’s really overkill in many situations, as it requires you to re-frame your problem into a form that’s amenable to calculation using mapreduce(). For instance, when I want to calculate the mean value of a property in a series of documents, trying to break that down into appropriate map, reduce, and finalize steps imposes some extra cognitive overhead that we’d like to avoid. So the new aggregation framework is (IMO) simpler.

Other than the obvious utility of the new aggregation framework in MongoDB, there is another reason to mention this post: You should use only as much aggregation or in topic map terminology, “merging,” as you need.

It isn’t possible to create a system that will correctly aggregate/merge all possible content. Take that as a given.

In part because new semantics are emerging every day and there are too many previous semantics that are poorly documented or unknown.

What we can do is establish requirements for particular semantics for given tasks and document those to facilitate their possible re-use in the future.

Aggregation in MongoDB (Part 1)

Filed under: Aggregation,MongoDB,NoSQL,Python — Patrick Durusau @ 4:31 pm

Aggregation in MongoDB (Part 1) by Rick Copeland.

From the post:

In some previous posts on mongodb and python,
pymongo, and gridfs, I introduced the NoSQL database MongoDB how to use it from Python, and how to use it to store large (more than 16 MB) files in it. Here, I’ll be showing you a few of the features that the current (2.0) version of MongoDB includes for performing aggregation. In a future post, I’ll give you a peek into the new aggregation framework included in MongoDB version 2.1.

An index “aggregates” information about a subject (called an ‘entry’), where the information is traditionally found between the covers of a book.

MongoDB offers predefined as well as custom “aggregations,” where the information field can be larger than a single book.

Good introduction to aggregation in MongoDB, although you (and I) really should get around to reading the MondoDB documentation.

May 26, 2012

Doug Mahugh Live! was: MongoDB Replica Sets

Filed under: Microsoft,MongoDB,Replica Sets — Patrick Durusau @ 4:14 pm

Doug Mahugh spotted on MongoDB Replica Sets.

The video also teaches you about MongoDB replica sets on Windows. Replica sets being the means MongoDB uses for high reliability and read performance. An expert from 10gen, Sridhar Nanjundeswaran, covers the MongoDB stuff.

PS: Kudos to Doug on his new role at MS on reaching out to open source projects!

May 22, 2012

MongoSF Highlights

Filed under: Conferences,MongoDB — Patrick Durusau @ 2:57 pm

I am on the Mongo mailing list and so got the monthly news about MongoDB, which included a list of highlights from MongoSF:

Except in the email the links had all the tracking trash that marketing types seem to think is important.

I visited the 10gen site and harvested the direct links for your convenience. I didn’t insert tracking trash for my blog.

Enjoy!

PS: It would really be nice to get emails that have the tracking trash if you insist but also clean links that can be forwarded to others, used in blog posts, real information type activities. Not to single out 10gen, I see it every day. From people who should know better.

PPS: There are more presentations to view at: Featured Presentations.

May 16, 2012

Progressive NoSQL Tutorials

Filed under: Cassandra,Couchbase,CouchDB,MongoDB,Neo4j,NoSQL,RavenDB,Riak — Patrick Durusau @ 10:20 am

Have you ever gotten an advertising email with clean links in it? I mean a link without all the marketing crap appended to the end. The stuff you have to clean off before using it in a post or sending it to a friend?

Got my first one today. From Skills Matter on the free videos for their Progressive NoSQL Tutorials that just concluded.

High quality presentations, videos freely available after presentation, friendly links in email, just a few of the reasons to support Skills Matter.

The tutorials:

May 1, 2012

Masstree – Much Faster than MongoDB, VoltDB, Redis, and Competitive with Memcached

Filed under: Masstree,Memcached,MongoDB,Redis,VoltDB — Patrick Durusau @ 4:45 pm

Masstree – Much Faster than MongoDB, VoltDB, Redis, and Competitive with Memcached

From the post:

The EuroSys 2012 system conference has an excellent live blog summary of their talks for: Day 1, Day 2, Day 3 (thanks Henry at the Paper Trail blog). Summaries for each of the accepted papers are here.

One of the more interesting papers from a NoSQL perspective was Cache Craftiness for Fast Multicore Key-Value Storage, a wonderfully detailed description of the low level techniques used to implement Masstree:

A storage system specialized for key-value data in which all data fits in memory, but must persist across server restarts. It supports arbitrary, variable-length keys. It allows range queries over those keys: clients can traverse subsets of the database, or the whole database, in sorted order by key. On a 16-core machine Masstree achieves six to ten million operations per second on parts A–C of the Yahoo! Cloud Serving Benchmark benchmark, more than 30x as fast as VoltDB [5] or MongoDB [2].

An inspiration for anyone pursuing pure performance in the key-value space.

As the authors note when comparing Masstree to other systems:

Many of these systems support features that Masstree does not, some of which may bottleneck their performance. We disable other systems’ expensive features when possible.

The lesson here is to not buy expensive features unless you need them.

April 21, 2012

On multi-form data

Filed under: MongoDB,NoSQL — Patrick Durusau @ 4:35 pm

On multi-form data

From the post:

I read an excellent debrief on a startup’s experience with MongoDB, called “A Year with MongoDB”.

It was excellent due to its level of detail. Some of its points are important — particularly global write lock and uncompressed field names, both issues that needlessly afflict large MongoDB clusters and will likely be fixed eventually.

However, it’s also pretty clear from this post that they were not using MongoDB in the best way.

An interesting take on when and just as importantly, when not to use MongoDB.

As NoSQL offerings mature, are we doing to see more of this sort of treatment or will more treatments like this drive the maturity of NoSQL offerings?

Pointers to “more like this?” (not just on MongoDB but other NoSQL offerings as well)

April 18, 2012

The Little MongoDB Book

Filed under: MongoDB,NoSQL — Patrick Durusau @ 6:08 pm

The Little MongoDB Book

Karl Seguin has written a short (thirty-two pages) guide to MongoDB.

It won’t make you a hairy-chested terror at big data conferences but it will get you started with MongoDB.

I would bookmark http://mongly.com/, also by Karl, to consult along with the Little MongoDB book.

Finally, as you learn MongoDB, contribute to these and other resources with examples, tutorials, data sets.

Particularly tutorials on analysis of data sets. It is one thing to know schema X works in general with data sets of type Y. It is quite another to understand why.

April 15, 2012

MongoDB Hadoop Connector Announced

Filed under: Hadoop,MongoDB — Patrick Durusau @ 7:13 pm

MongoDB Hadoop Connector Announced

From the post:

10gen is pleased to announce the availability of our first GA release of the MongoDB Hadoop Connector, version 1.0. This release was a long-term goal, and represents the culmination of over a year of work to bring our users a solid integration layer between their MongoDB deployments and Hadoop clusters for data processing. Available immediately, this connector supports many of the major Hadoop versions and distributions from 0.20.x and onwards.

The core feature of the Connector is to provide the ability to read MongoDB data into Hadoop MapReduce jobs, as well as writing the results of MapReduce jobs out to MongoDB. Users may choose to use MongoDB reads and writes together or separately, as best fits each use case. Our goal is to continue to build support for the components in the Hadoop ecosystem which our users find useful, based on feedback and requests.

For this initial release, we have also provided support for:

  • writing to MongoDB from Pig (thanks to Russell Jurney for all of his patches and improvements to this feature)
  • writing to MongoDB from the Flume distributed logging system
  • using Python to MapReduce to and from MongoDB via Hadoop Streaming.

Hadoop Streaming was one of the toughest features for the 10gen team to build. To that end, look for a more technical post on the MongoDB blog in the next week or two detailing the issues we encountered and how to utilize this feature effectively.

Question: Is anyone working on a matrix of Hadoop connectors and their capabilities? A summary resource on Hadoop connectors might be of value.

April 12, 2012

Red Hat and 10gen: Deeper collaboration around MongoDB

Filed under: MongoDB,Red Hat — Patrick Durusau @ 8:49 am

Red Hat and 10gen: Deeper collaboration around MongoDB

From the post:

Today [April 9, 2012], Red Hat and 10gen jointly announced a deeper collaboration around MongoDB. By combining Red Hat’s traditional strengths in operating systems and middleware with 10gen’s expertise in database technology, we’re developing a robust open source platform on which to develop and deploy your next generation of applications either in your own data centers or in the cloud.

Over the next several months, we’ll be working closely with Red Hat to optimize and integrate MongoDB with a number of Red Hat products. You can look at this effort resulting in a set of reference designs, solutions, packages and documentation for deploying high-performance, scalable and secure applications with MongoDB and Red Hat software. Our first collaboration is around a blueprint for deploying MongoDB on Red Hat Enterprise Linux, which we will release shortly. We’ll follow that up with a number of additional projects around RHEL, JBoss, Red Hat Enterprise Virtualization (RHEV), Cloud Forms, Red Hat Storage (GlusterFS), and of course continue the work we have started with OpenShift. We hope to get much involvement from the Red Hat and MongoDB communities, and any enhancements to MongoDB resulting from this work will, of course, be open sourced.

Have you noticed that open source projects are trending towards bundling themselves with each other?

A healthy recognition users want solutions over sporting with versions and configuration files.

April 6, 2012

MongoDB Architecture

Filed under: MongoDB,NoSQL — Patrick Durusau @ 6:51 pm

MongoDB Architecture by Ricky Ho.

From the post:

NOSQL has become a very heated topic for large web-scale deployment where scalability and semi-structured data driven the DB requirement towards NOSQL. There has been many NOSQL products evolving in over last couple years. In my past blogs, I have been covering the underlying distributed system theory of NOSQL, as well as some specific products such as CouchDB and Cassandra/HBase.

Last Friday I was very lucky to meet with Jared Rosoff from 10gen in a technical conference and have a discussion about the technical architecture of MongoDb. I found the information is very useful and want to share with more people.

One thing I am very impressed by MongoDb is that it is extremely easy to use and the underlying architecture is also very easy to understand.

Very nice walk through the architecture of MongoDB! Certainly a model for posts exploring other NoSQL solutions.

March 22, 2012

Spring MVC 3.1 – Implement CRUD with Spring Data Neo4j

Filed under: CRUD,MongoDB,Neo4j,Spring Data — Patrick Durusau @ 7:42 pm

Spring MVC 3.1 – Implement CRUD with Spring Data Neo4j

The title of the post includes “(Part-1)” but all five parts have been posted.

From the post:

In this tutorial, we will create a simple CRUD application using Spring 3.1 and Neo4j. We will based this tutorial on a previous guide for MongoDB. This means we will re-use our existing design and implement only the data layer to use Neo4j as our data store.

I would start at the beginning MongoDB post, Spring MVC 3.1 – Implement CRUD with Spring Data MongoDB. (It won’t hurt you to learn some MongoDB as well.)

Quite definitely will repay the time you spend.

March 3, 2012

Twitter Streaming with EventMachine and DynamoDB

Filed under: Dynamo,EventMachine,MongoDB,Tweets — Patrick Durusau @ 7:28 pm

Twitter Streaming with EventMachine and DynamoDB

From the post:

This week Amazon Web Services launched their latest database offering ‘DynamoDB’ – a highly-scalable NoSQL database service.

We’ve been using a couple of NoSQL database engines at work for a while now: Redis and MongoDB. Mongo allowed us to simplify many of our data models and represent more faithfully the underlying entities we were trying to represent in our applications and Redis is used for those projects where we need to make sure that a person only classifies an object once.1

Whether you’re using MongoDB or MySQL, scaling the performance and size of a database is non-trivial and is a skillset in itself. DynamoDB is a fully managed database service aiming to offer high-performance data storage and retrieval at any scale, regardless of request traffic or storage requirements. Unusually for Amazon Web Services, they’ve made a lot of noise about some of the underlying technologies behind DynamoDB, in particular they’ve utilised SSD hard drives for storage. I guess telling us this is designed to give us a hint at the performance characteristics we might expect from the service.

» A worked example

As with all AWS products there are a number of articles outlining how to get started with DynamoDB. This article is designed to provide an example use case where DynamoDB really shines – parsing a continual stream of data from the Twitter API. We’re going to use the Twitter streaming API to capture tweets and index them by user_id and creation time.

Wanted to include something a little different after all the graph database and modeling questions. 😉

I need to work on something like this to more effectively use Twitter as an information stream. Passing all mentions of graphs and related terms along for further processing, perhaps by a map between Twitter userIDs and known authors. Could be interesting.

February 9, 2012

The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework

Filed under: Aggregation,Cheminformatics,MongoDB — Patrick Durusau @ 4:30 pm

The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework by Davy Suvee.

From the post:

Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance. Part 3 finally, illustrates the use of the new MongoDB Aggregation Framework, which boosts performance beyond the capabilities of the map-reduce implementation.

In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities through Tanimoto coefficients. When employing a low target Tanimoto coefficient however, the number of returned compounds increases exponentially, resulting in a noticeable data transfer overhead. To circumvent this problem, part 2 of this article describes the use of MongoDB’s build-in map-reduce functionality to perform the Tanimoto coefficient calculation local to where the compound data is stored. Unfortunately, the execution of these map-reduce algorithms through Javascript is rather slow and a performance improvement can only be achieved when multiple shards are employed within the same MongoDB cluster.

Recently, MongoDB introduced its new Aggregation Framework. This framework provides a more simple solution to calculating aggregate values instead of relying upon the powerful map-reduce constructs. With just a few simple primitives, it allows you to compute, group, reshape and project documents that are contained within a certain MongoDB collection. The remainder of this article describes the refactoring of the map-reduce algorithm to make optimal use of the new MongoDB Aggregation Framework. The complete source code can be found on the Datablend public GitHub repository.

Does it occur to you that aggregation results in one or more aggregates? And if we are presented with one or more aggregates, we could persist those aggregates and add properties to them. Or have relationships between aggregates. Or point to occurrences of aggregates.

Kristina Chodorow demonstrated use of aggregation in MongoDB in Hacking Chess with the MongoDB Pipeline for analysis of chess games. Rather that summing the number of games in which the move “e4” is the first move for White, links to all 696 games could be treated as occurrences of that subject. Which would support discovery of the player of White as well as Black.

Think of aggregation as a flexible means for merging information about subjects and their relationships. (Blind interchange requires more but this is a step in the right direction.)

February 7, 2012

NoSQL: The Joy is in the Details

Filed under: MongoDB,NoSQL — Patrick Durusau @ 4:28 pm

NoSQL: The Joy is in the Details by James Downey.

From the post:

Whenever my wife returns excitedly from the mall having bought something new, I respond on reflex: Why do we need that? To which my wife retorts that if it were up to me, humans would still live in caves. Maybe not caves, but we’d still program in C and all applications would run on relational databases. Fortunately, there are geeks out there with greater imagination.

When I first began reading about NoSQL, I ran into the CAP Theorem, according to which a database system can provide only two of three key characteristics: consistency, availability, or partition tolerance. Relational databases offer consistency and availability, but not partition tolerance, namely, the capability of a database system to survive network partitions. This notion of partition tolerance ties into the ability of a system to scale horizontally across many servers, achieving on commodity hardware the massive scalability necessary for Internet giants. In certain scenarios, the gain in scalability makes worthwhile the abandonment of consistency. (For a simplified explanation, see this visual guide. For a heavy computer science treatment, see this proof.)

And so I plan to spend time this year exploring and posting about some of the many NoSQL options out there. I’ve already started a post on MongoDB. Stay tuned for more. And if you have any suggestions for which database I should look into next, please make a comment.

Definitely a series of posts I will be following this year. Suggest that you do the same.

February 5, 2012

The Comments Conundrum

Filed under: Aggregation,MongoDB — Patrick Durusau @ 7:57 pm

The Comments Conundrum by Kristina Chodorow.

From the post:

One of the most common questions I see about MongoDB schema design is:

I have a collection of blog posts and each post has an array of comments. How do I get…
…all comments by a given author
…the most recent comments
…the most popular commenters?

And so on. The answer to this has always been “Well, you can’t do that on the server side…” You can either do it on the client side or store comments in their own collection. What you really want is the ability to treat embedded documents like a “real” collection.

The aggregation pipeline gives you this ability by letting you “unwind” arrays into separate documents, then doing whatever else you need to do in subsequent pipeline operators.

Kristina continues her coverage of the aggregation pipeline in MongoDB.

Question: What is the result of an aggregation? (In a topic map sense?)

February 4, 2012

Hacking Chess: Data Munging

Filed under: Data Mining,MongoDB — Patrick Durusau @ 3:36 pm

Hacking Chess: Data Munging

Kristina Chodorow specifies a conversion from portable game notation (PGN) to JSON. For loading the chess games into MongoDB.

Useful for her Hacking Chess with the MongoDB Pipeline post.

Addressing data in situ would be more robust but conversion is far more common.

When I get around to outlining a topic map book, I will have to include a chapter on data conversion techniques.

You Only Wish MongoDB Wasn’t Relational

Filed under: MongoDB — Patrick Durusau @ 3:32 pm

You Only Wish MongoDB Wasn’t Relational.

From the post:

When choosing the stack for our TV guide service, we became interested in NoSQL dbs because we anticipated needing to scale horizontally. We evaluated several and settled on MongoDB. The main reason was that MongoDB got out of the way and let us get work done. You can read a little more about our production setup here.

So when you read that MongoDB is a document store, you might get the wonderful idea to store your relationships in a big document. Since mongo lets you reach into objects, you can query against them, right?

Several times, we’ve excitedly begun a schema this way, only to be forced to pull the nested documents out into their own collection. I’ll show you why, and why it’s not a big deal.

Perhaps a better title would have been: MongoDB: Relationships Optional. 😉

That is you can specify relationships but only to the extent necessary.

Worth your time to read.

« Newer PostsOlder Posts »

Powered by WordPress