Using MongoDB’s New Aggregation Framework in Python (MongoDB Aggregation Part 2) by Rick Copeland.
From the post:
Continuing on in my series on MongoDB and Python, this article will explore the new aggregation framework introduced in MongoDB 2.1. If you’re just getting started with MongoDB, you might want to read the previous articles in the series first:
- Getting Started with MongoDB and Python
- Moving Along With PyMongo
- GridFS: The MongoDB Filesystem
- Aggregation in MongoDB (Part 1)
And now that you’re all caught up, let’s jump right in….
Why a new framework?
If you’ve been following along with this article series, you’ve been introduced to MongoDB’s mapreduce command, which up until MongoDB 2.1 has been the go-to aggregation tool for MongoDB. (There’s also the
group()
command, but it’s really no more than a less-capable and un-shardable version ofmapreduce()
, so we’ll ignore it here.) So if you already havemapreduce()
in your toolbox, why would you ever want something else?Mapreduce is hard; let’s go shopping
The first motivation behind the new framework is that, while
mapreduce()
is a flexible and powerful abstraction for aggregation, it’s really overkill in many situations, as it requires you to re-frame your problem into a form that’s amenable to calculation usingmapreduce()
. For instance, when I want to calculate the mean value of a property in a series of documents, trying to break that down into appropriatemap
,reduce
, andfinalize
steps imposes some extra cognitive overhead that we’d like to avoid. So the new aggregation framework is (IMO) simpler.
Other than the obvious utility of the new aggregation framework in MongoDB, there is another reason to mention this post: You should use only as much aggregation or in topic map terminology, “merging,” as you need.
It isn’t possible to create a system that will correctly aggregate/merge all possible content. Take that as a given.
In part because new semantics are emerging every day and there are too many previous semantics that are poorly documented or unknown.
What we can do is establish requirements for particular semantics for given tasks and document those to facilitate their possible re-use in the future.