Work-Stealing & Recursive Partitioning with Fork/Join by Ilya Grigorik.
From the post:
Implementing an efficient parallel algorithm is, unfortunately, still a non-trivial task in most languages: we need to determine how to partition the problem, determine the optimal level of parallelism, and finally build an implementation with minimal synchronization. This last bit is especially critical since as Amdahl’s law tells us: “the speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program”.
The Fork/Join framework (JSR 166) in JDK7 implements a clever work-stealing technique for parallel execution that is worth learning about – even if you are not a JDK user. Optimized for parallelizing divide-and-conquer (and map-reduce) algorithms it abstracts all the CPU scheduling and work balancing behind a simple to use API.
I appreciated the observation later in the post that map-reduce is a special case of the pattern described in this post. A better understanding of the special cases can lead to a deeper understanding of the general one.