Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc.
Vendor content so usual disclaimers apply but this may signal an important but subtle shift in computing environments.
From the post:
In today’s competitive world, businesses need to make fast decisions to respond to changing market conditions and to maintain a competitive edge. The explosion of data that must be analyzed to find trends or hidden insights intensifies this challenge. Both the private and public sectors are turning to parallel computing techniques, such as “map/reduce” to quickly sift through large data volumes.
In some cases, it is practical to analyze huge sets of historical, disk-based data over the course of minutes or hours using batch processing platforms such as Hadoop. For example, risk modeling to optimize the handling of insurance claims potentially needs to analyze billions of records and tens of terabytes of data. However, many applications need to continuously analyze relatively small but fast-changing data sets measured in the hundreds of gigabytes and reaching into terabytes. Examples include clickstream data to optimize online promotions, stock trading data to implement trading strategies, machine log data to tune manufacturing processes, smart grid data, and many more.
Over the last several years, in-memory data grids (IMDGs) have proven their value in storing fast-changing application data and scaling application performance. More recently, IMDGs have integrated map/reduce analytics into the grid to achieve powerful, easy-to-use analysis and enable near real-time decision making. For example, the following diagram illustrates an IMDG used to store and analyze incoming streams of market and news data to help generate alerts and strategies for optimizing financial operations. This article explains how using an IMDG with integrated map/reduce capabilities can simplify data analysis and provide important competitive advantages.
Lowering the complexity of map/reduce, increasing operation speed (no file i/o), enabling easier parallelism, are all good things.
But they are differences in degree, not in kind.
I find IMDGs interesting because of the potential to increase the complexity of relationships between data, including data that is the output of operations.
From the post:
For example, an e-commerce Web site may need to monitor online shopping carts to see which products are selling.
That is probably a serious technical/data issue for Walmart or Home Depot, but it is a different in degree. You could do the same operations with a shoebox and paper receipts, although that would take a while.
Consider the beginning of something a bit more imaginative: What if sales at stores were treated differently than online shopping carts (due to delivery factors) and models built using weather forecasts three to five days out, time of year, local holidays and festivals? Multiple relationships between different data nodes.
That is just a back of an envelope sketch and I am sure successful retailers do even more than what I have suggested.
Complex relationships between data elements are almost at our fingertips.
Are you still counting shopping care items?