Archive for the ‘Aggregation’ Category

Real-Time Data Aggregation [White Paper Registration Warning]

Tuesday, April 30th, 2013

Real-Time Data Aggregation by Caroline Lim.

From the post:

Fast response times generate costs savings and greater revenue. Enterprise data architectures are incomplete unless they can ingest, analyze, and react to data in real-time as it is generated. While previously inaccessible or too complex — scalable, affordable real-time solutions are now finally available to any enterprise.

Infochimps Cloud::Streams

Read Infochimps’ newest whitepaper on how Infochimps Cloud::Streams is a proprietary stream processing framework based on four years of experience with sourcing and analyzing both bulk and in-motion data sources. It offers a linearly and fault-tolerant stream processing engine that leverages a number of well-proven web-scale solutions built by Twitter and Linkedin engineers, with an emphasis on enterprise-class scalability, robustness, and ease of use.

The price of this whitepaper is disclosure of your contact information.

Annoying considering the lack of substantive content about the solution. The use cases are mildly interesting but admit to any number of similar solutions.

If you need real-time data aggregation, skip the white paper and contact your IT consultant/vendor. (Including Infochimps, who do very good work, which is why a non-substantive white paper is so annoying.)

Aggregate Skyline Join Queries: Skylines with Aggregate Operations over Multiple Relations

Tuesday, January 29th, 2013

Aggregate Skyline Join Queries: Skylines with Aggregate Operations over Multiple Relations by Arnab Bhattacharya and B. Palvali Teja.
(Submitted on 28 Jun 2012)

Abstract:

The multi-criteria decision making, which is possible with the advent of skyline queries, has been applied in many areas. Though most of the existing research is concerned with only a single relation, several real world applications require finding the skyline set of records over multiple relations. Consequently, the join operation over skylines where the preferences are local to each relation, has been proposed. In many of those cases, however, the join often involves performing aggregate operations among some of the attributes from the different relations. In this paper, we introduce such queries as “aggregate skyline join queries”. Since the naive algorithm is impractical, we propose three algorithms to efficiently process such queries. The algorithms utilize certain properties of skyline sets, and processes the skylines as much as possible locally before computing the join. Experiments with real and synthetic datasets exhibit the practicality and scalability of the algorithms with respect to the cardinality and dimensionality of the relations.

The authors illustrate a “skyline” query with a search for a hotel that has a good price and it close to the beach. A “skyline” set of hotels excludes hotels that are not as good on those points as hotels in the set. They then observe:

In real applications, however, there often exists a scenario when a single relation is not sufficient for the application, and the skyline needs to be computed over multiple relations [16]. For example, consider a flight database. A person traveling from city A to city B may use stopovers, but may still be interested in flights that are cheaper, have a less overall journey time, better ratings and more amenities. In this case, a single relation specifying all direct flights from A to B may not suffice or may not even exist. The join of multiple relations consisting of flights starting from A and those ending at B needs to be processed before computing the preferences.

The above problem becomes even more complex if the person is interested in the travel plan that optimizes both on the total cost as well as the total journey time for the two flights (other than the ratings and amenities of each
airline). In essence, the skyline now needs to be computed on attributes that have been aggregated from multiple relations in addition to attributes whose preferences are local within each relation. The common aggregate operations are sum, average, minimum, maximum, etc.

No doubt the travel industry thinks it has conquered semantic diversity in travel arrangements. If they have, it has since I stopped traveling several years ago.

Even simple tasks such as coordination of air and train schedules was unnecessarily difficult.

I suspect that is still the case and so mention “skyline” queries as a topic to be aware of and if necessary, to include in a topic map application that brings sanity to travel arrangements.

True, you can get a travel service that handles all the details, but only for a price and only if you are that trusting.

MongoDB 2.2 Released [Aggregation News - Expiring Data From Merges?]

Thursday, August 30th, 2012

MongoDB 2.2 Released

From the post:

We are pleased to announce the release of MongoDB version 2.2. This release includes over 1,000 new features, bug fixes, and performance enhancements, with a focus on improved flexibility and performance. For additional details on the release:

Of particular interest to topic map fans:

Aggregation Framework

The Aggregation Framework is available in its first production-ready release as of 2.2. The aggregation framework makes it easier to manipulate and process documents inside of MongoDB, without needing to use Map Reduce, or separate application processes for data manipulation.

See the aggregation documentation for more information.

The H Open also mentions TTL (time to live) which can remove documents from collections.

MongoDB documentation: Expire Data from Collections by Setting TTL.

Have you considered “expiring” data from merges?

Building LinkedIn’s Real-time Activity Data Pipeline

Thursday, August 16th, 2012

Building LinkedIn’s Real-time Activity Data Pipeline by Ken Goodhope, Joel Koshy, Jay Kreps, Neha Narkhede, Richard Park, Jun Rao, and Victor Yang Ye. (pdf)

Abstract:

One trend in the implementation of modern web systems is the use of activity data in the form of log or event messages that capture user and server activity. This data is at the heart of many internet systems in the domains of advertising, relevance, search, recommendation systems, and security, as well as continuing to fulfill its traditional role in analytics and reporting. Many of these uses place real-time demands on data feeds. Activity data is extremely high volume and real-time pipelines present new design challenges. This paper discusses the design and engineering problems we encountered in moving LinkedIn’s data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system called Kafka. This pipeline currently runs in production at LinkedIn and handles more than 10 billion message writes each day with a sustained peak of over 172,000 messages per second. Kafka supports dozens of subscribing systems and delivers more than 55 billion messages to these consumer processing each day. We discuss the origins of this systems, missteps on the path to real-time, and the design and engineering problems we encountered along the way.

More details on Kafka (see Choking Cassandra Bolt).

What if you think about message feeds as being pipelines that are large enough to see and configure?

Chip level pipelines are more efficient but harder to configure.

Perhaps passing messages is efficient and flexible enough for a class of use cases.

The Case for Curation: The Relevance of Digest and Citator Results in Westlaw and Lexis

Wednesday, July 25th, 2012

The Case for Curation: The Relevance of Digest and Citator Results in Westlaw and Lexis by Susan Nevelow Mart and Jeffrey Luftig.

Abstract:

Humans and machines are both involved in the creation of legal research resources. For legal information retrieval systems, the human-curated finding aid is being overtaken by the computer algorithm. But human-curated finding aids still exist. One of them is the West Key Number system. The Key Number system’s headnote classification of case law, started back in the nineteenth century, was and is the creation of humans. The retrospective headnote classification of the cases in Lexis’s case databases, started in 1999, was created primarily although not exclusively with computer algorithms. So how do these two very different systems deal with a similar headnote from the same case, when they link the headnote to the digesting and citator functions in their respective databases? This paper continues an investigation into this question, looking at the relevance of results from digest and citator search run on matching headnotes in ninety important federal and state cases, to see how each performs. For digests, where the results are curated – where a human has made a judgment about the meaning of a case and placed it in a classification system – humans still have an advantage. For citators, where algorithm is battling algorithm to find relevant results, it is a matter of the better algorithm winning. But no one algorithm is doing a very good job of finding all the relevant results; the overlap between the two citator systems is not that large. The lesson for researchers: know how your legal research system was created, what involvement, if any, humans had in the curation of the system, and what a researcher can and cannot expect from the system you are using.

A must read for library students and legal researchers.

For legal research, the authors conclude:

The intervention of humans as curators in online environments is being recognized as a way to add value to an algorithm’s results, in legal research tools as well as web-based applications in other areas. Humans still have an edge in predicting which cases are relevant. And the intersection of human curation and algorithmically-generated data sets is already well underway. More curation will improve the quality of results in legal research tools, and most particularly can be used to address the algorithmic deficit that still seems to exist where analogic reasoning is needed. So for legal research, there is a case for curation. [footnotes omitted]

The distinction between curation, human gathering of relevant material and aggregation, machine gathering of potentially relevant material looks quite useful.

Curation anyone?

I first saw this at Legal Informatics.

Implementing Aggregation Functions in MongoDB

Tuesday, June 26th, 2012

Implementing Aggregation Functions in MongoDB by Arun Viswanathan and Shruthi Kumar.

From the post:

With the amount of data that organizations generate exploding from gigabytes to terabytes to petabytes, traditional databases are unable to scale up to manage such big data sets. Using these solutions, the cost of storing and processing data will significantly increase as the data grows. This is resulting in organizations looking for other economical solutions such as NoSQL databases that provide the required data storage and processing capabilities, scalability and cost effectiveness. NoSQL databases do not use SQL as the query language. There are different types of these databases such as document stores, key-value stores, graph database, object database, etc.

Typical use cases for NoSQL database includes archiving old logs, event logging, ecommerce application log, gaming data, social data, etc. due to its fast read-write capability. The stored data would then require to be processed to gain useful insights on customers and their usage of the applications.

The NoSQL database we use in this article is MongoDB which is an open source document oriented NoSQL database system written in C++. It provides a high performance document oriented storage as well as support for writing MapReduce programs to process data stored in MongoDB documents. It is easily scalable and supports auto partitioning. Map Reduce can be used for aggregation of data through batch processing. MongoDB stores data in BSON (Binary JSON) format, supports a dynamic schema and allows for dynamic queries. The Mongo Query Language is expressed as JSON and is different from the SQL queries used in an RDBMS. MongoDB provides an Aggregation Framework that includes utility functions such as count, distinct and group. However more advanced aggregation functions such as sum, average, max, min, variance and standard deviation need to be implemented using MapReduce.

This article describes the method of implementing common aggregation functions like sum, average, max, min, variance and standard deviation on a MongoDB document using its MapReduce functionality. Typical applications of aggregations include business reporting of sales data such as calculation of total sales by grouping data across geographical locations, financial reporting, etc.

Not terribly advanced but enough to get you started with creating aggregation functions.

Includes “testing” of the aggregation functions that are written in the article.

If Python is more your cup of tea, see: Aggregation in MongoDB (part1) and Aggregation in MongoDB (part 2).

OData Extensions for Data Aggregation

Friday, June 15th, 2012

OData Extensions for Data Aggregation by Chris Webb.

Chris writes:

I was just reading the following blog post on the OASIS OData Technical Committee Call for Participation: http://www.odata.org/blog/2012/6/11/oasis-odata-technical-committee-call-for-participation

…when I saw this:

In addition to the core OData version 3.0 protocol found here, the Technical Committee will be defining some key extensions in the first version of the OASIS Standard:

OData Extensions for Data Aggregation – Business Intelligence provides the ability to get the right set of aggregated results from large data warehouses. OData Extensions for Analytics enable OData to support Business Intelligence by allowing services to model data analytic “cubes” (dimensions, hierarchies, measures) and consumers to query aggregated data

Follow the link in the quoted text – it’s very interesting reading! Here’s just one juicy quote:

You have to go to Chris’ post to see the “juicy quote.” ;-)

With more data becoming available, at higher speeds, data aggregation is going to be the norm.

Some people will do it well. Some people will do it not so well.

Which one will describe you?

Participation in the OData TC at OASIS may help shape that answer: http://www.odata.org/blog/2012/6/11/oasis-odata-technical-committee-call-for-participation

First meeting details:

The first meeting of the Technical Committee will be a face-to-face meeting to be held in Redmond, Washington on July 26-27, 2012 from 9 AM PT to 5 PM PT. This meeting will be sponsored by Microsoft. Dial-in conference calling bridge numbers will be available for those unable to attend in person.

At least the meeting is on a Thursday/Friday slot! Any comments on the weather to expect in late July?

Real-time Analytics with HBase [Aggregation is a form of merging.]

Monday, June 11th, 2012

Real-time Analytics with HBase

From the post:

Here are slides from another talk we gave at both Berlin Buzzwords and at HBaseCon in San Francisco last month. In this presentation Alex describes one approach to real-time analytics with HBase, which we use at Sematext via HBaseHUT. If you like these slides you will also like HBase Real-time Analytics Rollbacks via Append-based Updates.

The slides come in a long and short version. Both are very good but I suggest the long version.

I particularly liked the “Background: pre-aggregation” slide (8 in the short version, 9 in the long version).

Aggregation as a form of merging.

What information is lost as part of aggregation? (That assumes we know the aggregation process. Without that, can’t say what is lost.)

What information (read subjects/relationships) do we want to preserve through an aggregation process?

What properties should those subjects/relationships have?

(Those are topic map design/modeling questions.)

Using MongoDB’s New Aggregation Framework in Python (MongoDB Aggregation Part 2)

Monday, June 4th, 2012

Using MongoDB’s New Aggregation Framework in Python (MongoDB Aggregation Part 2) by Rick Copeland.

From the post:

Continuing on in my series on MongoDB and Python, this article will explore the new aggregation framework introduced in MongoDB 2.1. If you’re just getting started with MongoDB, you might want to read the previous articles in the series first:

And now that you’re all caught up, let’s jump right in….

Why a new framework?

If you’ve been following along with this article series, you’ve been introduced to MongoDB’s mapreduce command, which up until MongoDB 2.1 has been the go-to aggregation tool for MongoDB. (There’s also the group() command, but it’s really no more than a less-capable and un-shardable version of mapreduce(), so we’ll ignore it here.) So if you already have mapreduce() in your toolbox, why would you ever want something else?

Mapreduce is hard; let’s go shopping

The first motivation behind the new framework is that, while mapreduce() is a flexible and powerful abstraction for aggregation, it’s really overkill in many situations, as it requires you to re-frame your problem into a form that’s amenable to calculation using mapreduce(). For instance, when I want to calculate the mean value of a property in a series of documents, trying to break that down into appropriate map, reduce, and finalize steps imposes some extra cognitive overhead that we’d like to avoid. So the new aggregation framework is (IMO) simpler.

Other than the obvious utility of the new aggregation framework in MongoDB, there is another reason to mention this post: You should use only as much aggregation or in topic map terminology, “merging,” as you need.

It isn’t possible to create a system that will correctly aggregate/merge all possible content. Take that as a given.

In part because new semantics are emerging every day and there are too many previous semantics that are poorly documented or unknown.

What we can do is establish requirements for particular semantics for given tasks and document those to facilitate their possible re-use in the future.

Aggregation in MongoDB (Part 1)

Monday, June 4th, 2012

Aggregation in MongoDB (Part 1) by Rick Copeland.

From the post:

In some previous posts on mongodb and python,
pymongo, and gridfs, I introduced the NoSQL database MongoDB how to use it from Python, and how to use it to store large (more than 16 MB) files in it. Here, I’ll be showing you a few of the features that the current (2.0) version of MongoDB includes for performing aggregation. In a future post, I’ll give you a peek into the new aggregation framework included in MongoDB version 2.1.

An index “aggregates” information about a subject (called an ‘entry’), where the information is traditionally found between the covers of a book.

MongoDB offers predefined as well as custom “aggregations,” where the information field can be larger than a single book.

Good introduction to aggregation in MongoDB, although you (and I) really should get around to reading the MondoDB documentation.

City Dashboard: Aggregating All Spatial Data for Cities in the UK

Saturday, April 28th, 2012

City Dashboard: Aggregating All Spatial Data for Cities in the UK

You need to try this out for yourself before reading the rest of this post.

Go ahead, I’ll wait…, …, …, ok.

To some extent this “aggregation” may reflect on the sort of questions we ask users about topic maps.

It’s possible to aggregate data about anything number of things. But even if you could, would you want to?

Take the “aggregation” for Birmingham, UK, this evening. One of the components informed me a choir director was arrested for rape. Concerns the choir director a good bit but why it would interest me?

Isn’t that the problem of aggregation? The definition of “useful” aggregation varies from person to person, even task to task.

Try London while you are at the site. There is a Slightly Unhappier/Significantly Unhappier, “Mood” indicator. It has what turns out to be a “count down” timer, for the next reset on the indicator.

I thought the changing count reflected people becoming more and more unhappy.

Looked like London was going to “flatline” while I was watching. ;-)

Fortunately turned out to not be the case.

There are dangers to personalization but aggregation without relevance just pumps up the noise.

Not sure that helps either.

Suggestions?

The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework

Thursday, February 9th, 2012

The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework by Davy Suvee.

From the post:

Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance. Part 3 finally, illustrates the use of the new MongoDB Aggregation Framework, which boosts performance beyond the capabilities of the map-reduce implementation.

In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities through Tanimoto coefficients. When employing a low target Tanimoto coefficient however, the number of returned compounds increases exponentially, resulting in a noticeable data transfer overhead. To circumvent this problem, part 2 of this article describes the use of MongoDB’s build-in map-reduce functionality to perform the Tanimoto coefficient calculation local to where the compound data is stored. Unfortunately, the execution of these map-reduce algorithms through Javascript is rather slow and a performance improvement can only be achieved when multiple shards are employed within the same MongoDB cluster.

Recently, MongoDB introduced its new Aggregation Framework. This framework provides a more simple solution to calculating aggregate values instead of relying upon the powerful map-reduce constructs. With just a few simple primitives, it allows you to compute, group, reshape and project documents that are contained within a certain MongoDB collection. The remainder of this article describes the refactoring of the map-reduce algorithm to make optimal use of the new MongoDB Aggregation Framework. The complete source code can be found on the Datablend public GitHub repository.

Does it occur to you that aggregation results in one or more aggregates? And if we are presented with one or more aggregates, we could persist those aggregates and add properties to them. Or have relationships between aggregates. Or point to occurrences of aggregates.

Kristina Chodorow demonstrated use of aggregation in MongoDB in Hacking Chess with the MongoDB Pipeline for analysis of chess games. Rather that summing the number of games in which the move “e4″ is the first move for White, links to all 696 games could be treated as occurrences of that subject. Which would support discovery of the player of White as well as Black.

Think of aggregation as a flexible means for merging information about subjects and their relationships. (Blind interchange requires more but this is a step in the right direction.)

The Comments Conundrum

Sunday, February 5th, 2012

The Comments Conundrum by Kristina Chodorow.

From the post:

One of the most common questions I see about MongoDB schema design is:

I have a collection of blog posts and each post has an array of comments. How do I get…
…all comments by a given author
…the most recent comments
…the most popular commenters?

And so on. The answer to this has always been “Well, you can’t do that on the server side…” You can either do it on the client side or store comments in their own collection. What you really want is the ability to treat embedded documents like a “real” collection.

The aggregation pipeline gives you this ability by letting you “unwind” arrays into separate documents, then doing whatever else you need to do in subsequent pipeline operators.

Kristina continues her coverage of the aggregation pipeline in MongoDB.

Question: What is the result of an aggregation? (In a topic map sense?)

Hacking Chess with the MongoDB Pipeline

Thursday, February 2nd, 2012

Hacking Chess with the MongoDB Pipeline

Kristina Chodorow* writes:

MongoDB’s new aggegation framework is now available in the nightly build! This post demonstrates some of its capabilities by using it to analyze chess games.

Make sure you have a the “Development Release (Unstable)” nightly running before trying out the stuff in this post. The aggregation framework will be in 2.1.0, but as of this writing it’s only in the nightly build.

First, we need some chess games to analyze. Download games.json, which contains 1132 games that were won in 10 moves or less (crush their soul and do it quick).

You can use mongoimport to import games.json into MongoDB:

If you think this example of “aggregation” as merging where the subjects have a uniform identifier (chess piece/move), you will understand why I find this interesting.

Aggregation, as is shown by Kristina’s post, can form the basis for analysis of data.

Analysis that isn’t possible in the absence of aggregation (read merging).

I am looking forward to addition posts on the aggregation framework and need to drop by the MongoDB project to see what the future holds on aggregation/merging.

*Kristina is the author of two O’Reilly titles, MongoDB: the definitive guide and Scaling MongoDB.

Aggregation and Restructuring data (from “R in Action”)

Tuesday, January 10th, 2012

Aggregation and Restructuring data (from “R in Action”) by Dr. Robert I. Kabacoff.

From the post:

R provides a number of powerful methods for aggregating and reshaping data. When you aggregate data, you replace groups of observations with summary statistics based on those observations. When you reshape data, you alter the structure (rows and columns) determining how the data is organized. This article describes a variety of methods for accomplishing these tasks.

We’ll use the mtcars data frame that’s included with the base installation of R. This dataset, extracted from Motor Trend magazine (1974), describes the design and performance characteristics (number of cylinders, displacement, horsepower, mpg, and so on) for 34 automobiles. To learn more about the dataset, see help(mtcars).

How do you recognize what data you want to aggregate or transpose?

Or communicate that knowledge to future users?

The data set for Motor Trend magazine is an easy one.

If you have access to the electronic text for Motor Trend magazine (one or more issues) for 1974, drop me a line. I am thinking of a way to illustrate the “semantic” problem.

SQL to MongoDB: An Updated Mapping

Saturday, December 17th, 2011

SQL to MongoDB: An Updated Mapping from Kristina Chodorow.

From the post:

The aggregation pipeline code has finally been merged into the main development branch and is scheduled for release in 2.2. It lets you combine simple operations (like finding the max or min, projecting out fields, taking counts or averages) into a pipeline of operations, making a lot of things that were only possible by using MapReduce doable with a “normal” query.

In celebration of this, I thought I’d re-do the very popular MySQL to MongoDB mapping using the aggregation pipeline, instead of MapReduce.

If you are interested in MongoDB-based solutions, this will be very interesting.

QL.IO

Thursday, December 8th, 2011

QL.IO – A declarative, data-retrieval and aggregation gateway for quickly consuming HTTP APIs.

From the about page:

A SQL and JSON inspired DSL

SQL is quite a powerful DSL to retrieve, filter, project, and join data — see efforts like A co-Relational Model of Data for Large Shared Data Banks, LINQ, YQL, or unQL for examples.

ql.io combines SQL, JSON, and a few procedural style constructs into a compact language. Scripts written in this language can make HTTP requests to retrieve data, perform joins between API responses, project responses, or even make requests in a loop. But note that ql.io’s scripting language is not SQL – it is SQL inspired.

Orchestration

Most real-world client apps need to mashup data from multiple APIs in one go. Data mashup is often complicated as client apps need to worry about order of requests, inter-dependencies, error handling, and parallelization to reduce overall latency.

ql.io’s scripts are procedural in appearance but are executed out of order based on dependencies. Some statements may be scheduled in parallel and some in series based on a dependency analysis done at script compile time. The compilation is an on-the-fly process.

Consumer Centric Interfaces

APIs are designed for reuse, and hence they cater to the common denominator. Getting new fields added, optimizing responses, or combining multiple requests into one involve drawn out negotiations between API producing teams and API consuming teams.

ql.io lets API consuming teams move fast by creating consumer-centric interfaces that are optimized for the client – such optimized interfaces can reduce bandwidth usage and number of HTTP requests.

I can believe the “SQL inspired” part since it looks like keys/column headers are opaque. That is you an specify a key/column header but you can’t specify the identity of the subject it represents.

So, if you don’t know the correct term, you are SOL. Which isn’t the state of being inspired.

Still, it looks like an interesting effort that could develop to be non-opaque with regard to keys and possibly values. (The next stage is how do you learn what properties a subject representative has for the purpose of subject recognition.)

SEISA: set expansion by iterative similarity aggregation

Friday, April 1st, 2011

SEISA: set expansion by iterative similarity aggregation by Yeye He, University of Wisconsin-Madison, Madison, WI, USA, and Dong Xin, Microsoft Research, Redmond, WA, USA.

In this paper, we study the problem of expanding a set of given seed entities into a more complete set by discovering other entities that also belong to the same concept set. A typical example is to use “Canon” and “Nikon” as seed entities, and derive other entities (e.g., “Olympus”) in the same concept set of camera brands. In order to discover such relevant entities, we exploit several web data sources, including lists extracted from web pages and user queries from a web search engine. While these web data are highly diverse with rich information that usually cover a wide range of the domains of interest, they tend to be very noisy. We observe that previously proposed random walk based approaches do not perform very well on these noisy data sources. Accordingly, we propose a new general framework based on iterative similarity aggregation, and present detailed experimental results to show that, when using general-purpose web data for set expansion, our approach outperforms previous techniques in terms of both precision and recall.

To the uses of set expansion mentioned by the authors:

Set expansion systems are of practical importance and can be used in various applications. For instance, web search engines may use the set expansion tools to create a comprehensive entity repository (for, say, brand names of each product category), in order to deliver better results to entity-oriented queries. As another example, the task of named entity recognition can also leverage the results generated by set expansion tools [13]

I would add:

  • augmented authoring of navigation tools for text corpora
  • discovery of related entities (for associations)

While the authors concentrate on web-based documents, which for the most part are freely available, the techniques shown here could be just as easily applied to commercial texts or used to generate pay-for-view results.

It would have to really be a step up to get people to pay a premium for navigation of free content, but given the noisy nature of most information sites, that is certainly possible.

Unified analysis of streaming news

Thursday, March 31st, 2011

Unified analysis of streaming news by Amr Ahmed, Qirong Ho, Jacob Eisenstein, and, Eric Xing Carnegie Mellon University, Pittsburgh, USA, and Alexander J. Smola and Choon Hui Teo of Yahoo! Research, Santa Clara, CA, USA.

News clustering, categorization and analysis are key components of any news portal. They require algorithms capable of dealing with dynamic data to cluster, interpret and to temporally aggregate news articles. These three tasks are often solved separately. In this paper we present a unified framework to group incoming news articles into temporary but tightly-focused storylines, to identify prevalent topics and key entities within these stories, and to reveal the temporal structure of stories as they evolve. We achieve this by building a hybrid clustering and topic model. To deal with the available wealth of data we build an efficient parallel inference algorithm by sequential Monte Carlo estimation. Time and memory costs are nearly constant in the length of the history, and the approach scales to hundreds of thousands of documents. We demonstrate the efficiency and accuracy on the publicly available TDT dataset and data of a major internet news site.

From the article:

Such an approach combines the strengths of clustering and topic models. We use topics to describe the content of each cluster, and then we draw articles from the associated story. This is a more natural fit for the actual process of how news is created: after an event occurs (the story), several journalists write articles addressing various aspects of the story. While their vocabulary and their view of the story may differ, they will by necessity agree on the key issues related to a story (at least in terms of their vocabulary). Hence, to analyze a stream of incoming news we need to infer a) which (possibly new) cluster could have generated the article and b) which topic mix describes the cluster best.

I single out that part of the paper to remark that at first the authors say that the vocabulary for a story may vary and then in the next breath say that for key issues the vocabulary will agree on key issues.

Given the success of their results, it may be that news reporting is more homogeneous in its vocabulary than other forms of writing?

Perhaps news compression where duplicated content is suppressed but the “fact” of reportage is retained, that could make an interesting topic map.

Provenance for Aggregate Queries

Friday, January 7th, 2011

Provenance for Aggregate Queries Authors: Yael Amsterdamer, Daniel Deutch, Val Tannen

Abstract:

We study in this paper provenance information for queries with aggregation. Provenance information was studied in the context of various query languages that do not allow for aggregation, and recent work has suggested to capture provenance by annotating the different database tuples with elements of a commutative semiring and propagating the annotations through query evaluation. We show that aggregate queries pose novel challenges rendering this approach inapplicable. Consequently, we propose a new approach, where we annotate with provenance information not just tuples but also the individual values within tuples, using provenance to describe the values computation. We realize this approach in a concrete construction, first for “simple” queries where the aggregation operator is the last one applied, and then for arbitrary (positive) relational algebra queries with aggregation; the latter queries are shown to be more challenging in this context. Finally, we use aggregation to encode queries with difference, and study the semantics obtained for such queries on provenance annotated databases.

Not for the faint of heart reading.

But, provenance for merging is one obvious application of this paper.

For that matter, provenance should also be a consideration for TMQL.