Archive for the ‘Summarization’ Category

Document Summarization via Markov Chains

Saturday, October 17th, 2015

Document Summarization via Markov Chains by Atabey Kaygun.

From the post:

Description of the problem

Today’s question is this: we have a long text and we want a machine generated summary of the text. Below, I will describe a statistical (hence language agnostic) method to do just that.

Sentences, overlaps and Markov chains.

In my previous post I described a method to measure the overlap between two sentences in terms of common words. Today, we will use the same measure, or a variation, to develop a discrete Markov chain whose nodes are labeled by individual sentences appearing in our text. This is essentially page rank applied to sentences.

Atabey says the algorithm (code supplied) works well on:

news articles, opinion pieces and blog posts.

Not so hot on Supreme Court decisions.

In commenting on a story from the New York Times, Obama Won’t Seek Access to Encrypted User Data, I suspect, Atabey says that we have no reference for “what frustrated him” in the text summary.

If you consider the relevant paragraph from the New York Times story:

Mr. Comey had expressed alarm a year ago after Apple introduced an operating system that encrypted virtually everything contained in an iPhone. What frustrated him was that Apple had designed the system to ensure that the company never held on to the keys, putting them entirely in the hands of users through the codes or fingerprints they use to get into their phones. As a result, if Apple is handed a court order for data — until recently, it received hundreds every year — it could not open the coded information.

The reference is clear. Several other people are mentioned in the New York Times article but none rank high enough to appear in the summary.

Not a sure bet but with testing, try attribution to people who rank high enough to appear in the summary.

Condensing News

Thursday, June 12th, 2014

Information Overload: Can algorithms help us navigate the untamed landscape of online news? by Jason Cohn.

From the post:

Digital journalism has evolved to a point of paradox: we now have access to such an overwhelming amount of news that it’s actually become more difficult to understand current events. IDEO New York developer Francis Tseng is—in his spare time—searching for a solution to the problem by exploring its root: the relationship between content and code. Tseng received a grant from the Knight Foundation to develop Argos*, an online news aggregation app that intelligently collects, summarizes and provides contextual information for news stories. Having recently finished version 0.1.0, which he calls the first “complete-ish” release of Argos, Tseng spoke with veteran journalist and documentary filmmaker Jason Cohn about the role technology can play in our consumption—and comprehension—of the news.

Great story and very interesting software. And as Alyona notes in her tweet, it’s open source!

Any number of applications, particularly for bloggers who are scanning lots of source material everyday.

Intended for online news but a similar application would be useful for TV news as well. In the Altanta, Georgia area a broadcast could be prefaced by:

  • Accidents (gristly ones) 25%
  • Crimes (various) 30%
  • News previously reported but it’s a slow day today 15%
  • News to be reported on a later broadcast 10%
  • Politics (non-contextualized posturing) 10%
  • Sports (excluding molesting stories reported under crimes) 5%
  • Weather 5%

I haven’t timed the news and some channels are worse than others but take that as a recurrent, public domain summary of Atlanta news. 😉

For digital news feeds, check out the Argos software!

I first saw this in a tweet by Alyona Medelyan.

Wavii: New Kind Of News Gatherer – (Donii?)

Wednesday, April 11th, 2012

Wavii: New Kind Of News Gatherer by Thomas Claburn.

Wavii, a new breed of aggregator, gives you news feeds culled from across the Web, from sources far beyond Google News. It also understands your interests and summarizes results.

From the post:

Imagine being able to follow topics rather than people on social networks. Imagine a Google Alert that arrived because Google actually had some understanding of your interests beyond what can be gleaned from the keywords you provided. That’s basically what Wavii, entering open beta testing on Wednesday, makes possible: It offers a way to follow topics or concepts and to receive updates in an automatically generated summary format.

Founded in 2009 by Adrian Aoun, an entrepreneur and former employee of Microsoft and Fox Media Interactive, Wavii provides users with news feeds culled from across the Web that can be accessed via Wavii’s website or mobile app. Unlike Google Alerts, these feeds are composed from content beyond Google News. Wavii gathers its information from all over the Web–news, videos, tweets, and beyond–and then attempts to make sense of what it has found using machine learning techniques.

Wavii is not just a pattern-matching system. It recognizes linguistic concepts and that understanding makes its assistance more valuable: Not only is Wavii good at finding information that matches a user’s expressed interests but it also concisely summarizes that information. The company has succeeded at a task that other companies haven’t managed to do quite as well.

Sounds interesting. After the initial rush I will sign up for test drive.

The story did not report what economic model that Wavii will be following? I assume the server space and CPU cycles plus staff time aren’t being donated. Yes? Wonder why that wasn’t worth mentioning. You?

BTW, let’s not be like television where if there is one housewife hooker show successful this season, next season there will be higher and lower end housewife’s doing the same thing and next year, well, let’s just say one of the partners will be non-human.

Here’s my alternative: Donii – Donii reports donations to you from within 2 degrees of separation of the person in front of you. Custom level settings: Hug; Nod Encouragingly; Glad Hand; Look For Someone Else, Anyone Else.

Summify’s Technology Examined

Wednesday, November 2nd, 2011

Summify’s Technology Examined by Phil Whelan.

From the post:

Following on from examining Quora’s technology, I thought I would look at a tech company closer to home. Home being Vancouver, BC. While the tech scene is much smaller here than in the valley, it is here. In fact, Vancouver boasts the largest number of entrepreneurs per capita. is a website that strives to make our lives easier and helps us deal with the information overload we all experience every time we sit down at our computers. The founders of this start-up, Cristian Strat and Mircea Paşoi, seem to have all the right ingredients for success. This is their biggest venture so far, but not their first. They have previously built and, which are both focused on their home country of Romania.

“We’re a team of two Romanian hackers and entrepreneurs, passionate about technology and Internet startups. We’ve interned at Google and Microsoft and we’ve kicked ass in programming contests like the International Olympiad in Informatics and TopCoder.”
– Summify Team. “Our Story”

In this post I will look at the technology infrastructure they have built for, the details of which they were kind enough to share with me.

From this last Spring so this may be old news but I thought it was an interesting look “behind the scenes” at an “information overload solution” application.

Curious that the two challenges for Summify were seen as:

  • Crawling a large volume of feeds and web pages
  • Live streaming updates to the website

May just be me but I would think the semantics of the feeds would rank pretty high. Both in terms of recognition of items of interest in terminology familiar to the user as well as new terminology. For example, what if I say I wants feeds on P2P systems, an information overload reducing application would also give me distributed network entries.

That’s an easy example but you get the idea. And the system should do that across different interests of users and update its recognition of relevant items to include new terminology as it emerges.

BTW, you might want to check out the Summify FAQ on how they determine your interests.

News: Summarization and Visualization

Tuesday, September 20th, 2011

News: Summarization and Visualization (CITRIS* i4Science Lecture Series) by Laurent El Ghaoui.

I don’t know that I agree with the point: “Yet we can’t do without news!”

As much noise as is in the news, I think I could read about how it comes out a year later and not have missed much. 😉

Do watch this lecture as it is very interesting in that it counts and visualizes words in ways you might not expect. Great way to explore text resources.

There is a argument against normalization for search purposes.

For extra credit: How would you test a search engine to see how normalization was affecting its results?

*CITRIS – Center for Information Technology Research in the Interest of Society

New Challenges in Distributed Information Filtering and Retrieval

Sunday, September 11th, 2011

New Challenges in Distributed Information Filtering and Retrieval

Proceedings of the 5th International Workshop on New Challenges in Distributed Information Filtering and Retrieval
Palermo, Italy, September 17, 2011.

Edited by:

Cristian Lai – CRS4, Loc. Piscina Manna, Building 1 – 09010 Pula (CA), Italy

Giovanni Semeraro – Dept. of Computer Science, University of Bari, Aldo Moro, Via E. Orabona, 4, 70125 Bari, Italy

Eloisa Vargiu – Dept. of Electrical and Electronic Engineering, University of Cagliari, Piazza d’Armi, 09123 Cagliari, Italy

Table of Contents:

  1. Experimenting Text Summarization on Multimodal Aggregation
    Giuliano Armano, Alessandro Giuliani, Alberto Messina, Maurizio Montagnuolo, Eloisa Vargiu
  2. From Tags to Emotions: Ontology-driven Sentimental Analysis in the Social Semantic Web
    Matteo Baldoni, Cristina Baroglio, Viviana Patti, Paolo Rena
  3. A Multi-Agent Decision Support System for Dynamic Supply Chain Organization
    Luca Greco, Liliana Lo Presti, Agnese Augello, Giuseppe Lo Re, Marco La Cascia, Salvatore Gaglio
  4. A Formalism for Temporal Annotation and Reasoning of Complex Events in Natural Language
    Francesco Mele, Antonio Sorgente
  5. Interaction Mining: the new Frontier of Call Center Analytics
    Vincenzo Pallotta, Rodolfo Delmonte, Lammert Vrieling, David Walker
  6. Context-Aware Recommender Systems: A Comparison Of Three Approaches
    Umberto Panniello, Michele Gorgoglione
  7. A Multi-Agent System for Information Semantic Sharing
    Agostino Poggi, Michele Tomaiuolo
  8. Temporal characterization of the requests to Wikipedia
    Antonio J. Reinoso, Jesus M. Gonzalez-Barahona, Rocio Muñoz-Mansilla, Israel Herraiz
  9. From Logical Forms to SPARQL Query with GETARUN
    Rocco Tripodi, Rodolfo Delmonte
  10. ImageHunter: a Novel Tool for Relevance Feedback in Content Based Image Retrieval
    Roberto Tronci, Gabriele Murgia, Maurizio Pili, Luca Piras, Giorgio Giacinto

Discovering, Summarizing and Using Multiple Clusterings

Friday, September 2nd, 2011

Discovering, Summarizing and Using Multiple Clusterings

Proceedings of the 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings

Athens, Greece, September 5, 2011.

Where you will find:

Invited Talks

1. Combinatorial Approaches to Clustering and Feature Selection, Michael E. Houle

2. Cartification: Turning Similarities into Itemset Frequencies, Bart Goethals

Research Papers

3. When Pattern Met Subspace Cluster, Jilles Vreeken, Arthur Zimek

4. Fast Multidimensional Clustering of Categorical Data, Tengfei Liu, Nevin L. Zhang, Kin Man Poon, Yi Wang, Hua Liu

5. Factorial Clustering with an Application to Plant Distribution Data, Manfred Jaeger, Simon Lyager, Michael Vandborg, Thomas Wohlgemuth

6. Subjectively Interesting Alternative Clusters,Tijl De Bie

7. Evaluation of Multiple Clustering Solutions, Hans-Peter Kriegel, Erich Schubert, Arthur Zimek

8. Browsing Robust Clustering-Alternatives, Martin Hahmann, Dirk Habich, Wolfgang Lehner

9. Generating a Diverse Set of High-Quality Clusterings, Jeff M. Phillips, Parasaran Raman, Suresh Venkatasubramanian

A Term Association Inference Model for Single Documents:….

Monday, November 22nd, 2010

A Term Association Inference Model for Single Documents: A Stepping Stone for Investigation through Information Extraction Author(s): Sukanya Manna and Tom Gedeon Keywords: Information retrieval, investigation, Gain of Words, Gain of Sentences, term significance, summarization


In this paper, we propose a term association model which extracts significant terms as well as the important regions from a single document. This model is a basis for a systematic form of subjective data analysis which captures the notion of relatedness of different discourse structures considered in the document, without having a predefined knowledge-base. This is a paving stone for investigation or security purposes, where possible patterns need to be figured out from a witness statement or a few witness statements. This is unlikely to be possible in predictive data mining where the system can not work efficiently in the absence of existing patterns or large amount of data. This model overcomes the basic drawback of existing language models for choosing significant terms in single documents. We used a text summarization method to validate a part of this work and compare our term significance with a modified version of Salton’s [1].

Excellent work that illustrates how re-thinking of fundamental assumptions of data mining can lead to useful results.


  1. Create an annotated bibliography of citations to this article.
  2. Citations of items in the bibliography since this paper (2008)? List and annotate.
  3. How would you use this approach with a document archive project? (3-5 pages, no citations)

Text Analysis Conference (TAC)

Sunday, November 21st, 2010

Text Analysis Conference (TAC)

From the website:

The Text Analysis Conference (TAC) is a series of evaluation workshops organized to encourage research in Natural Language Processing and related applications, by providing a large test collection, common evaluation procedures, and a forum for organizations to share their results. TAC comprises sets of tasks known as “tracks,” each of which focuses on a particular subproblem of NLP. TAC tracks focus on end-user tasks, but also include component evaluations situated within the context of end-user tasks.

  • Knowledge Base Population

    The goal of the Knowledge Base Population track is to develop systems that can augment an existing knowledge representation (based on Wikipedia infoboxes) with information about entities that is discovered from a collection of documents.

  • Recognizing Textual Entailment

    The goal of the RTE Track is to develop systems that recognize when one piece of text entails another.

  • Summarization

    The goal of the Summarization Track is to develop systems that produce short, coherent summaries of text.

Sponsored by the U.S. Department of Defense.

Rumor has it that one intelligence analysis group won a DoD contract without hiring an ex-general. If you get noticed by a prime contractor here, perhaps you won’t have to either. The primes have lots of ex-generals/colonels, etc.


  1. Select a paper from one of the TAC conferences. Update on the status of that research. (3-5 pages, citations)
  2. For the authors of #1, annotated bibliography of publications since the paper.
  3. How would you use the technique from #1 in the construction of a topic map? Inform your understanding, selection, data for that map, etc.? (3-5 pages, no citations)

(Yes, I stole the questions from my DUC conference posting. ;-))

DUC: Document Understanding Conferences

Sunday, November 21st, 2010

DUC: Document Understanding Conferences

From the website:

There is currently much interest and activity aimed at building powerful multi-purpose information systems. The agencies involved include DARPA, ARDA and NIST. Their programmes, for example DARPA’s TIDES (Translingual Information Detection Extraction and Summarization) programme, ARDA’s Advanced Question & Answering Program and NIST’s TREC (Text Retrieval Conferences) programme cover a range of subprogrammes. These focus on different tasks requiring their own evaluation designs.

Within TIDES and among other researchers interested in document understanding, a group grew up which has been focusing on summarization and the evaluation of summarization systems. Part of the initial evaluation for TIDES called for a workshop to be held in the fall of 2000 to explore different ways of summarizing a common set of documents. Additionally a road mapping effort was started in March of 2000 to lay plans for a long-term evaluation effort in summarization.

Data sets, papers, etc., on text summarization.

Yes, DUC has moved to Textual Analysis Conference (TAC) but what they don’t say is that the DUC data and papers for 2001 to 2007 are listed at this site only.

Something to remember when you are looking for text summarization data sets and research.


  1. Select a paper from the 2007 DUC conference. Update on the status of that research. (3-5 pages, citations)
  2. For the authors of #1, annotated bibliography of publications since the paper in 2007.
  3. How would you use the technique from #1 in the construction of a topic map? Inform your understanding, selection, data for that map, etc.? (3-5 pages, no citations)

Subjective Logic = Effective Logic?

Saturday, November 20th, 2010

Capture of Evidence for Summarization: An Application of Enhanced Subjective Logic

Authors(s): Sukanya Manna, B. Sumudu U. Mendis, Tom Gedeon Keywords: subjective logic, opinions, evidence, events, summarization, information extraction


In this paper, we present a method to generate an extractive summary from a single document using subjective logic. The idea behind our approach is to consider words and their co-occurrences between sentences in a document as evidence of their relatedness to the contextual meaning of the document. Our aim is to formulate a measure to find out ‘opinion’ about a proposition (which is a sentence in this case) using subjective logic in a closed environment (as in a document). Stronger opinion about a sentence represents its importance and are hence considered to summarize a document. Summaries generated by our method when evaluated with human generated summaries, show that they are more similar than baseline summaries.

The authors justify their use of “subjective” logic by saying:

pointed out that a given piece of text is interpreted by different person in a different fashion especially in the way how they understand and interpret the context. Thus we see that human understanding and reasoning is subjective in nature unlike propositional logic which deals with either truth or falsity of a statement. So, to deal with this kind of situation we used subjective logic to find out sentences which are significant in the context and can be used to summarize a document.

“Subjective” logic means we are more likely to reach the same result as a person reading the text.

Search results as used and evaluated by people.

That sounds like effective logic to me.


  1. Read the Audun Jøsang’s article Artificial Reasoning with Subjective Logic.
  2. Summarize three (3) applications (besides the article above) of “subjective” logic. (3-5 pages, citations)
  3. How do you think “subjective” logic should be modeled in topic maps? (3-5 pages, citations optional)