Document Summarization via Markov Chains by Atabey Kaygun.
From the post:
Description of the problem
Today’s question is this: we have a long text and we want a machine generated summary of the text. Below, I will describe a statistical (hence language agnostic) method to do just that.
Sentences, overlaps and Markov chains.
In my previous post I described a method to measure the overlap between two sentences in terms of common words. Today, we will use the same measure, or a variation, to develop a discrete Markov chain whose nodes are labeled by individual sentences appearing in our text. This is essentially page rank applied to sentences.
Atabey says the algorithm (code supplied) works well on:
news articles, opinion pieces and blog posts.
Not so hot on Supreme Court decisions.
In commenting on a story from the New York Times, Obama Won’t Seek Access to Encrypted User Data, I suspect, Atabey says that we have no reference for “what frustrated him” in the text summary.
If you consider the relevant paragraph from the New York Times story:
Mr. Comey had expressed alarm a year ago after Apple introduced an operating system that encrypted virtually everything contained in an iPhone. What frustrated him was that Apple had designed the system to ensure that the company never held on to the keys, putting them entirely in the hands of users through the codes or fingerprints they use to get into their phones. As a result, if Apple is handed a court order for data — until recently, it received hundreds every year — it could not open the coded information.
The reference is clear. Several other people are mentioned in the New York Times article but none rank high enough to appear in the summary.
Not a sure bet but with testing, try attribution to people who rank high enough to appear in the summary.