Archive for the ‘Text Analytics’ Category

Stanford CoreNLP – a suite of core NLP tools (3.7.0)

Thursday, January 12th, 2017

Stanford CoreNLP – a suite of core NLP tools

The beta is over and Stanford CoreNLP 3.7.0 is on the street!

From the webpage:

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get quotes people said, etc.

Choose Stanford CoreNLP if you need:

  • An integrated toolkit with a good range of grammatical analysis tools
  • Fast, reliable analysis of arbitrary texts
  • The overall highest quality text analytics
  • Support for a number of major (human) languages
  • Available interfaces for most major modern programming languages
  • Ability to run as a simple web service

Stanford CoreNLP’s goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. A tool pipeline can be run on a piece of plain text with just two lines of code. CoreNLP is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis, bootstrapped pattern learning, and the open information extraction tools. Moreover, an annotator pipeline can include additional custom or third-party annotators. CoreNLP’s analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

What stream of noise, sorry, news are you going to pipeling into the Stanford CoreNLP framework?


Imagine a web service that offers levels of analysis alongside news text.

Or does the same with leaked emails and/or documents?

Identifying Speech/News Writers

Friday, December 2nd, 2016

David Smith’s post: Stylometry: Identifying authors of texts using R details the use of R to distinguish tweets by president-elect Donald Trump from his campaign staff. (Hmmm, sharing a Twitter account password, there’s bad security for you.)

The same techniques may distinguish texts delivered “live” versus those “inserted” into Congressional Record.

What other texts are ripe for distinguishing authors?

From the post:

Few people expect politicians to write every word they utter themselves; reliance on speechwriters and spokepersons is a long-established political practice. Still, it's interesting to know which statements are truly the politician's own words, and which are driven primarily by advisors or influencers.

Recently, David Robinson established a way of figuring out which tweets from Donald Trump's Twitter account came from him personally, as opposed to from campaign staff, whcih he verified by comparing the sentiment of tweets from Android vs iPhone devices. Now, Ali Arsalan Kazmi has used stylometric analysis to investigate the provenance of speeches by the Prime Minister of Pakistan

A small amount of transparency can go a long way.

Email archives anyone?

Nuremberg Trial Verdicts [70th Anniversary]

Saturday, October 1st, 2016

Nuremberg Trial Verdicts by Jenny Gesley.

From the post:

Seventy years ago – on October 1, 1946 – the Nuremberg trial, one of the most prominent trials of the last century, concluded when the International Military Tribunal (IMT) issued the verdicts for the main war criminals of the Second World War. The IMT sentenced twelve of the defendants to death, seven to terms of imprisonment ranging from ten years to life, and acquitted three.

The IMT was established on August 8, 1945 by the United Kingdom (UK), the United States of America, the French Republic, and the Union of Soviet Socialist Republics (U.S.S.R.) for the trial of war criminals whose offenses had no particular geographical location. The defendants were indicted for (1) crimes against peace, (2) war crimes, (3) crimes against humanity, and of (4) a common plan or conspiracy to commit those aforementioned crimes. The trial began on November 20, 1945 and a total of 403 open sessions were held. The prosecution called thirty-three witnesses, whereas the defense questioned sixty-one witnesses, in addition to 143 witnesses who gave evidence for the defense by means of written answers to interrogatories. The hearing of evidence and the closing statements were concluded on August 31, 1946.

The individuals named as defendants in the trial were Hermann Wilhelm Göring, Rudolf Hess, Joachim von Ribbentrop, Robert Ley, Wilhelm Keitel, Ernst Kaltenbrunner, Alfred Rosenberg, Hans Frank, Wilhelm Frick, Julius Streicher, Walter Funk, Hjalmar Schacht, Karl Dönitz, Erich Raeder, Baldur von Schirach, Fritz Sauckel, Alfred Jodl, Martin Bormann, Franz von Papen, Arthur Seyss-Inquart, Albert Speer, Constantin von Neurath, Hans Fritzsche, and Gustav Krupp von Bohlen und Halbach. All individual defendants appeared before the IMT, except for Robert Ley, who committed suicide in prison on October 25, 1945; Gustav Krupp von Bolden und Halbach, who was seriously ill; and Martin Borman, who was not in custody and whom the IMT decided to try in absentia. Pleas of “not guilty” were entered by all the defendants.

The trial record is spread over forty-two volumes, “The Blue Series,” Trial of the Major War Criminals before the International Military Tribunal Nuremberg, 14 November 1945 – 1 October 1946.

All forty-two volumes are available in PDF format and should prove to be a more difficult indexing, mining, modeling, searching challenge than twitter feeds.

Imagine instead of “text” similarity, these volumes were mined for “deed” similarity. Similarity to deeds being performed now. By present day agents.

Instead of seldom visited dusty volumes in the library stacks, “The Blue Series” could develop a sharp bite.

UNIX, Bi-Grams, Tri-Grams, and Topic Modeling

Sunday, April 17th, 2016

UNIX, Bi-Grams, Tri-Grams, and Topic Modeling by Greg Brown.

From the post:

I’ve built up a list of UNIX commands over the years for doing basic text analysis on written language. I’ve built this list from a number of sources (Jim Martin‘s NLP class, StackOverflow, web searches), but haven’t seen it much in one place. With these commands I can analyze everything from log files to user poll responses.

Mostly this just comes down to how cool UNIX commands are (which you probably already know). But the magic is how you mix them together. Hopefully you find these recipes useful. I’m always looking for more so please drop into the comments to tell me what I’m missing.

For all of these examples I assume that you are analyzing a series of user responses with one response per line in a single file: data.txt. With a few cut and paste commands I often apply the same methods to CSV files and log files.

My favorite comment on this post was a reader who extended the tri-gram generator to build a hexagram!

If that sounds unreasonable, you haven’t read very many government reports. 😉

While you are at Greg’s blog, notice a number of useful posts on Elasticsearch.

bAbI – Facebook Datasets For Automatic Text Understanding And Reasoning

Sunday, February 21st, 2016

The bAbI project

Four papers and datasets on text understanding and reasoning from Facebook.

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin and Tomas Mikolov. Towards AI Complete Question Answering: A Set of Prerequisite Toy Tasks. arXiv:1502.05698.

Felix Hill, Antoine Bordes, Sumit Chopra and Jason Weston. The Goldilocks Principle: Reading Children’s Books with Explicit Memory Representations. arXiv:1511.02301.

Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller, Arthur Szlam, Jason Weston. Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems. arXiv:1511.06931.

Antoine Bordes, Nicolas Usunier, Sumit Chopra and Jason Weston. Simple Question answering with Memory Networks. arXiv:1506.02075.


Bible vs. Quran – Who’s More Violent?

Friday, January 22nd, 2016

Bible vs. Quran – Text analysis answers: Is the Quran really more violent than the Bible? by Tom H. C. Anderson.

Tom’s series appears in three parts, but sharing the common title:

Part I: The Project

From part 1:

With the proliferation of terrorism connected to Islamic fundamentalism in the late-20th and early 21st centuries, the question of whether or not there is something inherently violent about Islam has become the subject of intense and widespread debate.

Even before 9/11—notably with the publication of Samuel P Huntington’s “Clash of Civilizations” in 1996—pundits have argued that Islam incites followers to violence on a level that sets it apart from the world’s other major religions.

The November 2015 Paris attacks and the politicking of a U.S. presidential election year—particularly candidate Donald Trump’s call for a ban on Muslim’s entering the country and President Obama’s response in the State of the Union address last week—have reanimated the dispute in the mainstream media, and proponents and detractors, alike, have marshalled “experts” to validate their positions.

To understand a religion, it’s only logical to begin by examining its literature. And indeed, extensive studies in a variety of academic disciplines are routinely conducted to scrutinize and compare the texts of the world’s great religions.

We thought it would be interesting to bring to bear the sophisticated data mining technology available today through natural language processing and unstructured text analytics to objectively assess the content of these books at the surface level.

So, we’ve conducted a shallow but wide comparative analysis using OdinText to determine with as little bias as possible whether the Quran is really more violent than its Judeo-Christian counterparts.

Part II: Emotional Analysis Reveals Bible is “Angriest”

From part 2:

In my previous post, I discussed our potentially hazardous plan to perform a comparative analysis using an advanced data mining platform—OdinText—across three of the most important texts in human history: The Old Testament, The New Testament and the Quran.

Author’s note: For more details about the data sources and methodology, please see Part I of this series.

The project was inspired by the ongoing public debate around whether or not terrorism connected with Islamic fundamentalism reflects something inherently and distinctly violent about Islam compared to other major religions.

Before sharing the first set of results with you here today, due to the sensitive nature of this topic, I feel obliged to reiterate that this analysis represents only a cursory, superficial view of just the texts, themselves. It is in no way intended to advance any agenda or to conclusively prove anyone’s point.

Part III – Violence, Mercy and Non-Believers – to appear soon.

A comparison that may be an inducement for some to learn text/sentiment analysis but I would view its results with a great deal of caution.

Two of the comments to the first post read:

(comment) If you’re not completing the analysis in the native language, you’re just analyzing the translators’ understanding and interpretation of the texts; this is very different than the actual texts.

(to which a computational linguist replies) Technically, that is certainly true. However, if you are looking at broad categories of sentiment or topic, as this analysis does, there should be little variation in the results between translations, or by using the original. As well, it could be argued that what is most of interest is the viewpoint of the interpreters of the text, hence the translations may be *more* of interest, to some extent. But I would not expect that this analysis would be very sensitive at all to variations in translation or even language.

I find the position taken by the computational linguist almost incomprehensible.

Not only do we lack anything approaching a full social context for any of the texts in their original languages, moreover, terms that occur once (hapaxes) number approximately 1,300 in the Hebrew Bible and over 3,500 in the New Testament. For a discussion of the Qur’ān, see: Hapaxes in the Qur’ān: identifying and cataloguing lone words (and loadwords) by Shawkat M. Toorawa. Toorawa includes a list of hapaxes for the Qur’ān, a discussion of why they are important and a comparison to other texts.

Here is a quick example of where social context can change how you read a text:

23 The priest is to write these curses on a scroll and then wash them off into the bitter water. 24 He shall have the woman drink the bitter water that brings a curse, and this water will enter her and cause bitter suffering. 25 The priest is to take from her hands the grain offering for jealousy, wave it before the LORD and bring it to the altar. 26 The priest is then to take a handful of the grain offering as a memorial offering and burn it on the altar; after that, he is to have the woman drink the water. 27 If she has defiled herself and been unfaithful to her husband, then when she is made to drink the water that brings a curse, it will go into her and cause bitter suffering; her abdomen will swell and her thigh waste away, and she will become accursed among her people. (Numbers 5:23-27)

Does that sound sexist to you?

Interesting because a Hebrew Bible professor of my argued that it is one of the earliest pro-women passages in the text.

Think about the social context. There are no police, no domestic courts, short of retribution from the wife’s family members, there are no constraints on what a husband can do to his wife. Even killing her wasn’t beyond the pale.

Given that context, setting up a test that no one can fail, in the presence of a priest, which also deters resorting to a violent remedy, sounds like it gets the wife out of a dangerous situation where the priest can say: “See, you were jealous for no reason, etc.”

There’s no guarantee that is the correct interpretation either but it does accord with present understandings of law and custom at the time. The preservation of order in the community, no mean thing in the absence of an organized police force, was an important thing.

The English words used in translations also have their own context, which may be resolved differently from those in the original languages.

As I said, interesting but consider with a great deal of caution.

Text Analysis Without Programming

Sunday, October 18th, 2015

Text Analysis Without Programming by Lynn Cherny.

My favorite line in the slideshow reads:

PDFs are a sad text data reality

The slides give a good overview of a number of simple tools for text analysis.

And Cherny doesn’t skimp on pointing out issues with tools such as word clouds, where she says:

People don’t know what they indicate (and at the bottom of the slide: “But geez do people love them.”)

I suspect her observation on the uncertainty of what word clouds indicate is partially responsible for their popularity.

No matter what conclusion you draw about a word cloud, how could anyone offer a contrary argument?

A coding talk is promised and I am looking forward to it.


„To See or Not to See“…

Saturday, October 17th, 2015

„To See or Not to See“ – an Interactive Tool for the Visualization and Analysis of Shakespeare Plays by Thomas Wilhelm, Manuel Burghardt, and Christian Wolff.


In this article we present a web-based tool for the visualization and analysis of quantitative characteristics of Shakespeare plays. We use resources from the Folger Digital Texts Library 1 as input data for our tool. The Folger Shakespeare texts are annotated with structural markup from the Text Encoding Initiative (TEI) 2. Our tool interactively visualizes which character says what and how much at a particular point in time, allowing customized interpretations of Shakespeare plays on the basis of quantitative aspects, without having to care about technical hurdles such as markup or programming languages.

I found the remarkable web tool described in this paper at:

You can easily change plays (menu, top left) but note that “download source” refers to the processed plays themselves, not the XSL/T code that transformed the TEI markup. I think all the display code is JavaScript/CSS so you can scrape that from the webpage. I am more interested in the XSL/T applied to the original markup.

In the paper the authors say that plays may have over “5000 lines of code” for their transformation with XSL/T.

I am very curious if translating the XSL/T code into XQuery would reduce the amount of code required?

I recently re-wrote the XSLT code for the W3C Bibliography Generator, limited to Recommendations, and the XQuery code was far shorter than the XSLT used by the W3C.

Look for a post on the XQuery I wrote for the W3C bibliography on Monday, 19 October 2015.

If you decide to cite this article:

Wilhelm, T., Burghardt, M. & Wolff, C. (2013). “To See or Not to See” – An Interactive Tool for the Visualization and Analysis of Shakespeare Plays. In Franken-Wendelstorf, R., Lindinger, E. & Sieck J. (eds): Kultur und Informatik – Visual Worlds & Interactive Spaces, Berlin (pp. 175-185). Glückstadt: Verlag Werner Hülsbusch.

Two of the resources mentioned in the article:

Folger Digital Texts Library

Text Encoding Initiative (TEI)

Civil War Navies Bookworm

Tuesday, May 19th, 2015

Civil War Navies Bookworm by Abby Mullen.

From the post:

If you read my last post, you know that this semester I engaged in building a Bookworm using a government document collection. My professor challenged me to try my system for parsing the documents on a different, larger collection of government documents. The collection I chose to work with is the Official Records of the Union and Confederate Navies. My Barbary Bookworm took me all semester to build; this Civil War navies Bookworm took me less than a day. I learned things from making the first one!

This collection is significantly larger than the Barbary Wars collection—26 volumes, as opposed to 6. It encompasses roughly the same time span, but 13 times as many words. Though it is still technically feasible to read through all 26 volumes, this collection is perhaps a better candidate for distant reading than my first corpus.

The document collection is broken into geographical sections, the Atlantic Squadron, the West Gulf Blockading Squadron, and so on. Using the Bookworm allows us to look at the words in these documents sequentially by date instead of having to go back and forth between different volumes to get a sense of what was going on in the whole navy at any given time.

Before you ask:

The earlier post: Text Analysis on the Documents of the Barbary Wars

More details on Bookworm.

As with all ngram viewers, exercise caution in assuming a text string has uniform semantics across historical, ethnic, or cultural fault lines.

Expand Your Big Data Capabilities With Unstructured Text Analytics

Wednesday, May 6th, 2015

Expand Your Big Data Capabilities With Unstructured Text Analytics by Boris Evelson.

From the post:

Beware of insights! Real danger lurks behind the promise of big data to bring more data to more people faster, better and cheaper. Insights are only as good as how people interpret the information presented to them.

When looking at a stock chart, you can’t even answer the simplest question — “Is the latest stock price move good or bad for my portfolio?” — without understanding the context: Where you are in your investment journey and whether you’re looking to buy or sell.

While structured data can provide some context — like checkboxes indicating your income range, investment experience, investment objectives, and risk tolerance levels — unstructured data sources contain several orders of magnitude more context.

An email exchange with a financial advisor indicating your experience with a particular investment vehicle, news articles about the market segment heavily represented in your portfolio, and social media posts about companies in which you’ve invested or plan to invest can all generate much broader and deeper context to better inform your decision to buy or sell.

A thumbnail sketch of the complexity of extracting value from unstructured data sources. As such a sketch, there isn’t much detail but perhaps enough to avoid paying $2495 for the full report.

Detecting Text Reuse in Nineteenth-Century Legal Documents:…

Thursday, March 12th, 2015

Detecting Text Reuse in Nineteenth-Century Legal Documents: Methods and Preliminary Results by Lincoln Mullen.

From the post:

How can you track changes in the law of nearly every state in the United States over the course of half a century? How can you figure out which states borrowed laws from one another, and how can you visualize the connections among the legal system as a whole?

Kellen Funk, a historian of American law, is writing a dissertation on how codes of civil procedure spread across the United States in the second half of the nineteenth century. He and I have been collaborating on the digital part of this project, which involves identifying and visualizing the borrowings between these codes. The problem of text reuse is a common one in digital history/humanities projects. In this post I want to describe our methods and lay out some of our preliminary results. To get a fuller picture of this project, you should read the four posts that Kellen has written about his project:

Quite a remarkable project with many aspects that will be relevant to other projects.

Lincoln doesn’t use the term but this would be called textual criticism, if it were being applied to the New Testament. Of course here, Lincoln and Kellen have the original source document and the date of its adoption. New Testament scholars have copies of copies in no particular order and no undisputed evidence of the original text.

Did I mention that all the source code for this project is on Github?


Wednesday, March 11th, 2015


From the webpage:

Convert text from a file or from stdin into SQL table and query it instantly. Uses sqlite as backend. The idea is to make SQL into a tool on the command line or in scripts.

Online manual:

So what can it do?

  • convert text/CSV files into sqlite database/table
  • work on stdin data on-the-fly
  • it can be used as swiss army knife kind of tool for extracting information from other processes that send their information to termsql via a pipe on the command line or in scripts
  • termsql can also pipe into another termsql of course
  • you can quickly sort and extract data
  • creates string/integer/float column types automatically
  • gives you the syntax and power of SQL on the command line

Sometimes you need the esoteric and sometimes not!


I first saw this in a tweet by Christophe Lalanne.

TextBlob: Simplified Text Processing

Tuesday, February 24th, 2015

TextBlob: Simplified Text Processing

From the webpage:

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.


  • Noun phrase extraction
  • Part-of-speech tagging
  • Sentiment analysis
  • Classification (Naive Bayes, Decision Tree)
  • Language translation and detection powered by Google Translate
  • Tokenization (splitting text into words and sentences)
  • Word and phrase frequencies
  • Parsing
  • n-grams
  • Word inflection (pluralization and singularization) and lemmatization
  • Spelling correction
  • Add new models or languages through extensions
  • WordNet integration

Has anyone compared this head to head with NLTK?

Modelling Plot: On the “conversional novel”

Tuesday, January 20th, 2015

Modelling Plot: On the “conversional novel” by Andrew Piper.

From the post:

I am pleased to announce the acceptance of a new piece that will be appearing soon in New Literary History. In it, I explore techniques for identifying narratives of conversion in the modern novel in German, French and English. A great deal of new work has been circulating recently that addresses the question of plot structures within different genres and how we might or might not be able to model these computationally. My hope is that this piece offers a compelling new way of computationally studying different plot types and understanding their meaning within different genres.

Looking over recent work, in addition to Ben Schmidt’s original post examining plot “arcs” in TV shows using PCA, there have been posts by Ted Underwood and Matthew Jockers looking at novels, as well as a new piece in LLC that tries to identify plot units in fairy tales using the tools of natural language processing (frame nets and identity extraction). In this vein, my work offers an attempt to think about a single plot “type” (narrative conversion) and its role in the development of the novel over the long nineteenth century. How might we develop models that register the novel’s relationship to the narration of profound change, and how might such narratives be indicative of readerly investment? Is there something intrinsic, I have been asking myself, to the way novels ask us to commit to them? If so, does this have something to do with larger linguistic currents within them – not just a single line, passage, or character, or even something like “style” – but the way a greater shift of language over the course of the novel can be generative of affective states such as allegiance, belief or conviction? Can linguistic change, in other words, serve as an efficacious vehicle of readerly devotion?

While the full paper is available here, I wanted to post a distilled version of what I see as its primary findings. It’s a long essay that not only tries to experiment with the project of modelling plot, but also reflects on the process of model building itself and its place within critical reading practices. In many ways, its a polemic against the unfortunate binariness that surrounds debates in our field right now (distant/close, surface/depth etc.). Instead, I want us to see how computational modelling is in many ways conversional in nature, if by that we understand it as a circular process of gradually approaching some imaginary, yet never attainable centre, one that oscillates between both quantitative and qualitative stances (distant and close practices of reading).

Andrew writes of “…critical reading practices….” I’m not sure that technology will increase the use of “…critical reading practices…” but it certainly offers the opportunity to “read” texts in different ways.

I have done this with IT standards but never a novel, attempt reading it from the back forwards, a sentence at a time. At least with authoring you are proofing, it provides a radically different perspective than the more normal front to back. The first thing you notice is that it interrupts your reading/skimming speed so you will catch more errors as well as nuances in the text.

Before you think that literary analysis is a bit far afield from “practical” application, remember that narratives (think literature) are what drive social policy and decision making.

Take the current popular “war on terrorism” narrative that is so popular and unquestioned in the United States. Ask anyone inside the beltway in D.C. and they will blather on and on about the need to defend against terrorism. But there is an absolute paucity of terrorists, at least by deed, in the United States. Why does the narrative persist in the absence of any evidence to support it?

The various Red Scares in U.S. history were similar narratives that have never completely faded. They too had a radical disconnect between the narrative and the “facts on the ground.”

Piper doesn’t offer answers to those sort of questions but a deeper understanding of narrative, such as is found in novels, may lead to hints with profound policy implications.

Thoughts on Software Development Python NLTK/Neo4j:…

Saturday, January 10th, 2015

Python NLTK/Neo4j: Analysing the transcripts of How I Met Your Mother by Mark Needham.

From the post:

After reading Emil’s blog post about dark data a few weeks ago I became intrigued about trying to find some structure in free text data and I thought How I met your mother’s transcripts would be a good place to start.

I found a website which has the transcripts for all the episodes and then having manually downloaded the two pages which listed all the episodes, wrote a script to grab each of the transcripts so I could use them on my machine.

Interesting intermarriage between NLTK and Neo4j. Perhaps even more so if NLTK were used to extract information from dialogue outside of fictional worlds and Neo4j was used to model dialogue roles, etc., as well as relationships and events outside of the dialogue.

Congressional hearings (in the U.S., same type of proceedings outside the U.S.) would make an interesting target for analysis using NLTK and Neo4j.

Getting started with text analytics

Friday, January 2nd, 2015

Getting started with text analytics by Chris DuBois.

At GraphLab, we are helping data scientists go from inspiration to production. As part of that goal, we made sure that GraphLab Create is useful for manipulating text data, plugging the results into a machine learning model, and deploying a predictive service.

Text data is useful in a wide variety of applications:

  • Finding key phrases in online reviews that describe an attribute or aspect of a restaurant, product for sale, etc.
  • Detecting sentiment in social media, such as tweets and news article comments.
  • Predicting influential documents in large corpora, such as PubMed abstracts and arXiv articles


So how do data scientists get started with text data? Regardless of the ultimate goal, the first step in text processing is typically feature engineering. We make this work easy to do using GraphLab Create. Examples of features include:

Just in case you get tired of watching conference presentations this weekend, I found this post from early December 2014 that I have been meaning to mention. Take a break from the videos and enjoy working through this post.

Chris promises more posts on data science skills so stay tuned!

Sony Pictures Demands That News Agencies Delete ‘Stolen’ Data

Monday, December 15th, 2014

Sony Pictures Demands That News Agencies Delete ‘Stolen’ Data by Michael Cieply and Brooks Barnes.

From the article:

Sony Pictures Entertainment warned media outlets on Sunday against using the mountains of corporate data revealed by hackers who raided the studio’s computer systems in an attack that became public last month.

In a sharply worded letter sent to news organizations, including The New York Times, David Boies, a prominent lawyer hired by Sony, characterized the documents as “stolen information” and demanded that they be avoided, and destroyed if they had already been downloaded or otherwise acquired.

The studio “does not consent to your possession, review, copying, dissemination, publication, uploading, downloading or making any use” of the information, Mr. Boies wrote in the three-page letter, which was distributed Sunday morning.

Since I wrote about the foolish accusations against North Korea by Sony, I thought it only fair to warn you that the idlers at Sony have decided to threaten everyone else.

A rather big leap from trash talking about North Korea to accusing the rest of the world of being interested in their incestuous bickering.

I certainly don’t want a copy of their movies, released or unreleased. Too much noise and too little signal for the space they would take. But, since Sony has gotten on its “let’s threaten everybody” hobby-horse, I do hope the location of the Sony documents suddenly appears in many more inboxes. 😉

How would you display choice snippets and those who uttered them when a webpage loads?

The bitching and catching by Sony are sure signs that something went terribly wrong internally. The current circus is an attempt to distract the public from that failure. Probably a member of management with highly inappropriate security clearance because “…they are important!”

Inappropriate security clearances for management to networks is a sign of poor systems administration. I wonder when that shoe is going to drop?

Pride & Prejudice & Word Embedding Distance

Sunday, November 23rd, 2014

Pride & Prejudice & Word Embedding Distance by Lynn Cherny.

From the webpage:

An experiment: Train a word2vec model on Jane Austen’s books, then replace the nouns in P&P with the nearest word in that model. The graph shows a 2D t-SNE distance plot of the nouns in this book, original and replacement. Mouse over the blue words!

In her blog post, Visualizing Word Embeddings in Pride and Prejudice, Lynn explain more about the project and the process she followed.

From that post:

Overall, the project as launched consists of the text of Pride and Prejudice, with the nouns replaced by the most similar word in a model trained on all of Jane Austen’s books’ text. The resulting text is pretty nonsensical. The blue words are the replaced words, shaded by how close a “match” they are to the original word; if you mouse over them, you see a little tooltip telling you the original word and the score.

I don’t agree that: “The resulting test is pretty nonsensical.”

True, it’s not Jane Austin’s original text and it is challenging to read, but that may be because our assumptions about Pride and Prejudice and literature in general are being defeated by the similar word replacements.

The lack of familiarity and smoothness of a received text may (no guarantees) enable us to see the text differently than we would on a casual re-reading.

What novel corpus would you use for such an experiment?

Classifying Shakespearean Drama with Sparse Feature Sets

Tuesday, October 14th, 2014

Classifying Shakespearean Drama with Sparse Feature Sets by Douglas Duhaime.

From the post:

In her fantastic series of lectures on early modern England, Emma Smith identifies an interesting feature that differentiates the tragedies and comedies of Elizabethan drama: “Tragedies tend to have more streamlined plots, or less plot—you know, fewer things happening. Comedies tend to enjoy a multiplication of characters, disguises, and trickeries. I mean, you could partly think about the way [tragedies tend to move] towards the isolation of a single figure on the stage, getting rid of other people, moving towards a kind of solitude, whereas comedies tend to end with a big scene at the end where everybody’s on stage” (6:02-6:37). 

The distinction Smith draws between tragedies and comedies is fairly intuitive: tragedies isolate the poor player that struts and frets his hour upon the stage and then is heard no more. Comedies, on the other hand, aggregate characters in order to facilitate comedic trickery and tidy marriage plots. While this discrepancy seemed promising, I couldn’t help but wonder whether computational analysis would bear out the hypothesis. Inspired by the recent proliferation of computer-assisted genre classifications of Shakespeare’s plays—many of which are founded upon high dimensional data sets like those generated by DocuScope—I was curious to know if paying attention to the number of characters on stage in Shakespearean drama could help provide additional feature sets with which to carry out this task.

A quick reminder that not all text analysis is concerned with 140 character strings. 😉

Do you prefer:

high dimensional

where every letter in “high dimensional” is a hyperlink with an unknown target or a fuller listing:

Allison, Sarah, and Ryan Heuser, Matthew Jockers, Franco Moretti, Michael Witmore. Quantitative Formalism: An Experiment

Jockers, Matthew. Machine-Classifying Novels and Plays by Genre

Hope, Jonathan and Michael Witmore. “The Hundredth Psalm to the Tune of ‘Green Sleeves’”: Digital Approaches Shakespeare’s Language of Genre

Hope, Jonathan. Shakespeare by the numbers: on the linguistic texture of the Late Plays

Hope, Jonathan and Michael Witmore. The Very Large Textual Object: A Prosthetic Reading of Shakespeare

Lenthe, Victor. Finding the Sherlock in Shakespeare: some ideas about prose genre and linguistic uniqueness

Stumpf, Mike. How Quickly Nature Falls Into Revolt: On Revisiting Shakespeare’s Genres

Stumpf, Mike. This Thing of Darkness (Part III)

Tootalian, Jacob A. Shakespeare, Without Measure: The Rhetorical Tendencies of Renaissance Dramatic Prose

Ullyot, Michael. Encoding Shakespeare

Witmore, Michael. A Genre Map of Shakespeare’s Plays from the First Folio (1623)

Witmore, Michael. Shakespeare Out of Place?

Witmore, Michael. Shakespeare Quarterly 61.3 Figures

Witmore, Michael. Visualizing English Print, 1530-1800, Genre Contents of the Corpus

Decompiling Shakespeare (Site is down. Was also down when the WayBack machine tried to archive the site in July of 2014)

I prefer the longer listing.

If you are interested in Shakespeare, Folger Digital Texts has free XML and PDF versions of his work.

I first saw this in a tweet by Gregory Piatetsky

Mirrors for Princes and Sultans:…

Monday, October 13th, 2014

Mirrors for Princes and Sultans: Advice on the Art of Governance in the Medieval Christian and Islamic Worlds by Lisa Blaydes, Justin Grimmery, and Alison McQueen.


Among the most signi cant forms of political writing to emerge from the medieval period are texts off ering advice to kings and other high-ranking ocials. Books of counsel varied considerably in their content and form; scholars agree, however, that such texts reflected the political exigencies of their day. As a result, writings in the “mirrors for princes” tradition o er valuable insights into the evolution of medieval modes of governance. While European mirrors (and Machiavelli’s Prince in particular) have been extensively studied, there has been less scholarly examination of a parallel political advice literature emanating from the Islamic world. We compare Muslim and Christian advisory writings from the medieval period using automated text analysis, identify sixty conceptually distinct topics that our method automatically categorizes into three areas of concern common to both Muslim and Christian polities, and examine how they evolve over time. We o er some tentative explanations for these trends.

If you don’t know the phrase, “mirrors for princes,”:

texts that seek to off er wisdom or guidance to monarchs and other high-ranking advisors.

Since nearly all bloggers and everyone with a byline in traditional media considers themselves qualified to offer advice to “…monarchs and other high-ranking advisors,” one wonders how the techniques presented would fare with modern texts?

Certainly a different style of textual analysis than is seen outside the humanities and so instructive for that purpose.

I do wonder about the comparison of texts in translation into English. Obviously easier but runs the risk of comparing translators to translators and not so much the thoughts of the original authors.

I first saw this in a tweet by Christopher Phipps.

Data Sciencing by Numbers:…

Wednesday, September 3rd, 2014

Data Sciencing by Numbers: A Walk-through for Basic Text Analysis by Jason Baldridge.

From the post:

My previous post “Titillating Titles: Uncoding SXSW Proposal Titles with Text Analytics” discusses a simple exploration I did into algorithmically rating SXSW titles, most of which I did while on a plane trip last week. What I did was pretty basic, and to demonstrate that, I’m following up that post with one that explicitly shows you how you can do it yourself, provided you have access to a Mac or Unix machine.

There are three main components to doing what I did for the blog post:

  • Topic modeling code: the Mallet toolkit’s implementation of Latent Dirichlet Allocation
  • Language modeling code: the BerkeleyLM Java package for training and using n-gram language models
  • Unix command line tools for processing raw text files with standard tools and the topic modeling and language modeling code
  • I’ll assume you can use the Unix command line at at least a basic level, and I’ve packaged up the topic modeling and language modeling code in the Github repository maul to make it easy to try them out. To keep it really simple: you can download the Maul code and then follow the instructions in the Maul README. (By the way, by giving it the name “maul” I don’t want to convey that it is important or anything — it is just a name I gave the repository, which is just a wrapper around other people’s code.)

    Jason’s post should help get you starting doing data exercises. It is up to you if you continue those exercises and branch out to other data and new tools.

    Like everything else, data exploration proficiency requires regular exercise.

    Are you keeping a data exercise calendar?

    I first saw this in a post by Jason Baldridge.

    Titillating Titles:…

    Wednesday, September 3rd, 2014

    Titillating Titles: Uncoding SXSW Proposal Titles with Text Analytics by Jason Baldridge.

    From the post:

    The proposals for SXSW 2015 have been posted for several weeks now, and the community portion of the process ends this week on Friday, September 5. As a proposer myself for Are You In A Social Media Experiment?, I’ve been meaning to find a chance to look into the titles and see whether some straight-forward Unix commands, text analytics and natural language processing can reveal anything interesting about them.

    People reportedly put a lot of thought into their titles since that is a big part of getting your proposal noticed in the community part of the voting process for panels. The creators of proposals for SXSW are given lots of feedback, including things like on their titles.

    “Vague, non-descriptive language is a common mistake on titles — but if readers can’t comprehend the basic focus of your proposal without also reading the description, then you probably need to re-think your approach. If you can make the title witty and attention-getting, then wonderful. But please don’t let wit sidetrack you from the more significant goals of simple, accurate and succinct.”

    In short, a title should stand out while remaining informative. It turns out that there has been research in computational linguistics into how to craft memorable quotes that is interesting with respect to standing out. Danescu-Niculescu-Mizil, Cheng, Kleinberg, and Lee’s (2012) “You had me at hello: How phrasing affects memorability” found that memorable movie quotes use less common words built on a scaffold of common syntactic patterns (BTW, the paper itself has great section titles). Chan, Lee and Pang (2014) go to the next step of building a model that predicts which of two versions of a tweet will have a better response (in terms of obtaining retweets) (see the demo).

    Are you read to take your titles beyond spell-check and grammar correction?

    What if you could check your titles at least to make them more memorable? Would you do it?

    Jason provides an example of how checking your title for “impact” may not be all that far fetched.

    PS: Be sure to try the demo for “better” tweets.

    You Say “Concepts” I Say “Subjects”

    Wednesday, August 27th, 2014

    Researchers are cracking text analysis one dataset at a time by Derrick Harris.

    From the post:

    Google on Monday released the latest in a string of text datasets designed to make it easier for people outside its hallowed walls to build applications that can make sense of all the words surrounding them.

    As explained in a blog post, the company analyzed the New York Times Annotated Corpus — a collection of millions of articles spanning 20 years, tagged for properties such as people, places and things mentioned — and created a dataset that ranks the salience (or relative importance) of every name mentioned in each one of those articles.

    Essentially, the goal with the dataset is to give researchers a base understanding of which entities are important within particular pieces of content, an understanding that should then be complemented with background data sources that will provide even more information. So while the number of times a person or company is mentioned in an article can be a very strong sign of which words are important — especially when compared to the usual mention count for that word, one of the early methods for ranking search results — a more telling method of ranking importance would also leverage existing knowledge of broader concepts to capture important words that don’t stand out from a volume perspective.

    A summary of some of the recent work on recognizing concepts in text and not just key words.

    As topic mappers know, there is no universal one to one correspondence between words and subjects (“concepts” in this article). Finding “concepts” means that whatever words triggered that recognition, we can supply other information that is known about the same concept.

    Certainly will make topic map authoring easier when text analytics can generate occurrence data and decorate existing topic maps with their findings.

    Text Coherence

    Tuesday, May 13th, 2014

    Christopher Phipps mentioned Automatic Evaluation of Text Coherence: Models and Representations by Mirella Lapata and Regina Barzilay in a tweet today. Running that article down, I discovered it was published in the proceedings of International Joint Conferences on Artificial Intelligence in 2005.

    Useful but a bit dated.

    A more recent resource: A Bibliography of Coherence and Cohesion, Wolfram Bublitz (Universität Augsburg). Last updated: 2010.

    The Bublitz bibliography is more recent but current bibliography would be even more useful.

    Can you suggest a more recent bibliography on text coherence/cohesion?

    I ask because while looking for such a bibliography, I encountered: Improving Topic Coherence with Regularized Topic Models by David Newman, Edwin V. Bonilla, and, Wray Buntine.

    The abstract reads:

    Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To over-come this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data.

    I don’t think the “…small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful” is a surprise to anyone. I take that as the traditional “garbage in, garbage out.”

    However, “regularizers” may be useful for automatic/assisted authoring of topics in the topic map sense of the word topic. Assuming you want to mine “small or small and noisy texts.” The authors say the technique should apply to large texts and promise future research on applying “regularizers” to large texts.

    I checked the authors’ recent publications but didn’t see anything I would call a “large” text application of “regularizers.” Open area of research if you want to take the lead.

    GATE 8.0

    Monday, May 12th, 2014

    GATE (general architecture for text engineering) 8.0

    From the download page:

    Release 8.0 (May 11th 2014)

    Most users should download the installer package (~450MB):

    If the installer does not work for you, you can download one of the following packages instead. See the user guide for installation instructions:

    The BIN, SRC and ALL packages all include the full set of GATE plugins and all the libraries GATE requires to run, including sample trained models for the LingPipe and OpenNLP plugins.

    Version 8.0 requires Java 7 or 8, and Mac users must install the full JDK, not just the JRE.

    Four major changes in this release:

    1. Requires Java 7 or later to run
    2. Tools for Twitter.
    3. ANNIE (named entity annotation pipeline) Refreshed.
    4. Tools for Crowd Sourcing.

    Not bad for a project that will turn twenty (20) next year!

    More resources:


    Nightly Snapshots

    Mastering a substantial portion of GATE should keep you in nearly constant demand.

    Word Tree [Standard Editor’s Delight]

    Monday, February 24th, 2014

    Word Tree by Jason Davies.

    From the webpage:

    The Word Tree visualisation technique was invented by the incredible duo Martin Wattenberg and Fernanda Viégas in 2007. Read their paper for the full details.

    Be sure to also check out various text analysis projects by Santiago Ortiz

    Created by Jason Davies. Thanks to Mike Bostock for comments and suggestions. .

    This is excellent!

    I pasted in the URL from a specification I am reviewing and got this result:


    I then changed the focus to “server” and had this result:


    Granted I need to play with it a good bit more but not bad for throwing a URL at the page.

    I started to say this probably won’t work across multiple texts, in order to check consistency of the documents.

    But, I already have text versions of the files with various formatting and boilerplate stripped out. I could just cat all the files together and then run word tree on the resulting file.

    Would make checking for consistency a lot easier. True, tracking down the inconsistencies will be a pain but that’s going to be true in any event.

    Not feasible to do it manually with 600+ pages of text spread over twelve (12) documents. Well, could if I were in a monastery and had several months to complete the task. 😉

    This also looks like a great data exploration tool for topic map authoring as well.

    I first saw this in a tweet by Elena Glassman.

    Word Storms:…

    Monday, February 24th, 2014

    Word Storms: Multiples of Word Clouds for Visual Comparison of Documents by Quim Castellà and Charles Sutton.


    Word clouds are popular for visualizing documents, but are not as useful for comparing documents, because identical words are not presented consistently across different clouds. We introduce the concept of word storms, a visualization tool for analyzing corpora of documents. A word storm is a group of word clouds, in which each cloud represents a single document, juxtaposed to allow the viewer to compare and contrast the documents. We present a novel algorithm that creates a coordinated word storm, in which words that appear in multiple documents are placed in the same location, using the same color and orientation, across clouds. This ensures that similar documents are represented by similar- looking word clouds, making them easier to compare and contrast visually. We evaluate the algorithm using an automatic evaluation based on document classifi cation, and a user study. The results con rm that a coordinated word storm allows for better visual comparison of documents.

    I never have cared for word clouds all that much but word storms as presented by the authors looks quite useful.

    The paper examines the use of word storms at a corpus, document and single document level.

    You will find Word Storms: Multiples of Word Clouds for Visual Comparison of Documents (website) of particular interest, including its like to Github for the source code used in this project.

    Of particular interests for topic mappers is the observation:

    similar documents should be represented by visually similar clouds (emphasis in original)

    Now imagine for a moment visualizing topics and associations with “similar” appearances. Even if limited to colors that are easy to distinguish, that could be a very powerful display/discover tool for topic maps.

    Not the paper’s use case but one that comes to mind with regard to display/discovery in a heterogeneous data set (such as a corpus of documents).

    qdap 1.1.0 Released on CRAN [Text Analysis]

    Monday, February 24th, 2014

    qdap 1.1.0 Released on CRAN by Tyler Rinker.

    From the post:

    We’re very pleased to announce the release of qdap 1.1.0

    This is the fourth installment of the qdap package available at CRAN. Major development has taken place since the last CRAN update.

    The qdap package automates many of the tasks associated with quantitative discourse analysis of transcripts containing discourse, including frequency counts of sentence types, words, sentence, turns of talk, syllable counts and other assorted analysis tasks. The package provides parsing tools for preparing transcript data but may be useful for many other natural language processing tasks. Many functions enable the user to aggregate data by any number of grouping variables providing analysis and seamless integration with other R packages that undertake higher level analysis and visualization of text.

    Appropriate for chat rooms, IRC transcripts, plays (the sample data is Romeo and Juliet), etc.

    Theory and Applications for Advanced Text Mining

    Monday, October 28th, 2013

    Theory and Applications for Advanced Text Mining edited by Shigeaki Sakurai.

    From the post:

    Book chapters include:

    • Survey on Kernel-Based Relation Extraction by Hanmin Jung, Sung-Pil Choi, Seungwoo Lee and Sa-Kwang Song
    • Analysis for Finding Innovative Concepts Based on Temporal Patterns of Terms in Documents by Hidenao Abe
    • Text Clumping for Technical Intelligence by Alan Porter and Yi Zhang
    • A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining by Alessio Leoncini, Fabio Sangiacomo, Paolo Gastaldo and Rodolfo Zunino
    • Ontology Learning Using Word Net Lexical Expansion and Text Mining by Hiep Luong, Susan Gauch and Qiang Wang
    • Automatic Compilation of Travel Information from Texts: A Survey by Hidetsugu Nanba, Aya Ishino and Toshiyuki Takezawa
    • Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques by Masaomi Kimura
    • Biomedical Named Entity Recognition: A Survey of Machine-Learning Tools by David Campos, Sergio Matos and Jose Luis Oliveira
    • Toward Computational Processing of Less Resourced Languages: Primarily Experiments for Moroccan Amazigh Language by Fadoua Ataa Allah and Siham Boulaknadel

    Download the book or the chapters at:

    Is it just me or have more data mining/analysis books been appearing as open texts alongside traditional print publication? Than say five years ago?

    The Irony of Obamacare:…

    Friday, October 4th, 2013

    The Irony of Obamacare: Republicans Thought of It First by Meghan Foley.

    From the post:

    “An irony of the Patient Protection and Affordable Care Act (Obamacare) is that one of its key provisions, the individual insurance mandate, has conservative origins. In Congress, the requirement that individuals to purchase health insurance first emerged in Republican health care reform bills introduced in 1993 as alternatives to the Clinton plan. The mandate was also a prominent feature of the Massachusetts plan passed under Governor Mitt Romney in 2006. According to Romney, ‘we got the idea of an individual mandate from [Newt Gingrich], and [Newt] got it from the Heritage Foundation.’” – Tracing the Flow of Policy Ideas in Legislatures: A Text Reuse Approach

    That irony led John Wilkerson of the University of Washington and his colleagues David Smith and Nick Stramp to study the legislative history of the health care reform law using a text-analysis system to understand its origins.

    Scholars rely almost exclusively on floor roll call voting patterns to assess partisan cooperation in Congress, according to findings in the paper, Tracing the Flow of Policy Ideas in Legislatures: A Text Reuse Approach. By that standard, the Affordable Care was a highly partisan bill. Yet a different story emerges when the source of the reform’s policy is analyzed. The authors’ findings showed that a number of GOP policy ideas overlap with provisions in the Affordable Care Act: Of the 906-page law, 3 percent of the “policy ideas” used wording similar to bills sponsored by House Republicans and 8 percent used wording similar to bills sponsored by Senate Republicans.

    In the paper, the authors say:

    Our approach is to focus on legislative text. We assume that two bills share a policy idea when they share similar text. Of course, this raises many questions about whether similar text does actually capture shared policy ideas. This paper constitutes an early cut at the question.

    The same thinking, similar text = similar ideas, permeates prior art searches on patents as well.

    A more fruitful search would be of donor statements, proposals, literature for similar language/ideas.

    In that regard, members of the United States Congress are just messengers.

    PS: Thanks to Sam Hunting for the pointer to this article!