Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 18, 2019

Data Mining Relevance Practice – Iraq War Report

Filed under: Data Mining,Relevance,Search Data,Text Mining — Patrick Durusau @ 9:47 pm

By now you realize how useless relevancy at the “document” level can be, considering documents can be ten, twenty, hundreds or even thousands of pages long.

Highly relevant “hits” are great, but are you going to read every page of every document?

The main report on the Iraq War, The U.S. Army in the Iraq War – Volume 1: Invasion – Insurgency – Civil War, 2003-2006 and The U.S. Army in the Iraq War — Volume 2: Surge and Withdrawal, 2007-2011, totals out at about 1,400+ pages.

Along with the report, nearly 30,000 unclassified documents used in the writing of the report are also available.

Other than being timely, the advantage for data miners is the report, while a bit long, is readable and you know in advance the ~30,000 documents are relevant to that report. Ignoring footnotes (that’s cheating), which documents go with which pages of the report? You can check your answers against the footnotes.

For bonus points, what pages of the ~30,000 documents should go with which pages of the report? They weren’t citing the entire document like some search engines, but particular pages.

And no, I haven’t loaded these documents but hope to this weekend.

PS: The Army War College Publications office has a remarkable range of very high quality publications.

December 6, 2018

Basic Text [Leaked Email] Processing in R

Filed under: R,Text Mining — Patrick Durusau @ 10:08 am

Basic Text Processing in R by Taylor Arnold and Lauren Tilton.

From Learning Goals:

A substantial amount of historical data is now available in the form of raw, digitized text. Common examples include letters, newspaper articles, personal notes, diary entries, legal documents and transcribed speeches. While some stand-alone software applications provide tools for analyzing text data, a programming language offers increased flexibility to analyze a corpus of text documents. In this tutorial we guide users through the basics of text analysis within the R programming language. The approach we take involves only using a tokenizer that parses text into elements such as words, phrases and sentences. By the end of the lesson users will be able to:

  • employ exploratory analyses to check for errors and detect high-level patterns;
  • apply basic stylometric methods over time and across authors;
  • approach document summarization to provide a high-level description of the
    elements in a corpus.

The tutorial uses United States Presidential State of the Union Addresses, yawn, as their dataset.

Great tutorial but aren’t there more interesting datasets to use as examples?

Modulo that I haven’t prepared such a dataset or matched it to a tutorial such as this one.

Question: What would make a more interesting dataset than United States Presidential State of the Union Addresses?

Anything is not a helpful answer.

Suggestions?

August 1, 2018

Trucks and beer (Music)

Filed under: Music,Text Analytics,Text Mining — Patrick Durusau @ 6:13 pm

Trucks and beer by John W. Miller.

From the post:

Inspired by a post on Big-ish Data, I’ve started working on a textual analysis of popular country music.

More specifically, I scraped Ranker.com for a list of the top female and male country artists of the last 100 years and used my python wrapper for the Genius API to download the lyrics to each song by every artist on the list. After my script ran for about six hours I was left with the lyrics to 12,446 songs by 83 artists stored in a 105 MB JSON file. As a bit of an outsider to the world of country music, I was curious whether some of the preconceived notions I had about the genre were true.

Some pertinent questions:

  • Which artist mentions trucks in their songs most often?
  • Does an artist’s affinity for trucks predict any other features? Their gender for example? Or their favorite drink?
  • How has the genre’s vocabulary changed over time?
  • Of all the artists, whose language is most diverse? Whose is most repetitive?

You can find my code for this project on GitHub.

Miller focuses on popular country music but the lesson here could be applied to any collection of lyrics.

What’s your favorite genre or group?

Here’s a history/data question: Does popular (for some definition of popular) music change before revolutions? If so, in what way?

While you are at Miller’s site, browse around. There’s a number of interesting posts in addition to this one.

April 30, 2018

Examining POTUS Executive Orders [Tweets < Executive Orders < Cern Data]

Filed under: Government Data,R,Text Mining,Texts — Patrick Durusau @ 8:12 pm

Examining POTUS Executive Orders by Bob Rudis.

From the post:

This week’s edition of Data is Plural had two really fun data sets. One is serious fun (the first comprehensive data set on U.S. evictions, and the other I knew about but had forgotten: The Federal Register Executive Order (EO) data set(s).

The EO data is also comprehensive as the summary JSON (or CSV) files have links to more metadata and even more links to the full-text in various formats.

What follows is a quick post to help bootstrap folks who may want to do some tidy text mining on this data. We’ll look at EOs-per-year (per-POTUS) and also take a look at the “top 5 ‘first words’” in the titles of the EOS (also by POTUS).

My estimate of the importance of executive orders by American Presidents, “Tweets < Executive Orders < Cern Data,” is only an approximation.

Rudis leaves you plenty of room to experiment with R and processing the text of executive orders.

Enjoy!

December 27, 2017

Game of Thrones DVDs for Christmas?

Filed under: R,Text Mining — Patrick Durusau @ 10:40 am

Mining Game of Thrones Scripts with R by Gokhan Ciflikli

If you are serious about defeating all comers to Game of Thrones trivia, then you need to know the scripts cold. (sorry)

Ciflikli introduces you to the quanteda and analysis of the Game of Thrones scripts in a single post saying:

I meant to showcase the quanteda package in my previous post on the Weinstein Effect but had to switch to tidytext at the last minute. Today I will make good on that promise. quanteda is developed by Ken Benoit and maintained by Kohei Watanabe – go LSE! On that note, the first 2018 LondonR meeting will be taking place at the LSE on January 16, so do drop by if you happen to be around. quanteda v1.0 will be unveiled there as well.

Given that I have already used the data I had in mind, I have been trying to identify another interesting (and hopefully less depressing) dataset for this particular calling. Then it snowed in London, and the dire consequences of this supernatural phenomenon were covered extensively by the r/CasualUK/. One thing led to another, and before you know it I was analysing Game of Thrones scripts:

2018, with its mid-term congressional elections, will be a big year for leaked emails, documents, in addition to the usual follies of government.

Text mining/analysis skills you gain with the Game of Thrones scripts will be in high demand by partisans, investigators, prosecutors, just about anyone you can name.

From the quanteda documentation site:


quanteda is principally designed to allow users a fast and convenient method to go from a corpus of texts to a selected matrix of documents by features, after defining what the documents and features. The package makes it easy to redefine documents, for instance by splitting them into sentences or paragraphs, or by tags, as well as to group them into larger documents by document variables, or to subset them based on logical conditions or combinations of document variables. The package also implements common NLP feature selection functions, such as removing stopwords and stemming in numerous languages, selecting words found in dictionaries, treating words as equivalent based on a user-defined “thesaurus”, and trimming and weighting features based on document frequency, feature frequency, and related measures such as tf-idf.
… (emphasis in original)

Once you follow the analysis of the Game of Thrones scripts, what other texts or features of quanteda will catch your eye?

Enjoy!

November 8, 2017

eTRAP (electronic Text Reuse Acquisition Project) [Motif Identities]

Filed under: Text Analytics,Text Mining,Texts — Patrick Durusau @ 10:49 am

eTRAP (electronic Text Reuse Acquisition Project)

From the webpage:

As the name suggests, this interdisciplinary team studies the linguistic and literary phenomenon that is text reuse with a particular focus on historical languages. More specifically, we look at how ancient authors copied, alluded to, paraphrased and translated each other as they spread their knowledge in writing. This early career research group seeks to provide a basic understanding of the historical text reuse methodology (it being distinct from plagiarism), and so to study what defines text reuse, why some people reuse information, how text is reused and how this practice has changed over history. We’ll be investigating text reuse on big data or, in other words, datasets that, owing to their size, cannot be manually processed.

While primarily geared towards research, the team also organises events and seminars with the aim of learning more about the activities conducted by our scholarly communities, to broaden our network of collaborations and to simply come together to share our experiences and knowledge. Our Activities page lists our events and we provide project updates via the News section.

Should you have any comments, queries or suggestions, feel free to contact us!

A bit more specifically, Digital Breadcrumbs of Brothers Grimm, which is described in part as:

Described as “a great monument to European literature” (David and David, 1964, p. 180), 2 Jacob and Wilhelm Grimm’s masterpiece Kinder- und Hausmärchen has captured adult and child imagination for over 200 years. International cinema, literature and folklore have borrowed and adapted the brothers’ fairy tales in multifarious ways, inspiring themes and characters in numerous cultures and languages. 3

Despite being responsible for their mainstream circulation, the brothers were not the minds behind all fairy tales. Indeed, Jacob and Wilhelm themselves collected and adapted their stories from earlier written and oral traditions, some of them dating back to as far as the seventh century BC, and made numerous changes to their own collection (ibid., p. 183) producing seven distinct editions between 1812 and 1857.

The same tale often appears in different forms and versions across cultures and time, making it an interesting case-study for textual and cross-lingual comparisons. Is it possible to compare the Grimm brothers’ Snow White and the Seven Dwarves to Pushkin’s Tale of the Dead Princess and the Seven Nights? Can we compare the Grimm brothers’ version of Cinderella to Charles Perrault’s Cinderella? In order to do so it is crucial to find those elements that both tales have in common. Essentially, one must find those measurable primitives that, if present in a high number – and in a similar manner – in both texts, make the stories comparable. We identify these primitives as the motifs of a tale. Prince’s Dictionary of Narratology describes motifs as “..minimal thematic unit[s]”, 4 which can be recorded and have been recorded in the Thompson Motif-index. 5 Hans-Jörg Uther, who expanded Aarne-Thompson classification system (AT number system) in 2004 defined a motif as:

“…a broad definition that enables it to be used as a basis for literary and ethnological research. It is a narrative unit, and as such is subject to a dynamic that determines with which other motifs it can be combined. Thus motifs constitute the basic building blocks of narratives.” (Uther, 2004)

From a topic maps perspective, what do you “see” in a tale that supports your identification of one or more motifs?

Or for that matter, how do you search across multiple identifications of motifs to discover commonalities between identifications by different readers?

It’s all well and good to tally which motifs were identified by particular readers, but clues as to why they differ requires more detail (read subjects).

Unlike the International Consortium of Investigative Journalists (ICIJ), sponsor of the Panama Papers and the Paradise Papers, the eTRAP data is available on Github.

There are only three stories, Snow White, Puss in Boots, and Fisherman and his Wife, in the data repository as of today.

August 7, 2017

Applications of Topic Models [Monograph, Free Until 12 August 2017]

Filed under: Latent Dirichlet Allocation (LDA),Text Mining,Topic Models (LDA) — Patrick Durusau @ 10:48 am

Applications of Topic Models by Jordan Boyd-Graber, Yuening Hu,David Mimno. (Jordan Boyd-Graber, Yuening Hu and David Mimno (2017), “Applications of Topic Models”, Foundations and Trends® in Information Retrieval: Vol. 11: No. 2-3, pp 143-296. http://dx.doi.org/10.1561/1500000030)

Abstract:

How can a single person understand what’s going on in a collection of millions of documents? This is an increasingly common problem: sifting through an organization’s e-mails, understanding a decade worth of newspapers, or characterizing a scientific field’s research. Topic models are a statistical framework that help users understand large document collections: not just to find individual documents but to understand the general themes present in the collection.

This survey describes the recent academic and industrial applications of topic models with the goal of launching a young researcher capable of building their own applications of topic models. In addition to topic models’ effective application to traditional problems like information retrieval, visualization, statistical inference, multilingual modeling, and linguistic understanding, this survey also reviews topic models’ ability to unlock large text collections for qualitative analysis. We review their successful use by researchers to help understand fiction, non-fiction, scientific publications, and political texts.

The authors discuss the use of topic models for, 4. Historical Documents, 5. Understanding Scientific Publications, 6. Fiction and Literature, 7. Computational Social Science, 8. Multilingual Data and Machine Translation, and provide further guidance in: 9. Building a Topic Model.

If you have haystacks of documents to mine, Applications of Topic Models is a must have on your short reading list.

May 4, 2017

Text Mining For Lawyers (The 55% Google Weaned Lawyers Are Missing)

Filed under: eDiscovery,Law,Searching,Text Mining — Patrick Durusau @ 1:52 pm

Working the Mines: How Text Mining Can Help Create Value for Lawyers by Rees Morrison, Juris Datoris, Legaltech News.

From the post:

To most lawyers, text mining may sound like a magic wand or more hype regarding “artificial intelligence.” In fact, with the right input, text mining is a well-grounded genre of software that can find patterns and insights from large amounts of written material. So, if your law firm or law department has a sizable amount of text from various sources, it can extract value from that collection through powerful software tools.

To help lawyers recognize the potential of text mining and demystify it, this article digs through typical steps of a project. Terms of art related to this domain of software are in bold and, yes, there will be a quiz at the end.

Our example project assumes that your law firm (or law department) has gathered a raft of written comments through an internal survey of lawyers or from clients who have typed their views in a client satisfaction survey (perhaps in response to an open-ended question like “In what ways could we improve?”). All that writing is grist for the mill of text mining!

Great overview of the benefits and complexities of text mining!

I was recently assured by a Google weaned lawyer that natural language searching enabled him and his friends to do a few quick searches to find relevant authorities.

I could not help but point out my review of Blair and Maron’s work that demonstrated while attorneys estimated they recovered 75% of relevant documents, in fact they recovered barely 20%.

No solution returns 100% of the relevant documents for any non-trivial dataset, but leaving 55% on the floor doesn’t inspire confidence.

Especially when searchers consider a relevant result to be success. Depends.

Depends on how many relevant authorities existed and if any were closer to your facts than those found? Among other things.

Is a relevant result your test for research success or the best relevant research result, with a measure of confidence in it’s quality?

February 16, 2017

“Tidying” Up Jane Austen (R)

Filed under: Literature,R,Text Mining — Patrick Durusau @ 9:29 am

Text Mining the Tidy Way by Julia Silge.

Thanks to Julia’s presentation I now know there is an R package with all of Jane Austen’s novels ready for text analysis.

OK, Austen may not be at the top of your reading list, but the Tidy techniques Julia demonstrates are applicable to a wide range of textual data.

Among those mentioned in the presentation, NASA datasets!

Julia, along with Dave Robinson, wrote: Text Mining with R: A Tidy Approach, available online now and later this year from O’Reilly.

January 12, 2017

Stanford CoreNLP – a suite of core NLP tools (3.7.0)

Filed under: Natural Language Processing,Stanford NLP,Text Analytics,Text Mining — Patrick Durusau @ 9:16 pm

Stanford CoreNLP – a suite of core NLP tools

The beta is over and Stanford CoreNLP 3.7.0 is on the street!

From the webpage:

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get quotes people said, etc.

Choose Stanford CoreNLP if you need:

  • An integrated toolkit with a good range of grammatical analysis tools
  • Fast, reliable analysis of arbitrary texts
  • The overall highest quality text analytics
  • Support for a number of major (human) languages
  • Available interfaces for most major modern programming languages
  • Ability to run as a simple web service

Stanford CoreNLP’s goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. A tool pipeline can be run on a piece of plain text with just two lines of code. CoreNLP is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis, bootstrapped pattern learning, and the open information extraction tools. Moreover, an annotator pipeline can include additional custom or third-party annotators. CoreNLP’s analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

What stream of noise, sorry, news are you going to pipeling into the Stanford CoreNLP framework?

😉

Imagine a web service that offers levels of analysis alongside news text.

Or does the same with leaked emails and/or documents?

November 19, 2016

How to get superior text processing in Python with Pynini

Filed under: FSTs,Journalism,News,Python,Reporting,Text Mining — Patrick Durusau @ 9:35 pm

How to get superior text processing in Python with Pynini by Kyle Gorman and Richard Sproat.

From the post:

It’s hard to beat regular expressions for basic string processing. But for many problems, including some deceptively simple ones, we can get better performance with finite-state transducers (or FSTs). FSTs are simply state machines which, as the name suggests, have a finite number of states. But before we talk about all the things you can do with FSTs, from fast text annotation—with none of the catastrophic worst-case behavior of regular expressions—to simple natural language generation, or even speech recognition, let’s explore what a state machine is, what they have to do with regular expressions.

Reporters, researchers and others will face a 2017 where the rate of information has increased, along with noise from media spasms over the latest taut from president-elect Trump.

Robust text mining/filtering will your daily necessities, if they aren’t already.

Tagging text is the first example. Think about auto-generating graphs from emails with “to:,” “from:,” “date:,” and key terms in the email. Tagging the key terms is essential to that process.

Once tagged, you can slice and dice the text as more information is uncovered.

Interested?

November 3, 2016

Stanford CoreNLP v3.7.0 beta is out! [Time is short, comments, bug reports, now!]

Stanford CoreNLP v3.7.0 beta

The tweets I saw from Stanford NLP Group read:

Stanford CoreNLP v3.7.0 beta is out—improved coreference, dep parsing—KBP relation annotator—Arabic pipeline #NLProc

We‘re doing an official CoreNLP beta release this time, so bugs, comments, and fixes especially appreciated over the next two weeks!

OK, so, what are you waiting for? 😉

Oh, the standard blurb for your boss on why Stanford CoreNLP should be taking up your time:

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract open-class relations between mentions, etc.

Choose Stanford CoreNLP if you need:

  • An integrated toolkit with a good range of grammatical analysis tools
  • Fast, reliable analysis of arbitrary texts
  • The overall highest quality text analytics
  • Support for a number of major (human) languages
  • Interfaces available for various major modern programming languages
  • Ability to run as a simple web service

Stanford CoreNLP is an integrated framework. Its goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. A CoreNLP tool pipeline can be run on a piece of plain text with just two lines of code. It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis, bootstrapped pattern learning, and the open information extraction tools. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

Using the standard blurb about the Stanford CoreNLP has these advantages:

  • It’s copy-n-paste, you didn’t have to write it
  • It’s appeal to authority (Stanford)
  • It’s truthful

The truthful point is a throw-away these days but thought I should mention it. 😉

October 1, 2016

Nuremberg Trial Verdicts [70th Anniversary]

Filed under: Text Analytics,Text Extraction,Text Mining,Texts,TF-IDF — Patrick Durusau @ 8:46 pm

Nuremberg Trial Verdicts by Jenny Gesley.

From the post:

Seventy years ago – on October 1, 1946 – the Nuremberg trial, one of the most prominent trials of the last century, concluded when the International Military Tribunal (IMT) issued the verdicts for the main war criminals of the Second World War. The IMT sentenced twelve of the defendants to death, seven to terms of imprisonment ranging from ten years to life, and acquitted three.

The IMT was established on August 8, 1945 by the United Kingdom (UK), the United States of America, the French Republic, and the Union of Soviet Socialist Republics (U.S.S.R.) for the trial of war criminals whose offenses had no particular geographical location. The defendants were indicted for (1) crimes against peace, (2) war crimes, (3) crimes against humanity, and of (4) a common plan or conspiracy to commit those aforementioned crimes. The trial began on November 20, 1945 and a total of 403 open sessions were held. The prosecution called thirty-three witnesses, whereas the defense questioned sixty-one witnesses, in addition to 143 witnesses who gave evidence for the defense by means of written answers to interrogatories. The hearing of evidence and the closing statements were concluded on August 31, 1946.

The individuals named as defendants in the trial were Hermann Wilhelm Göring, Rudolf Hess, Joachim von Ribbentrop, Robert Ley, Wilhelm Keitel, Ernst Kaltenbrunner, Alfred Rosenberg, Hans Frank, Wilhelm Frick, Julius Streicher, Walter Funk, Hjalmar Schacht, Karl Dönitz, Erich Raeder, Baldur von Schirach, Fritz Sauckel, Alfred Jodl, Martin Bormann, Franz von Papen, Arthur Seyss-Inquart, Albert Speer, Constantin von Neurath, Hans Fritzsche, and Gustav Krupp von Bohlen und Halbach. All individual defendants appeared before the IMT, except for Robert Ley, who committed suicide in prison on October 25, 1945; Gustav Krupp von Bolden und Halbach, who was seriously ill; and Martin Borman, who was not in custody and whom the IMT decided to try in absentia. Pleas of “not guilty” were entered by all the defendants.

The trial record is spread over forty-two volumes, “The Blue Series,” Trial of the Major War Criminals before the International Military Tribunal Nuremberg, 14 November 1945 – 1 October 1946.

All forty-two volumes are available in PDF format and should prove to be a more difficult indexing, mining, modeling, searching challenge than twitter feeds.

Imagine instead of “text” similarity, these volumes were mined for “deed” similarity. Similarity to deeds being performed now. By present day agents.

Instead of seldom visited dusty volumes in the library stacks, “The Blue Series” could develop a sharp bite.

January 5, 2016

Jane, John … Leslie? A Historical Method for Algorithmic Gender Prediction [Gatekeeping]

Filed under: History,R,Text Mining — Patrick Durusau @ 7:43 pm

Jane, John … Leslie? A Historical Method for Algorithmic Gender Prediction by Cameron Blevins and Lincoln Mullen.

Abstract:

This article describes a new method for inferring the gender of personal names using large historical datasets. In contrast to existing methods of gender prediction that treat names as if they are timelessly associated with one gender, this method uses a historical approach that takes into account how naming practices change over time. It uses historical data to measure the likelihood that a name was associated with a particular gender based on the time or place under study. This approach generates more accurate results for sources that encompass changing periods of time, providing digital humanities scholars with a tool to estimate the gender of names across large textual collections. The article first describes the methodology as implemented in the gender package for the R programming language. It goes on to apply the method to a case study in which we examine gender and gatekeeping in the American historical profession over the past half-century. The gender package illustrates the importance of incorporating historical approaches into computer science and related fields.

An excellent introduction to the gender package for R, historical grounding of the detection of gender by name, with the highlight of the article being the application of this technique to professional literature in American history.

It isn’t uncommon to find statistical techniques applied to texts whose authors and editors are beyond the reach of any critic or criticism.

It is less than common to find statistical techniques applied to extant members of a profession.

Kudos to both Blevins and Mullen for refinement the detection of gender and for applying that refinement publishing in American history.

November 14, 2015

Querying Biblical Texts: Part 1 [Humanists Take Note!]

Filed under: Bible,Text Mining,XML,XQuery — Patrick Durusau @ 5:13 pm

Querying Biblical Texts: Part 1 by Jonathan Robie.

From the post:

This is the first in a series on querying Greek texts with XQuery. We will also look at the differences among various representations of the same text, starting with the base text, morphology, and three different treebank formats. As we will see, the representation of a text indicates what the producer of the text was most interested in, and it determines the structure and power of queries done on that particular representation. The principles discussed here also apply to other languages.

This is written as a tutorial, and it can be read in two ways. The first time through, you may want to simply read the text. If you want to really learn how to do this yourself, you should download an XQuery processor and some data (in your favorite biblical language) and try these queries and variations on them.

Humanists need to follow this series and pass it along to others.

Texts of interest to you will vary but the steps Jonathan covers are applicable to all texts (well, depending upon your encoding).

In exchange for learning a little XQuery, you can gain a good degree of mastery over XML encoded texts.

Enjoy!

November 10, 2015

Editors’ Choice: An Introduction to the Textreuse Package [+ A Counter Example]

Filed under: R,Similarity,Similarity Retrieval,Text Mining — Patrick Durusau @ 5:58 pm

Editors’ Choice: An Introduction to the Textreuse Package by Lincoln Mullen.

From the post:

A number of problems in digital history/humanities require one to calculate the similarity of documents or to identify how one text borrows from another. To give one example, the Viral Texts project, by Ryan Cordell, David Smith, et al., has been very successful at identifying reprinted articles in American newspapers. Kellen Funk and I have been working on a text reuse problem in nineteenth-century legal history, where we seek to track how codes of civil procedure were borrowed and modified in jurisdictions across the United States.

As part of that project, I have recently released the textreuse package for R to CRAN. (Thanks to Noam Ross for giving this package a very thorough open peer review for rOpenSci, to whom I’ve contributed the package.) This package is a general purpose implementation of several algorithms for detecting text reuse, as well as classes and functions for investigating a corpus of texts. Put most simply, full text goes in and measures of similarity come out. (emphasis added)

Kudos to Lincoln on this important contribution to the digital humanities! Not to mention the package will also be useful for researchers who want to compare the “similarity” of texts as “subjects” for purposes of elimination of duplication (called merging in some circles) for presentation to a reader.

I highlighted

Put most simply, full text goes in and measures of similarity come out.

to offer a cautionary tale about the assumption that a high measure of similarity is an indication of the “source” of a text.

Louisiana, my home state, is the only civilian jurisdiction in the United States. Louisiana law, more at one time than now, is based upon Roman law.

Roman law and laws based upon it have a very deep and rich history that I won’t even attempt to summarize.

It is sufficient for present purposes to say the Digest of the Civil Laws now in Force in the Territory of Orleans (online version, English/French) was enacted in 1808.

A scholarly dispute arose (1971-1972) between Professor Batiza (Tulane), who considered the Digest to reflect the French civil code and Professor Pascal (LSU), who argued that despite quoting the French civil code quite liberally, that the redactors intended to codify the Spanish civil law in force at the time of the Louisiana Purchase.

The Batiza vs. Pascal debate was carried out at length and in public:

Batiza, The Louisiana Civil Code of 1808: Its Actual Sources and Present Relevance, 46 TUL. L. REV. 4 (1971); Pascal, Sources of the Digest of 1808: A Reply to Professor Batiza, 46 TUL.L.REV. 603 (1972); Sweeney, Tournament of Scholars over the Sources of the Civil Code of 1808, 46 TUL. L. REV. 585 (1972); Batiza, Sources of the Civil Code of 1808, Facts and Speculation: A Rejoinder, 46 TUL. L. REV. 628 (1972).

I could not find any freely available copies of those articles online. (Don’t encourage paywalls accessing such material. Find it at your local law library.)

There are a couple of secondary articles that discuss the dispute: A.N. Yiannopoulos, The Civil Codes of Louisiana, 1 CIV. L. COMMENT. 1, 1 (2008) at http://www.civil-law.org/v01i01-Yiannopoulos.pdf, and John W. Cairns, The de la Vergne Volume and the Digest of 1808, 24 Tulane European & Civil Law Forum 31 (2009), which are freely available online.

You won’t get the full details from the secondary articles but they do capture some of the flavor of the original dispute. I can report (happily) that over time, Pascal’s position has prevailed. Textual history is more complex than rote counting techniques can capture.

A far more complex case of “text similarity” than Lincoln addresses in the Textreuse package, but once you move beyond freshman/doctoral plagiarism, the “interesting cases” are all complicated.

October 28, 2015

Text Mining Meets Neural Nets: Mining the Biomedical Literature

Filed under: Bioinformatics,Neural Networks,Semantics,Text Mining — Patrick Durusau @ 4:28 pm

Text Mining Meets Neural Nets: Mining the Biomedical Literature by Dan Sullivan.

From the webpage:

Text mining and natural language processing employ a range of techniques from syntactic parsing, statistical analysis, and more recently deep learning. This presentation presents recent advances in dense word representations, also known as word embedding, and their advantages over sparse representations, such as the popular term frequency-inverse document frequency (tf-idf) approach. It also discusses convolutional neural networks, a form of deep learning that is proving surprisingly effective in natural language processing tasks. Reference papers and tools are included for those interested in further details. Examples are drawn from the bio-medical domain.

Basically an abstract for the 58 slides you will find here: http://www.slideshare.net/DanSullivan10/text-mining-meets-neural-nets.

The best thing about these slides is the wealth of additional links to other resources. There is only so much you can say on a slide so links to more details should be a standard practice.

Slide 53: Formalize a Mathematical Model of Semantics, seems a bit ambitious to me. Considering mathematics are a subset of natural languages. Difficult to see how the lesser could model the greater.

You could create a mathematical model of some semantics and say it was all that is necessary, but that’s been done before. Always strive to make new mistakes.

October 18, 2015

Text Analysis Without Programming

Filed under: Text Analytics,Text Mining — Patrick Durusau @ 8:53 pm

Text Analysis Without Programming by Lynn Cherny.

My favorite line in the slideshow reads:

PDFs are a sad text data reality

The slides give a good overview of a number of simple tools for text analysis.

And Cherny doesn’t skimp on pointing out issues with tools such as word clouds, where she says:

People don’t know what they indicate (and at the bottom of the slide: “But geez do people love them.”)

I suspect her observation on the uncertainty of what word clouds indicate is partially responsible for their popularity.

No matter what conclusion you draw about a word cloud, how could anyone offer a contrary argument?

A coding talk is promised and I am looking forward to it.

Enjoy!

October 17, 2015

Document Summarization via Markov Chains

Filed under: Algorithms,Markov Decision Processes,Summarization,Text Mining — Patrick Durusau @ 12:58 pm

Document Summarization via Markov Chains by Atabey Kaygun.

From the post:

Description of the problem

Today’s question is this: we have a long text and we want a machine generated summary of the text. Below, I will describe a statistical (hence language agnostic) method to do just that.

Sentences, overlaps and Markov chains.

In my previous post I described a method to measure the overlap between two sentences in terms of common words. Today, we will use the same measure, or a variation, to develop a discrete Markov chain whose nodes are labeled by individual sentences appearing in our text. This is essentially page rank applied to sentences.

Atabey says the algorithm (code supplied) works well on:

news articles, opinion pieces and blog posts.

Not so hot on Supreme Court decisions.

In commenting on a story from the New York Times, Obama Won’t Seek Access to Encrypted User Data, I suspect, Atabey says that we have no reference for “what frustrated him” in the text summary.

If you consider the relevant paragraph from the New York Times story:

Mr. Comey had expressed alarm a year ago after Apple introduced an operating system that encrypted virtually everything contained in an iPhone. What frustrated him was that Apple had designed the system to ensure that the company never held on to the keys, putting them entirely in the hands of users through the codes or fingerprints they use to get into their phones. As a result, if Apple is handed a court order for data — until recently, it received hundreds every year — it could not open the coded information.

The reference is clear. Several other people are mentioned in the New York Times article but none rank high enough to appear in the summary.

Not a sure bet but with testing, try attribution to people who rank high enough to appear in the summary.

October 5, 2015

International Hysteria Over American Gun Violence

Filed under: Data Mining,News,Text Mining — Patrick Durusau @ 7:56 pm

Australia’s call for a boycott on U.S. travel until gun-reform is passed may be the high point of the international hysteria over gun violence in the United States. Or it may not be. Hard to say at this point.

Social media has been flooded with hand wringing over the loss of “innocent” lives, etc., you know the drill.

The victims in Oregon were no doubt “innocent,” but innocence alone isn’t the criteria by which “mass murder” is judged.

At least not according to both the United States government, other Western governments and their affiliated news organizations.

Take the Los Angeles Times for example, which has an updated list of mass shootings, 1984 – 2015.

Or the breathless prose of The Chicagoist in Chicago Dominates The U.S. In Mass Shootings Count.

Based on data compiled by the crowd-sourced Mass Shooting Tracker site, the Guardian discovered that there were 994 mass shootings—defined as an incident in which four or more people are shot—in 1,004 days since Jan. 1, 2013. The Oregon shooting happened on the 274th day of 2015 and was the 294th mass shooting of the year in the U.S.

Some 294 mass shootings since January 1, 2015 in the U.S.?

Chump change my friend, chump change.

No disrespect to the innocent dead, wounded or their grieving families, but as I said, “innocence isn’t the criteria for judging mass violence. Not by Western governments, not by the Western press.

You will have to do a little data mining to come to that conclusion but if you have the time, follow along.

First, of course, we have to find acts of violence with no warning to its innocent victims who were just going about their lives. At least until pain and death came raining out of the sky.

Let’s start with Operation Inherent Resolve: Targeted Operations Against ISIL Terrorists.

If you select a country name, your options are Syria and Iraq, a pop-up will display the latest news briefing on “Airstrikes in Iraq and Syria.” Under the current summary, you will see “View Information on Previous Airstrikes.”

Selecting “View Information on Previous Airstrikes” will give you a very long drop down page with previous air strike reports. It doesn’t list human casualties or the number of bombs dropped, but it does recite the number of airstrikes.

Capture that information down to January 1, 2015 and save it to a text file. I have already captured it and you can download us-airstrikes-iraq-syria.txt.

You will notice that the file has text other than the air strikes, but air strikes are reported in a common format:

 - Near Al Hasakah, three strikes struck three separate ISIL tactical units 
   and destroyed three ISIL structures, two ISIL fighting positions, and an 
   ISIL motorcycle.
 - Near Ar Raqqah, one strike struck an ISIL tactical unit.
 - Near Mar’a, one strike destroyed an ISIL excavator.
 - Near Washiyah, one strike damaged an ISIL excavator.

Your first task is to extract just the lines that start with: “- Near” and save them to a file.

I used: grep '\- Near' us-airstrikes-iraq-syria.txt > us-airstrikes-iraq-syria-strikes.txt

Since I now have all the lines with airstrike count data, how do I add up all the numbers?

I am sure there is an XQuery solution but its throw-away data , so I took the easy way out:

grep 'one airstrike' us-airstrikes-iraq-syria-strikes.txt | wc -l

Which gave me a count of all the lines with “one airstrike,” or 629 if you are interested.

Just work your way up through “ten airstrikes” and after that, nothing but zeroes. Multiple the number of lines times the number in the search expression and you have the number of airstrikes for that number. One I found was 132 for “four airstrikes,” so that was 528 airstrikes for that number.

Oh, I forgot to mention, some of the reports don’t use names for numbers but digits. Yeah, inconsistent data.

The dirty answer to that was:

grep '[0-9] airstrikes' us-airstrikes-iraq-syria-strikes.txt > us-airstrikes-iraq-syria-strikes-digits.txt

The “[0-9]” detects any digit, between zero and nine. Could have made it a two-digit number but any two-digit number starts with one digit so why bother?

Anyway, that found another 305 airstrikes that were reported in digits.

Ah, total number of airstrikes, not bombs but airstrikes since January 1, 2015?

4,207 airstrikes as of today.

That’s four thousand, two hundred and seven (minimum, more than one bomb per airstrike), times that innocent civilians may have been murdered or at least terrorized by violence falling out of the sky.

Those 4,207 events were not the work of marginally functional, disturbed or troubled individuals. No, those events were orchestrated by highly trained, competent personnel, backed by the largest military machine on the planet and a correspondingly large military industrial complex.

I puzzle over the international hysteria over American gun violence when the acts are random, unpredictable and departures from the norm. Think of all the people with access to guns in the United States who didn’t go on violent rampages.

The other puzzlement is that the crude data mining I demonstrated above establishes the practice of violence against innocents is a long standing and respected international practice.

Why stress over 294 mass shootings in the U.S. when 4,207 airstrikes in 2015 have killed or endangered equally innocent civilians who are non-U.S. citizens?

What is fair for citizens of one country should be fair for citizens of every country. The international community seems to be rather selective when applying that principle.

October 2, 2015

Workflow for R & Shakespeare

Filed under: Literature,R,Text Corpus,Text Mining — Patrick Durusau @ 2:00 pm

A new data processing workflow for R: dplyr, magrittr, tidyr, ggplot2

From the post:

Over the last year I have changed my data processing and manipulation workflow in R dramatically. Thanks to some great new packages like dplyr, tidyr and magrittr (as well as the less-new ggplot2) I've been able to streamline code and speed up processing. Up until 2014, I had used essentially the same R workflow (aggregate, merge, apply/tapply, reshape etc) for more than 10 years. I have added a few improvements over the years in the form of functions in packages doBy, reshape2 and plyr and I also flirted with the package data.table (which I found to be much faster for big datasets but the syntax made it difficult to work with) — but the basic flow has remained remarkably similar. Until now…

Given how much I've enjoyed the speed and clarity of the new workflow, I thought I would share a quick demonstration.

In this example, I am going to grab data from a sample SQL database provided by Google via Google BigQuery and then give examples of manipulation using dplyr, magrittr and tidyr (and ggplot2 for visualization).

This is a great introduction to a work flow in R that you can generalize for your own purposes.

Word counts won’t impress your English professor but you will have a base for deeper analysis of Shakespeare.

I first saw this in a tweet by Christophe Lalanne.

September 28, 2015

Discovering Likely Mappings between APIs using Text Mining [Likely Merging?]

Filed under: Programming,Text Mining — Patrick Durusau @ 8:23 pm

Discovering Likely Mappings between APIs using Text Mining by Rahul Pandita, Raoul Praful Jetley, Sithu D Sudarsan, Laurie Williams.

Abstract:

Developers often release different versions of their applications to support various platform/programming-language application programming interfaces (APIs). To migrate an application written using one API (source) to another API (target), a developer must know how the methods in the source API map to the methods in the target API. Given a typical platform or language exposes a large number of API methods, manually writing API mappings is prohibitively resource-intensive and may be error prone. Recently, researchers proposed to automate the mapping process by mining API mappings from existing codebases. However, these approaches require as input a manually ported (or at least functionally similar) code across source and target APIs. To address the shortcoming, this paper proposes TMAP: Text Mining based approach to discover likely API mappings using the similarity in the textual description of the source and target API documents. To evaluate our approach, we used TMAP to discover API mappings for 15 classes across: 1) Java and C# API, and 2) Java ME and Android API. We compared the discovered mappings with state-of-the-art source code analysis based approaches: Rosetta and StaMiner. Our results indicate that TMAP on average found relevant mappings for 57% more methods compared to previous approaches. Furthermore, our results also indicate that TMAP on average found exact mappings for 6.5 more methods per class with a maximum of 21 additional exact mappings for a single class as compared to previous approaches.

From the introduction:

Our intuition is: since the API documents are targeted towards developers, there may be an overlap in the language used to describe similar concepts that can be leveraged.

There are a number of insights in this paper but this statement of intuition alone is enough to justify reading the paper.

What if instead of API documents we were talking about topics that had been written for developers? Isn’t it fair to assume that concepts would have the same or similar vocabularies?

The evidence from this paper certainly suggests that to be the case.

Of course, merging rules would have to allow for “likely” merging of topics, which could then be refined by readers.

Readers who hopefully contribute more information to make “likely” merging more “precise.” (At least in their view.)

That’s one of the problems with most semantic technologies isn’t it?

“Precision” can only be defined from a point of view, which by definition varies from user to user.

What would it look like to allow users to determine their desired degree of semantic precision?

Suggestions?

July 29, 2015

Unix™ for Poets

Filed under: Linux OS,Text Mining — Patrick Durusau @ 1:41 pm

Unix™ for Poets by Kenneth Ward Church.

A very delightful take on using basic Unix tools for text processing.

Exercises cover:

1. Count words in a text

2. Sort a list of words in various ways

  • ascii order
  • dictionary order
  • ‘‘rhyming’’ order

3. Extract useful info from a dictionary

4. Compute ngram statistics

5. Make a Concordance

Fifty-three (53) pages of pure Unix joy!

Enjoy!

Text Processing in R

Filed under: R,Text Mining — Patrick Durusau @ 1:08 pm

Text Processing in R by Matthew James Denny.

From the webpage:

This tutorial goes over some basic concepts and commands for text processing in R. R is not the only way to process text, nor is it really the best way. Python is the de-facto programming language for processing text, with a lot of builtin functionality that makes it easy to use, and pretty fast, as well as a number of very mature and full featured packages such as NLTK and textblob. Basic shell scripting can also be many orders of magnitude faster for processing extremely large text corpora — for a classic reference see Unix for Poets. Yet there are good reasons to want to use R for text processing, namely that we can do it, and that we can fit it in with the rest of our analyses. I primarily make use of the stringr package for the following tutorial, so you will want to install it:

Perhaps not the best tool for text processing but if you are inside R and have text processing needs, this will get you started.

May 1, 2015

Practical Text Analysis using Deep Learning

Filed under: Deep Learning,Natural Language Processing,Text Mining — Patrick Durusau @ 4:34 pm

Practical Text Analysis using Deep Learning by Michael Fire.

From the post:

Deep Learning has become a household buzzword these days, and I have not stopped hearing about it. In the beginning, I thought it was another rebranding of Neural Network algorithms or a fad that will fade away in a year. But then I read Piotr Teterwak’s blog post on how Deep Learning can be easily utilized for various image analysis tasks. A powerful algorithm that is easy to use? Sounds intriguing. So I decided to give it a closer look. Maybe it will be a new hammer in my toolbox that can later assist me to tackle new sets of interesting problems.

After getting up to speed on Deep Learning (see my recommended reading list at the end of this post), I decided to try Deep Learning on NLP problems. Several years ago, Professor Moshe Koppel gave a talk about how he and his colleagues succeeded in determining an author’s gender by analyzing his or her written texts. They also released a dataset containing 681,288 blog posts. I found it remarkable that one can infer various attributes about an author by analyzing the text, and I’ve been wanting to try it myself. Deep Learning sounded very versatile. So I decided to use it to infer a blogger’s personal attributes, such as age and gender, based on the blog posts.

If you haven’t gotten into deep learning, here’s another opportunity focused on natural language processing. You can follow Michael’s general directions to learn on your own or follow more detailed instructions in his Ipython notebook.

Enjoy!

April 7, 2015

q – Text as Data

Filed under: CSV,SQL,Text Mining — Patrick Durusau @ 5:03 pm

q – Text as Data by Harel Ben-Attia.

From the webpage:

q is a command line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files).

q treats ordinary files as database tables, and supports all SQL constructs, such as WHERE, GROUP BY, JOINs etc. It supports automatic column name and column type detection, and provides full support for multiple encodings.

q’s web site is http://harelba.github.io/q/. It contains everything you need to download and use q in no time.

I’m not looking for an alternative to awk or sed for CSV/TSV files but you may be.

From the examples I suspect it would be “easier” in some sense of the word to teach than either awk or sed.

Give it a try and let me know what you think.

I first saw this in a tweet by Scott Chamberlain.

March 12, 2015

Detecting Text Reuse in Nineteenth-Century Legal Documents:…

Filed under: History,Law - Sources,Text Analytics,Text Mining,Texts — Patrick Durusau @ 6:32 pm

Detecting Text Reuse in Nineteenth-Century Legal Documents: Methods and Preliminary Results by Lincoln Mullen.

From the post:

How can you track changes in the law of nearly every state in the United States over the course of half a century? How can you figure out which states borrowed laws from one another, and how can you visualize the connections among the legal system as a whole?

Kellen Funk, a historian of American law, is writing a dissertation on how codes of civil procedure spread across the United States in the second half of the nineteenth century. He and I have been collaborating on the digital part of this project, which involves identifying and visualizing the borrowings between these codes. The problem of text reuse is a common one in digital history/humanities projects. In this post I want to describe our methods and lay out some of our preliminary results. To get a fuller picture of this project, you should read the four posts that Kellen has written about his project:

Quite a remarkable project with many aspects that will be relevant to other projects.

Lincoln doesn’t use the term but this would be called textual criticism, if it were being applied to the New Testament. Of course here, Lincoln and Kellen have the original source document and the date of its adoption. New Testament scholars have copies of copies in no particular order and no undisputed evidence of the original text.

Did I mention that all the source code for this project is on Github?

January 21, 2015

TM-Gen: A Topic Map Generator from Text Documents

Filed under: Authoring Topic Maps,Text Mining,Topic Maps — Patrick Durusau @ 4:55 pm

TM-Gen: A Topic Map Generator from Text Documents by Angel L. Garrido, et al.

From the post:

The vast amount of text documents stored in digital format is growing at a frantic rhythm each day. Therefore, tools able to find accurate information by searching in natural language information repositories are gaining great interest in recent years. In this context, there are especially interesting tools capable of dealing with large amounts of text information and deriving human-readable summaries. However, one step further is to be able not only to summarize, but to extract the knowledge stored in those texts, and even represent it graphically.

In this paper we present an architecture to generate automatically a conceptual representation of knowledge stored in a set of text-based documents. For this purpose we have used the topic maps standard and we have developed a method that combines text mining, statistics, linguistic tools, and semantics to obtain a graphical representation of the information contained therein, which can be coded using a knowledge representation language such as RDF or OWL. The procedure is language-independent, fully automatic, self-adjusting, and it does not need manual configuration by the user. Although the validation of a graphic knowledge representation system is very subjective, we have been able to take advantage of an intermediate product of the process to make an experimental
validation of our proposal.

Of particular note on the automatic construction of topic maps:

Addition of associations:

TM-Gen adds to the topic map the associations between topics found in each sentence. These associations are given by the verbs present in the sentence. TM-Gen performs this task by searching the subject included as topic, and then it adds the verb as its association. Finally, it links its verb complement with the topic and with the association as a new topic.

Depending on the archive one would expect associations between the authors and articles but also topics within articles, to say nothing of date, the publication, etc. Once established, a user can request a view that consists of more or less detail. If not captured, however, more detail will not be available.

There is only a general description of TM-Gen but enough to put you on the way to assembling something quite similar.

January 20, 2015

Modelling Plot: On the “conversional novel”

Filed under: Language,Literature,Text Analytics,Text Mining — Patrick Durusau @ 11:11 am

Modelling Plot: On the “conversional novel” by Andrew Piper.

From the post:

I am pleased to announce the acceptance of a new piece that will be appearing soon in New Literary History. In it, I explore techniques for identifying narratives of conversion in the modern novel in German, French and English. A great deal of new work has been circulating recently that addresses the question of plot structures within different genres and how we might or might not be able to model these computationally. My hope is that this piece offers a compelling new way of computationally studying different plot types and understanding their meaning within different genres.

Looking over recent work, in addition to Ben Schmidt’s original post examining plot “arcs” in TV shows using PCA, there have been posts by Ted Underwood and Matthew Jockers looking at novels, as well as a new piece in LLC that tries to identify plot units in fairy tales using the tools of natural language processing (frame nets and identity extraction). In this vein, my work offers an attempt to think about a single plot “type” (narrative conversion) and its role in the development of the novel over the long nineteenth century. How might we develop models that register the novel’s relationship to the narration of profound change, and how might such narratives be indicative of readerly investment? Is there something intrinsic, I have been asking myself, to the way novels ask us to commit to them? If so, does this have something to do with larger linguistic currents within them – not just a single line, passage, or character, or even something like “style” – but the way a greater shift of language over the course of the novel can be generative of affective states such as allegiance, belief or conviction? Can linguistic change, in other words, serve as an efficacious vehicle of readerly devotion?

While the full paper is available here, I wanted to post a distilled version of what I see as its primary findings. It’s a long essay that not only tries to experiment with the project of modelling plot, but also reflects on the process of model building itself and its place within critical reading practices. In many ways, its a polemic against the unfortunate binariness that surrounds debates in our field right now (distant/close, surface/depth etc.). Instead, I want us to see how computational modelling is in many ways conversional in nature, if by that we understand it as a circular process of gradually approaching some imaginary, yet never attainable centre, one that oscillates between both quantitative and qualitative stances (distant and close practices of reading).

Andrew writes of “…critical reading practices….” I’m not sure that technology will increase the use of “…critical reading practices…” but it certainly offers the opportunity to “read” texts in different ways.

I have done this with IT standards but never a novel, attempt reading it from the back forwards, a sentence at a time. At least with authoring you are proofing, it provides a radically different perspective than the more normal front to back. The first thing you notice is that it interrupts your reading/skimming speed so you will catch more errors as well as nuances in the text.

Before you think that literary analysis is a bit far afield from “practical” application, remember that narratives (think literature) are what drive social policy and decision making.

Take the current popular “war on terrorism” narrative that is so popular and unquestioned in the United States. Ask anyone inside the beltway in D.C. and they will blather on and on about the need to defend against terrorism. But there is an absolute paucity of terrorists, at least by deed, in the United States. Why does the narrative persist in the absence of any evidence to support it?

The various Red Scares in U.S. history were similar narratives that have never completely faded. They too had a radical disconnect between the narrative and the “facts on the ground.”

Piper doesn’t offer answers to those sort of questions but a deeper understanding of narrative, such as is found in novels, may lead to hints with profound policy implications.

January 12, 2015

A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles

Filed under: Machine Learning,PDF,Tables,Text Mining — Patrick Durusau @ 8:17 pm

A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles by Stefan Klampfl, Kris Jack, Roman Kern.

Abstract:

In digital scientific articles tables are a common form of presenting information in a structured way. However, the large variability of table layouts and the lack of structural information in digital document formats pose significant challenges for information retrieval and related tasks. In this paper we present two table recognition methods based on unsupervised learning techniques and heuristics which automatically detect both the location and the structure of tables within a article stored as PDF. For both algorithms the table region detection first identifies the bounding boxes of individual tables from a set of labelled text blocks. In the second step, two different tabular structure detection methods extract a rectangular grid of table cells from the set of words contained in these table regions. We evaluate each stage of the algorithms separately and compare performance values on two data sets from different domains. We find that the table recognition performance is in line with state-of-the-art commercial systems and generalises to the non-scientific domain.

Excellent article if you have ever struggled with the endless tables in government documents.

I first saw this in a tweet by Anita de Waard.

Older Posts »

Powered by WordPress