Archive for the ‘TF-IDF’ Category

Nuremberg Trial Verdicts [70th Anniversary]

Saturday, October 1st, 2016

Nuremberg Trial Verdicts by Jenny Gesley.

From the post:

Seventy years ago – on October 1, 1946 – the Nuremberg trial, one of the most prominent trials of the last century, concluded when the International Military Tribunal (IMT) issued the verdicts for the main war criminals of the Second World War. The IMT sentenced twelve of the defendants to death, seven to terms of imprisonment ranging from ten years to life, and acquitted three.

The IMT was established on August 8, 1945 by the United Kingdom (UK), the United States of America, the French Republic, and the Union of Soviet Socialist Republics (U.S.S.R.) for the trial of war criminals whose offenses had no particular geographical location. The defendants were indicted for (1) crimes against peace, (2) war crimes, (3) crimes against humanity, and of (4) a common plan or conspiracy to commit those aforementioned crimes. The trial began on November 20, 1945 and a total of 403 open sessions were held. The prosecution called thirty-three witnesses, whereas the defense questioned sixty-one witnesses, in addition to 143 witnesses who gave evidence for the defense by means of written answers to interrogatories. The hearing of evidence and the closing statements were concluded on August 31, 1946.

The individuals named as defendants in the trial were Hermann Wilhelm Göring, Rudolf Hess, Joachim von Ribbentrop, Robert Ley, Wilhelm Keitel, Ernst Kaltenbrunner, Alfred Rosenberg, Hans Frank, Wilhelm Frick, Julius Streicher, Walter Funk, Hjalmar Schacht, Karl Dönitz, Erich Raeder, Baldur von Schirach, Fritz Sauckel, Alfred Jodl, Martin Bormann, Franz von Papen, Arthur Seyss-Inquart, Albert Speer, Constantin von Neurath, Hans Fritzsche, and Gustav Krupp von Bohlen und Halbach. All individual defendants appeared before the IMT, except for Robert Ley, who committed suicide in prison on October 25, 1945; Gustav Krupp von Bolden und Halbach, who was seriously ill; and Martin Borman, who was not in custody and whom the IMT decided to try in absentia. Pleas of “not guilty” were entered by all the defendants.

The trial record is spread over forty-two volumes, “The Blue Series,” Trial of the Major War Criminals before the International Military Tribunal Nuremberg, 14 November 1945 – 1 October 1946.

All forty-two volumes are available in PDF format and should prove to be a more difficult indexing, mining, modeling, searching challenge than twitter feeds.

Imagine instead of “text” similarity, these volumes were mined for “deed” similarity. Similarity to deeds being performed now. By present day agents.

Instead of seldom visited dusty volumes in the library stacks, “The Blue Series” could develop a sharp bite.

arXiv Sanity Preserver

Sunday, November 29th, 2015

arXiv Sanity Preserver by Andrej Karpathy.

From the webpage:

There are way too many arxiv papers, so I wrote a quick webapp that lets you search and sort through the mess in a pretty interface, similar to my pretty conference format.

It’s super hacky and was written in 4 hours. I’ll keep polishing it a bit over time perhaps but it serves its purpose for me already. The code uses Arxiv API to download the most recent papers (as many as you want – I used the last 1100 papers over last 3 months), and then downloads all papers, extracts text, creates tfidf vectors for each paper, and lastly is a flask interface for searching through and filtering similar papers using the vectors.

Main functionality is a search feature, and most useful is that you can click “sort by tfidf similarity to this”, which returns all the most similar papers to that one in terms of tfidf bigrams. I find this quite useful.


You can see this rather remarkable tool online at:

Beyond its obvious utility for researchers, this could be used as a framework for experimenting with other similarity measures.


I first saw this in a tweet by Lynn Cherny.

TF-IDF using flambo

Wednesday, August 13th, 2014

TF-IDF using flambo by Muslim Baig.

From the post:

flambo is a Clojure DSL for Spark created by the data team at Yieldbot. It allows you to create and manipulate Spark data structures using idiomatic Clojure. The following tutorial demonstrates typical flambo API usage and facilities by implementing the classic tf-idf algorithm.

The complete runnable file of the code presented in this tutorial is located under the flambo.example.tfidf namespace, under the flambo /test/flambo/example directory. We recommend you download flambo and follow along in your REPL.

Working through the Clojure code you will get a better understanding of the TF-IDF algorithm.

I don’t know if it was intentional, but the division of the data into “documents” illustrates one of the fundamental questions for most indexing techniques:

What do you mean by document?

It is a non-trivial question and one that has a major impact on the results of the algorithm.

If I get to choose what is considered a “document,” then I can weight the results while using the same algorithm as everyone else.

Think about it. My “documents” may have the term “example” in each one, as opposed to “example” appearing three times in a single document. See the last section in the Wikipedia article tf-idf for the impact of such splitting.

Other algorithms are subject to similar manipulation. It isn’t ever enough to know the algorithms applied to data, you need to see the data itself.

Pig Macro for TF-IDF Makes Topic Summarization 2 Lines of Pig

Monday, October 1st, 2012

Pig Macro for TF-IDF Makes Topic Summarization 2 Lines of Pig by Russell Jurney.

From the post:

In a recent post we used Pig to summarize documents via the Term-Frequency, Inverse Document Frequency (TF-IDF) algorithm.

In this post, we’re going to turn that code into a Pig macro that can be called in one line of code:

Any Pig macros in your trick bag?

Scalding for the Impatient

Sunday, August 12th, 2012

Scalding for the Impatient by Sujit Pal.

From the post:

Few weeks ago, I wrote about Pig, a DSL that allows you to specify a data processing flow in terms of PigLatin operations, and results in a sequence of Map-Reduce jobs on the backend. Cascading is similar to Pig, except that it provides a (functional) Java API to specify a data processing flow. One obvious advantage is that everything can now be in a single language (no more having to worry about UDF integration issues). But there are others as well, as detailed here and here.

Cascading is well documented, and there is also a very entertaining series of articles titled Cascading for the Impatient that builds up a Cascading application to calculate TF-IDF of terms in a (small) corpus. The objective is to showcase the features one would need to get up and running quickly with Cascading.

Scalding is a Scala DSL built on top of Cascading. As you would expect, Cascading code is an order of magnitude shorter than equivalent Map-Reduce code. But because Java is not a functional language, implementing functional constructs leads to some verbosity in Cascading that is eliminated in Scalding, leading to even shorter and more readable code.

I was looking for something to try my newly acquired Scala skills on, so I hit upon the idea of building up a similar application to calculate TF-IDF for terms in a corpus. The table below summarizes the progression of the Cascading for the Impatient series. I’ve provided links to the original articles for the theory (which is very nicely explained there) and links to the source codes for both the Cascading and Scalding versions.

A very nice side by side comparison and likely to make you interested in Scalding.

Explicit Semantic Analysis (ESA) using Wikipedia

Sunday, May 6th, 2012

Explicit Semantic Analysis (ESA) using Wikipedia

From the webpage:

The Explicit Semantic Analysis (ESA) method (Gabrilovich and Markovitch, 2007) is a measure to compute the semantic relatedness (SR) between two arbitrary texts. The Wikipedia-based technique represents terms (or texts) as high-dimensional vectors, each vector entry presenting the TF-IDF weight between the term and one Wikipedia article. The semantic relatedness between two terms (or texts) is expressed by the cosine measure between the corresponding vectors.

I have no objection to machine-based techniques, or human-based ones for that matter, so long as the limitations of both are kept firmly in mind.

Some older resources on Explicit Semantic Analysis.