Inter-Document Similarity with Scikit-Learn and NLTK by Sujit Pal.
From the post:
Someone recently asked me about using Python to calculate document similarity across text documents. The application had to do with cheating detection, ie, compare student transcripts and flag documents with (abnormally) high similarity for further investigation. For security reasons, I could not get access to actual student transcripts. But the basic idea was to convince ourselves that this approach is valid, and come up with a code template for doing this.
I have been playing quite a bit with NLTK lately, but for this work, I decided to use the Python ML Toolkit Scikit-Learn, which has pretty powerful text processing facilities. I did end up using NLTK for its cosine similarity function, but that was about it.
I decided to use the coffee-sugar-cocoa mini-corpus of 53 documents to test out the code – I first found this in Dr Manu Konchady’s TextMine project, and I have used it off and on. For convenience I have made it available at the github location for the sub-project.
Similarity measures are fairly well understood.
But they lack interesting data sets for testing code.
Here are some random suggestions:
- Speeches by Republicans on Benghazi
- Speeches by Democrats on Gun Control
- TV reports on any particular disaster
- News reports of sporting events
- Dialogue from popular TV shows
With a five to ten second lag, perhaps streams of speech could be monitored for plagiarism or repetition and simply dropped.
😉