Archive for the ‘Concordance’ Category

Saving Output of nltk Text.Concordance()

Friday, April 18th, 2014

Saving Output of NLTK Text.Concordance() by Kok Hua.

From the post:

In NLP, sometimes users would like to search for series of phrases that contain particular keyword in a passage or web page.

NLTK provides the function concordance() to locate and print series of phrases that contain the keyword. However, the function only print the output. The user is not able to save the results for further processing unless redirect the stdout.

Below function will emulate the concordance function and return the list of phrases for further processing. It uses the NLTK concordance Index which keeps track of the keyword index in the passage/text and retrieve the surrounding words.

Text mining is a very common part of topic map construction so tools that help with that task are always welcome.

To be honest, I am citing this because it may become part of several small tools for processing standards drafts. Concordance software is not rare but a full concordance of a document seems to frighten some proof readers.

The current thinking being if only the “important” terms are highlighted in context, that some proof readers will be more likely to use the work product.

The same principal applies to the authoring of topic maps as well.

AntConc

Thursday, May 9th, 2013

AntConc by Laurence Anthony.

From the help file:

Concordance

The Concordance tool generates KWIC (key word in context) concordance lines from one or more target texts chosen by the user.

Concordance Plot

The Concordance Plot tool generates an alternative view of search term hits in a corpus compared with the Concordance tool. Here the relative position of each hit in a file is displayed as a line in bar chart. (Search terms can be inputted in an identical way to that in the Concordance Tool.)

File View

The File View tool is used to display the original files of the corpus. It can also be used to search for terms within individual files in a similar way to searches using the Concordance and Concordance Plot tools.

Word Clusters

The Word Clusters tool is used to generate an ordered list of clusters that appear around a search term in the target files listed in the left frame of the main window.

N-Grams

The N-grams tool is used to generate an ordered list of N-grams that appear in the target files listed in the left frame of the main window. N-grams are word N-grams, and therefore, large files will create huge numbers of N-grams. For example, N-grams of size 2 for the sentence “this is a pen”, are ‘this is’, ‘is a’ and ‘a pen’.

Collocates

The Collocates tool is used to generate an ordered list of collocates that appear near a search term in the target files listed in the left frame of the main window.

Word List

The Word List feature is used to generate a list of ordered words that appear in the target files listed in the left frame of the main window.

Keyword List

In addition to generating word lists using the Word List tool, AntConc can compare the words that appear in the target files with the words that appear in a ‘reference corpus’ to generate a list of “Keywords”, that are unusually frequent (or infrequent) in the target files.

The 1.0 version appeared in 2002 and the current beta version is 3.3.5.

Great for exploring texts!

Did I mention it is freeware?