Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 2, 2019

Constructing Stoplists for Historical Languages [Hackers?]

Filed under: Classics,Cybersecurity,Hacking,Natural Language Processing — Patrick Durusau @ 9:50 am

Constructing Stoplists for Historical Languages by Patrick J. Burns.

Abstract

Stoplists are lists of words that have been filtered from documents prior to text analysis tasks, usually words that are either high frequency or that have low semantic value. This paper describes the development of a generalizable method for building stoplists in the Classical Language Toolkit (CLTK), an open-source Python platform for natural language processing research on historical languages. Stoplists are not readily available for many historical languages, and those that are available often offer little documentation about their sources or method of construction. The development of a generalizable method for building historical-language stoplists offers the following benefits: 1. better support for well-documented, data-driven, and replicable results in the use of CLTK resources; 2. reduction of arbitrary decision-making in building stoplists; 3. increased consistency in how stopwords are extracted from documents across multiple languages; and 4. clearer guidelines and standards for CLTK developers and contributors, a helpful step forward in managing the complexity of a multi-language open-source project.

I post this in part to spread the word about these stoplists for humanists.

At the same time, I’m curious about the use of stoplists by hackers to filter cruft from disassembled files. Disassembled files are “texts” of a sort and it seems to me that many of the tools used by humanists could, emphasis on could, be relevant.

Suggestions/pointers?

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress