RTextTools: A Supervised Learning Package for Text Classification

RTextTools: A Supervised Learning Package for Text Classification by Timothy P. Jurka, Loren Collingwood, Amber E. Boydstun, Emiliano Grossman, and Wouter van Atteveldt.


Social scientists have long hand-labeled texts to create datasets useful for studying topics from congressional policymaking to media reporting. Many social scientists have begun to incorporate machine learning into their toolkits. RTextTools was designed to make machine learning accessible by providing a start-to-finish product in less than 10 steps. After installing RTextTools, the initial step is to generate a document term matrix. Second, a container object is created, which holds all the objects needed for further analysis. Third, users can use up to nine algorithms to train their data. Fourth, the data are classified. Fifth, the classification is summarized. Sixth, functions are available for performance evaluation. Seventh, ensemble agreement is conducted. Eighth, users can cross-validate their data. Finally, users write their data to a spreadsheet, allowing for further manual coding if required.

Another software package that comes with a sample data set!

The congressional bills example reminds me of a comment by Trey Grainger in Building a Real-time, Big Data Analytics Platform with Solr.

Trey makes the point that “document” in Solr depends on how you define document. Which enables processing/retrieval at a much lower level than a traditional “document.”

If the congressional bills were broken down at a clause level, would the results be different?

Not something I am going to pursue today but will appreciate comments and suggestions if you have seen that tried in other contexts.

Comments are closed.