Applied Natural Language Processing by Jason Baldridge.
Description:
This class will provide instruction on applying algorithms in natural language processing and machine learning for experimentation and for real world tasks, including clustering, classification, part-of-speech tagging, named entity recognition, topic modeling, and more. The approach will be practical and hands-on: for example, students will program common classifiers from the ground up, use existing toolkits such as OpenNLP, Chalk, StanfordNLP, Mallet, and Breeze, construct NLP pipelines with UIMA, and get some initial experience with distributed computation with Hadoop and Spark. Guidance will also be given on software engineering, including build tools, git, and testing. It is assumed that students are already familiar with machine learning and/or computational linguistics and that they already are competent programmers. The programming language used in the course will be Scala; no explicit instruction will be given in Scala programming, but resources and assistance will be provided for those new to the language.
From the syllabus:
The foremost goal of this course is to provide practical exposure to the core techniques and applications of natural language processing. By the end, students will understand the motivations for and capabilities of several core natural language processing and machine learning algorithms and techniques used in text analysis, including:
- regular expressions
- vector space models
- clustering
- classification
- deduplication
- n-gram language models
- topic models
- part-of-speech tagging
- named entity recognition
- PageRank
- label propagation
- dependency parsing
We will show, on a few chosen topics, how natural language processing builds on and uses the fundamental data structures and algorithms presented in this course. In particular, we will discuss:
- authorship attribution
- language identification
- spam detection
- sentiment analysis
- influence
- information extraction
- geolocation
Students will learn to write non-trivial programs for natural language processing that take advantage of existing open source toolkits. The course will involve significant guidance and instruction in to software engineering practices and principles, including:
- functional programming
- distributed version control systems (git)
- build systems
- unit testing
- distributed computing (Hadoop)
The course will help prepare students both for jobs in the industry and for doing original research that involves natural language processing.
A great start to one aspect of being a “data scientist.”
I encountered this course via the Nak (Scala library for NLP) project. Version 1.1.1 was just released and I saw a tweet from Jason Baldridge on the same.
The course materials have exercises and a rich set of links to other resources.
You may also enjoy:
Bcomposes (Jason’s blog).