Using machine learning to extract quotes from text by Chase Davis.
From the post:
Since we launched our Politics Verbatim project a couple of years ago, I’ve been hung up on what should be a simple problem: How can we automate the extraction of quotes from news articles, so it doesn’t take a squad of bored-out-of-their-minds interns to keep track of what politicians say in the news?
You’d be surprised at how tricky this is. At first glance, it looks like something a couple of regular expressions could solve. Just find the text with quotes in it, then pull out the words in between! But what about “air quotes?” Or indirect quotes (“John said he hates cheeseburgers.”)? Suffice it to say, there are plenty of edge cases that make this problem harder than it looks.
When I took over management of the combined Center for Investigative Reporting/Bay Citizen technology team a couple of months ago, I encouraged everyone to have a personal project on the back burner – an itch they wanted to scratch either during slow work days or (in this case) on nights and weekends.
This is mine: the citizen-quotes project, an app that uses simple machine learning techniques to extract more than 40,000 quotes from every article that ran on The Bay Citizen since it launched in 2010. The goal was to build something that accounts for the limitations of the traditional method of solving quote extraction – regular expressions and pattern matching. And sure enough, it does a pretty good job.
Illustrates the application of machine learning to a non-trivial text analysis problem.