spaCy: Industrial-strength NLP

spaCy: Industrial-strength NLP by Matthew Honnibal.

From the post:

spaCy is a new library for text processing in Python and Cython. I wrote it because I think small companies are terrible at natural language processing (NLP). Or rather: small companies are using terrible NLP technology.

To do great NLP, you have to know a little about linguistics, a lot about machine learning, and almost everything about the latest research. The people who fit this description seldom join small companies. Most are broke — they’ve just finished grad school. If they don’t want to stay in academia, they join Google, IBM, etc.

The net result is that outside of the tech giants, commercial NLP has changed little in the last ten years. In academia, it’s changed entirely. Amazing improvements in quality. Orders of magnitude faster. But the academic code is always GPL, undocumented, unuseable, or all three. You could implement the ideas yourself, but the papers are hard to read, and training data is exorbitantly expensive. So what are you left with? A common answer is NLTK, which was written primarily as an educational resource. Nothing past the tokenizer is suitable for production use.

I used to think that the NLP community just needed to do more to communicate its findings to software engineers. So I wrote two blog posts, explaining how to write a part-of-speech tagger and parser. Both were well received, and there’s been a bit of interest in my research software — even though it’s entirely undocumented, and mostly unuseable to anyone but me.

So six months ago I quit my post-doc, and I’ve been working day and night on spaCy since. I’m now pleased to announce an alpha release.

If you’re a small company doing NLP, I think spaCy will seem like a minor miracle. It’s by far the fastest NLP software ever released. The full processing pipeline completes in 7ms per document, including accurate tagging and parsing. All strings are mapped to integer IDs, tokens are linked to embedded word representations, and a range of useful features are pre-calculated and cached.

Matthew uses an example based on Stephen King’s admonition “the adverb is not your friend“, which immediately brought to mind the utility of tagging all adverbs and adjectives in a standards draft and then generating comments that identify its parent <p> element and the offending phrase.

I haven’t verified the performance comparisons, but as you know, the real question is how well spaCy works on your data, work flow, etc.?

Thanks to Matthew for the reminder of: On writing : a memoir of the craft by Stephen King. Documentation will never be as gripping as a King novel, but it shouldn’t be painful to read.

I first saw this in a tweet by Jason Baldridge.

Comments are closed.