Open Sourcing Duckling, our probabilistic (date) parser [Clojure]

Open Sourcing Duckling, our probabilistic (date) parser

From the post:

We’ve previously discussed ambiguity in natural language. What’s really fascinating is that even the simplest, seemingly most structured parts of natural language, like the way we humans describe dates and times, are actually so difficult to turn into structured data.

The wild world of temporal expressions in human language

All the following expressions describe the same point in time (at least in some contexts):

  • “December 30th, at 3 in the afternoon”
  • “The day before New Year’s Eve at 3pm”
  • “At 1500 three weeks from now”
  • “The last Tuesday of December at 3pm”

But wait… is it really equivalent to say 3pm and 1500? In the latter case, it seems that speaker meant to be more precise. Is it OK to drop this information?

And what about “next Tuesday”? If today is Monday, is that tomorrow or in 8 days? When I say “last month”, is it the last full month or the last 30 days?

A last example: “one month” looks like a well defined duration. That is, until you try to normalize durations in seconds, and you realize different months have anywhere between 28 and 31 days! Even “one day” is difficult. Yes, a day can last between 23 and 25 hours, because of daylight savings. Oh, and did I mention that at midnight at the end of 1927 in Shanghai, the clocks went back 5 minutes and 52 seconds? So “1927-12-31 23:54:08” actually happened twice there.

There are hundreds of hard things like these, and the more you dig into this, believe me, the more you’ll encounter. But that’s out of the scope of this post.

An introduction to the vagaries of date statements in natural language, a probabilistic (date) parser in Clojure, and an opportunity to extend said parser to other data types.

Nice way to end the week!

Comments are closed.