Our Tagged Ingredients Data is Now on GitHub by Erica Greene and Adam McKaig.
From the post:
Since publishing our post about “Extracting Structured Data From Recipes Using Conditional Random Fields,” we’ve received a tremendous number of requests to release the data and our code. Today, we’re excited to release the roughly 180,000 labeled ingredient phrases that we used to train our machine learning model.
You can find the data and code in the ingredient-phrase-tagger GitHub repo. Instructions are in the README and the raw data is in nyt-ingredients-snapshot-2015.csv.
…
Reaching a critical mass for any domain is a stumbling block for any topic map. Erica and Adam kick start your foodie topic map adventures with ~ 180,000 labeled ingredient phrases.
You are looking at the end result of six years of data mining and some clever programming so be sure to:
- Always acknowledge this project along with Erica and Alex in your work.
- Contribute back improved data.
- Contribute back improvements on the conditional random fields (CRF).
- Have a great time extending this data set!
Possible extensions include automatic translation (with mapping of “equivalent” terms), melding in the USDA food database (it’s formally known as: USDA National Nutrient Database for Standard Reference) with nutrient content information on ~8,800 foods, and, of course, the “correct” way to make a roux as reflected in your mother’s cookbook.
It is, unfortunately, true that you can buy a mix for roux in a cardboard box. That requires a food processor to chop up the cardboard to enjoy with the roux that came in it. I’m originally from Louisiana and the thought of a roux mix is depressing, if not heretical.