A Probabilistic Model for Ingredients in the Kaggle What's Cooking Challenge
In this notebook, we'll try to solve to a problem that arose when I was working on the Kaggle What's Cooking challenge (see my previous posts on the subject here and here).
In this challenge, the data, which consists of recipes, contains a few quirks. For example, the data may contain names of brands such as knorr garlic minicub, or mention weights and quantities that should have been cleaned up beforehand (10 oz.) frozen chopped spinach, thawed and squeezed dry.
The goal of this notebook is therefore to develop a method that suggests better ingredient names in an automatic way, based on a given recipe.
The outline of this blog post is as follows: we will first look at some overall statistics about the data, then develop a probabilistic model inspired by the principle behind spellcheckers and google auto-correct as explained by Peter Norvig and finally apply it to the machine learning algorithm used in the previous posts, the logistic regression.
Statistics about the data, at ingredient and word level¶
In this section, we'll do some basic data exploration. In particular, let's try to answer the following questions:
- how many ingredients do we find in the data? how many are unique?
- how many individual words make up these ingredients?
First, let's load the data. We read it using codecs because pandas hasn't support for specifying encoding in the json loading routine (see http://blog.nexusger.de/2015/11/python-pandas-and-json_read-with-utf-8-encoding/).
import pandas as pd
import codecs
df_train = pd.read_json(codecs.open('train.json', 'r', 'utf-8'))
The head of the dataframe looks like this:
df_train.head()
cuisine | id | ingredients | |
---|---|---|---|
0 | greek | 10259 | [romaine lettuce, black olives, grape tomatoes... |
1 | southern_us | 25693 | [plain flour, ground pepper, salt, tomatoes, g... |
2 | filipino | 20130 | [eggs, pepper, salt, mayonaise, cooking oil, g... |
3 | indian | 22213 | [water, vegetable oil, wheat, salt] |
4 | indian | 13162 | [black pepper, shallots, cornflour, cayenne pe... |
Now, let's build a single list with all the ingredients found in the dataset.
all_ingredients_text = []
for ingredient_list in df_train.ingredients:
all_ingredients_text += [ing.lower() for ing in ingredient_list]
Let's have a look at the stats. We have the following number of ingredients in our recipes:
len(all_ingredients_text)
428275
Among those, the number of unique ingredients is:
len(set(all_ingredients_text))
6703
How about looking at these ingredients in terms of words? We can split each ingredient at a word boundary using a regexp and then count them:
import re
re.split(re.compile('[,. ]+'), 'KRAFT Shredded Pepper Jack Cheese with a TOUCH OF PHILADELPHIA')
['KRAFT', 'Shredded', 'Pepper', 'Jack', 'Cheese', 'with', 'a', 'TOUCH', 'OF', 'PHILADELPHIA']
Let's do that and compute the number of words and then the unique words:
splitter = re.compile('[,. ]+')
all_words = []
for ingredient in all_ingredients_text:
all_words += re.split(splitter, ingredient)
len(all_words)
807802
How about unique ones?
len(set(all_words))
3152
So to conclude, our dataset consists of:
- 428275 ingredients
- among which 6703 are unique
- all these ingredients are made of 807802 words
- among which 3152 are unique
Let's now turn to the problems one can see with these ingredients.
Why some ingredients found in the data are strange¶
The longest ingredient names found in the dataset are:
sorted(set(all_ingredients_text), key=len, reverse=True)[:50]
['pillsbury™ crescent recipe creations® refrigerated seamless dough sheet', 'kraft mexican style shredded four cheese with a touch of philadelphia', 'bertolli vineyard premium collect marinara with burgundi wine sauc', 'kraft shredded pepper jack cheese with a touch of philadelphia', 'hidden valley® farmhouse originals italian with herbs dressing', 'hidden valley® original ranch salad® dressing & seasoning mix', 'condensed reduced fat reduced sodium cream of mushroom soup', 'hellmann’ or best food canola cholesterol free mayonnais', 'condensed reduced fat reduced sodium cream of chicken soup', "i can't believ it' not butter! made with olive oil spread", 'wish-bone light asian sesame ginger vinaigrette dressing', '(10 oz.) frozen chopped spinach, thawed and squeezed dry', 'kraft mexican style 2% milk finely shredded four cheese', 'kraft shredded low-moisture part-skim mozzarella cheese', 'hurst family harvest chipotle lime black bean soup mix', 'reduced fat reduced sodium tomato and herb pasta sauce', 'frozen orange juice concentrate, thawed and undiluted', 'crystal farms reduced fat shredded marble jack cheese', "i can't believe it's not butter!® all purpose sticks", 'sargento® traditional cut shredded mozzarella cheese', 'lipton sparkling diet green tea with strawberry kiwi', 'sargento® traditional cut shredded 4 cheese mexican', 'hidden valley® original ranch® spicy ranch dressing', 'hidden valley® greek yogurt original ranch® dip mix', 'conimex woksaus specials vietnamese gember knoflook', 'honeysuckle white® hot italian turkey sausage links', 'ragu old world style sweet tomato basil pasta sauc', 'sargento® artisan blends® shredded parmesan cheese', 'reduced fat reduced sodium cream of mushroom soup', 'frozen lemonade concentrate, thawed and undiluted', 'reduced sodium reduced fat cream of mushroom soup', 'condensed reduced fat reduced sodium tomato soup', 'shredded reduced fat reduced sodium swiss cheese', 'knorr reduc sodium chicken flavor bouillon cube', '2 1/2 to 3 lb. chicken, cut into serving pieces', 'kraft mexican style finely shredded four cheese', 'frozen chopped spinach, thawed and squeezed dry', "frank's® redhot® original cayenne pepper sauce", 'reduced sodium condensed cream of chicken soup', 'foster farms boneless skinless chicken breasts', 'knorr tomato bouillon with chicken flavor cube', 'pompeian canola oil and extra virgin olive oil', 'hidden valley® original ranch® light dressing', "uncle ben's ready rice whole grain brown rice", 'soy vay® veri veri teriyaki® marinade & sauce', 'taco bell® home originals® taco seasoning mix', 'pillsbury™ refrigerated crescent dinner rolls', 'kraft reduced fat shredded mozzarella cheese', 'reduced sodium italian style stewed tomatoes', 'skinless and boneless chicken breast fillet']
What's going on here? Several observations can be made from this ingredient list:
- some ingredients contain special symbols that are not relevant (trademark, copyright, ...)
- some ingredients feature brand names, for instance KRAFT, Pillsbury, Hidden Valley, that I would say are not relevant to the identification of the ingredient itself and that I would thus want to exclude
- some ingredients contain sentences of english words instead of ingredients i can't believ it' not butter! made with olive oil spread
- some ingredients contain spelling errors burgundi wine sauc should actually be spelled burgundy wine sauce
- some ingredients contain quantities of ingredients that shouldn't be there: 2 1/2 to 3 lb. chicken, cut into serving pieces
Ideally, we want a function that returns a more relevant version of these ingredients. For example:
original ingredient | improved ingredient |
---|---|
pillsbury™ crescent recipe creations® refrigerated seamless dough sheet | refrigerated seamless dough sheet |
i can't believe it's not butter!® all purpose sticks | all purpose sticks |
2 1/2 to 3 lb. chicken, cut into serving pieces | chicken pieces |
These suggested improved ingredients are not unique and depend upon the knowledge of the person formulating them. The problem is that they're not easy to systematically generate. Therefore, it would be great if we could use a model based on the data itself to suggest them. This is what we're going to do in the next section.
Building a simple model¶
Theory¶
Given an ingredient, we have a choice to throw out some of its components (i.e. words). Judging if a component of an ingredient is useful or not is difficult. However, by following Peter Norvig's approach on spell correcting (or see these slides at Cornell University https://courses.cit.cornell.edu/info4300_2011fa/slides/25.pdf that explain Peter's approach in detail), we can formulate the problem of throwing out ingredient components in a probabilistic way. Although not stated in Norvig's writeup, this approach is close to something called a Naive Bayes classifier.
Let's take a closer look at what we want to do. We want our improved ingredient to maximize its probability given our input ingredient:
$$ \DeclareMathOperator*{\argmax}{arg\,max} \argmax_{b}P\left({b \mid i}\right) $$
Here, I denoted by $b$ the better ingredient and by $i$ the original ingredient.
The previous expression can be transformed according to Bayes Rule (following Peter Norvig) to yield:
$$ \argmax_{b} {P\left({i \mid b}\right) P(b) \over P(i)} $$
Given that the ingredient $i$ is fixed, we neglect the denominator (as Peter does) and end up with:
$$ \argmax_{b} P\left({i \mid b}\right) P(b) $$
This leaves us with three terms that balance out:
- the probability of occurence of ingredient in our data $P(b)$ (this will favor commonly used ingredients)
- the error model, which says how likely it is that the input ingredient $i$ is a modified version of the better ingredient $b$
- the argmax, our control mechanism, which chooses the $b$ that gives the best combined probability score
Our error model will be that we can only delete words from our ingredient. So what we need is to generate a list of possible ingredients based on our original ingredient by substracting words. Also, we will assume that word order doesn't matter, so we can represent an ingredient by a set.
Modelling ingredients using sets¶
Ingredients contain a fixed number of words. We will therefore model them as frozensets of words. This will allow us to manipulate them more easily in the remaining document.
Let's define a function that creates an ingredient from a text string:
import re
def to_ingredient(text):
"Transforms text into an ingredient."
return frozenset(re.split(re.compile('[,. ]+'), text))
Let's build a list of all our ingredients using this function:
all_ingredients = [to_ingredient(text) for text in all_ingredients_text]
all_ingredients[:10]
[frozenset({'lettuce', 'romaine'}), frozenset({'black', 'olives'}), frozenset({'grape', 'tomatoes'}), frozenset({'garlic'}), frozenset({'pepper'}), frozenset({'onion', 'purple'}), frozenset({'seasoning'}), frozenset({'beans', 'garbanzo'}), frozenset({'cheese', 'crumbles', 'feta'}), frozenset({'flour', 'plain'})]
Let's now implement our algorithm given our model of an ingredient.
Implementation¶
We can now write a function that generates all possible candidate ingredients that can be built from a starting ingredient, using combinations found in the Python standard library.
import itertools
def candidates(ingredient):
"Returns a list of candidate ingredients obtained from the original ingredient by keeping at least one of them."
n = len(ingredient)
possible = []
for i in range(1, n + 1):
possible += [frozenset(combi) for combi in itertools.combinations(ingredient, i)]
return possible
Let's see how this works on examples:
candidates(to_ingredient("tomato and herb pasta sauce"))
[frozenset({'pasta'}), frozenset({'herb'}), frozenset({'sauce'}), frozenset({'and'}), frozenset({'tomato'}), frozenset({'herb', 'pasta'}), frozenset({'pasta', 'sauce'}), frozenset({'and', 'pasta'}), frozenset({'pasta', 'tomato'}), frozenset({'herb', 'sauce'}), frozenset({'and', 'herb'}), frozenset({'herb', 'tomato'}), frozenset({'and', 'sauce'}), frozenset({'sauce', 'tomato'}), frozenset({'and', 'tomato'}), frozenset({'herb', 'pasta', 'sauce'}), frozenset({'and', 'herb', 'pasta'}), frozenset({'herb', 'pasta', 'tomato'}), frozenset({'and', 'pasta', 'sauce'}), frozenset({'pasta', 'sauce', 'tomato'}), frozenset({'and', 'pasta', 'tomato'}), frozenset({'and', 'herb', 'sauce'}), frozenset({'herb', 'sauce', 'tomato'}), frozenset({'and', 'herb', 'tomato'}), frozenset({'and', 'sauce', 'tomato'}), frozenset({'and', 'herb', 'pasta', 'sauce'}), frozenset({'herb', 'pasta', 'sauce', 'tomato'}), frozenset({'and', 'herb', 'pasta', 'tomato'}), frozenset({'and', 'pasta', 'sauce', 'tomato'}), frozenset({'and', 'herb', 'sauce', 'tomato'}), frozenset({'and', 'herb', 'pasta', 'sauce', 'tomato'})]
candidates(to_ingredient('knorr chicken flavor bouillon cube'))
[frozenset({'flavor'}), frozenset({'chicken'}), frozenset({'cube'}), frozenset({'knorr'}), frozenset({'bouillon'}), frozenset({'chicken', 'flavor'}), frozenset({'cube', 'flavor'}), frozenset({'flavor', 'knorr'}), frozenset({'bouillon', 'flavor'}), frozenset({'chicken', 'cube'}), frozenset({'chicken', 'knorr'}), frozenset({'bouillon', 'chicken'}), frozenset({'cube', 'knorr'}), frozenset({'bouillon', 'cube'}), frozenset({'bouillon', 'knorr'}), frozenset({'chicken', 'cube', 'flavor'}), frozenset({'chicken', 'flavor', 'knorr'}), frozenset({'bouillon', 'chicken', 'flavor'}), frozenset({'cube', 'flavor', 'knorr'}), frozenset({'bouillon', 'cube', 'flavor'}), frozenset({'bouillon', 'flavor', 'knorr'}), frozenset({'chicken', 'cube', 'knorr'}), frozenset({'bouillon', 'chicken', 'cube'}), frozenset({'bouillon', 'chicken', 'knorr'}), frozenset({'bouillon', 'cube', 'knorr'}), frozenset({'chicken', 'cube', 'flavor', 'knorr'}), frozenset({'bouillon', 'chicken', 'cube', 'flavor'}), frozenset({'bouillon', 'chicken', 'flavor', 'knorr'}), frozenset({'bouillon', 'cube', 'flavor', 'knorr'}), frozenset({'bouillon', 'chicken', 'cube', 'knorr'}), frozenset({'bouillon', 'chicken', 'cube', 'flavor', 'knorr'})]
The final step is to compute probabilities of candidate words. This is done using a counter of ingredients: the higher the count, the higher its probability (we don't bother normalizing here).
from collections import Counter
c = Counter(all_ingredients)
c.most_common(20)
[(frozenset({'salt'}), 18049), (frozenset({'oil', 'olive'}), 7972), (frozenset({'onions'}), 7972), (frozenset({'water'}), 7457), (frozenset({'garlic'}), 7380), (frozenset({'sugar'}), 6434), (frozenset({'cloves', 'garlic'}), 6237), (frozenset({'butter'}), 4848), (frozenset({'black', 'ground', 'pepper'}), 4785), (frozenset({'all-purpose', 'flour'}), 4632), (frozenset({'pepper'}), 4438), (frozenset({'oil', 'vegetable'}), 4385), (frozenset({'eggs'}), 3388), (frozenset({'sauce', 'soy'}), 3296), (frozenset({'kosher', 'salt'}), 3113), (frozenset({'green', 'onions'}), 3078), (frozenset({'tomatoes'}), 3058), (frozenset({'eggs', 'large'}), 2948), (frozenset({'carrots'}), 2814), (frozenset({'butter', 'unsalted'}), 2782)]
Now we're all set to compute the best candidate for a given input.
First, let's build the probability evaluation for a possible ingredient using a default dict (so that we don't end up asking for an ingredient that doesn't exist):
from collections import defaultdict
probability = defaultdict(lambda: 1, c.most_common())
Let's test the probability:
probability[to_ingredient('pasta and herb')]
1
probability[to_ingredient('tomato sauce')]
867
Seems like what we expect: tomato sauce has a higher probability than pasta and herb, which doesn't appear in our initial words.
Let's now write the function that yields the most probable replacement ingredient among all possible candidates.
def best_replacement(ingredient):
"Computes best replacement ingredient for a given input."
return max(candidates(ingredient), key=lambda c: probability[c])
Let's now look at some examples:
best_replacement(to_ingredient("tomato sauce"))
frozenset({'sauce', 'tomato'})
best_replacement(to_ingredient("pasta and herb"))
frozenset({'pasta'})
best_replacement(to_ingredient("kraft mexican style shredded four cheese with a touch of philadelphia"))
frozenset({'cheese'})
These examples all look good. What about the less frequent ingredients we had in our data?
pd.DataFrame([(text,
" ".join(best_replacement(to_ingredient(text))))
for text in sorted(set(all_ingredients_text), key=len, reverse=True)[:50]],
columns=['original ingredient', 'improved ingredient'])
original ingredient | improved ingredient | |
---|---|---|
0 | pillsbury™ crescent recipe creations® refriger... | dough |
1 | kraft mexican style shredded four cheese with ... | cheese |
2 | bertolli vineyard premium collect marinara wit... | wine |
3 | kraft shredded pepper jack cheese with a touch... | pepper |
4 | hidden valley® farmhouse originals italian wit... | herbs |
5 | hidden valley® original ranch salad® dressing ... | seasoning |
6 | condensed reduced fat reduced sodium cream of ... | cream |
7 | hellmann’ or best food canola cholesterol fr... | canola |
8 | condensed reduced fat reduced sodium cream of ... | chicken |
9 | i can't believ it' not butter! made with olive... | oil olive |
10 | wish-bone light asian sesame ginger vinaigrett... | ginger |
11 | (10 oz.) frozen chopped spinach, thawed and sq... | spinach |
12 | kraft mexican style 2% milk finely shredded fo... | milk |
13 | kraft shredded low-moisture part-skim mozzarel... | mozzarella cheese shredded |
14 | hurst family harvest chipotle lime black bean ... | lime |
15 | reduced fat reduced sodium tomato and herb pas... | sauce tomato |
16 | frozen orange juice concentrate, thawed and un... | orange |
17 | crystal farms reduced fat shredded marble jack... | cheese |
18 | i can't believe it's not butter!® all purpose ... | believe can't butter!® purpose it's not all st... |
19 | sargento® traditional cut shredded mozzarella ... | mozzarella cheese shredded |
20 | lipton sparkling diet green tea with strawberr... | kiwi |
21 | sargento® traditional cut shredded 4 cheese me... | cheese |
22 | hidden valley® original ranch® spicy ranch dre... | ranch dressing |
23 | hidden valley® greek yogurt original ranch® di... | greek yogurt |
24 | conimex woksaus specials vietnamese gember kno... | woksaus |
25 | honeysuckle white® hot italian turkey sausage ... | sausage italian |
26 | ragu old world style sweet tomato basil pasta ... | pasta |
27 | sargento® artisan blends® shredded parmesan ch... | parmesan cheese |
28 | reduced fat reduced sodium cream of mushroom soup | cream |
29 | frozen lemonade concentrate, thawed and undiluted | frozen lemonade concentrate |
30 | reduced sodium reduced fat cream of mushroom soup | cream |
31 | condensed reduced fat reduced sodium tomato soup | fat |
32 | shredded reduced fat reduced sodium swiss cheese | cheese |
33 | knorr reduc sodium chicken flavor bouillon cube | chicken |
34 | 2 1/2 to 3 lb. chicken, cut into serving pieces | chicken |
35 | kraft mexican style finely shredded four cheese | cheese |
36 | frozen chopped spinach, thawed and squeezed dry | spinach |
37 | frank's® redhot® original cayenne pepper sauce | pepper |
38 | reduced sodium condensed cream of chicken soup | chicken |
39 | foster farms boneless skinless chicken breasts | breasts boneless skinless chicken |
40 | knorr tomato bouillon with chicken flavor cube | chicken |
41 | pompeian canola oil and extra virgin olive oil | oil olive |
42 | hidden valley® original ranch® light dressing | dressing |
43 | uncle ben's ready rice whole grain brown rice | rice |
44 | soy vay® veri veri teriyaki® marinade & sauce | soy sauce |
45 | taco bell® home originals® taco seasoning mix | seasoning taco |
46 | pillsbury™ refrigerated crescent dinner rolls | rolls |
47 | kraft reduced fat shredded mozzarella cheese | mozzarella cheese shredded |
48 | reduced sodium italian style stewed tomatoes | tomatoes |
49 | skinless and boneless chicken breast fillet | chicken |
This is interesting: a lot of the ingredients get well simplified, which is exactly what we want. Two comments:
- some simplifications are not the ones expected i can't believe it's not butter!® all purpose sticks, reduced sodium reduced fat cream of mushroom soup
- some get simplified too much: skinless and boneless chicken breast fillet becomes chicken, reduced sodium italian style stewed tomatoes becomes tomatoes
This model fails to take into account one thing: if we leave out too many ingredients in our replacement ingredient, the distance to the original ingredient increases. This observation is analogous to Peter Norvig's spell checker, where he considers the possible candidates by increasing edit distance: first words that have only one modification compared to the original word, then two modifications, then three...
To compensate for this oversimplifcation, let's modify our procedure to keep a new ingredient that is as close as possible to the original.
Building a slightly more elaborate model¶
The only thing we have to change is the way our candidates function works. Instead of generating all possibilites it should return only the ones that exist in our vocabulary of recipes with the least possible number of modifications. Here the modifications can be thought of leaving out a given number of words, by increasing the number of words left out.
def candidates_increasing_distance(ingredient, vocabulary):
"Returns candidate ingredients obtained from the original ingredient by substraction, largest number of ingredients first."
n = len(ingredient)
for i in range(n - 1, 1, -1):
possible = [frozenset(combi) for combi in itertools.combinations(ingredient, i)
if frozenset(combi) in vocabulary]
if len(possible) > 0:
return possible
return [ingredient]
We will define our vocabulary by the already existing counted ingredients:
vocabulary = dict(c.most_common())
list(vocabulary.keys())[:10]
[frozenset({'chicken', 'cubes', 'stock'}), frozenset({'creations®', 'crescent', 'dough', 'pillsbury™', 'recipe', 'refrigerated', 'seamless', 'sheet'}), frozenset({'nori', 'shredded'}), frozenset({'eye', 'of', 'roast', 'round'}), frozenset({'gumbo', 'mixture', 'vegetable'}), frozenset({'curry', 'green', 'paste', 'thai'}), frozenset({'bouillon', 'chicken', 'cube', 'flavor', 'knorr'}), frozenset({'beef', 'roast', 'shoulder'}), frozenset({'grained'}), frozenset({'dried', 'ear', 'mushrooms', 'wood'})]
Let's test our function on a few examples:
candidates_increasing_distance(to_ingredient("bottled clam juice"), vocabulary)
[frozenset({'clam', 'juice'})]
candidates_increasing_distance(to_ingredient('hidden valley original ranch spicy ranch dressing'), vocabulary)
[frozenset({'dressing', 'ranch'})]
candidates_increasing_distance(to_ingredient("skinless and boneless chicken breast fillet"), vocabulary)
[frozenset({'boneless', 'breast', 'chicken', 'skinless'})]
candidates_increasing_distance(to_ingredient("reduced sodium italian style stewed tomatoes"), vocabulary)
[frozenset({'italian', 'stewed', 'style', 'tomatoes'})]
As we see, this works better on the examples that got simplified too much using the previous model.
Let's now write a new function for computing the best replacement, that takes into account the probabilities:
def best_replacement_increasing_distance(ingredient, vocabulary):
"Computes best replacement ingredient for a given input."
return max(candidates_increasing_distance(ingredient, vocabulary), key=lambda w: vocabulary[w])
And let's apply this to our lesser used ingredients:
pd.DataFrame([(text,
" ".join(best_replacement_increasing_distance(to_ingredient(text), vocabulary)))
for text in sorted(set(all_ingredients_text), key=len, reverse=True)[:50]],
columns=['original ingredient', 'improved ingredient'])
original ingredient | improved ingredient | |
---|---|---|
0 | pillsbury™ crescent recipe creations® refriger... | refrigerated seamless dough crescent |
1 | kraft mexican style shredded four cheese with ... | cheese shredded |
2 | bertolli vineyard premium collect marinara wit... | wine marinara bertolli vineyard with burgundi ... |
3 | kraft shredded pepper jack cheese with a touch... | kraft jack pepper cheese shredded |
4 | hidden valley® farmhouse originals italian wit... | herbs italian |
5 | hidden valley® original ranch salad® dressing ... | ranch dressing |
6 | condensed reduced fat reduced sodium cream of ... | mushroom fat cream reduced soup of sodium |
7 | hellmann’ or best food canola cholesterol fr... | best canola cholesterol mayonnais food or free... |
8 | condensed reduced fat reduced sodium cream of ... | cream chicken reduced soup condensed of sodium |
9 | i can't believ it' not butter! made with olive... | oil olive |
10 | wish-bone light asian sesame ginger vinaigrett... | vinaigrette dressing |
11 | (10 oz.) frozen chopped spinach, thawed and sq... | chopped frozen spinach thawed squeezed and dry |
12 | kraft mexican style 2% milk finely shredded fo... | mexican shredded finely style cheese kraft four |
13 | kraft shredded low-moisture part-skim mozzarel... | kraft mozzarella cheese shredded |
14 | hurst family harvest chipotle lime black bean ... | bean soup mix |
15 | reduced fat reduced sodium tomato and herb pas... | sauce tomato |
16 | frozen orange juice concentrate, thawed and un... | orange frozen juice concentrate |
17 | crystal farms reduced fat shredded marble jack... | cheese reduced shredded fat |
18 | i can't believe it's not butter!® all purpose ... | believe can't butter!® sticks not it's all i p... |
19 | sargento® traditional cut shredded mozzarella ... | mozzarella cheese shredded |
20 | lipton sparkling diet green tea with strawberr... | tea green |
21 | sargento® traditional cut shredded 4 cheese me... | cheese shredded |
22 | hidden valley® original ranch® spicy ranch dre... | hidden valley® ranch® original dressing |
23 | hidden valley® greek yogurt original ranch® di... | greek yogurt |
24 | conimex woksaus specials vietnamese gember kno... | woksaus specials knoflook gember conimex vietn... |
25 | honeysuckle white® hot italian turkey sausage ... | hot sausage turkey italian links |
26 | ragu old world style sweet tomato basil pasta ... | pasta world old ragu style sauc |
27 | sargento® artisan blends® shredded parmesan ch... | parmesan cheese shredded |
28 | reduced fat reduced sodium cream of mushroom soup | mushroom fat cream reduced soup of |
29 | frozen lemonade concentrate, thawed and undiluted | frozen lemonade concentrate |
30 | reduced sodium reduced fat cream of mushroom soup | mushroom fat cream reduced soup of |
31 | condensed reduced fat reduced sodium tomato soup | condensed soup tomato |
32 | shredded reduced fat reduced sodium swiss cheese | swiss reduced cheese fat |
33 | knorr reduc sodium chicken flavor bouillon cube | bouillon reduc flavor knorr chicken sodium |
34 | 2 1/2 to 3 lb. chicken, cut into serving pieces | pieces chicken |
35 | kraft mexican style finely shredded four cheese | cheese shredded |
36 | frozen chopped spinach, thawed and squeezed dry | chopped frozen spinach |
37 | frank's® redhot® original cayenne pepper sauce | pepper cayenne sauce |
38 | reduced sodium condensed cream of chicken soup | condensed soup cream chicken of |
39 | foster farms boneless skinless chicken breasts | breasts boneless skinless chicken |
40 | knorr tomato bouillon with chicken flavor cube | bouillon tomato with flavor knorr chicken |
41 | pompeian canola oil and extra virgin olive oil | oil virgin olive |
42 | hidden valley® original ranch® light dressing | hidden valley® ranch® original dressing |
43 | uncle ben's ready rice whole grain brown rice | whole grain rice |
44 | soy vay® veri veri teriyaki® marinade & sauce | soy sauce |
45 | taco bell® home originals® taco seasoning mix | seasoning taco mix |
46 | pillsbury™ refrigerated crescent dinner rolls | refrigerated rolls crescent |
47 | kraft reduced fat shredded mozzarella cheese | cheese reduced shredded fat |
48 | reduced sodium italian style stewed tomatoes | tomatoes style italian stewed |
49 | skinless and boneless chicken breast fillet | skinless breast chicken boneless |
This result is interesting: the refined labels given here seem much better than the originals. However, there are still problems with the new ingredients. In particular, special symbols pollute the data. Therefore it's a good idea to first clean the data before employing our method.
Hopefully, this leads us to a better ingredient names, leading to better classification. To assert this, we'll train again on the dataset and see if we can expect some improvement compared to our previous model.
Training a machine learning model¶
First, let's train the previous model:
df_train['all_ingredients'] = df_train['ingredients'].map(";".join)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(df_train['all_ingredients'].values)
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y = enc.fit_transform(df_train.cuisine)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Let's evaluate the accuracy of our model by doing a K-fold cross validation (training the model several times and check the error each time).
from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem
import numpy as np
def evaluate_cross_validation(clf, X, y, K):
# create a k-fold cross validation iterator
cv = KFold(len(y), K, shuffle=True, random_state=0)
# by default the score used is the one returned by score method of the estimator (accuracy)
scores = cross_val_score(clf, X, y, cv=cv)
print (scores)
print ("Mean score: {0:.3f} (+/-{1:.3f})".format(
np.mean(scores), sem(scores)))
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
evaluate_cross_validation(logistic, X, y, 5)
[ 0.79195475 0.78830924 0.78290383 0.79019485 0.78124214] Mean score: 0.787 (+/-0.002)
Now that we have an initial evaluation of our algorithm, let's feed it our better vectors. All we have to do is rebuild the X matrix using our new feature ingredients. We also perform some string substitution to take into account the special characters.
def clean_string(ingredient_text):
"Cleans the ingredient from special characters and returns it as lowercase."
return "".join([char for char in ingredient_text.lower() if char.isalnum() or char.isspace()])
Let's test this:
clean_string("frank's® redhot® original cayenne pepper sauce")
'franks redhot original cayenne pepper sauce'
clean_string("2 1/2 to 3 lb. chicken, cut into serving pieces")
'2 12 to 3 lb chicken cut into serving pieces'
clean_string("i can't believe it's not butter!® all purpose sticks")
'i cant believe its not butter all purpose sticks'
clean_string("extra-virgin olive oil")
'extravirgin olive oil'
Finally, we can put together our function for improving a list of ingredients from the dataframe. We'll use a treshold for improving ingredients because very popular ingredients should just stay the same and not get cleaned. To do this, let's plot a graph for the value counts of the different ingredients found:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('bmh')
plt.figure(figsize=(10, 6))
pd.Series(vocabulary).sort_values().plot(logy=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1e49b438>
This gives us a criterion: let's only improved ingredients whose counts are lower 100. The others are assumed to be already well defined.
def improve_ingredients(ingredient_list):
"Improves the list of ingredients given as input."
better_ingredients = []
for ingredient in ingredient_list:
cleaned_ingredient = clean_string(ingredient)
if vocabulary[to_ingredient(cleaned_ingredient)] <= 100:
better_ingredients.append(" ".join(best_replacement_increasing_distance(
to_ingredient(cleaned_ingredient), vocabulary)))
else:
better_ingredients.append(cleaned_ingredient)
return ";".join(better_ingredients)
Before using our function, we also need to rebuild our vocabulary using our new string cleaning function:
df_test = pd.read_json(codecs.open('test.json', 'r', 'utf-8'))
all_ingredients_text = []
for df in [df_train, df_test]:
for ingredient_list in df.ingredients:
all_ingredients_text += [clean_string(ing) for ing in ingredient_list]
all_ingredients = [to_ingredient(text) for text in all_ingredients_text]
c = Counter(all_ingredients)
vocabulary = dict(c.most_common())
df_train['better_ingredients'] = df_train['ingredients'].map(improve_ingredients)
df_train.head(10)
cuisine | id | ingredients | all_ingredients | better_ingredients | |
---|---|---|---|---|---|
0 | greek | 10259 | [romaine lettuce, black olives, grape tomatoes... | romaine lettuce;black olives;grape tomatoes;ga... | romaine lettuce;black olives;grape tomatoes;ga... |
1 | southern_us | 25693 | [plain flour, ground pepper, salt, tomatoes, g... | plain flour;ground pepper;salt;tomatoes;ground... | plain flour;ground pepper;salt;tomatoes;ground... |
2 | filipino | 20130 | [eggs, pepper, salt, mayonaise, cooking oil, g... | eggs;pepper;salt;mayonaise;cooking oil;green c... | eggs;pepper;salt;mayonaise;cooking oil;green c... |
3 | indian | 22213 | [water, vegetable oil, wheat, salt] | water;vegetable oil;wheat;salt | water;vegetable oil;wheat;salt |
4 | indian | 13162 | [black pepper, shallots, cornflour, cayenne pe... | black pepper;shallots;cornflour;cayenne pepper... | black pepper;shallots;cornflour;cayenne pepper... |
5 | jamaican | 6602 | [plain flour, sugar, butter, eggs, fresh ginge... | plain flour;sugar;butter;eggs;fresh ginger roo... | plain flour;sugar;butter;eggs;fresh ginger roo... |
6 | spanish | 42779 | [olive oil, salt, medium shrimp, pepper, garli... | olive oil;salt;medium shrimp;pepper;garlic;cho... | olive oil;salt;medium shrimp;pepper;garlic;cho... |
7 | italian | 3735 | [sugar, pistachio nuts, white almond bark, flo... | sugar;pistachio nuts;white almond bark;flour;v... | sugar;nuts pistachio;white bark almond;flour;v... |
8 | mexican | 16903 | [olive oil, purple onion, fresh pineapple, por... | olive oil;purple onion;fresh pineapple;pork;po... | olive oil;purple onion;fresh pineapple;pork;po... |
9 | italian | 12734 | [chopped tomatoes, fresh basil, garlic, extra-... | chopped tomatoes;fresh basil;garlic;extra-virg... | chopped tomatoes;fresh basil;garlic;extravirgi... |
Let's now see the performance of this new version of the ingredients.
X = cv.fit_transform(df_train['better_ingredients'].values)
evaluate_cross_validation(logistic, X, y, 5)
[ 0.78830924 0.78680075 0.78214959 0.78918919 0.77923058] Mean score: 0.785 (+/-0.002)
Well that's too bad... Our "clever" ingredients don't improve our score on the test set. Still let's try to submit a solution given that the test data might have some other issues (unknown ingredients that our method can simplify...).
df_test['better_ingredients'] = df_test['ingredients'].map(improve_ingredients)
And now, let's build our feature matrix:
X_submit = cv.transform(df_test['better_ingredients'].values)
logistic.fit(X, y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
y_submit = logistic.predict(X_submit)
with open("submission_better_words.csv", 'w') as f:
f.write("id,cuisine\n")
for i, cuisine in zip(df_test.id, enc.inverse_transform(y_submit)):
f.write("{},{}\n".format(i, cuisine))
Conclusion¶
My submission (to the already closed contest) achieves a ranking of 576th, nothing to frill about. That's too bad. I would have hoped it would somehow improve my score, even though the K-fold had shown worse performance. Nevertheless, I think what is valuable here is that we have devised a way to improve a given text feature vector by using simple probability. This technique might be useful in other machine learning settings, when one can't trust the absolute reliability of the data. As demonstrated on the most exotic ingredients found in the dataset, this method can perform very well and transform an ingredient that is quite exotic into something that still makes sense.