The Kaggle What's Cooking challenge
In this blog post, we'll have a look at the Kaggle What's Cooking data challenge.
This competition is all about predicting which country a recipe is from, given a list of its ingredient. For example, assume you have a recipe that reads:
plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, yellow corn meal, milk, vegetable oil
Can you guess from which cuisine this is? As this recipe comes from the training set of this challenge, I can tell you that the expected answer is Southern US.
Without further ado, let's dive in. First, let us have a look at the training data.
Exploring the training data¶
We'll use pandas to go through the data. First, let's read the json file containing the recipes and the cuisines:
import pandas as pd
df_train = pd.read_json('train.json')
Let's look at the head of the data:
df_train.head()
| cuisine | id | ingredients | |
|---|---|---|---|
| 0 | greek | 10259 | [romaine lettuce, black olives, grape tomatoes... |
| 1 | southern_us | 25693 | [plain flour, ground pepper, salt, tomatoes, g... |
| 2 | filipino | 20130 | [eggs, pepper, salt, mayonaise, cooking oil, g... |
| 3 | indian | 22213 | [water, vegetable oil, wheat, salt] |
| 4 | indian | 13162 | [black pepper, shallots, cornflour, cayenne pe... |
We can see the structure of the data: a cuisine, an id for the recipe, and the ingredients.
As a first step, let's look at the cuisines in the dataset. How many and how much of these do we have?
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
df_train['cuisine'].value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x10dce51d0>
As can be seen in this figure, there are a lot of Italian, Mexican and Southern US recipes, a little less of the other recipes.
To get a little insight in the data itself, we can look at a couple of recipes. In particular, we can count the most frequent ingredients for each cuisine. To do that, we can use the Python counter objects (found in the collections module from the standard library).
from collections import Counter
counters = {}
for cuisine in df_train['cuisine'].unique():
counters[cuisine] = Counter()
indices = (df_train['cuisine'] == cuisine)
for ingredients in df_train[indices]['ingredients']:
counters[cuisine].update(ingredients)
Let's look at a result:
counters['italian'].most_common(10)
[('salt', 3454),
('olive oil', 3111),
('garlic cloves', 1619),
('grated parmesan cheese', 1580),
('garlic', 1471),
('ground black pepper', 1444),
('extra-virgin olive oil', 1362),
('onions', 1240),
('water', 1052),
('butter', 1030)]
We can easily convert the top 10 ingredients for each cuisine to a separate dataframe for nicer viewing:
top10 = pd.DataFrame([[items[0] for items in counters[cuisine].most_common(10)] for cuisine in counters],
index=[cuisine for cuisine in counters],
columns=['top{}'.format(i) for i in range(1, 11)])
top10
| top1 | top2 | top3 | top4 | top5 | top6 | top7 | top8 | top9 | top10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| jamaican | salt | onions | water | garlic | ground allspice | pepper | scallions | dried thyme | black pepper | garlic cloves |
| moroccan | salt | olive oil | ground cumin | onions | ground cinnamon | garlic cloves | water | ground ginger | carrots | paprika |
| irish | salt | all-purpose flour | butter | onions | potatoes | sugar | baking soda | baking powder | milk | carrots |
| russian | salt | sugar | onions | all-purpose flour | sour cream | eggs | water | butter | unsalted butter | large eggs |
| greek | salt | olive oil | dried oregano | garlic cloves | feta cheese crumbles | extra-virgin olive oil | fresh lemon juice | ground black pepper | garlic | pepper |
| french | salt | sugar | all-purpose flour | unsalted butter | olive oil | butter | water | large eggs | garlic cloves | ground black pepper |
| italian | salt | olive oil | garlic cloves | grated parmesan cheese | garlic | ground black pepper | extra-virgin olive oil | onions | water | butter |
| korean | soy sauce | sesame oil | garlic | green onions | sugar | salt | water | sesame seeds | onions | scallions |
| thai | fish sauce | garlic | salt | coconut milk | vegetable oil | soy sauce | sugar | water | garlic cloves | fresh lime juice |
| vietnamese | fish sauce | sugar | salt | garlic | water | carrots | soy sauce | shallots | garlic cloves | vegetable oil |
| southern_us | salt | butter | all-purpose flour | sugar | large eggs | baking powder | water | unsalted butter | milk | buttermilk |
| cajun_creole | salt | onions | garlic | green bell pepper | butter | olive oil | cayenne pepper | cajun seasoning | all-purpose flour | water |
| indian | salt | onions | garam masala | water | ground turmeric | garlic | cumin seed | ground cumin | vegetable oil | oil |
| british | salt | all-purpose flour | butter | milk | eggs | unsalted butter | sugar | onions | baking powder | large eggs |
| mexican | salt | onions | ground cumin | garlic | olive oil | chili powder | jalapeno chilies | sour cream | avocado | corn tortillas |
| brazilian | salt | onions | olive oil | lime | water | garlic cloves | garlic | cachaca | sugar | tomatoes |
| spanish | salt | olive oil | garlic cloves | extra-virgin olive oil | onions | water | tomatoes | ground black pepper | red bell pepper | pepper |
| filipino | salt | garlic | water | onions | soy sauce | pepper | oil | sugar | carrots | ground black pepper |
| chinese | soy sauce | sesame oil | salt | corn starch | sugar | garlic | water | green onions | vegetable oil | scallions |
| japanese | soy sauce | salt | mirin | sugar | water | sake | rice vinegar | vegetable oil | scallions | ginger |
An even better visualisation would be to have images instead of words for this visualization. We can do this by exporting the previous table to html and replacing the ingredient names with HTML image tags for the selected ingredients. This is done using regular expression matching, while the image is base64 encoded in the source (thanks).
import re
import base64
import pdb
def repl(m):
ingredient = m.groups()[0]
image_path = 'img/' + ingredient + '.png'
with open(image_path, "rb") as image_file:
encoded_string = base64.b64encode(image_file.read())
return '<td><img width=100 src="data:image/png;base64,{}"></td>'.format(encoded_string.decode('utf-8'))
table_with_images = re.sub("<td>([ \-\w]+)</td>", repl, top10.to_html())
We can easily display this HTML output in our notebook:
from IPython.display import HTML
HTML(table_with_images)
| top1 | top2 | top3 | top4 | top5 | top6 | top7 | top8 | top9 | top10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| jamaican | ||||||||||
| moroccan | ||||||||||
| irish | ||||||||||
| russian | ||||||||||
| greek | ||||||||||
| french | ||||||||||
| italian | ||||||||||
| korean | ||||||||||
| thai | ||||||||||
| vietnamese | ||||||||||
| southern_us | ||||||||||
| cajun_creole | ||||||||||
| indian | ||||||||||
| british | ||||||||||
| mexican | ||||||||||
| brazilian | ||||||||||
| spanish | ||||||||||
| filipino | ||||||||||
| chinese | ||||||||||
| japanese |
This visualization allows us to determine a couple of things. For instance we can see that the top1 ingredient for each cuisine is a salty ingredient. This salty ingredient allows us to group the cuisines already:
- salt is the standard for most cuisines
- soy sauce is number one for chinese, japanese and korean cuisines
- fish sauce is number one for thai and vietnamese cuisines
Another things that is easily seen from this table is that many ingredients have more than one name:
- garlic cloves, garlic
- olive oil, extra-virgin olive oil
- ...
Jugding from this table, it can be interesting to see which ingredients among the top10 ingredients are highly specific for a certain cuisine. A way to do this is to simply count the number of times an ingredient appears in a given cuisine and divide by the total number of recipes.
To do this, we first create a new column in our dataframe by simply concatening the ingredients to a single string:
df_train['all_ingredients'] = df_train['ingredients'].map(";".join)
df_train.head()
| cuisine | id | ingredients | all_ingredients | |
|---|---|---|---|---|
| 0 | greek | 10259 | [romaine lettuce, black olives, grape tomatoes... | romaine lettuce;black olives;grape tomatoes;ga... |
| 1 | southern_us | 25693 | [plain flour, ground pepper, salt, tomatoes, g... | plain flour;ground pepper;salt;tomatoes;ground... |
| 2 | filipino | 20130 | [eggs, pepper, salt, mayonaise, cooking oil, g... | eggs;pepper;salt;mayonaise;cooking oil;green c... |
| 3 | indian | 22213 | [water, vegetable oil, wheat, salt] | water;vegetable oil;wheat;salt |
| 4 | indian | 13162 | [black pepper, shallots, cornflour, cayenne pe... | black pepper;shallots;cornflour;cayenne pepper... |
We can now take advantage of the powerful string processing functions of pandas to check for the presence of an ingredient in a recipe:
df_train['all_ingredients'].str.contains('garlic cloves')
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 True
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
...
39744 True
39745 False
39746 False
39747 False
39748 False
39749 False
39750 False
39751 False
39752 False
39753 True
39754 True
39755 False
39756 False
39757 True
39758 False
39759 False
39760 False
39761 True
39762 False
39763 False
39764 False
39765 False
39766 False
39767 False
39768 False
39769 False
39770 False
39771 False
39772 False
39773 False
Name: all_ingredients, dtype: bool
This can be used to group our recipes by the presence of that ingredient:
indices = df_train['all_ingredients'].str.contains('garlic cloves')
df_train[indices]['cuisine'].value_counts().plot(kind='bar',
title='garlic cloves as found per cuisine')
<matplotlib.axes._subplots.AxesSubplot at 0x10df36c18>
However, we have to keep in mind that there are a lot of Italian recipes in our database, so it's appropriate to divide by that number before presenting the result:
relative_freq = (df_train[indices]['cuisine'].value_counts() / df_train['cuisine'].value_counts())
relative_freq.sort(inplace=True)
relative_freq.plot(kind='bar')
/Users/kappamaki/anaconda/lib/python3.4/site-packages/ipykernel/__main__.py:2: FutureWarning: sort is deprecated, use sort_values(inplace=True) for for INPLACE sorting from ipykernel import kernelapp as app
<matplotlib.axes._subplots.AxesSubplot at 0x10e036630>
This way of looking at the data lets us see which countries use garlic cloves a lot in the recipes found in the database. As expected, mediterranean and asian cuisines are at the top, and british at the bottom.
We can do this sort of plot for all top 10 ingredients. First let's determine the unique ingredients:
import numpy as np
unique = np.unique(top10.values.ravel())
unique
array(['all-purpose flour', 'avocado', 'baking powder', 'baking soda',
'black pepper', 'butter', 'buttermilk', 'cachaca',
'cajun seasoning', 'carrots', 'cayenne pepper', 'chili powder',
'coconut milk', 'corn starch', 'corn tortillas', 'cumin seed',
'dried oregano', 'dried thyme', 'eggs', 'extra-virgin olive oil',
'feta cheese crumbles', 'fish sauce', 'fresh lemon juice',
'fresh lime juice', 'garam masala', 'garlic', 'garlic cloves',
'ginger', 'grated parmesan cheese', 'green bell pepper',
'green onions', 'ground allspice', 'ground black pepper',
'ground cinnamon', 'ground cumin', 'ground ginger',
'ground turmeric', 'jalapeno chilies', 'large eggs', 'lime', 'milk',
'mirin', 'oil', 'olive oil', 'onions', 'paprika', 'pepper',
'potatoes', 'red bell pepper', 'rice vinegar', 'sake', 'salt',
'scallions', 'sesame oil', 'sesame seeds', 'shallots', 'sour cream',
'soy sauce', 'sugar', 'tomatoes', 'unsalted butter',
'vegetable oil', 'water'], dtype=object)
Turns out we can fit this in a 8 by 8 subplot diagram:
fig, axes = plt.subplots(8, 8, figsize=(20, 20))
for ingredient, ax_index in zip(unique, range(64)):
indices = df_train['all_ingredients'].str.contains(ingredient)
relative_freq = (df_train[indices]['cuisine'].value_counts() / df_train['cuisine'].value_counts())
relative_freq.plot(kind='bar', ax=axes.ravel()[ax_index], fontsize=7, title=ingredient)
The previous diagram, even if it's not very clear, allows us to spot ingredients which have a high degree of uniqueness. Among them, I'd list:
- soy sauce (asian cuisine)
- sake (Japanese)
- sesame oil (asian cuisine)
- feta cheese crumbs (Greek)
- garam masala (Indian)
- ground ginger (Morrocan)
- avocado (Mexican)
Others are quite common:
- salt
- oil
- pepper
- sugar
This nicely concludes our data exploration. At the same time, it allows us to form a little intuition about how we could categorize a recipe's cuisine based on the ingredients:
- are there highly specific ingredients in the recipe that clearly point it to a given country?
In the next section, we will train a logistic regression classifier on the data we have analyzed so far and look at the results.
Training a logistic regression classifier¶
We will use scikit-learn to perform our classification. First, we will need to encode our features to a matrix that the machine learning algorithms in scikit learn can use. This is done using a count vectorizer:
from sklearn.feature_extraction.text import CountVectorizer
We can conveniently tell the count vectorizer which features it should accept and let him build the matrix with 1s and 0s when ingredients are present in a single step as follows:
cv = CountVectorizer()
X = cv.fit_transform(df_train['all_ingredients'].values)
We can check the shape of that matrix:
X.shape
(39774, 3010)
We see that the vectorizer has retained 3010 ingredients and processed the 40 000 recipes in the training dataset. We can easily access the features to check them using the vectorizers properties (which is a dictionary):
print(list(cv.vocabulary_.keys())[:100])
['lipton', 'hand', 'branzino', 'mayonnaise', 'tip', 'gouda', 'tamarind', 'meat', 'lea', 'saltine', 'style', 'ajinomoto', 'greek', 'chuck', 'ducklings', 'korean', 'tokyo', 'tender', 'shoots', 'prik', 'bitters', 'believ', 'satsuma', 'cardamon', 'pilaf', 'roll', 'crusts', 'brisée', 'penn', 'tamale', 'boneless', 'skate', 'picante', 'raw', 'swerve', 'sponge', 'filling', 'choi', 'sharp', 'cranberry', 'salt', 'manicotti', 'atar', 'quorn', 'ale', 'pound', 'nectar', 'iron', 'licorice', 'daiya', 'mince', 'chiffonade', 'japanese', 'yuzu', 'fleshed', 'picholine', 'ragu', 'buds', 'loosely', 'peppermint', 'cashews', 'yaki', 'stir', 'nam', 'blackberries', 'heirloom', 'terrine', 'crumbles', 'bonnet', 'clam', 'katsuo', 'cauliflowerets', 'crust', 'callaloo', 'doughs', 'wraps', 'pimentos', 'shaved', 'oregano', 'tomatillo', 'lump', 'cups', 'marin', 'olives', 'nectarines', 'greekstyl', 'bottom', 'fresca', 'cardoons', 'dijonnaise', 'delicata', 'lite', 'accompaniment', 'sprouts', 'flora', 'sansho', 'groundnut', 'seitan', 'tapioca', 'short']
Each feature gets assigned a column number, which is assigned a 1 or a 0 depending on the presence or not of the ingredient.
Now that we have our feature matrix, we still need to encode the labels that represent the cuisine of each recipe. This is done with a label encoder:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y = enc.fit_transform(df_train.cuisine)
The variable y is now a vector with number instead of strings for each cuisine:
y[:100]
array([ 6, 16, 4, 7, 7, 10, 17, 9, 13, 9, 9, 3, 9, 13, 9, 7, 1,
9, 18, 19, 18, 13, 16, 3, 9, 3, 2, 9, 3, 13, 9, 2, 13, 18,
9, 2, 9, 4, 16, 16, 9, 0, 13, 7, 13, 3, 5, 16, 16, 16, 11,
16, 9, 16, 9, 10, 11, 7, 9, 8, 18, 18, 7, 10, 9, 18, 12, 5,
5, 16, 17, 7, 14, 9, 9, 14, 14, 19, 11, 13, 2, 16, 5, 7, 7,
9, 9, 7, 12, 17, 9, 16, 16, 6, 13, 13, 16, 7, 9, 9])
We can check the result by inspecting the encoders classes:
enc.classes_
array(['brazilian', 'british', 'cajun_creole', 'chinese', 'filipino',
'french', 'greek', 'indian', 'irish', 'italian', 'jamaican',
'japanese', 'korean', 'mexican', 'moroccan', 'russian',
'southern_us', 'spanish', 'thai', 'vietnamese'], dtype=object)
Let's now train a logistic regression on the dataset. We'll split the dataset so that we can also test our classifier on data that he hasn't seen before:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Now, let's train a logistic regression:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
We can evaluate our classifier on our test set:
logistic.score(X_test, y_test)
0.77737272155876802
It turns out it performs quite nicely, with a 78% accuracy.
However, this doesn't tell the whole story about what's happening. Let's inspect the classification results using a confusion matrix.
Inspecting the classification results using a confusion matrix¶
A confusion matrix allows us to see the confusion the classifier makes. It should be read column by column. In each column, one sees the recipes the classifier considered to be one cuisine. Looking at the color in each square one can see the relative accuracy of that classification.
from sklearn.metrics import confusion_matrix
plt.figure(figsize=(10, 10))
cm = confusion_matrix(y_test, logistic.predict(X_test))
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.imshow(cm_normalized, interpolation='nearest')
plt.title("confusion matrix")
plt.colorbar(shrink=0.3)
cuisines = df_train['cuisine'].value_counts().index
tick_marks = np.arange(len(cuisines))
plt.xticks(tick_marks, cuisines, rotation=90)
plt.yticks(tick_marks, cuisines)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
<matplotlib.text.Text at 0x119f8c208>
/Users/kappamaki/anaconda/lib/python3.4/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
if self._edgecolors == str('face'):
Here, we see that some cuisines are really well predicted (Moroccan, Thai, Indian) while some suffer from confusion (Greek is often predicted as other cuisines, same with irish).
Another way to look at the results is the classification report from scikit-learn:
from sklearn.metrics import classification_report
y_pred = logistic.predict(X_test)
print(classification_report(y_test, y_pred, target_names=cuisines))
precision recall f1-score support
italian 0.78 0.47 0.58 96
mexican 0.58 0.41 0.48 165
southern_us 0.80 0.66 0.73 289
indian 0.78 0.85 0.82 542
chinese 0.70 0.54 0.61 140
french 0.62 0.63 0.63 555
cajun_creole 0.78 0.70 0.74 228
thai 0.85 0.90 0.87 596
japanese 0.69 0.47 0.56 123
greek 0.80 0.89 0.84 1587
spanish 0.84 0.73 0.78 97
korean 0.81 0.72 0.76 289
vietnamese 0.81 0.76 0.78 191
moroccan 0.89 0.91 0.90 1262
british 0.80 0.72 0.76 157
filipino 0.57 0.40 0.47 100
irish 0.69 0.81 0.74 860
jamaican 0.67 0.50 0.57 195
russian 0.74 0.72 0.73 321
brazilian 0.69 0.48 0.57 162
avg / total 0.77 0.78 0.77 7955
This allows use to see the different precision measurements (accuracy, recall, f1 score) all in a single place.
From the previous analyses, we can come up with a number of ways of how to improve aspects of our machine learning and reach better classification results.
Conclusions¶
In this post, we've gone through different stages of machine learning: we first explored the data in depth that came with the challenge and then went on to train a model, whose results we tried to analyze. It's not clear from the results what we can easily improve in our classification, but it gives us quite a lot of information to analyze.