The Kaggle What's Cooking challenge

In this blog post, we'll have a look at the Kaggle What's Cooking data challenge.

This competition is all about predicting which country a recipe is from, given a list of its ingredient. For example, assume you have a recipe that reads:

plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, yellow corn meal, milk, vegetable oil

Can you guess from which cuisine this is? As this recipe comes from the training set of this challenge, I can tell you that the expected answer is Southern US.

Without further ado, let's dive in. First, let us have a look at the training data.

Exploring the training data

We'll use pandas to go through the data. First, let's read the json file containing the recipes and the cuisines:

In [1]:
import pandas as pd
In [2]:
df_train = pd.read_json('train.json')

Let's look at the head of the data:

In [3]:
cuisine id ingredients
0 greek 10259 [romaine lettuce, black olives, grape tomatoes...
1 southern_us 25693 [plain flour, ground pepper, salt, tomatoes, g...
2 filipino 20130 [eggs, pepper, salt, mayonaise, cooking oil, g...
3 indian 22213 [water, vegetable oil, wheat, salt]
4 indian 13162 [black pepper, shallots, cornflour, cayenne pe...

We can see the structure of the data: a cuisine, an id for the recipe, and the ingredients.

As a first step, let's look at the cuisines in the dataset. How many and how much of these do we have?

In [4]:
%matplotlib inline
In [5]:
import matplotlib.pyplot as plt'ggplot')
In [6]:
<matplotlib.axes._subplots.AxesSubplot at 0x10dce51d0>

As can be seen in this figure, there are a lot of Italian, Mexican and Southern US recipes, a little less of the other recipes.

To get a little insight in the data itself, we can look at a couple of recipes. In particular, we can count the most frequent ingredients for each cuisine. To do that, we can use the Python counter objects (found in the collections module from the standard library).

In [7]:
from collections import Counter
In [8]:
counters = {}
for cuisine in df_train['cuisine'].unique():
    counters[cuisine] = Counter()
    indices = (df_train['cuisine'] == cuisine)
    for ingredients in df_train[indices]['ingredients']:

Let's look at a result:

In [9]:
[('salt', 3454),
 ('olive oil', 3111),
 ('garlic cloves', 1619),
 ('grated parmesan cheese', 1580),
 ('garlic', 1471),
 ('ground black pepper', 1444),
 ('extra-virgin olive oil', 1362),
 ('onions', 1240),
 ('water', 1052),
 ('butter', 1030)]

We can easily convert the top 10 ingredients for each cuisine to a separate dataframe for nicer viewing:

In [10]:
top10 = pd.DataFrame([[items[0] for items in counters[cuisine].most_common(10)] for cuisine in counters],
            index=[cuisine for cuisine in counters],
            columns=['top{}'.format(i) for i in range(1, 11)])
top1 top2 top3 top4 top5 top6 top7 top8 top9 top10
jamaican salt onions water garlic ground allspice pepper scallions dried thyme black pepper garlic cloves
moroccan salt olive oil ground cumin onions ground cinnamon garlic cloves water ground ginger carrots paprika
irish salt all-purpose flour butter onions potatoes sugar baking soda baking powder milk carrots
russian salt sugar onions all-purpose flour sour cream eggs water butter unsalted butter large eggs
greek salt olive oil dried oregano garlic cloves feta cheese crumbles extra-virgin olive oil fresh lemon juice ground black pepper garlic pepper
french salt sugar all-purpose flour unsalted butter olive oil butter water large eggs garlic cloves ground black pepper
italian salt olive oil garlic cloves grated parmesan cheese garlic ground black pepper extra-virgin olive oil onions water butter
korean soy sauce sesame oil garlic green onions sugar salt water sesame seeds onions scallions
thai fish sauce garlic salt coconut milk vegetable oil soy sauce sugar water garlic cloves fresh lime juice
vietnamese fish sauce sugar salt garlic water carrots soy sauce shallots garlic cloves vegetable oil
southern_us salt butter all-purpose flour sugar large eggs baking powder water unsalted butter milk buttermilk
cajun_creole salt onions garlic green bell pepper butter olive oil cayenne pepper cajun seasoning all-purpose flour water
indian salt onions garam masala water ground turmeric garlic cumin seed ground cumin vegetable oil oil
british salt all-purpose flour butter milk eggs unsalted butter sugar onions baking powder large eggs
mexican salt onions ground cumin garlic olive oil chili powder jalapeno chilies sour cream avocado corn tortillas
brazilian salt onions olive oil lime water garlic cloves garlic cachaca sugar tomatoes
spanish salt olive oil garlic cloves extra-virgin olive oil onions water tomatoes ground black pepper red bell pepper pepper
filipino salt garlic water onions soy sauce pepper oil sugar carrots ground black pepper
chinese soy sauce sesame oil salt corn starch sugar garlic water green onions vegetable oil scallions
japanese soy sauce salt mirin sugar water sake rice vinegar vegetable oil scallions ginger

An even better visualisation would be to have images instead of words for this visualization. We can do this by exporting the previous table to html and replacing the ingredient names with HTML image tags for the selected ingredients. This is done using regular expression matching, while the image is base64 encoded in the source (thanks).

In [11]:
import re
import base64
In [12]:
import pdb
In [13]:
def repl(m):
    ingredient = m.groups()[0]
    image_path = 'img/' + ingredient + '.png'
    with open(image_path, "rb") as image_file:
        encoded_string = base64.b64encode(
    return '<td><img width=100 src="data:image/png;base64,{}"></td>'.format(encoded_string.decode('utf-8'))

table_with_images = re.sub("<td>([ \-\w]+)</td>", repl, top10.to_html())

We can easily display this HTML output in our notebook:

In [14]:
from IPython.display import HTML
In [15]:
top1 top2 top3 top4 top5 top6 top7 top8 top9 top10