Visualization the Kaggle What's Cooking recipes using Bokeh and QQ plots

I'm currently reading Dataclysm, a book by one of the OkCupid founders, Christian Rudder. He's the one behind the OkTrends blog, which gives you a taste of what sort of data analysis the book is about. About halfway through the book, Rudder analyzes essays written by the users about themselves. To find meaning in the data across the different categories (white, black, asian, hispanic), he makes us of quantile-quantile plots. This struck me as an excellent application of interactive visualization using Bokeh and the Kaggle What's Cooking challenge data, which I have previously investigated.

Loading the data and counting it

We will start by loading the data, as usual:

In [1]:
import pandas as pd
In [2]:
df = pd.read_json('train.json')
In [3]:
df.head()
Out[3]:
cuisine id ingredients
0 greek 10259 [romaine lettuce, black olives, grape tomatoes...
1 southern_us 25693 [plain flour, ground pepper, salt, tomatoes, g...
2 filipino 20130 [eggs, pepper, salt, mayonaise, cooking oil, g...
3 indian 22213 [water, vegetable oil, wheat, salt]
4 indian 13162 [black pepper, shallots, cornflour, cayenne pe...

To produce the sort of plot we want, we need to select one of the categories, say greek cuisine, compute the counts of all ingredients and their ranks and then do the same with the data from all other cuisines. To do this, we'll build ourselves a helper function that provides a count of all ingredients in a part of the dataframe.

In [4]:
from collections import Counter

def count_ingredients(df):
    """Counts ingredients in given df."""
    
    c = Counter()
    for recipe in df['ingredients']:
        for ingredient in recipe:
            c.update([ingredient])
    
    return c

Let's test this on a single recipe, using the first recipe in the dataset:

In [5]:
count_ingredients(df.iloc[0:1])
Out[5]:
Counter({'black olives': 1,
         'feta cheese crumbles': 1,
         'garbanzo beans': 1,
         'garlic': 1,
         'grape tomatoes': 1,
         'pepper': 1,
         'purple onion': 1,
         'romaine lettuce': 1,
         'seasoning': 1})

The input recipe was:

In [6]:
df.iloc[0]['ingredients']
Out[6]:
['romaine lettuce',
 'black olives',
 'grape tomatoes',
 'garlic',
 'pepper',
 'purple onion',
 'seasoning',
 'garbanzo beans',
 'feta cheese crumbles']

So this seems to work. Let's count all greek recipes using this mechanism:

In [7]:
is_greek = df['cuisine'] == 'greek'
greek_counts = count_ingredients(df[is_greek])
In [8]:
[(key, greek_counts[key]) for key in greek_counts.keys()][:10]
Out[8]:
[('chopped green bell pepper', 8),
 ('fresh bay leaves', 2),
 ('canned low sodium chicken broth', 5),
 ('cucumber', 187),
 ('thin pizza crust', 2),
 ('cooking spray', 67),
 ('top sirloin steak', 1),
 ('bone in skin on chicken thigh', 1),
 ('granulated garlic', 3),
 ('chopped parsley', 18)]

The last thing we need now is just to use the sorted count results as their ranks. We can do this using the counter's most_common method:

In [9]:
greek_counts.most_common(10)
Out[9]:
[('salt', 572),
 ('olive oil', 504),
 ('dried oregano', 267),
 ('garlic cloves', 254),
 ('feta cheese crumbles', 252),
 ('extra-virgin olive oil', 229),
 ('fresh lemon juice', 222),
 ('ground black pepper', 221),
 ('garlic', 216),
 ('pepper', 203)]

Actually, we also need a count for all non-greek ingredients. Let's make it here:

In [10]:
non_greek_counts = count_ingredients(df[~is_greek])

Visualization using Bokeh

Let's now move on to the qq-plot. To do this interactively, I want to use Bokeh. It turns out that Bokeh can easily plot points as glyphs on a plane and add hover labels (see here:

In [11]:
from bokeh.plotting import output_notebook, figure, show
from bokeh.models import HoverTool, BoxSelectTool

output_notebook()

TOOLS = [BoxSelectTool(), HoverTool()]

p = figure(plot_width=600, plot_height=400, title='A test scatter plot with hover labels', tools=TOOLS)

p.circle([1, 2, 3, 4, 5], [2, 5, 8, 2, 7], size=10)

show(p)