Exploring Japanese characters with principal component analysis

Japanese Machine Learning

In this post, we'll apply a little principal component analysis, a technique widely used in machine learning, to images of Japanese characters, the kanji.

We'll go through several steps:

  • first, we produce rasterized images of the individual characters using matplotlib
  • next, we use scikit-learn to compute a principal component analysis (PCA) on the image data obtained in the previous step
  • finally, we'll explore some of the applications of the data obtained through this analysis

Making the images

As a first step, let's produce rasterized images of Japanese characters. You might ask, which characters? It turns out that the Japanese have standardized the list of characters usually encountered in newspapers, movies, posters, books. This list is called the Joyo kanji list, which has a page on Wikipedia.

Using this webpage, I copied and pasted the content of the character table in a file which I placed on my hard drive. It is aptly named kanji_list.csv. I'll import the character data using pandas, which seems to be everyone's favorite when it comes to reading csv files.

In [2]:
import pandas as pd
In [9]:
df = pd.read_csv("kanji_list.csv", sep='\t', header=None)

To get a glimpse of the content of the file, we can display its head:

In [424]:
0 1 2 3 4 5 6 7 8
0 1 7 S NaN sub-
1 a NaN NaN NaN NaN NaN NaN NaN NaN
2 2 NaN 9 S NaN pathetic アイ、あわ-れ、あわ-れむ
3 ai, awa-re, awa-remu NaN NaN NaN NaN NaN NaN NaN NaN
4 3 NaN 10 S 2010 push open アイ
5 ai NaN NaN NaN NaN NaN NaN NaN NaN
6 4 NaN 13 4 NaN love アイ
7 ai NaN NaN NaN NaN NaN NaN NaN NaN
8 5 NaN 17 S 2010 not clear アイ
9 ai NaN NaN NaN NaN NaN NaN NaN NaN

The kanji we are looking for are located in column 1, mixed with other information (NaNs for example). We can create the character list by simply dropping NaNs and keeping the column contents as a numpy array.

In [425]:
kanji = df[1].dropna().values
In [426]:
In [427]:
array(['亜', '哀', '挨', '愛', '曖', '悪', '握', '圧', '扱', '宛'], dtype=object)

The next step is generating rasterized images. We'll do this using matplotlib, which has a system allowing to write with any custom font supporting Japanese characters. In this case, I followed the advice given on Stackoverflow:

  • I downloaded a Japanese font (see link) and saved it to my working folder
  • and I then created a custom FontProperties object allowing me to use this font within matplotlib
In [1]:
import matplotlib.pyplot as plt
In [2]:
%matplotlib inline
In [68]:
prop = fm.FontProperties(fname='ipam.ttc', size=50)
In [431]:
plt.figure(figsize=(1, 1))
plt.text(0, 0, kanji[0], ha='center', va='center', fontproperties=prop)
plt.xlim(-0.1, 0.1)
plt.ylim(-0.1, 0.1)
(-0.1, 0.1)

The proof of concept working, let's write a function that creates a kanji image from a character and saves it to disk:

In [89]:
def rasterize_kanji(kanji, save_to):
    plt.figure(figsize=(1, 1))
    prop = fm.FontProperties(fname='ipam.ttc', size=70)
    plt.text(0, 0, kanji[0], ha='center', va='center', fontproperties=prop)
    plt.xlim(-0.1, 0.1)
    plt.ylim(-0.1, 0.1)

Let's test the function and look at its output:

In [432]:
rasterize_kanji(kanji[0], "1.png")
In [434]:
from IPython.display import Image

The previous function produces 72x72 PNG images, this is well suited to what we want to do.

To conclude this first step, we loop over all kanjis in my list and produce a PNG image for further analysis.

In [91]:
for i, k in enumerate(kanji):
    rasterize_kanji(k, "img/{0:04}.png".format(i));

Computing the principal component analysis on our images

To compute our PCA, we will use scikit-learn, the reference machine learning library in Python. Our first step is to load the images we have produced in the previous section as vectors. We do this by keeping all PNG files in our img/ directory, loading them as grayscale numpy arrays and reshaping them to vectors:

In [3]:
import numpy as np
import sklearn
import os
from scipy import ndimage
In [4]:
image_names = list(filter(lambda s: s.endswith('.png', 0), os.listdir('img/')))
In [5]:
X = np.array([ndimage.imread(os.path.join('img/', fname), flatten=True).ravel() for fname in image_names])

The array X contains our data. Its shape is easy to explain:

  • it should have a row for each image
  • and 72*72 pixels in columns

Let's check that we're correct:

In [6]:
(2138, 5184)
In [7]:

Even if we're now handling row vectors in the matrix X, it's quite easy to get an image back from one of our vectors. We just need to reshape the vector to a (72, 72) matrix. For example:

In [8]:
plt.imshow(X[0, :].reshape((72, 72)), cmap='gray')

We can now perform the PCA using sklearn's decomposition module. As most things in sklearn, we first need to fit it on the data in the previously created matrix to use it. There is an excellent introduction to PCA with scikit-learn that was written for the Pycon 2015 conference by the amazing Jake VanderPlas (http://nbviewer.ipython.org/github/jakevdp/sklearn_pycon2015/blob/master/notebooks/04.1-Dimensionality-PCA.ipynb).

In [9]:
from sklearn.decomposition import PCA
In [10]:
pca = PCA(n_components=100)
pca_score = pca.explained_variance_ratio_
V = pca.components_

Usually, one first plots the explained variance of each principal component after performing a PCA. This allows to see how much the use of the principal vectors allows to simplify the data:

In [22]:
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

We can see that the explained variance decreases with each new component. This is expected, as one can imagine the selection of the successive components as the vector directions that maximize the variance left in the dataset at each step.

One of the criteria for selecting the number of eigenvectors is to stop once 99% of the variance is explained by the number of components kept. We haven't done this here, as I just wanted 100 vectors to do some interesting visualizations, but it's still interesting to look at how much variance is explained by the first 100 components:

In [23]:

It's not that much. Now comes the most graphic part of our analysis so far. What do the principal components look like? They are stored in the matrix V. Let's look at what our first component looks like:

In [13]:
plt.imshow(V[0, :].reshape((72, 72)), cmap='gray')
plt.title('first principal component of kanji dataset');

This is interesting! What one basically sees is that an image representing a stroke on the left part captures a lot of variance of the kanji dataset. For those who have learnt kanji, this is not surprising: many characters fall into the left/right shape! For reference, let's plot a grid of 10x10 characters chosen at random in our initial data:

In [14]:
from random import randint
def plot_random_kanji():
    for i, ind in zip(range(100), np.random.choice(np.arange(X.shape[0]), 100)):
        plt.subplot(10, 10, i + 1)
        plt.imshow(X[ind, :].reshape((72, 72)), cmap='gray')
plt.figure(figsize=(10, 10))

And now, let's compare the global visual impression with that of the principal components:

In [15]:
def plot_principal_components():
    for i in range(100):
        plt.subplot(10, 10, i + 1)
        plt.imshow(V[i, :].reshape((72, 72)), cmap='gray')
In [16]:
plt.figure(figsize=(10, 10))