# Exploring Japanese characters with principal component analysis

In this post, we'll apply a little principal component analysis, a technique widely used in machine learning, to images of Japanese characters, the *kanji*.

We'll go through several steps:

- first, we produce rasterized images of the individual characters using
`matplotlib`

- next, we use
`scikit-learn`

to compute a principal component analysis (PCA) on the image data obtained in the previous step - finally, we'll explore some of the applications of the data obtained through this analysis

# Making the images¶

As a first step, let's produce rasterized images of Japanese characters. You might ask, which characters? It turns out that the Japanese have standardized the list of characters usually encountered in newspapers, movies, posters, books. This list is called the Joyo kanji list, which has a page on Wikipedia.

Using this webpage, I copied and pasted the content of the character table in a file which I placed on my hard drive. It is aptly named `kanji_list.csv`

. I'll import the character data using `pandas`

, which seems to be everyone's favorite when it comes to reading csv files.

```
import pandas as pd
```

```
df = pd.read_csv("kanji_list.csv", sep='\t', header=None)
```

To get a glimpse of the content of the file, we can display its *head*:

```
df.head(10)
```

The kanji we are looking for are located in column 1, mixed with other information (NaNs for example). We can create the character list by simply dropping NaNs and keeping the column contents as a numpy array.

```
kanji = df[1].dropna().values
```

```
type(kanji)
```

```
kanji[:10]
```

The next step is generating rasterized images. We'll do this using matplotlib, which has a system allowing to write with any custom font supporting Japanese characters. In this case, I followed the advice given on Stackoverflow:

- I downloaded a Japanese font (see link) and saved it to my working folder
- and I then created a custom
`FontProperties`

object allowing me to use this font within matplotlib

```
import matplotlib.pyplot as plt
```

```
%matplotlib inline
```

```
prop = fm.FontProperties(fname='ipam.ttc', size=50)
```

```
plt.figure(figsize=(1, 1))
plt.text(0, 0, kanji[0], ha='center', va='center', fontproperties=prop)
plt.xlim(-0.1, 0.1)
plt.ylim(-0.1, 0.1)
```

The proof of concept working, let's write a function that creates a kanji image from a character and saves it to disk:

```
def rasterize_kanji(kanji, save_to):
plt.figure(figsize=(1, 1))
prop = fm.FontProperties(fname='ipam.ttc', size=70)
plt.text(0, 0, kanji[0], ha='center', va='center', fontproperties=prop)
plt.xlim(-0.1, 0.1)
plt.ylim(-0.1, 0.1)
plt.axis("off")
plt.savefig(save_to)
plt.close()
```

Let's test the function and look at its output:

```
rasterize_kanji(kanji[0], "1.png")
```

```
from IPython.display import Image
Image(filename='1.png')
```

The previous function produces 72x72 PNG images, this is well suited to what we want to do.

To conclude this first step, we loop over all kanjis in my list and produce a PNG image for further analysis.

```
for i, k in enumerate(kanji):
rasterize_kanji(k, "img/{0:04}.png".format(i));
```

# Computing the principal component analysis on our images¶

To compute our PCA, we will use `scikit-learn`

, the reference machine learning library in Python. Our first step is to load the images we have produced in the previous section as vectors. We do this by keeping all PNG files in our *img/* directory, loading them as grayscale numpy arrays and reshaping them to vectors:

```
import numpy as np
import sklearn
import os
from scipy import ndimage
```

```
image_names = list(filter(lambda s: s.endswith('.png', 0), os.listdir('img/')))
```

```
X = np.array([ndimage.imread(os.path.join('img/', fname), flatten=True).ravel() for fname in image_names])
```

The array `X`

contains our data. Its shape is easy to explain:

- it should have a row for each image
- and 72*72 pixels in columns

Let's check that we're correct:

```
X.shape
```

```
72*72
```

Even if we're now handling row vectors in the matrix `X`

, it's quite easy to get an image back from one of our vectors. We just need to reshape the vector to a (72, 72) matrix. For example:

```
plt.imshow(X[0, :].reshape((72, 72)), cmap='gray')
```

We can now perform the PCA using `sklearn`

's `decomposition`

module. As most things in `sklearn`

, we first need to fit it on the data in the previously created matrix to use it. There is an excellent introduction to PCA with scikit-learn that was written for the Pycon 2015 conference by the amazing Jake VanderPlas (http://nbviewer.ipython.org/github/jakevdp/sklearn_pycon2015/blob/master/notebooks/04.1-Dimensionality-PCA.ipynb).

```
from sklearn.decomposition import PCA
```

```
pca = PCA(n_components=100)
pca.fit(X)
pca_score = pca.explained_variance_ratio_
V = pca.components_
```

Usually, one first plots the explained variance of each principal component after performing a PCA. This allows to see how much the use of the principal vectors allows to simplify the data:

```
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
```

We can see that the explained variance decreases with each new component. This is expected, as one can imagine the selection of the successive components as the vector directions that maximize the variance left in the dataset at each step.

One of the criteria for selecting the number of eigenvectors is to stop once 99% of the variance is explained by the number of components kept. We haven't done this here, as I just wanted 100 vectors to do some interesting visualizations, but it's still interesting to look at how much variance is explained by the first 100 components:

```
np.cumsum(pca_score)[-1]
```

It's not that much. Now comes the most graphic part of our analysis so far. What do the principal components look like? They are stored in the matrix `V`

. Let's look at what our first component looks like:

```
plt.imshow(V[0, :].reshape((72, 72)), cmap='gray')
plt.colorbar()
plt.title('first principal component of kanji dataset');
```

This is interesting! What one basically sees is that an image representing a stroke on the left part captures a lot of variance of the kanji dataset. For those who have learnt kanji, this is not surprising: many characters fall into the left/right shape! For reference, let's plot a grid of 10x10 characters chosen at random in our initial data:

```
from random import randint
def plot_random_kanji():
for i, ind in zip(range(100), np.random.choice(np.arange(X.shape[0]), 100)):
plt.subplot(10, 10, i + 1)
plt.imshow(X[ind, :].reshape((72, 72)), cmap='gray')
plt.axis('off')
plt.figure(figsize=(10, 10))
plot_random_kanji()
```

And now, let's compare the global visual impression with that of the principal components:

```
def plot_principal_components():
for i in range(100):
plt.subplot(10, 10, i + 1)
plt.imshow(V[i, :].reshape((72, 72)), cmap='gray')
plt.axis('off')
```

```
plt.figure(figsize=(10, 10))
plot_principal_components()
```