Exploring Japanese characters with principal component analysis
In this post, we'll apply a little principal component analysis, a technique widely used in machine learning, to images of Japanese characters, the kanji.
We'll go through several steps:
- first, we produce rasterized images of the individual characters using
matplotlib
- next, we use
scikit-learn
to compute a principal component analysis (PCA) on the image data obtained in the previous step - finally, we'll explore some of the applications of the data obtained through this analysis
Making the images¶
As a first step, let's produce rasterized images of Japanese characters. You might ask, which characters? It turns out that the Japanese have standardized the list of characters usually encountered in newspapers, movies, posters, books. This list is called the Joyo kanji list, which has a page on Wikipedia.
Using this webpage, I copied and pasted the content of the character table in a file which I placed on my hard drive. It is aptly named kanji_list.csv
. I'll import the character data using pandas
, which seems to be everyone's favorite when it comes to reading csv files.
import pandas as pd
df = pd.read_csv("kanji_list.csv", sep='\t', header=None)
To get a glimpse of the content of the file, we can display its head:
df.head(10)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 亜 | 亞 | 二 | 7 | S | NaN | sub- | ア |
1 | a | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2 | 哀 | NaN | 口 | 9 | S | NaN | pathetic | アイ、あわ-れ、あわ-れむ |
3 | ai, awa-re, awa-remu | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 3 | 挨 | NaN | 手 | 10 | S | 2010 | push open | アイ |
5 | ai | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 | 4 | 愛 | NaN | 心 | 13 | 4 | NaN | love | アイ |
7 | ai | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | 5 | 曖 | NaN | 日 | 17 | S | 2010 | not clear | アイ |
9 | ai | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
The kanji we are looking for are located in column 1, mixed with other information (NaNs for example). We can create the character list by simply dropping NaNs and keeping the column contents as a numpy array.
kanji = df[1].dropna().values
type(kanji)
numpy.ndarray
kanji[:10]
array(['亜', '哀', '挨', '愛', '曖', '悪', '握', '圧', '扱', '宛'], dtype=object)
The next step is generating rasterized images. We'll do this using matplotlib, which has a system allowing to write with any custom font supporting Japanese characters. In this case, I followed the advice given on Stackoverflow:
- I downloaded a Japanese font (see link) and saved it to my working folder
- and I then created a custom
FontProperties
object allowing me to use this font within matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
prop = fm.FontProperties(fname='ipam.ttc', size=50)
plt.figure(figsize=(1, 1))
plt.text(0, 0, kanji[0], ha='center', va='center', fontproperties=prop)
plt.xlim(-0.1, 0.1)
plt.ylim(-0.1, 0.1)
(-0.1, 0.1)
The proof of concept working, let's write a function that creates a kanji image from a character and saves it to disk:
def rasterize_kanji(kanji, save_to):
plt.figure(figsize=(1, 1))
prop = fm.FontProperties(fname='ipam.ttc', size=70)
plt.text(0, 0, kanji[0], ha='center', va='center', fontproperties=prop)
plt.xlim(-0.1, 0.1)
plt.ylim(-0.1, 0.1)
plt.axis("off")
plt.savefig(save_to)
plt.close()
Let's test the function and look at its output:
rasterize_kanji(kanji[0], "1.png")
from IPython.display import Image
Image(filename='1.png')
The previous function produces 72x72 PNG images, this is well suited to what we want to do.
To conclude this first step, we loop over all kanjis in my list and produce a PNG image for further analysis.
for i, k in enumerate(kanji):
rasterize_kanji(k, "img/{0:04}.png".format(i));
Computing the principal component analysis on our images¶
To compute our PCA, we will use scikit-learn
, the reference machine learning library in Python. Our first step is to load the images we have produced in the previous section as vectors. We do this by keeping all PNG files in our img/ directory, loading them as grayscale numpy arrays and reshaping them to vectors:
import numpy as np
import sklearn
import os
from scipy import ndimage
image_names = list(filter(lambda s: s.endswith('.png', 0), os.listdir('img/')))
X = np.array([ndimage.imread(os.path.join('img/', fname), flatten=True).ravel() for fname in image_names])
The array X
contains our data. Its shape is easy to explain:
- it should have a row for each image
- and 72*72 pixels in columns
Let's check that we're correct:
X.shape
(2138, 5184)
72*72
5184
Even if we're now handling row vectors in the matrix X
, it's quite easy to get an image back from one of our vectors. We just need to reshape the vector to a (72, 72) matrix. For example:
plt.imshow(X[0, :].reshape((72, 72)), cmap='gray')
<matplotlib.image.AxesImage at 0x10760ed68>
We can now perform the PCA using sklearn
's decomposition
module. As most things in sklearn
, we first need to fit it on the data in the previously created matrix to use it. There is an excellent introduction to PCA with scikit-learn that was written for the Pycon 2015 conference by the amazing Jake VanderPlas (http://nbviewer.ipython.org/github/jakevdp/sklearn_pycon2015/blob/master/notebooks/04.1-Dimensionality-PCA.ipynb).
from sklearn.decomposition import PCA
pca = PCA(n_components=100)
pca.fit(X)
pca_score = pca.explained_variance_ratio_
V = pca.components_
Usually, one first plots the explained variance of each principal component after performing a PCA. This allows to see how much the use of the principal vectors allows to simplify the data:
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
We can see that the explained variance decreases with each new component. This is expected, as one can imagine the selection of the successive components as the vector directions that maximize the variance left in the dataset at each step.
One of the criteria for selecting the number of eigenvectors is to stop once 99% of the variance is explained by the number of components kept. We haven't done this here, as I just wanted 100 vectors to do some interesting visualizations, but it's still interesting to look at how much variance is explained by the first 100 components:
np.cumsum(pca_score)[-1]
0.66575521
It's not that much. Now comes the most graphic part of our analysis so far. What do the principal components look like? They are stored in the matrix V
. Let's look at what our first component looks like:
plt.imshow(V[0, :].reshape((72, 72)), cmap='gray')
plt.colorbar()
plt.title('first principal component of kanji dataset');
This is interesting! What one basically sees is that an image representing a stroke on the left part captures a lot of variance of the kanji dataset. For those who have learnt kanji, this is not surprising: many characters fall into the left/right shape! For reference, let's plot a grid of 10x10 characters chosen at random in our initial data:
from random import randint
def plot_random_kanji():
for i, ind in zip(range(100), np.random.choice(np.arange(X.shape[0]), 100)):
plt.subplot(10, 10, i + 1)
plt.imshow(X[ind, :].reshape((72, 72)), cmap='gray')
plt.axis('off')
plt.figure(figsize=(10, 10))
plot_random_kanji()
And now, let's compare the global visual impression with that of the principal components:
def plot_principal_components():
for i in range(100):
plt.subplot(10, 10, i + 1)
plt.imshow(V[i, :].reshape((72, 72)), cmap='gray')
plt.axis('off')
plt.figure(figsize=(10, 10))
plot_principal_components()
It's difficult to see patterns in the previous image, except for the fact that each principal component looks like a Japanese character with lots of oscillations in between.
To get a better understanding of how the principal components can be used, let's decompose an existing character into its eigenvector components:
def decompose_character(kanji):
weights = [(np.dot(kanji, V[i, :]), i) for i in range(100)]
weights.sort(key=lambda s: abs(s[0]), reverse=True)
for i, components in enumerate([1, 10, 50, 100]):
approximation = np.zeros_like(kanji)
for c in range(components):
w, comp = weights[c]
approximation += w * V[comp, :]
plt.subplot(2, 2, i + 1)
plt.imshow(approximation.reshape((72, 72)), cmap='gray')
plt.axis('off')
decompose_character(X[0, :])
These last images start to look like one of our characters, but I'd like them to look like real characters. Therefore, we'll write a function that thresholds the previously obtained image.
We will use the Otsu thresholding technique for selecting an appropriate threshold.
import skimage
skimage.__version__
'0.11.3'
from skimage.filters import threshold_otsu
def decompose_character_threshold(kanji):
weights = [(np.dot(kanji, V[i, :]), i) for i in range(100)]
weights.sort(key=lambda s: abs(s[0]), reverse=True)
for i, components in enumerate([1, 10, 25, 50, 100]):
approximation = np.zeros_like(kanji)
for c in range(components):
w, comp = weights[c]
approximation += w * V[comp, :]
thresh = threshold_otsu(approximation)
binary = approximation > thresh
plt.subplot(2, 3, i + 1)
plt.imshow(binary.reshape((72, 72)), cmap='gray')
plt.axis('off')
plt.subplot(2, 3, 6)
plt.imshow(kanji.reshape((72, 72)), cmap='gray')
plt.axis('off')
decompose_character_threshold(X[0, :])
We can make this analysis interactive for any Japanese character in the dataset by using the IPython widgets:
from IPython.html.widgets import interact
interact(lambda index: decompose_character_threshold(X[index, :]),
index=(0, X.shape[0] - 1))
Inspired by Jake VanderPlas' work, we can look at a grid of characters reconstructed using only the 100 first principal components:
def approximate_reconstruction(kanji, n_components):
weights = [(np.dot(kanji, V[i, :]), i) for i in range(100)]
weights.sort(key=lambda s: abs(s[0]), reverse=True)
approximation = np.zeros_like(kanji)
for c in range(n_components):
w, comp = weights[c]
approximation += w * V[comp, :]
thresh = threshold_otsu(approximation)
binary = approximation > thresh
plt.imshow(binary.reshape((72, 72)), cmap='gray')
plt.axis('off')
plt.figure(figsize=(10, 10))
for i in range(100):
plt.subplot(10, 10, i + 1)
approximate_reconstruction(X[np.random.choice(np.arange(X.shape[0])), :], 100)
This quite recognizable and shows how PCA can be used for information compression. Just for fun, what does this look if we only approximate each character by 20 principal components?
plt.figure(figsize=(10, 10))
for i in range(100):
plt.subplot(10, 10, i + 1)
approximate_reconstruction(X[np.random.choice(np.arange(X.shape[0])), :], 20)
That's a lot less recognizable for me. So clearly, 20 components are not enough for accurate recognition of the characters.
Bonus: how to create new characters from the dataset¶
To finish this post, we're going to look into creating new characters from scratch, by combining principal components normally not found together and smoothing the results. The basic algorithm consists of a loop that draws random values for weights of randomly chosen principal components and finally sums and thresholds this.
We first compute the meaningful weights for each principal component as well as the associated variance.
means = pca.transform(X[:, :]).mean(axis=0)
stds = np.std(pca.transform(X[:, :]), axis=0)
We can now draw gaussian weights using the means and standard deviations just defined.
new_kanji = np.zeros_like(X[0, :])
# we select between 5 and 50 random components in our resulting kanji
for i in range(np.random.randint(5, 50)):
component = np.random.choice(np.arange(V.shape[0]))
weight = np.random.normal(means[component], stds[component])
new_kanji += weight * V[component, :]
plt.imshow(new_kanji.reshape((72, 72)), cmap='gray')
plt.axis('off')
(-0.5, 71.5, 71.5, -0.5)
This previous image is nice, but it still needs binarization to look like a real character. Thus we perform Otsu thresholding again:
thresh = threshold_otsu(new_kanji)
binary = new_kanji > thresh
plt.imshow(binary.reshape((72, 72)), cmap='gray')
plt.axis('off')
(-0.5, 71.5, 71.5, -0.5)
This doesn't look like a character yet. To improve it's appearance, we could use a little bit of smoothing before the binarization.
new_kanji_smooth = ndimage.gaussian_filter(new_kanji.reshape((72, 72)), 1.5)
plt.imshow((new_kanji_smooth > threshold_otsu(new_kanji_smooth)).reshape((72, 72)), cmap='gray')
plt.axis('off')
(-0.5, 71.5, 71.5, -0.5)
But how much should we actually smooth the character so that it looks nice? Let's explore this using the interactive widgets from IPython:
def examine_smoothing(factor):
new_kanji_smooth = ndimage.gaussian_filter(new_kanji.reshape((72, 72)), factor)
plt.imshow((new_kanji_smooth > threshold_otsu(new_kanji_smooth)).reshape((72, 72)), cmap='gray')
plt.axis('off')
interact(examine_smoothing,
factor=(0.5, 5.5, 0.1))
A value between 1.8 and 2.5 seems like a good choice.
A further improvement could be to apply a binary filter to the resulting image.
new_kanji_smooth = ndimage.gaussian_filter(new_kanji.reshape((72, 72)), 1.8)
new_kanji_smooth = ndimage.grey_dilation(new_kanji_smooth, size=2)
new_kanji_smooth = ndimage.gaussian_filter(new_kanji_smooth, 1.8)
plt.imshow((new_kanji_smooth > threshold_otsu(new_kanji_smooth)).reshape((72, 72)), cmap='gray')
plt.axis('off')
(-0.5, 71.5, 71.5, -0.5)
Putting all these ideas together, we can generate galleries of random characters:
def make_gallery(func):
for i in range(100):
plt.subplot(10, 10, i + 1)
func()
def generate_new_kanji(factor):
new_kanji = np.zeros_like(X[0, :])
# we select between 5 and 50 random components in our resulting kanji
for i in range(np.random.randint(5, 50)):
component = np.random.choice(np.arange(V.shape[0]))
weight = np.random.normal(means[component], stds[component])
new_kanji += weight * V[component, :]
new_kanji_smooth = ndimage.gaussian_filter(new_kanji.reshape((72, 72)), factor)
plt.imshow((new_kanji_smooth > threshold_otsu(new_kanji_smooth)).reshape((72, 72)), cmap='gray')
plt.axis('off')
plt.figure(figsize=(10, 10))
make_gallery(lambda : generate_new_kanji(1.8))
plt.figure(figsize=(10, 10))
make_gallery(lambda : generate_new_kanji(2.3))
Applying a binary erosion to the created kanji, we can obtain the following results:
def generate_new_kanji_binary_op(factor):
new_kanji = np.zeros_like(X[0, :])
# we select between 5 and 50 random components in our resulting kanji
for i in range(np.random.randint(5, 50)):
component = np.random.choice(np.arange(V.shape[0]))
weight = np.random.normal(means[component], stds[component])
new_kanji += weight * V[component, :]
new_kanji_smooth = ndimage.gaussian_filter(new_kanji.reshape((72, 72)), factor)
new_kanji_smooth = ndimage.grey_erosion(new_kanji_smooth, size=2)
new_kanji_smooth = ndimage.gaussian_filter(new_kanji_smooth, factor)
plt.imshow((new_kanji_smooth > threshold_otsu(new_kanji_smooth)).reshape((72, 72)), cmap='gray')
plt.axis('off')
plt.figure(figsize=(10, 10))
make_gallery(lambda : generate_new_kanji_binary_op(2.1))
plt.figure(figsize=(10, 10))
make_gallery(lambda : generate_new_kanji_binary_op(2.8))
plt.figure(figsize=(10, 10))
make_gallery(lambda : generate_new_kanji_binary_op(1.5))
Conclusions¶
I hope you've liked this little investigation into the complexity of Japanese (or Chinese) characters. One thing that I conclude from this analysis is that the characters are intrinsically complex and varied and that it is difficult to reduce them to something more tractable.
I would say that looking at the characters from the point of view of principal component analysis does not really capture their structure and thus the approach followed in this post has failed. However, there should be better ways to apply dimensionality reduction to these characters. For instance, creating new features from the distinct parts of the characters. If you have an idea about how to do this, feel free to comment below.
This post was entirely written using the IPython notebook. Its content is BSD-licensed. You can see a static view or download this notebook with the help of nbviewer at 20150417_KanjiPCA.ipynb.