Analyzing verb endings in the Japanese language

Goal of this study

According to Tae Kim's guide to Japanese grammar and its introduction to Japanese verbs, the following holds true:

All ru-verbs end in 「る」 while u-verbs can end in a number of u-vowel sounds including 「る」. Therefore, if a verb does not end in 「る」, it will always be an u-verb. For verbs ending in 「る」, if the vowel sound preceding the 「る」 is an /a/, /u/ or /o/ vowel sound, it will always be an u-verb. Otherwise, if the preceding sound is an /i/ or /e/ vowel sound, it will be a ru-verb in most cases. A list of common exceptions are at the end of this section.

In this exploration, we're going to check that this description and try to find the statistics (i.e. the percentages for the different categories).

Sample file extraction

The first step in our study is to read the dictionary XML and browse it for verbs.

We're starting small, with a sample file that contains one word so that we can extract:

  • the pronunciation in Hiragana
  • the grammatical function of the word
In [1]:
import xml.etree.cElementTree as ET
In [2]:
tree = ET.ElementTree(file='./JMdict_example.xml')
tree
Out[2]:
<ElementTree at 0x3d630d0>
In [3]:
root = tree.getroot()
root
Out[3]:
<Element 'entry' at 0x03DC45D8>
In [4]:
for elem in root:
    print elem
<Element 'ent_seq' at 0x03DC4608>
<Element 'k_ele' at 0x03DC4488>
<Element 'r_ele' at 0x03DC4548>
<Element 'sense' at 0x03DC4770>
<Element 'sense' at 0x03DC4998>

From the evaluation above, we see that an element of the XML structure contains a pronunciation that we can extract from the r_ele/reb tag:

In [5]:
print root.find('r_ele/reb').text
うよく

The grammatical function comes from the sense/pos tag. It should be noted that a word can have more than one grammatical function in Japanese. In the case below, the word uyoku can be an adjective-noun and a noun.

In [8]:
print [elem.text for elem in root.findall('sense/pos')]
['adj-no;', 'n;']

We can now define a function that acts as a shortcut for the two operations done above (pronunciation and grammatical function):

In [9]:
def extract_information(elem):
    return (elem.find('r_ele/reb').text,
            [sense.text for sense in elem.findall('sense/pos')])
In [10]:
extract_information(root)
Out[10]:
(u'\u3046\u3088\u304f', ['adj-no;', 'n;'])

Where are the verbs?

If we take a look at the DTD of the JMdict XML file we find the following entities corresponding to verbs:

As we can see, this is quite complex. Instead of parsing every single verb type and differentiating between it, we will focus on four verb types only:

  • Ichidan
  • Nidan
  • Yodan
  • Godan

As a side note, the Ichidan verbs are the ru verbs Tae Kim is talking about, while the Godan verbs are the u verbs.

Below, we're looping over the dictionary items and extracting words that possibly contain verbs. To do this, we're checking for the presence of either of the four above verb types.

In [11]:
tree = ET.ElementTree(file='../JMdict.xml')
tree
Out[11]:
<ElementTree at 0x3d63bf0>
In [19]:
verb_list = []
verb_types = [u'Ichidan', u'Nidan', u'Yodan', u'Godan']
for elem in tree.getroot().findall('entry'):
    item_info = extract_information(elem)
    for verb_type in verb_types:
        if " ".join(item_info[1]).find(verb_type) != -1:
                verb_list.append((item_info[0], verb_type))
                break
                
In [20]:
len(verb_list)
Out[20]:
10266
In [21]:
verb_list[:10]
Out[21]:
[(u'\u3042\u3052\u3064\u3089\u3046', u'Godan'),
 (u'\u3042\u3057\u3089\u3046', u'Godan'),
 (u'\u3042\u3076\u308c\u308b', u'Ichidan'),
 (u'\u3042\u3084\u3059', u'Godan'),
 (u'\u3044\u304b\u3059', u'Godan'),
 (u'\u3044\u3058\u3051\u308b', u'Ichidan'),
 (u'\u3044\u3058\u308a\u307e\u308f\u3059', u'Godan'),
 (u'\u3044\u3061\u3083\u3064\u304f', u'Godan'),
 (u'\u3044\u306a\u306a\u304f', u'Godan'),
 (u'\u3044\u3073\u308b', u'Godan')]

As the unicode is a bit hard to read, we can format the previous output to the following effect:

In [23]:
for elem in map(": ".join, verb_list[:10]):
    print elem
あげつらう: Godan
あしらう: Godan
あぶれる: Ichidan
あやす: Godan
いかす: Godan
いじける: Ichidan
いじりまわす: Godan
いちゃつく: Godan
いななく: Godan
いびる: Godan

First statistical breakdown

As we now have a list of verb types, we can calculate their relative frequency:

In [38]:
for verb_type in verb_types:
    print "%s: %i out of %i verbs" % (verb_type, len(filter(lambda v: v[1] == verb_type, verb_list)), len(verb_list))
Ichidan: 3415 out of 10266 verbs
Nidan: 38 out of 10266 verbs
Yodan: 25 out of 10266 verbs
Godan: 6788 out of 10266 verbs
In [257]:
for verb_type in verb_types:
    print "%s: %.1f percent" % (verb_type, len(filter(lambda v: v[1] == verb_type, verb_list)) / float(len(verb_list)) * 100)
Ichidan: 33.3 percent
Nidan: 0.4 percent
Yodan: 0.2 percent
Godan: 66.1 percent

According to this output, there's a majority of u verbs in the Japanese language.

Verb types by last character

Following Tae Kim's data,we're going to count number of verbs of a given type by classifying them according to their last character:

In [40]:
last_char_dict = {}
for verb in verb_list:
    last_char = verb[0][-1] 
    if last_char not in last_char_dict:
        last_char_dict[last_char] = {verb[1] : 1}
    else:
        if verb[1] in last_char_dict[last_char]:
            last_char_dict[last_char][verb[1]] += 1
        else:
            last_char_dict[last_char][verb[1]] = 1
        
In [41]:
last_char_dict
Out[41]:
{u'\u3046': {u'Godan': 719, u'Nidan': 6, u'Yodan': 4},
 u'\u304f': {u'Godan': 911, u'Nidan': 2, u'Yodan': 4},
 u'\u3050': {u'Godan': 137, u'Nidan': 2},
 u'\u3059': {u'Godan': 1803, u'Nidan': 1, u'Yodan': 3},
 u'\u305a': {u'Ichidan': 1, u'Nidan': 2},
 u'\u3064': {u'Godan': 244, u'Nidan': 2, u'Yodan': 1},
 u'\u306c': {u'Godan': 9, u'Nidan': 2},
 u'\u3075': {u'Nidan': 1},
 u'\u3076': {u'Godan': 102, u'Nidan': 2, u'Yodan': 2},
 u'\u3080': {u'Godan': 596, u'Nidan': 4},
 u'\u3086': {u'Nidan': 7},
 u'\u308b': {u'Godan': 2267, u'Ichidan': 3414, u'Nidan': 7, u'Yodan': 11}}
In [53]:
", ".join(map(lambda s: ": ".join((s[0], str(s[1]))), last_char_dict[last_char_dict.keys()[0]].items()))
Out[53]:
u'Nidan: 4, Godan: 596'
In [54]:
for key in last_char_dict:
    print "verbs ending in %s" % key, ", ".join(map(lambda s: ": ".join((s[0], str(s[1]))), last_char_dict[key].items()))
verbs ending in む Nidan: 4, Godan: 596
verbs ending in つ Yodan: 1, Nidan: 2, Godan: 244
verbs ending in う Yodan: 4, Nidan: 6, Godan: 719
verbs ending in る Yodan: 11, Nidan: 7, Ichidan: 3414, Godan: 2267
verbs ending in ゆ Nidan: 7
verbs ending in ぬ Nidan: 2, Godan: 9
verbs ending in く Yodan: 4, Nidan: 2, Godan: 911
verbs ending in ぐ Nidan: 2, Godan: 137
verbs ending in ふ Nidan: 1
verbs ending in ぶ Yodan: 2, Nidan: 2, Godan: 102
verbs ending in す Yodan: 3, Nidan: 1, Godan: 1803
verbs ending in ず Nidan: 2, Ichidan: 1

As we can see from the output the first idea given by Tae Kim seems correct: if it doesn't end in ru it's a Godan verb (except for one ず Ichidan verb). Let's check the second idea:

if it ends in ru, the second before last hiragana mostly determines the verb type

In [57]:
second_last_char_dict = {}
for verb in verb_list:
    last_char = verb[0][-1] 
    if last_char == u'る' and len(verb[0]) >= 2:
        second_last_char = verb[0][-2] 
        if second_last_char not in second_last_char_dict:
            second_last_char_dict[second_last_char] = {verb[1] : 1}
        else:
            if verb[1] in second_last_char_dict[second_last_char]:
                second_last_char_dict[second_last_char][verb[1]] += 1
            else:
                second_last_char_dict[second_last_char][verb[1]] = 1
In [58]:
for key in second_last_char_dict:
    print key, second_last_char_dict[key]
ゃ {u'Yodan': 2, u'Godan': 5}
わ {u'Godan': 130}
ェ {u'Godan': 1}
カ {u'Godan': 1}
ク {u'Godan': 5}
コ {u'Godan': 4}
シ {u'Godan': 3}
い {u'Ichidan': 32, u'Godan': 76}
え {u'Yodan': 1, u'Ichidan': 398, u'Godan': 41}
ニ {u'Godan': 3}
が {u'Nidan': 1, u'Godan': 162}
ぐ {u'Godan': 25}
ビ {u'Godan': 1}
ご {u'Nidan': 1, u'Godan': 3}
じ {u'Ichidan': 104, u'Godan': 21}
ぜ {u'Ichidan': 11}
ミ {u'Godan': 1}
だ {u'Nidan': 1, u'Godan': 18}
ャ {u'Godan': 1}
つ {u'Godan': 34}
ョ {u'Godan': 1}
と {u'Godan': 161}
ぬ {u'Godan': 6}
ば {u'Godan': 45}
へ {u'Ichidan': 3, u'Godan': 5}
ぼ {u'Godan': 32}
む {u'Godan': 13}
や {u'Godan': 28}
よ {u'Godan': 36}
れ {u'Ichidan': 548, u'Godan': 1}
グ {u'Godan': 5}
ジ {u'Godan': 2}
ツ {u'Godan': 1}
ト {u'Godan': 2}
か {u'Yodan': 2, u'Godan': 149}
く {u'Godan': 69}
バ {u'Godan': 2}
こ {u'Godan': 35}
ピ {u'Godan': 2}
し {u'Ichidan': 1, u'Godan': 37}
せ {u'Ichidan': 297, u'Godan': 9}
ボ {u'Godan': 2}
た {u'Godan': 72}
ム {u'Godan': 2}
で {u'Ichidan': 79, u'Godan': 1}
に {u'Ichidan': 3}
レ {u'Ichidan': 2}
は {u'Yodan': 1, u'Godan': 44}
び {u'Ichidan': 40, u'Godan': 7}
ほ {u'Godan': 6}
み {u'Ichidan': 61}
め {u'Ichidan': 470, u'Godan': 7}
ら {u'Ichidan': 1}
キ {u'Godan': 2}
ケ {u'Godan': 1}
ス {u'Godan': 4}
チ {u'Godan': 2}
あ {u'Yodan': 1, u'Godan': 28}
う {u'Godan': 13}
ド {u'Godan': 1}
お {u'Godan': 45}
ぎ {u'Ichidan': 29, u'Godan': 40}
パ {u'Godan': 1}
げ {u'Ichidan': 251, u'Godan': 5}
フ {u'Godan': 2}
ざ {u'Yodan': 2, u'Godan': 11}
ず {u'Ichidan': 79, u'Godan': 21}
ぞ {u'Godan': 3}
メ {u'Ichidan': 1}
て {u'Ichidan': 185, u'Godan': 3}
な {u'Nidan': 2, u'Godan': 147}
ロ {u'Godan': 3}
の {u'Godan': 31}
ひ {u'Ichidan': 3, u'Godan': 4}
ぶ {u'Nidan': 1, u'Godan': 70}
ま {u'Yodan': 2, u'Godan': 147}
も {u'Godan': 43}
ゆ {u'Godan': 1}
り {u'Ichidan': 24, u'Godan': 2}
ズ {u'Godan': 1}
テ {u'Godan': 2}
ナ {u'Godan': 1}
き {u'Ichidan': 30, u'Godan': 108}
け {u'Ichidan': 630, u'Godan': 22}
さ {u'Godan': 58}
ブ {u'Godan': 6}
す {u'Godan': 16}
そ {u'Nidan': 1, u'Godan': 10}
ち {u'Ichidan': 38, u'Godan': 9}
モ {u'Godan': 3}
づ {u'Godan': 2}
ど {u'Godan': 35}
ね {u'Ichidan': 61, u'Godan': 15}
ぱ {u'Godan': 5}
ふ {u'Godan': 18}
べ {u'Ichidan': 33, u'Godan': 9}

This still needs a little sorting!

In [248]:
def vowel_type(character):
    vowels = ['a', 'i', 'u', 'e', 'o']
    chars = [u'あかさたなはまやらわゃカがだャばバパざナぱ', 
             u'シいニビじミジピしにびみキチぎひりきち',
             u'クぐつぬむグツくムスうフずぶゆズブすづふ',
             u'べェえぜへれせでレめケげメてテけね',
             u'コごョとぼよトこボほドおぞロのもそモど']
    for ind, hiragana in enumerate(chars):
        if character in hiragana:
            return vowels[ind]
    print character
    raise Exception('character not supported')
In [250]:
final_classification = {}
for key in second_last_char_dict:
    vowel = vowel_type(key)
    if vowel not in final_classification:
        final_classification[vowel] = second_last_char_dict[key]
    else:
        for verb_group in second_last_char_dict[key]:
            if verb_group in final_classification[vowel]:
                final_classification[vowel][verb_group] += second_last_char_dict[key][verb_group]
            else:
                final_classification[vowel][verb_group] = second_last_char_dict[key][verb_group]
In [251]:
final_classification
Out[251]:
{'a': {u'Godan': 1055, u'Ichidan': 1, u'Nidan': 4, u'Yodan': 10},
 'e': {u'Godan': 122, u'Ichidan': 2969, u'Yodan': 1},
 'i': {u'Godan': 320, u'Ichidan': 365},
 'o': {u'Godan': 456, u'Nidan': 2},
 'u': {u'Godan': 314, u'Ichidan': 79, u'Nidan': 1}}
In [255]:
for key in final_classification.keys():
    print 'vowel sound ending in %s' % key
    print ", ".join(map(lambda s: ": ".join((s[0], str(s[1]))), final_classification[key].items()))    
vowel sound ending in a
Yodan: 10, Nidan: 4, Ichidan: 1, Godan: 1055
vowel sound ending in u
Nidan: 1, Ichidan: 79, Godan: 314
vowel sound ending in e
Yodan: 1, Ichidan: 2969, Godan: 122
vowel sound ending in i
Ichidan: 365, Godan: 320
vowel sound ending in o
Nidan: 2, Godan: 456
In [263]:
for key in final_classification.keys():
    print 'vowel sound ending in %s' % key
    total = sum(map(lambda i: i[1], final_classification[key].items()))
    print ", ".join(map(lambda s: ": ".join((s[0], format(s[1] / float(total) * 100, ".2f"))), final_classification[key].items()))
vowel sound ending in a
Yodan: 0.93, Nidan: 0.37, Ichidan: 0.09, Godan: 98.60
vowel sound ending in u
Nidan: 0.25, Ichidan: 20.05, Godan: 79.70
vowel sound ending in e
Yodan: 0.03, Ichidan: 96.02, Godan: 3.95
vowel sound ending in i
Ichidan: 53.28, Godan: 46.72
vowel sound ending in o
Nidan: 0.44, Godan: 99.56

Concluding remarks

High certainty

  • if vowel sound is o, it's a u verb in 100% of cases
  • if vowel sound is a, it's a u verb in 99% of cases
  • if vowel sound is e, it's a ru verb in 96% of cases

Lower certainty

  • if vowel sound is u, it's a u verb in 80% of cases and a ru verb in 20% of cases
  • if vowel sound is i, it's ru verb in 53% of cases and u verb in 47% of cases

Side note

The more I'm computing this sort of thing, the more I'm thinking I could do this in an easier way with Pandas.

Comments