Exploring Finnish words using the interactive IPython HTML widgets

In this post, we're gonna take a look at the Finnish language. Our starting point is a file found here, which contains the 10000 most common words found in Finnish.

In [37]:
import urllib2
response = urllib2.urlopen('http://www.csc.fi/english/research/sciences/linguistics/taajuussanasto-B9996/download')
words = response.read()

As the file is encoded in utf-8, we need to decode it, meaning convert it to unicode, to use it first. A great help in figuring this out is the fantastic tutorial presentation at http://farmdev.com/talks/unicode/.

In [38]:
words = words.decode('utf-8')

We can now split the text to its component lines in order to start exploring it.

In [39]:
words = words.splitlines()
In [40]:
words[:10]
Out[40]:
[u'                   Sanahakemisto (laskevan taajuuden mukaan)',
 u'',
 u'   N        Abs   Rel    Uppslagsord',
 u'   1    2716396 4,614851 olla (verbi)',
 u'   2    1566108 2,660641 ja (konjunktio)',
 u'   3     593462 1,008225 ei (verbi)',
 u'   4     538609 0,915036 se (pronomini)',
 u'   5     443301 0,753118 ett\xe4 (konjunktio)',
 u'   6     417984 0,710108 joka (pronomini)',
 u'   7     344927 0,585992 vuosi (substantiivi)']

As we see, the words have to be separated from their headers, which give additional information:

  • rank
  • absolute count of word in corpus
  • relative frequency
  • the word we're talking about

This can be parsed in the following way: we develop a simple lambda function for each part of the header that we can then apply to the whole list.

In [41]:
words[10]
Out[41]:
u'   8     302803 0,514428 h\xe4n (pronomini)'
In [42]:
print words[10]
   8     302803 0,514428 hän (pronomini)

First, the rank.

In [43]:
rank = lambda w: int(w[:8])
rank(words[10])
Out[43]:
8

The absolute word count.

In [44]:
abs_count = lambda w: int(w[8:15])
abs_count(words[10])
Out[44]:
302803

The relative count.

In [45]:
rel_count = lambda w: float(w[15:25].replace(',', '.'))
rel_count(words[10])
Out[45]:
0.514428

Finally, the word itself, in unicode form.

In [46]:
the_word = lambda w: w[25:].split('(')[0]
print the_word(words[10])
hän 

Having set up these methods, we can apply them to each row that we want to parse from this text file.

In [47]:
word_dict = dict(
    [(the_word(w), 
      (rank(w), abs_count(w), rel_count(w))) for w in words[3:-6]])

With this new word dictionary, we can build a sort of widget that lets us play with the words.

In [48]:
from IPython.html.widgets import interact
In [49]:
from IPython.display import HTML, display
In [50]:
def show_word(n):
    word = word_dict.keys()[n]
    s = '<h3>Word: %s</h3><table>\n' % word
    for k,v in zip(('rank', 'relative count', 'absolute count'),
                   word_dict[word]):
        s += '<tr><td>{0}</td><td>{1}</td></tr>\n'.format(k,v)
    s += '</table>'
    display(HTML(s))

Let's test this function with a call on the 3rd index item.

In [51]:
show_word(3)

Word: arvokisa

rank4188
relative count1360
absolute count0.00231

And now, let's make this interactive with the latests IPython Widget machinery.

In [52]:
interact(show_word,
         n=(0, len(word_dict.keys()) - 1))

Word: diplomaattinen

rank7201
relative count672
absolute count0.001142
Out[52]:
<function __main__.show_word>

Adding a translation from an external website

We can easily supplement a translation of the word displayed by using an external website and displaying a search page for the word we're using:

In [53]:
from IPython.display import IFrame
IFrame('http://www.fincd.com/index.php?txtSearch=tunti&lang=fi', width='100%', height=350)
Out[53]:

But how to encode unicode strings for use in URLs?

Let's use an example of what we want to do: the encoded url for the word kyllä is http://www.fincd.com/finnish/kyll%E4.html

In [54]:
my_word = u'kyllä'

This word has to be encoded somehow. Looking at the website, we find it uses the iso-8859-1 coding. Let's try that on our word.

In [55]:
my_word.encode('iso-8859-1')
Out[55]:
'kyll\xe4'

Once encoded, we can use the quote method to make it ready for urls:

In [56]:
urllib2.quote(my_word.encode('iso-8859-1'))
Out[56]:
'kyll%E4'

Finally, we can put this together:

In [57]:
print 'http://www.fincd.com/index.php?txtSearch=%s&lang=fi' % urllib2.quote(my_word.encode('iso-8859-1'))
http://www.fincd.com/index.php?txtSearch=kyll%E4&lang=fi

Good! Now, let's write the word and translation thing and let's interact with it.

In [58]:
def show_word_and_translation(n):
    word = word_dict.keys()[n]
    s = '<h3>Word: %s</h3><table>\n' % word
    for k,v in zip(('rank', 'relative count', 'absolute count'),
                   word_dict[word]):
        s += '<tr><td>{0}</td><td>{1}</td></tr>\n'.format(k,v)
    s += '</table>'
    display(HTML(s))
    url = 'http://www.fincd.com/index.php?txtSearch=%s&lang=fi' % urllib2.quote(word[:-1].encode('iso-8859-1'))
    display(IFrame(url, width='100%', height=350))
In [59]:
interact(show_word_and_translation,
         n=(0, len(word_dict.keys()) - 1))

Word: enso

rank2690
relative count2351
absolute count0.003994

Can we do better than that? Let's see if we can just take the table with the content from the website.

In [60]:
response = urllib2.urlopen('http://www.fincd.com/finnish/kyll%E4.html')
source = response.read()
In [61]:
source_split = source.decode("iso-8859-1").splitlines()

The really interesting part is here:

In [62]:
source_split[85:120]
Out[62]:
[u'\t<a href="/friends/">[Links]</a>&nbsp;',
 u'\t<a href="javascript:bookmark()">[Bookmark]</a>',
 u'  </td></tr>',
 u'</table>',
 u'',
 u'<table width="728" align="center" id="tbDotBorder">',
 u'  <tr>',
 u'    <td id="lang_cell" width="20%">Finnish:</td>',
 u'    <td id="helper_cell" width="80%"><a href = "/finnish/kyll%E4.html">kyll\xe4</a>\t',
 u'\t</td>',
 u'  </tr>',
 u'  <tr>',
 u'    <td id="lang_cell" width="20%">English:</td>',
 u'    <td id="content_cell" width="80%">',
 u'\t<li><a href="/english/yes.html">yes</a></li>',
 u'\t<p id="msg"></p>\t</td>',
 u'  </tr>',
 u'  <tr>',
 u'    <td colspan="2" id="suggestion_cell"><!--Write your own explain here--></td>',
 u'  </tr>',
 u'  <tr>',
 u'  <td colspan="2">',
 u'  <table width="100%">',
 u'    <td width="50%" id="discuss_cell"> </td>',
 u'    <td width="50%" id="discuss_cell"> </td>',
 u'  </tr>',
 u'  </table>',
 u'  </td>',
 u'  <tr><td colspan="3" align="right" id="copy_right"><small><a href = "/old/" title="Suomi Englanti sanakirja ">Suomi Englanti Suomi sanakirja Beta5</a></small></td></tr>',
 u'</table>',
 u'',
 u'<br>',
 u'',
 u'<table width="728" align="center" id="tbDotBorder" style="border-style:none">',
 u'<tr>']

We can the exact indices for the table with the information here:

In [63]:
source_split.index(u'<table width="728" align="center" id="tbDotBorder">')
Out[63]:
90
In [64]:
source_split.index(u'<table width="728" align="center" id="tbDotBorder" style="border-style:none">')
Out[64]:
118

This can easily be extracted to HTML.

In [65]:
HTML("".join(source_split[90:118]))
Out[65]:

Let's try to extract only the meaningful information using regular expressions.

In [90]:
src = "".join(source_split[90:118])
In [91]:
import re
In [92]:
p = re.compile('<tr>')
In [93]:
iterator = p.finditer(src)
for match in iterator:
    print match.span()
(53, 57)
(201, 205)
(368, 372)
(461, 465)
(619, 623)

Judging from these matches, we only need the first two table rows to extract our data. Here, the data lies therefore between characters 53 and 367.

In [94]:
HTML(src[53:367])
Out[94]:
Finnish: kyllä English:
  • yes
  • Let's design a function that extracts exactly this last part:

    In [99]:
    def extract_word_definition(source):
        source_split = source.decode("iso-8859-1").splitlines()
        start = source_split.index(u'<table width="728" align="center" id="tbDotBorder">')
        stop = source_split.index(u'<table width="728" align="center" id="tbDotBorder" style="border-style:none">')
        src = "".join(source_split[start:stop])
        p = re.compile('<tr>')
        iterator = p.finditer(src)
        spans = [match.span() for match in iterator]
        start = spans[0][0]
        stop = spans[2][0]
        return src[start:stop]
    
    In [100]:
    HTML(extract_word_definition(source))
    
    Out[100]:
    Finnish: kyllä English:
  • yes
  • We can integrate this into the existing code:

    In [101]:
    def show_word_and_translation_html_only(n):
        word = word_dict.keys()[n]
        s = '<h3>Word: %s</h3><table>\n' % word
        for k,v in zip(('rank', 'relative count', 'absolute count'),
                       word_dict[word]):
            s += '<tr><td>{0}</td><td>{1}</td></tr>\n'.format(k,v)
        url = 'http://www.fincd.com/index.php?txtSearch=%s&lang=fi' % urllib2.quote(word[:-1].encode('iso-8859-1'))
        s += extract_word_definition(urllib2.urlopen(url).read())
        s += '</table>'
        display(HTML(s))
    
    In [102]:
    interact(show_word_and_translation_html_only,
             n=(0, len(word_dict.keys()) - 1))
    

    Word: toimisto

    rank1561
    relative count4584
    absolute count0.007788
    Finnish: toimisto
    English:
  • board
  • bureau
  • office
  • Finally, I can present the same tool but with a sorted ranking.

    In [112]:
    word_dict[0]
    
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    <ipython-input-112-547a459bb01d> in <module>()
    ----> 1 word_dict[0]
    
    KeyError: 0
    In [113]:
    sorted_keys = sorted(word_dict.keys(), key=lambda n:word_dict[n][0])
    
    In [115]:
    sorted_keys[:10]
    
    Out[115]:
    [u'olla ',
     u'ja ',
     u'ei ',
     u'se ',
     u'ett\xe4 ',
     u'joka ',
     u'h\xe4n ',
     u'saada ',
     u'mutta ',
     u't\xe4m\xe4 ']
    In [116]:
    def show_word_and_translation_html_only_sorted(n):
        word = sorted_keys[n]
        s = '<h3>Word: %s</h3><table>\n' % word
        for k,v in zip(('rank', 'relative count', 'absolute count'),
                       word_dict[word]):
            s += '<tr><td>{0}</td><td>{1}</td></tr>\n'.format(k,v)
        url = 'http://www.fincd.com/index.php?txtSearch=%s&lang=fi' % urllib2.quote(word[:-1].encode('iso-8859-1'))
        s += extract_word_definition(urllib2.urlopen(url).read())
        s += '</table>'
        display(HTML(s))
    
    In [117]:
    interact(show_word_and_translation_html_only_sorted,
             n=(0, len(word_dict.keys()) - 1))
    

    Word: joukkue

    rank180
    relative count31653
    absolute count0.053775
    Finnish: joukkue
    English:
  • team
  • To make things easier, we can also just consider the first 200 words:

    In [118]:
    interact(show_word_and_translation_html_only_sorted,
             n=(0, 200))
    

    Word: tulla

    rank14
    relative count192327
    absolute count0.326742
    Finnish: tulla
    English:
  • come
  • get
  • grow
  • show up
  • Conclusions

    In this post, I've tried to develop a simple dictionary tool for learning the Finnish language and that can be used interactively using the IPython Notebook HTML widgets. I found this fun to write, and I hope to use this again for learning purposes.

    One of the things that could be done with this is to take into account the notion of words that I already know, a sort of vocabulary database, and then make suggestions based on word similarities in terms of their writing for potential learning candidates, so as to expand my current base of vocabulary.

    Comments