Exploring Finnish words using the interactive IPython HTML widgets
In this post, we're gonna take a look at the Finnish language. Our starting point is a file found here, which contains the 10000 most common words found in Finnish.
import urllib2
response = urllib2.urlopen('http://www.csc.fi/english/research/sciences/linguistics/taajuussanasto-B9996/download')
words = response.read()
As the file is encoded in utf-8, we need to decode it, meaning convert it to unicode, to use it first. A great help in figuring this out is the fantastic tutorial presentation at http://farmdev.com/talks/unicode/.
words = words.decode('utf-8')
We can now split the text to its component lines in order to start exploring it.
words = words.splitlines()
words[:10]
[u' Sanahakemisto (laskevan taajuuden mukaan)', u'', u' N Abs Rel Uppslagsord', u' 1 2716396 4,614851 olla (verbi)', u' 2 1566108 2,660641 ja (konjunktio)', u' 3 593462 1,008225 ei (verbi)', u' 4 538609 0,915036 se (pronomini)', u' 5 443301 0,753118 ett\xe4 (konjunktio)', u' 6 417984 0,710108 joka (pronomini)', u' 7 344927 0,585992 vuosi (substantiivi)']
As we see, the words have to be separated from their headers, which give additional information:
- rank
- absolute count of word in corpus
- relative frequency
- the word we're talking about
This can be parsed in the following way: we develop a simple lambda function for each part of the header that we can then apply to the whole list.
words[10]
u' 8 302803 0,514428 h\xe4n (pronomini)'
print words[10]
8 302803 0,514428 hän (pronomini)
First, the rank.
rank = lambda w: int(w[:8])
rank(words[10])
8
The absolute word count.
abs_count = lambda w: int(w[8:15])
abs_count(words[10])
302803
The relative count.
rel_count = lambda w: float(w[15:25].replace(',', '.'))
rel_count(words[10])
0.514428
Finally, the word itself, in unicode form.
the_word = lambda w: w[25:].split('(')[0]
print the_word(words[10])
hän
Having set up these methods, we can apply them to each row that we want to parse from this text file.
word_dict = dict(
[(the_word(w),
(rank(w), abs_count(w), rel_count(w))) for w in words[3:-6]])
With this new word dictionary, we can build a sort of widget that lets us play with the words.
from IPython.html.widgets import interact
from IPython.display import HTML, display
def show_word(n):
word = word_dict.keys()[n]
s = '<h3>Word: %s</h3><table>\n' % word
for k,v in zip(('rank', 'relative count', 'absolute count'),
word_dict[word]):
s += '<tr><td>{0}</td><td>{1}</td></tr>\n'.format(k,v)
s += '</table>'
display(HTML(s))
Let's test this function with a call on the 3rd index item.
show_word(3)
Word: arvokisa
rank | 4188 |
relative count | 1360 |
absolute count | 0.00231 |
And now, let's make this interactive with the latests IPython Widget machinery.
interact(show_word,
n=(0, len(word_dict.keys()) - 1))
Word: diplomaattinen
rank | 7201 |
relative count | 672 |
absolute count | 0.001142 |
<function __main__.show_word>
Adding a translation from an external website¶
We can easily supplement a translation of the word displayed by using an external website and displaying a search page for the word we're using:
from IPython.display import IFrame
IFrame('http://www.fincd.com/index.php?txtSearch=tunti&lang=fi', width='100%', height=350)
But how to encode unicode strings for use in URLs?¶
Let's use an example of what we want to do: the encoded url for the word kyllä is http://www.fincd.com/finnish/kyll%E4.html
my_word = u'kyllä'
This word has to be encoded somehow. Looking at the website, we find it uses the iso-8859-1
coding. Let's try that on our word.
my_word.encode('iso-8859-1')
'kyll\xe4'
Once encoded, we can use the quote
method to make it ready for urls:
urllib2.quote(my_word.encode('iso-8859-1'))
'kyll%E4'
Finally, we can put this together:
print 'http://www.fincd.com/index.php?txtSearch=%s&lang=fi' % urllib2.quote(my_word.encode('iso-8859-1'))
http://www.fincd.com/index.php?txtSearch=kyll%E4&lang=fi
Good! Now, let's write the word and translation thing and let's interact with it.
def show_word_and_translation(n):
word = word_dict.keys()[n]
s = '<h3>Word: %s</h3><table>\n' % word
for k,v in zip(('rank', 'relative count', 'absolute count'),
word_dict[word]):
s += '<tr><td>{0}</td><td>{1}</td></tr>\n'.format(k,v)
s += '</table>'
display(HTML(s))
url = 'http://www.fincd.com/index.php?txtSearch=%s&lang=fi' % urllib2.quote(word[:-1].encode('iso-8859-1'))
display(IFrame(url, width='100%', height=350))
interact(show_word_and_translation,
n=(0, len(word_dict.keys()) - 1))
Word: enso
rank | 2690 |
relative count | 2351 |
absolute count | 0.003994 |
Can we do better than that? Let's see if we can just take the table with the content from the website.
response = urllib2.urlopen('http://www.fincd.com/finnish/kyll%E4.html')
source = response.read()
source_split = source.decode("iso-8859-1").splitlines()
The really interesting part is here:
source_split[85:120]
[u'\t<a href="/friends/">[Links]</a> ', u'\t<a href="javascript:bookmark()">[Bookmark]</a>', u' </td></tr>', u'</table>', u'', u'<table width="728" align="center" id="tbDotBorder">', u' <tr>', u' <td id="lang_cell" width="20%">Finnish:</td>', u' <td id="helper_cell" width="80%"><a href = "/finnish/kyll%E4.html">kyll\xe4</a>\t', u'\t</td>', u' </tr>', u' <tr>', u' <td id="lang_cell" width="20%">English:</td>', u' <td id="content_cell" width="80%">', u'\t<li><a href="/english/yes.html">yes</a></li>', u'\t<p id="msg"></p>\t</td>', u' </tr>', u' <tr>', u' <td colspan="2" id="suggestion_cell"><!--Write your own explain here--></td>', u' </tr>', u' <tr>', u' <td colspan="2">', u' <table width="100%">', u' <td width="50%" id="discuss_cell"> </td>', u' <td width="50%" id="discuss_cell"> </td>', u' </tr>', u' </table>', u' </td>', u' <tr><td colspan="3" align="right" id="copy_right"><small><a href = "/old/" title="Suomi Englanti sanakirja ">Suomi Englanti Suomi sanakirja Beta5</a></small></td></tr>', u'</table>', u'', u'<br>', u'', u'<table width="728" align="center" id="tbDotBorder" style="border-style:none">', u'<tr>']
We can the exact indices for the table with the information here:
source_split.index(u'<table width="728" align="center" id="tbDotBorder">')
90
source_split.index(u'<table width="728" align="center" id="tbDotBorder" style="border-style:none">')
118
This can easily be extracted to HTML.
HTML("".join(source_split[90:118]))
Let's try to extract only the meaningful information using regular expressions.
src = "".join(source_split[90:118])
import re
p = re.compile('<tr>')
iterator = p.finditer(src)
for match in iterator:
print match.span()
(53, 57) (201, 205) (368, 372) (461, 465) (619, 623)
Judging from these matches, we only need the first two table rows to extract our data. Here, the data lies therefore between characters 53 and 367.
Let's design a function that extracts exactly this last part:
def extract_word_definition(source):
source_split = source.decode("iso-8859-1").splitlines()
start = source_split.index(u'<table width="728" align="center" id="tbDotBorder">')
stop = source_split.index(u'<table width="728" align="center" id="tbDotBorder" style="border-style:none">')
src = "".join(source_split[start:stop])
p = re.compile('<tr>')
iterator = p.finditer(src)
spans = [match.span() for match in iterator]
start = spans[0][0]
stop = spans[2][0]
return src[start:stop]
We can integrate this into the existing code:
def show_word_and_translation_html_only(n):
word = word_dict.keys()[n]
s = '<h3>Word: %s</h3><table>\n' % word
for k,v in zip(('rank', 'relative count', 'absolute count'),
word_dict[word]):
s += '<tr><td>{0}</td><td>{1}</td></tr>\n'.format(k,v)
url = 'http://www.fincd.com/index.php?txtSearch=%s&lang=fi' % urllib2.quote(word[:-1].encode('iso-8859-1'))
s += extract_word_definition(urllib2.urlopen(url).read())
s += '</table>'
display(HTML(s))
interact(show_word_and_translation_html_only,
n=(0, len(word_dict.keys()) - 1))
Finally, I can present the same tool but with a sorted ranking.
word_dict[0]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-112-547a459bb01d> in <module>() ----> 1 word_dict[0] KeyError: 0
sorted_keys = sorted(word_dict.keys(), key=lambda n:word_dict[n][0])
sorted_keys[:10]
[u'olla ', u'ja ', u'ei ', u'se ', u'ett\xe4 ', u'joka ', u'h\xe4n ', u'saada ', u'mutta ', u't\xe4m\xe4 ']
def show_word_and_translation_html_only_sorted(n):
word = sorted_keys[n]
s = '<h3>Word: %s</h3><table>\n' % word
for k,v in zip(('rank', 'relative count', 'absolute count'),
word_dict[word]):
s += '<tr><td>{0}</td><td>{1}</td></tr>\n'.format(k,v)
url = 'http://www.fincd.com/index.php?txtSearch=%s&lang=fi' % urllib2.quote(word[:-1].encode('iso-8859-1'))
s += extract_word_definition(urllib2.urlopen(url).read())
s += '</table>'
display(HTML(s))
interact(show_word_and_translation_html_only_sorted,
n=(0, len(word_dict.keys()) - 1))
To make things easier, we can also just consider the first 200 words:
interact(show_word_and_translation_html_only_sorted,
n=(0, 200))
Conclusions¶
In this post, I've tried to develop a simple dictionary tool for learning the Finnish language and that can be used interactively using the IPython Notebook HTML widgets. I found this fun to write, and I hope to use this again for learning purposes.
One of the things that could be done with this is to take into account the notion of words that I already know, a sort of vocabulary database, and then make suggestions based on word similarities in terms of their writing for potential learning candidates, so as to expand my current base of vocabulary.