Training an Unknown Word Guesser for Swedish

Words that do not appear in the lexicon - so called unknown words - present a problem to part of speech tagging. A guesser is that module in a part of speech tagger that deals with the problem by trying to guess the part of speech for unknown words. In this exercise, we will see that it is straightforward to train a guesser in the µ-TBL system.

Brill dealt with the problem in (Brill 1994) and what we are going to experiment with here is strongly influenced by his approach. The idea is simply to let rules inspect prefixes and suffixes of words, and to take a guess from that - a guess that may later be overridden by rules further down in the sequence.

We will use a subset of the templates that Brill used, converted into the µ-TBL system's template formalism.

Train and test, and see what happens! From the OS prompt, run:

> ./mutbl -f examples/guessing.script

Inspect the script for information about where templates and training and test data are located.

In your report, I would like you to consider the following questions:

  1. If you look at the data files, you'll notice that the representation of the learning problem is different, if compared to the PoS tagging task. In which way, and why, do you think?
  2. The template file generates candidate rules that look only at suffixes and prefixes up to N=2 characters long. The learning result will improve if longer suffixes and prefixes are considered as well. Run a few experiments and try to find the optimal setting of N. Do you think that this is a language dependent setting?
  3. Look at the distribution of rules between the ones looking at prefixes and the one looking at suffixes. One of them is more common. Why do you think that is? Would that be language dependent?