In this exercise, we will use the µ-TBL system to train a simple transformation-based part of speech tagger for English. We pick our training and test corpus from the Wall Street Journal corpus, annotated with the Penn Treebank tagset. However, since the purpose of this exercise is just to demonstrate the technique, we will use only between 7,500 and 60,000 words for training. The distribution also contains a suitable test corpus, distinct from the training corpus, consisting of almost 10,000 words.
If all went well with the test (see main page), you are ready to proceed. There are two parts to this exercise.
As can be seen from the test run, the result on the test corpus is 96.1% (In the exercises, recall, precision and F-score will always be the same number, which we will speak of as "accuracy".) Your task in this exercise is to improve upon this performance. There are several means to accomplish this:
Given your best rule sequence so far, try to make use of the html error analysis tool, documented as follows in the user's manual:
What are the most common kinds of errors that your tagger make? Try to correct some of the errors by manually manipulating (one of) the learned rules sequences. Describe the difficulties that you meet with.
A short report describing your experiments and their outcome, with links to templates, scripts, learned rules, etc., should be available on your web page on the 9th of November. Try to describe general tendencies in how the setting of the above parameters relates to the performance of the learned rule sequences. Give some sort of qualitative error analysis (perhaps in the form of a confusion matrix). Try to come up with suggestions for improving the performance further. Relate your investigations to what is said in the literature.