Training a Part of Speech Tagger with the µ-TBL System

Introduction

In this exercise, we will use the µ-TBL system to train a simple transformation-based part of speech tagger for English. We pick our training and test corpus from the Wall Street Journal corpus, annotated with the Penn Treebank tagset. However, since the purpose of this exercise is just to demonstrate the technique, we will use only between 7,500 and 60,000 words for training. The distribution also contains a suitable test corpus, distinct from the training corpus, consisting of almost 10,000 words.

Exercises

If all went well with the test (see main page), you are ready to proceed. There are two parts to this exercise.

Part A

As can be seen from the test run, the result on the test corpus is 96.1% (In the exercises, recall, precision and F-score will always be the same number, which we will speak of as "accuracy".) Your task in this exercise is to improve upon this performance. There are several means to accomplish this:

  1. Modify the set of templates. The template file pointed to in the test script contains only eight templates. Your task is to come up with a better set of templates. Let yourself be inspired by (Brill 1995). Note that there is a trade-off between the number of templates used, and the time it will take to train with them.
  2. Increase the size of the training data. The test script points to training data containing only (roughly) 7,500 words. The distribution also contains training data containing 15,000, 30,000 and 60,000 words. Look in the 'data' directory. Experiment with different sizes.
  3. Lower the score threshold. The flag 'score_threshold' is set to 6. You may use a value as low as 1 and this may improve the result, but it will increase the learning time. Try different values.
  4. Change the accuracy threshold. The accuracy threshold is a floating point number between 0.5 and 1.0. Default is 0.5. The effect of using a higher number is that certain rules - the ones that are not accurate enough - will not appear in the learned sequence. The effect of setting the accuracy threshold to 1.0 is that only rules that are 100% accurate are learned. (It is a part of this exercise to determine if this is a good idea or not.)

Part B

Given your best rule sequence so far, try to make use of the html error analysis tool, documented as follows in the user's manual:

write_html_error_data
Computes error data and saves it in a HTML-based format. Load (or reload) the file "error_data.html" into a HTML browser to view it. The templates must be loaded in order to run this command.

What are the most common kinds of errors that your tagger make? Try to correct some of the errors by manually manipulating (one of) the learned rules sequences. Describe the difficulties that you meet with.

Report

A short report describing your experiments and their outcome, with links to templates, scripts, learned rules, etc., should be available on your web page on the 9th of November. Try to describe general tendencies in how the setting of the above parameters relates to the performance of the learned rule sequences. Give some sort of qualitative error analysis (perhaps in the form of a confusion matrix). Try to come up with suggestions for improving the performance further. Relate your investigations to what is said in the literature.