This series of lectures and exercises have been developed for a course in Machine Learning to be held at Göteborg University in the fall of 2005. The full course consists of four parts, and this page is concerned only with the second part, dealing specifically with Transformation-Based Learning (TBL).
Before you do anything else, you need to install the learner and the material relevant to the assignment. See below. Note that even if you already have an installation of the system, you will still need the course material.
Can be found here (PDF).
The assignment consists of six parts, where each part involves the training and testing of rules for a particular NLP task (each of which will have been touched upon in the lectures):
So, are you expected to do all of them? No, if you have already taken the GSLT course NLP1, then you have already done 1), and I do of course not expect you to do it again. But then you must do 2) - 6). If you haven't done 1) before, then you must do it, but then you are not required to do 5) and 6). I encourage you to have a go at them anyway, but you are not required to.
All material in the form of training and test data, and in some cases small sets of templates, will be made available to you. Your task is to - with an eye to the suggested literature (see below) - perform a few traning-test-modifcation runs in order to get a result as good as possible. For practical reasons, the training material will be small, so you should not expect to reach top-notch performance figures. Once you know how to work, you should not have to spend more than an hour, and maybe only half an hour, on each task (except for 1, which takes longer). Note, however, that you should also read the papers. Preferably, you have already browsed the papers before the course starts.
Your short report should specify - for each task, and at a minimum - the set of templates that you ended up using, the (initial part of the) resulting rule sequence, and the performance (in terms of accuracy, recall and precision, etc.) that the resulting rule sequence has. (The report that I expect you to write for task 1) is a bit more involved - see the instructions for that particular task.) Sometimes I have given you some additional questions to consider as well, and any other kind of comments or relevant discussion is of course also welcome.
In case you choose to carry out your individual course project within the TBL framework, here are some suggestions:
If your work from home, you need to install the learner plus material for the course yourself. Download one of the following:
Unpack the distribution. The resulting directory contains all you need. The distribution also contains a user's manual. Make sure you read it!
To test the distribution, invoke the µ-TBL system with the -f flag and the script 'pos_tagging.script' from the examples directory.
> ./mutbl -fexamples/pos_tagging.script
If everything is installed all right, then you should see the following being printed to the screen:
*****************************************************
The µ-TBL system, version 1.0
Copyright © Torbjörn Lager 2000
Department of Linguistics, Uppsala University, Sweden
The µ-TBL system comes with absolutely no warranty.
Type "help." to list all available commands.
*****************************************************
Learning a rule sequence...
Loading data: data/wsj_7500 ... done! Size is 7494.
Loading algorithm: algorithms/brill ... done!
Loading templates: templates/test_templates ... done!
11 1.00 tag:'VBP'>'VB' <- tag:'MD'@[-1,-2]
11 1.00 tag:'VBN'>'VBD' <- tag:'PP'@[-1]
8 1.00 tag:'NN'>'VB' <- tag:'MD'@[-1]
7 0.77 tag:'JJ'>'RB' <- wd:due@[0]
7 1.00 tag:'VBP'>'VB' <- tag:'TO'@[-1]
6 1.00 tag:'VB'>'VBP' <- tag:'NNS'@[-1]
6 0.88 tag:'VB'>'NN' <- tag:'DT'@[-1,-2]
6 1.00 tag:'IN'>'WDT' <- tag:'VBD'@[1]
8 rule(s) for feature(s) [tag]
Testing the learned rule sequence...
Loading templates: templates/pos_tagging_templates ... done!
Loading data: data/wsj_test ... done! Size is 9625.
DATA STATISTICS:
Corpus Size: 9625
Number of Tags: 9625
Number of Correct Tags: 9228
Number of Errors: 397
Recall: 95.9%
Precision: 95.9%
F-Score: 95.9%
Number of Tags per Word: 1.000
Applied 8 rule(s) for feature(s) [tag] in 0.220 seconds
DATA STATISTICS:
Corpus Size: 9625
Number of Tags: 9625
Number of Correct Tags: 9245
Number of Errors: 380
Recall: 96.1%
Precision: 96.1%
F-Score: 96.1%
Number of Tags per Word: 1.000
Saving the rule sequence(s) in file 'rules/test.pl'.
Generating data for the Error Browser...
Load (or reload) the file "error_data.html"
into a HTML browser to view error data.
Finished!
bash-2.05$
It is a good idea to study the script 'examples/pos_tagging.script' together with the manual and to try connect the different commands in there with what is written out when running the script.