Finite state grammar

for finding grammatical errors

in Swedish text

Aim | Method | Plan | Milestones | Data | Conclusion | People | Bibliography

This project is part of the project Integrated language tools for writing and document handling sponsored by the HSFR/NUTEK Language Technology programme.

Other participating groups are:

AIM

The aim of this part of the project is to use finite state tools to find grammatical errors in Swedish texts.

Grammatical errors, as opposed to spelling errors, have to do with the syntactic structure of language. Examples that we are interested in include agreement and selection phenomena: "det stora hus", "ett stora huset", "statsrådet är sjuk", "han vill kommer"; word order phenomena: "om han kommer inte idag, så ..."; missing sentence boundaries: "Han kommer inte idag han är sjuk."

Such grammatical errors can arise in texts written by native speakers of Swedish not only as a result of typing errors but also as a result of incomplete editing. For example, if the writer decided to change "Han är sjuk" to "Statsrådet är sjukt" she may have changed "Han" to "Statsrådet" and omitted to edit the adjective to agree. Similarly, the writer may have written "Han kommer inte idag" as a main sentence and then decided that it should be an if-clause and inserted "om" and decapitalized "Han", but omitted to change the order of "kommer" and "inte". Such errors can also occur as a result of lack of knowledge of the language, as in writing by non-native speakers, and as a result of language deficiencies, as in writing by dyslectics.

METHOD

We seek to build a general resource tool that will find such errors in free Swedish text and can therefore be used in a number of different applications. We will use finite state grammar techniques. The particular technique is described for example in Karttunen et al. (1996) and exploits the fact that one can subtract one finite state automaton from another and still have a finite state automaton remaining. The idea is to write two finite state grammars: one "broad" and the other "narrow". By subtracting the narrow grammar from the broad grammar we create a machine that will find NPs in a text that contain agreement errors. We can diagnose different kinds of errors by subtracting different fragments of "narrow grammar" from the broad grammar.

An advantage of this simple finite state technique of subtraction is that the grammars one needs to write to find errors are always positive grammars rather that grammars written to find specific errors. This means that they will find a large class of errors without having to specify them individually. Also the fact that we are writing positive grammars rather than negative grammars for errors means that the grammars can be reused for robust processing in other applications that do not necessarily have anything to do with error detection. We thus see the building of a reasonably large finite state grammar as a contribution to the general resource base for Swedish language technology that could, for example, be included in the SVENSK toolbox.

RESEARCH PLAN

In the project we will first build on an electronic version of Lexin which has been made available to us by Språkdata at Göteborgs universitet.We will use the Xerox lexicon compiler to convert this to a finite state morphological analyzer and we will complete it with finite state morphological rules to analyze those forms in the morphological paradigms that are not included in the dictionary. At a later stage in the project we hope to use the lexicon which will be developed in Daniel Ridings' lexicon project in the Language Technology Program.

To develop our finite state grammars we will use the finite state tools from Xerox which we have on a free academic license. We assume that the results of our project will be useable by researchers for non-profit purposes provided they enter into a similar agreement with Xerox and are currently verifying this with Xerox Research Centre Europe.

We will test our grammars on data collected in previous projects and, if necessary, on data that we will collect ourselves. The grammars will be based on existing feature based grammatical analyses of relevant phenomena in Swedish (including Cooper, 1984, 1986a, 1986b) and additional analyses that we will carry out as part of the project.

The Project Plan in more detail

1. Establishing an electronic version of Lexin developed at Språkdata, Göteborg University - the lexicon will be used as a finite state transducer, first stage will be to use it as finite state word analyzer. At a later stage in the project we hope to use the lexicon which will be developed in Daniel Ridings' lexicon project in the Language Technology Program.

2. Consulting the work done at GRANSKA developed at NADA, KTH and of Eva Ejerhed and her project group at the Dept. of Linguistics, Umeå University - especially the modules of GRANSKA dealing with grammar would be of interest to analyze and see how far it is developed - at Umeå University the development of finite-state grammar for Swedish and/or other applications of finite-state technology is of interest

3. Detection of the types of errors made by users - we will use finite state techniques for parsing, that allow you to check only small portions of the text at a time and omitt the rest. The particular finite state grammar technique we use is described for example in Karttunen et al (1996). It explores the fact that one can subtract one finite state automaton from another and still have a finite state automaton remaining. The idea is to write two finite state grammars: one "broad", which detects both the well-formed and ill-formed fragments, and the other "narrow", only detecting the badly formulated segments. We can diagnose different kinds of errors by subtracting different fragments of "narrow grammar" from the "broad grammar". The following types of errors will be included:

MILESTONES AND DELIVERABLES

1998-09-01 Preliminary version of the dictionary, based on Lexin. (Report-9808) Report on relationship to this project of GRANSKA and work on finite-state grammar for Swedish at Umeå. (Report-9809)

1999-03-01 A system that finds errors in noun phrases (Report-9903) and selection errors, together with a report documenting it.

1999-09-01 Further development of the system to cover verb tense phenomena and word order phenomena, together with a report documenting it. (Report-9904)

2000-03-01 Evaluation of the system. Attempts to extend to cover missing sentence boundaries. If successful an extension of the system to cover these cases. A report documenting the system or the attempts we have made. (Report-9909)

DATA AND PROPOSED SOLUTIONS

A. Noun phrase errors

A common error taken from a text written by a ten-year-old is shown in (1). It involves an error in definiteness agreement within an NP:

(1) jag tar den närmsta handuk och slänger
I take the [def] nearest [def] towel [indef] and throw den i vasken it in the sink

The correct form of the NP requires the noun to be definite as in den närmsta handuken 'the nearest towel'. We assume that for a native speaker this kind of error arises because of editing problems rather than difficulty with the language.Perhaps the author started with an indefinite article, changed the article to a definite and changed the form of the adjective accordingly but omitted to changethe noun. However, this kind of error occurs also in the writing of non-native speakers and in the writing of dyslectics. In the case of non-native speakers the error may arise because of lack of knowledge or because of a performance error (in addition, of course, to the possibility of editing problems of the same kind as native speakers). In dyslectics such errors may arise simply because of theperformance difficulties in creating and reading text. Whatever the reasons for the errors, the software techniques that can be used to find and diagnose them can be the same.

By subtracting the narrow grammar from the broad grammar we create a machine that will find NPs in a text that contain agreement errors. For example,the broad grammar may have a finite state approximation of:

NP --> Det Adj* N

whereas the narrow grammar may have the equivalent of:

NP --> Det[+def] Adj[+def] N[+def]

NP --> Det[-def] Adj[-def] N[-def]

The actual analysis is a good deal more complex than this. See Cooper (1984,1986a, 1986b) for detailed discussion of such agreement features. To illustrate the subtraction of grammars given above, here is a toy examplewritten using the Xerox finite state tool:

# broad grammar

define Det [d e t | e t t];
define Adj [s t o r a | s t o r t];
define N [h u s | h u s e t];

define NP [Det " " (Adj " ")* N];

#narrow grammar

#definiteness agreement

define DetDef [d e t];
define DetIndef [e t t];
define AdjDef [s t o r a];
define AdjIndef [s t o r t];
define NDef [h u s e t ];
define NIndef [h u s];

define NPDef [DetDef " " (AdjDef " ")* NDef];
define NPIndef [DetIndef " " (AdjIndef " ")* NIndef];

# errors in definiteness agreement

define NPErrDef [NP - [NPDef | NPIndef]];

regex [NPErrDef [" " | "." ]] @-> " "%[ ... " " e r r d e f %];

read text < agr.txt

compose net
lower-side net
print words

This program will take the following (constructed text) as input:

(2) Han såg det stora hus. Men han trodde inte att det var
he saw the[def] big[def] house[indef] But he believed not that it was

ett stort huset. Det stora huset var fint.
a[indef] big[indef] house[def] The big house was nice

and yield:

(3) Han såg [det stora hus.errdef] Men han trodde inte att det var [ett stort huset. errdef] Det stora huset var fint.

In this example we do not have to say that the incorrect noun-phrase can consist of a definite determiner followed by an indefinite noun or an indefinite determiner followed by a definite noun, or any of the other incorrect combinations that are possible. Thus, with this technique of subtraction the grammars one needs to write to find errors are always positive grammars rather than grammars written to find specific errors.

B. Selection errors

An error made by non-native speakers when writing Swedish involves the selection of tensed or infinitive verbs. (4) is an example on this phenomena.

(4) Kan ni skickar ut en påminnelse om detta möte
Can you send[pres] out a reminder about this meating

Here the auxiliary verb should be followed by an infinitive, kan skicka 'can send', but in this case the present tense is used.

A similar approach as for the noun phrases may be taken for the selection phenomena, where we needto define the constraints for auxiliary verb and the verb that follows, i. e. the distiction between finite and infinite verb forms. Broad grammar for the case when the infinite verb immediately follows the auxiliary verb will then have this form:

VP --> Aux V

whereas the narrow grammar may have the equivalent of:

VP --> Aux[-inf] V[+inf]

In this case we are also looking at small portions of text and subtracting the narrow definitions from the broad ones. But the verbs need not be adjacent (as in example (4)) and the detection work then will rather concern single verb strings than phrases with other types of strings between the verbs.

C. Verb tense

Another error often found in texts by both children and dyslectics is shownin example (5) (here from a text written by a nine-year-old). The correct form of the verb stanna 'stop' should be in the past tense, i. e. stannade 'stopped'.

(5) långt åkte jag tills jag stanna vid en port
long went I tillI I stop[inf] by a front-door

This type of error is rather frequent among these users and arises from the fact that their writing is highly influenced by spoken language. Thus, in spokenSwedish regular weak verbs in the past tense often lack the appropriate ending and the spoken form then coincides with the infinitive form of the verb.

The problem of incorrect verb forms as in example (5) requires techniques that can spot sentence for finite verbs or their absence.

D. Word order

Word order may also be a problem, especially for the non-native speakers. An example is shown in (6):

(6) Om han kommer inte...
If he come not

In grammatical subordinate clauses, sentence adverbials always precede the verb, i. e. Om han inte kommer... 'If he not come...' The error is probably due to the fact that in Swedish main clauses, the tensed verb precedes sentence adverbials. Such grammatical errors can arise in texts written by native speakers of Swedish not only as a result of typing errors but also as a result of incomplete editing. For example the writer may have written Han kommer inte idag. 'H'will not come today.' as a main sentence and then decided that it should be an if-clause and inserted om 'if' and decapitalized Han 'he' but omitted to change the order of kommer 'come' and inte 'not'.

In order to detect these errors in word order, techniques that find the finite verb in a sentence and manage to analyse the constituents around the verb have to be aplied. Further we need to identify the type of sentence. Thus we need to define grammars of local order variations.

E. Sentence boundaries

The last phenomenon concerns sentenceboundaries. Analysis of texts written by children and dyslectics also show difficulties with punctuation as shown in (7), where the boundary between the first and second sentence is not marked, neither by period or capitals.

(7) nasse blev arg han gick och la sig med dom andra syskonen.
nasse became angry he went and lay himself with the other siblings

These errors are rather frequent, characterised by irregular use of punctuation, i. e. texts with some attempts to indicate the sentence-endings. Thus, it is often difficult to identify sentence domain. An idea in tackling the problem is to categorise verbs by their transitivity and types of complement they take. This information (although not exhaustive) is already stored in Lexin, ready to be retrieved. Consider again the example, here broken down into sentences:

(8) a. Nasse blev arg.
Nasse became angry.

b. Han gick och la sig med dom andra syskonen.
he went and lay himself with the other siblings

(8) a. includes a copula-verb that has an adjective as its complement. The second sentence (8) b. consists of a conjunction of two intransitive verbs modifiedby a preposition phrase. With valency information stored in the lexicon, we could easily identify the first sentence and inform the user about the inconsistency in the text. The fact that the verb blev 'became' can only take a single noun or adjective phrase complement is enough information to detect this error. Thus,we can detect the missing sentence marker by looking at the text following the verb and state that the number of complements exceeds the allowed amount.

PEOPLE INVOLVED

CONCLUSION

We are sure that we can achieve a fairly large degree of success with agreement phenomena.Word order phenomena we see as more problematic in general although we expect success with local order variations such as Swedish negative placement. Finding sentence boundaries we see as a challenge which is of considerable theoretical and practical interest but for which we cannot guarantee success in the space of the project.

While work has been carried out previously on spelling correction for Swedish there has to our knowledge been little work on grammatical errors apart from that carried out at NADA and within the SCARRIE- project. Work carried out in connection with GRANSKA has already been successful in treating a number of agreement phenomena within Swedish noun-phrases of the kind we have used for illustration here and this will provide an important benchmark for this alternative approach. But many of the harder problems in syntactic error detection remained unsolved even for other languages in the international community.

BIBLIOGRAPHY

Cooper, Robin (1984) Svenska nominalfraser och kontext-frigrammatik, Nordic Journal of Linguistics, Vol. 7 No. 2, pp.115-144.

Cooper, Robin (1986a) Swedish and the Head-Feature Convention, inTopics in Scandinavian Syntax, ed. by Lars Hellan and KirstiKoch Christensen, Reidel Publishing Company, pp. 31-52

Cooper, Robin (1986b) Verb Second - Predication or Unification? Nordic Journalof Linguistics, Vol. 9, No. 2, pp. 163-180

Karttunen, Lauri, Jean-Pierre Chanod, Gregory Grefenstette, AnneSchiller (1996) Regular Expressions for Language Engineering, Natural Language Engineering 2 (4) 305-328.


sylvana@ling.gu.se

Last modified: Fri Apr 26 14:14:03 MET DST 2002