The Swedish
spoken language corpus
at Göteborg University




Corpus overview    Publications    Computer tools    Transcription example



The Swedish Spoken Language Corpus at Göteborg University is an incrementally growing corpus of spoken language from different social activities. It is part of the Göteborg spoken language corpora. Besides these corpora there are also several written corpora. Based on the fact that spoken language varies considerably in different social activities with regard to pronunciation, vocabulary and grammar, the goal of the corpus is to include spoken language from as many social activities as possible. For an overview of the activities transcribed so far, see Overview. You can also view a complete listing. Some of the transcriptions are also available for download. You may use the Corpus Browser for research purposes. Just send a signed copy of the agreement to Leif.

The spoken language material has been transcribed according to the transcription standard Modified Standard Orthography MSO, Modifierad Standardortografi (ps, pdf, appendix - ps , which is a standard for transcription which is more faithful to spoken language than Swedish standard orthography but less detailed than a phonetic or phonematic transcription would be.

In MSO, standard orthography is used unless there are several spoken language pronunciations of a word. When there are several variants, these are kept apart graphically. According to this principle, the Swedish word "jag" (I), which is mostly pronounced "ja" but occasionally as "jag" is written in both these ways, depending on which form is actually used. What variants can be distinguished is, however, to some extent arbitrary and has, therefore, in some cases been decided on a stipulative basis. Thus, we have not, in general, distinguished words on the basis of vowel length.

For an example of a transcription with a short explanation see example.

Through this practice, sometimes words which are pronounced the same way, but kept apart in standard orthography, will coincide. This, for example, happens to "jag" (I) pronounced as "ja" and "ja" (yes). When this happens, the words have been disambiguated by brackets or numerical indexes. In this case, "ja{g}" (jag) and "ja" (yes). If the spoken form is produced by just removing letters from the standard form, then brackets are used to indicate the corresponding standard form. If the spoken forms can't be disambiguated by brackets, then numerical indexes are used. For example, the spoken form "å" can mean "och" ("and") or "att" ("to" - infinitive marker), so the transcribed form is "å0" for "och" and "å1" for "att". Thus, MSO maintains the same degree of disambiguation as standard written orthography but adds to this the disambiguations which are actually added by spoken language, e.g. between Swedish standard orthography "att" (that, to) which can be pronounced as "å" ("to" - infinitive marker) or "att" ("that" - conjunction). However, no attempt is made to separate homonyms which are separated neither in written or spoken language. This means that one can not know from a word form like "springa" (run, chink) whether it is a verb or a noun.

Regarding analysis of the corpus we have produced a first book of frequencies of Swedish spoken language. The book contains word frequencies both for the words in MSO format and in standard format. It also contains comparisons between word frequencies in spoken and written language. These lists are given in alphabetical and frequency order. There are list of frequencies for collocations in MSO, standard orthography and written language. Connected with the word frequencies, there are lists of words which are unique to or very much more common in spoken MSO spoken language rendered in standard orthography of written language. Finally, there is statistics on the parts of speech represented in the corpus, based on an automatic probabilistic tagging, yielding a 96% correct classification.

Further, there has been work on the corpus using various kinds of manual coding for communication management (including hesitations, changes, feedback and turntaking), speech acts, obligations, maximal grammatical units, etc. For this work we have a link containing a transcription with coding and manuals available.

Talspråksklubben is an arena for contact with interested persons in the general public who would like to help us make recordings and transcriptions of spoken language in different social activities.

THE LINKS MENTIONED ON THIS PAGE ARE:

Göteborg spoken language corpora

Göteborg written language corpora

Overview of the Swedish spoken language corpus

MSO, Modifierad Standardortografi ps, pdf

MSO Appendix

Allwood, J: Some frequency based differences between spoken and written Swedish

Transcription example

Göteborg dialogue coding - Types of analysis and coding

Transcription with coding and manuals

Talspråksklubben

OTHER LINKS:

Linguistic Annotation (LDC, University of Pennsylvania)


<leifg@ling.gu.se>
Last modified: Fri Nov 8 23:47:02 MET 2002