Minutes from Lemmatisation colloquium
Bergen 3–4 February 2005

Locale: Radisson SAS hotel Bryggen

Participants: Haraldur Bernhardsson (Reykjavík), Hans Fix-Bonner (Greifswald), Florian Grammel (København), Odd Einar Haugen (Bergen), Karl G. Johansson (Oslo), Paul Meurer (Bergen), Christian-Emil Ore (Oslo), Sindre Sørensen (Bergen), Andrea de Leeuw van Weenen (Leiden).

Chair: Odd Einar Haugen
Minutes: Florian Grammel and Odd Einar Haugen

From left to right: Andrea de Leeuw van Weenen, Hans Fix-Bonner, Florian Grammel, Christian-Emil Ore, Karl G. Johansson, Haraldur Bernhardsson, Paul Meurer, Sindre Sørensen and Odd Einar Haugen

 

Thursday 3 February

1. Presentations

Odd Einar Haugen welcomed the participants to Bergen. He said that the purpose of the meeting was to exchange ideas for lemmatisation of Old Norse texts and to outline possible areas of cooperation. He hoped that the participants would focus on the problems they had encountered and work to be done. The first day of the colloquium would be spent on the status of the present projects, and the second on practical discussions.

As a background for the meeting, he gave a short presentation of the Menota project, Menota TVB.

Menota presentation (PDF file, 3.2 MB)

The CHLT project on Old Norse lexicography was briefly discussed, but the participants did not have sufficient knowledge of its present status to assess what contribution this project could offer. From what some of the participants had seen, the parser did not yet work properly. The project is based on Old Norse dictionary of Geir T. Zoega.

The participants then gave a short presentation of their work with lemmatisation.

 

Hans Fix-Bonner reported on the Töluspá project in Greifswald. Over the years, he had worked with a number of texts, especially legal texts – Grágás, Járnsíða and Jónsbók. A part of the Töluspá project aims to generate all Old Norse morphological paradigms in order to supplement the present grammars. It presently has approx. 60,000 entries (head words), but might be expanded to approx. 80,000 entries. The morphological analysis has started with the strong verbs, which is the most diffcult and demanding class. The paradigms of all strong verbs are generated on the basis of Germanic roots and subsequent Old Norse sound laws. Presently, approx. 95% of all verb forms are generated correctly.

The Töluspá project has a strong morphological focus. Although the morphology in grammars like Adolf Noreen Altnordische Grammatik is correct in general, the total variation in Old Norse is not yet covered satisfactorily. For the time being, work is also being done on the verbs in Codex Regius of the Eddic poems.

Several texts were deposited in the computer bank in Copenhagen in the 1980s, but unfortunately Hans Fix-Bonner did not have any copies of the files, and he feared that they might be inaccessible.

Note 1 (1 June 2005): Although the computer bank was discontinued around 1990, the files are still kept at the Arnamagnæan Institute in Copenhagen. It is not known how well they have been preserved and how much work will be needed in order to convert them to another format, e.g. XML.

 

Andrea de Leeuw van Weenen has also worked on a number of texts over the years, and published several editions. Texts include Mðruvallabók (AM 132 fol) and The Icelandic homily book (Holm perg 15 4to). She has recently been working on Alexanders saga (AM 519 a 4to), and is now working on AM 677 4to. This is one of the oldest Icelandic texts (c. 1200-1225), but has not been included in Ludvig Larsson Ordförrådet i de älsta islänska handskrifterna (1891). She began her work in the late 1970s, and has also been working with Armenian texts (Deuteronomy).

A completley automatic lemmatisation for Old Norse texts is not possible, in her view, but the computer can assist in the process and speed up the process considerably. She outlined four methods:
(a) Manual lemmatisation. Tends to introduce errors. Requires a lot of typing. Only advisable for very short texts.
(b) Sorting approach. Works fine for non-ambiguous words, but requires much disambiguation (depending on the level of syncretism in the language)
(c) Word list approach. Based on a pseudo-normalisation of the word forms. After the list has been sorted and doubles have been checked out, lemmata can be assigned to the words. The word list must be continuosly compared with the text.
(d) Word analysis approach. Assigning words to lemmata on the basis of their endings. This works fine for Armenian (due to the morphological structure of this language), but is less suitable for Old Norse.

When lemmatising Möðruvallabók (AM 132 fol), she had found it useful to start with the function words (conjunctions, preposistions and also the verb vera) in order to have a sort of skeleton of the text. The word list established for this text could be re-used for other texts.

She recommended starting the work with a pseudo-normalisation. This is especially important if the work is being done on a text transcribed on the facsimile level (as in the edition of Möðruvallabók). A pseudo-normalisation is not a complete normalisation (as e.g. in Old Norse grammars, dictionaries and series like Íslensk fornrit), but entails the assignment of variant graphical forms to a smaller number of partly normalised forms, performed in successive steps.

Pseudo-normalisation page 1 | page 2 | page 3 (jpg files, approx. 200 kB each)

 

Christian-Emil Ore told that his involvement with lemmatisation had started with “rescue work” on the slip archives at Gammalnorsk Ordboksverk in the early 1990s as part of the Documentation project. As a result, the slip archives had been converted to an electronic data base, and would now be made availbale through the Menota project. All words in the database have been lemmatised and assigned to grammatical forms. The database now contains approx. 500.000 lemmatised words from Old Norwegian prose works.

The texts themselves are now being proofread and converted to XML according to the Menota guidelines. Thómass saga erkibyskups (Holm perg 17 4to) will probably be the first text to be deposited in the Menota archive.

Johan Fritzner's dictionary of Old Norse was encoded a few years ago and is now accessible for search on the web: Fritzners Ordbog.

The lemma list of Fritzner has been compared with the lemma list of Ordbog over det norrøne prosasprog (ONP). Surprisingly, of the 41,125 words in Fritzner, only approx. 30,000 match with the 65,000 words in the ONP lemma list.

The Menotic texts could easily be encoded so that they link to the lemmatised database as well as to Fritzner‘s dictionary (and, if implemented, to ONP).

 

Karl G. Johansson reported from the Swedish point of view. A large number of modern Swedish texts have been lemmatised, but Old Swedish texts are still waiting. The texts in the Källtext archive in Göteborg are not up to mark for the time being due to lack of proofreading. The Swedish National Archive has a large project on charters (diplomas). The Vadstena project, on which he is working, is preparing a sizeable number of transcriptions of Old Swedish texts. These texts will be encoded according to the Menota guidelines on two levels, the facsimile level and the diplomatic level.

 

Friday 4 February

Haraldur Bernhardsson presented the new edition of the Eddic poems in Codex Regius (GKS 2365 4to). This text will encoded according to the Menota guidelines and probably published on the web as well as on a DVD. The text will be transcribed on three levels, facsimile, diplomatic and normalised. It will also contain digitised images of the manuscript and a discussion of paleography and cocicology. Finally, it will be fully lemmatised and include a lemmatised concordance. It should be noted that the present printed concordances have been based on the Eddic text as such, not on the manuscript GKS 2365 4to.

The project has received funding from the Icelandic research council, and cooperates with the Menota project.

 

Paul Meurer told that he has been working with lemmatisation of Modern Norwegian for several years, notably in a collaborative project between the universities in Oslo and Bergen. He has designed part of the Oslo-Bergen-taggeren [LINK 00000], a utility for lemmatising and analysing Modern Norwegain texts. For this work, he has been using routines developed by the Corpus Workbench project at the University of Stuttgart. Although he has not been working with with Old Norse texts previously, he believed that these tools would be equally useful for that language.

A tagger should be able to help with disambiguation. A minimal requirement is that it lists the possible alternatives (extracted from the available lexicon). It should also list these forms in order of likelihood (e.g. on the basis of frequency, including contextual frequency).

The next step would be to implement rules for organising the choice of alternative forms. He would recommend using rather simple Constraint Grammar (CR) rules for this purpose. In many cases, they were as helpful and efficient as more complicated rules based on a full syntactical analysis.

Using the Oslo-Bergen-tagger, he gave a demonstration of how this would work.

Note 2 (10 June 2005): A prototype of an Old Norse tagger, Menota Lemmatisation Assistant (MLA), has been developed by Paul Meurer and is being tested by people in Copenhagen and Reykjavík. MLA will probably be made publicly available later this year. News and links will be posted on this page.

Sindre Sørensen focused on the practical implementation of a tagger for Menota. The tagger could be set up as a web service, so that it was accessible for users everywhere (although with password protection). With a suitable interface, users could upload their XML files, work on the lemmatisation of the text, and download an updated file. This would be a service parallel to e.g. the XML validator at Brown University.

 

2. Discussion

The meeting agreed that in order to enable multi-file search, it is important to have a single lemma list. The lemma list of Ordbog over det norrøne prosasprog is the most likely candidate for Old Icelandic and Old Norwegian (Old Norse). For Old Danish, Old Swedish and Middle Norwegian, however, there is no similar list or dictionary.

Christian-Emil Ore pointed out that it would be of great interest to establish a pan-Nordic lemma list based on existing Medieval Nordic lexicographical resources (Fritzner, ONP, Söderwall, Kalkar). This list would be similar in scope to the concept of a Meta Dictionary established for the new Norwegian Dictionary in Oslo.

Note 3 (20 May 2005): At a meeting held in Copenhagen 29.–30 April 2005, the question of pooling resources was discussed at some length. Two work groups have been set up to continue with this collaborative project. See the minutes from the meeting.

The meeting continued by discussing whether the actual lemmatisation should be based on texts transcribed on a facsimile level (i.e. very close to the manuscript) or at a more normalised level, e.g. after having performed a process of pseudo-normalisation. Although the idea of transcribing texts at a level close to the manuscript was favoured by several participants, it was pointed out that the number of variant forms is very high on this level. It would thus be necessary to do some kind of pre-processing before starting with the lemmatisation, e.g. in the form of a pseduo-normalisation of the word forms.

The meeting agreed that the work being done in various lemmatisation projects should be shared as far as possible, and that the dictionaries established should be coordinated.

The meeting agreed that the available resources should be collected at Aksis in Bergen in order to establish a common database of Old Norse lemmatised words. In addition to the corpus from Oslo, which is being prepared for the time being, Andrea de Leeuw van Weenen and Hans Fix-Bonner would contribute to this lexicographical database with their resources. Also, the results of the lemmatisation of Codex Regius of the Eddic poem would be added to the database.

The common ON lexicographical database would contain texts encoded on <facs> level (from Andrea de Leeuw van Weenen and Hans Fix-Bonner) as well as on <dipl> level (the Oslo material). The meeting discussed various strategies for integrating these resources and for the (pseudo)-normalisation of the <facs> material.

Note 4 (10 June 2005): Paul Meurer and Odd Einar Haugen have discussed the integration of <facs> and <dipl> readings in the Menota Lemmatisation Assistant (see note 2 above) and believe both levels could fairly easily be integrated.

Towards the end of the meeting, Hans Fix-Bonner pointed out that there might be possibility for funding in Greifswald (Stiftung Alfried Krupp). If one could set up a greater project, perhaps on the Old Norse law texts, one might look for cooperation with the Law faculty in Greifswald (which is thinking of establishing a master programme including a historical perspective). Karl G. Johansson said that the law material had so far not been studied as a whole, only in a number of individual projects.

The meeting ended with a short practical discussion on how to encode compunds if TEI P5 should discourage the use of attributes. The present solution in the Menota handbook has been published in chapter 2.3 of version 2.0 beta.

Note 5 (10 June 2005): TEI P5 will probably be published towards the end of 2005. The Menota handbook v. 2.0 will be TEI P5 compatible and will return to these questions.

 


Comments to these minutes should be sent by 20 June 2005


Back to Menota TVB


Created 27 May 2005 by OEH. Last update 10 June 2005.