Ch. 8. Lemmatisation of manuscript text

Version 1.1 (5 May 2004)

8.1 Introduction
8.2 The attribute lemma
8.3 The attribute pos
8.4 General problems
8.5 Word classes

8.1 Introduction

In ch. 2.3 we suggested that the unit word, <w>, should be marked in the transcription of manuscript text in order to provide possibilities to treat abbreviations and their expansions consistently. The element <w> can also include information on lemma and a grammatical analysis for every word in the manuscript text. This information can preferably be provided as content in the two attributes lemma and pos. In this chapter the basic principles for lemmatisation of manuscript text are treated. It is important to note, however, that this presentation should be seen as a scetch rather than definite guidelines. The elements and attributes that will be discussed are:

Elements	Contents
<w>	Delimits a grammatical word.
lemma	Gives the lexical form of the grammatical word.
pos	Gives the morphosyntactic analysis of the grammatical word.

It is essential that the lemmatisation of medieval Nordic manuscript text is done in adherence to the principles developed for handling large corpora in linguistic research. We therefore recommend that the guidelines provided by EAGLES (1996) are used as a starting point. In some aspects, however, the principles presented in the following are diverging from the principles suggested by EAGLES, as the medieval Nordic languages present particular problems for the encoder.

The model provided here is adjusted to the Old Norse-Icelandic grammar. For Old Swedish and Old Danish texts we can expect a radical levelling in the grammatical system, e.g. in the nominal and verbal inflections. The model provided here will therefore overgenerate when applied to Swedish and Danish texts form the period.

8.2 The attribute lemma

Within the element <w> it is possible to provide a variety of information for every graphic word. With the attribute lemma we can for example provide information on the lexical form of the graphic word, which enables us to search all graphic and grammatical forms of the word. When a text is marked up with the element <w> we can add information about the lexical form of the word in the attribute lemma. The lemma could preferably be equal to the form you find in the lexicon. For Old Norse-Icelandic texts we suggest that the word-list produced by the Arnamagnæan Commission's Ordbog over det norrøne prosasprog (ONP) at University of Copenhagen is used to create the lemma base. The attribute would then be marked up as follows:

<w lemma="hafa">ha&fins;i</w>

In ch. 2.3 the use of <w> for the markup of graphic words and information concerning their description is treated. In the example used here the graphic word contains an Old Norse-Icelandic character which is not used in modern script. In a transcription of manuscript text this character is given with an entity name "&fins;" as described in ch. 5, but in the lemmatized form the character is normalised according to the principles of the ONP. The resulting structure will look as follows:

<facs>ha&fins;i</facs>

<dipl>ha&fins;i</dipl>

<norm>hafi</norm>

</w>

In the following example we can see how a more complicated form with abbreviations and expansions can be presented in the elements <facs> and <dipl> respectively, included in the element <w>, and thereby all be related to the attribute lemma.

<facs>co&bar</facs>

<dipl>co<expan>m</expan></dipl>

<norm>kom</norm>

</w>

In cases where a graphic word is included partially or completely in the element <unclear> this can be marked within the element <w> and be related to the attribute lemma.

<facs><unclear reason="faded">s&ra;</unclear></facs>

<dipl> <unclear>s<expan>ua</expan></unclear></dipl>

<norm>svá</norm>

</w>

Text included within the element <supplied> is not lemmatized. The following example shows how a character, word or phrase that has been supplied is marked with the element <w>, but without markup of the lemma as the text is not transcribed from the manuscript text.

<w>

<facs><supplied reason="illegible" resp="KGJ">lei</supplied>kti</facs>

<facs><supplied reason="illegible" resp="KGJ">lei</supplied>kti</dipl>

<norm>leikti</norm>

</w>

This means that the forms that are not marked will not be included in the searchable database under the category lemma. We hereby avoid the problem of contamination between forms that are from the manuscript text and forms that have been supplied by a transcriber or encoder of the text. A basic principle is that the lemmatized text should be from the manuscript text.

8.3 The attribute pos

With the attribute pos we can add information about the morphosyntactic form of the individual representation of a lemma, i.e. the form provided in the element <facs> is described morphosyntactically. To be able to make this analysis it is necessary to create a model for the encoding that describes all the possible morphological forms of each lemma. In the following this description is tentatively built from the basic categories with sub-categories to provide a full description of the Old Norse-Icelandic grammar.

For the noun hestr 'horse' in dative plural this can be described as follows:

<facs>hestu&bar;</facs>

<dipl>hestu<expan>m</expan></dipl>

<norm>hestum</norm>

</w>

The lemmata can primarily be divided in word classes. The first character in the character set provided for the attribute pos above represents the word class Nouns (N). The following character defines the noun as a nomen appellativum or a Common Noun (C). There is also a need for information about gender (masculinum, M), number (plural, P) and case (dative, D). For every word class the categories are given in a certain order. In cases where a category is not in use, the space is marked with an # (cf. ch. 8.4.2 below).

8.4 General problems

8.4.1 Form variation: internal and external homography

The manuscript texts of medieval Nordic display a wide range of variation graphematically and ortographically, the Old Norse-Icelandic texts in a higher degree than the East Nordic texts. Further, the medieval language of the Nordic countries is highly inflectional, which causes problems as soon as we try to relate the graphical forms, what we call graphic words, to the lemmatic forms, i. e. the grammatical forms available for the analysed lemma. In the following we propose a model for the lemmatisation and analysis into lemmatic forms.

The variation we find in the manuscript text, what we can call form variation, is an initial problem in the first phase of the lemmatisation. We need to be able to identify all possible graphic forms that can represent a lemma in the manuscript text. A good example is some of the graphic variation for the pronoun hann 'he' in different cases.

Form	Lemma
hann	hann
han&bar;	hann
h&bar;	hann
h&bar;n	hann
ha&scap;	hann
hans	hann
han&slong;	hann
h&bar;s	hann
h&bar;&slong;	hann
honum	hann
honom	hann
h&bar;m	hann

The inflectional diversity of the medieval Nordic languages provides many cases of homography between lemmatic forms of the same lemma, what we call internal homography. This can be seen in the following example where the feminine noun hetja has the form hetja in both nominative singular and genitive plural (NCFSN | NCFPG), the form hetju in all oblique cases in singular (NCFSG | NCFSD | NCFSA), and the form hetjur in both nominative and accusative plural (NCFPN | NCFPA). The homography is indicated by a vertical line, |, between each possible form:

Form	Lemma	Tag
hetja	hetja	NCFSNI \| NCFPGI
hetju	hetja	NCFSGI \| NCFSDI \| NCFSAI
hetjur	hetja	NCFPNI \| NCFPAI

In the initial markup of lemmatic forms it is suggested that all possible tags are given in the attribute pos. This is, however, not satisfying if we wish to have a consistent markup of the morphosyntactic analysis. In cases where the morphosyntactic analysis can be made consistently this should of course be done.

Further we must take into account the possibility that the graphic form for different lemmata appears in homographic forms on the level of lemmatic form, what we call external homography. An example of this could be the neuter noun vár 'spring' (NCNSN) and possessive determinative várr 'our' in feminine singular nominative, neuter plural nominative and accusative (DPFSN | DPNPN | DPNPA).

Form	Lemma	Tag
vár	vár	NCNSNI
vár	várr	DPFSN \| DPNPN \| DPNPA

The graphic forms can also be homographic for different lemmata as in the feminine noun þýða 'friendship' in nominative singular and the verb þýða 'interpret' in infinitive.

Form	Lemma	Tag
þýða	þýða	NCFSNWI \| VPresInfA####Wk

In these cases the morphosyntactic analysis has to be made manually. An alternative is to give all possible lemmatic forms in the attribute pos as in the above example.

8.4.2 Zero values: # and @

In the last example above, several positions are marked with the "#" sign. This is to indicate that some of the possible morphological categories are not relevant for this particular word. Thus, for an infinitive like þýða the positions Gender, Number, Case and Species are not relevant - no infinitives are inflected for these categories. Cf. ch. 8.5.8.2 below. For simplicity, we suggest that "#" is read as "irrelevant".

In other cases, a word is inflected for a certain category, but the encoder is not able to specify a value. This may be the case with some proper nouns, for which no gender can be specified. This is a different type of "zero" value, and we therefore suggest to indicate these positions with the "@" sign. For simplicity, we suggest that "@" is read as "unknown".

8.5 Word classes

8.5.1 Nouns (N)

Nouns can be divided into two categories, appellatives and propria. They are all marked with an N for noun. In the second field the markup defines the two categories C, appellatives (Common Nouns), and P, propria (Proper Nouns). In marginal cases, it may be difficult to decide whether a noun is a common or a proper name; in that case this field may be marked with an @.

Nouns should also be marked for gender. In the medieval Nordic languages we define three gender categories masculine, feminine and neuter, which are marked in the third field as M, F and N respectively. Some proper nouns are indeterminate with respect to gender and should be marked with an #.

There are two categories for numerus. Singular and plural should be marked in the fourth field as S and P respectively. Most personal names are not inflected for number and should be marked with an #.

There are four categories for case in the medieval Nordic languages, nominative, genitive, dative and accusative, which are marked in the fifth field as N, G, D and A respectively. Due to the high degree of internal homography (syncretism) in the declension of weak nouns, we suggest that it should be possible to refer to all oblique cases (i.e. genitive, dative and accusative) with a single code, Obl.

A noun occurs either in an indefinite or a definite form, e.g. "hestr" or "hestrinn". This is marked in the sixth field as I and D respectively. Concerning personal names and place names only the last can occur in definite form.

Example: Encoding of the noun "ymr" in the phrase "þá heyrðu þeir ym mikinn ok gny":

<w lemma="ymr" pos="NCMSAI">ym</w>

Noun	Subcategory	Gender	Number	Case	Species
N	C P @	M F N #	S P #	N G D A Obl	I D

8.5.2 Adjectives (AJ)

Adjectives (AJ) are inflected for grade in three levels, positive, comparative and superlative, which are marked in the second field as P, C and S respectively.

Adjectives should also be marked for gender. In the medieval Nordic languages there are three gender categories, masculine, feminine and neuter, which are marked in the third field as M, F and N respectively.

There are two categories for numerus. Singular and plural should be marked in the fourth field as S and P respectively.

There are four categories for case in medieval Nordic languages, nominative, genitive, dative and accusative, which are marked in the fifth field as N, G, D and A respectively. Due to the high degree of internal homography (syncretism) in the declension of weak adjectives, we suggest that it should be possible to refer to all oblique cases (i.e. genitive, dative and accusative) with a single code, Obl.

Finally, an adjective occurs either in an indefinite (strong) form, e.g. "hvítr hestr", or an definite (weak) form, "inn hvíti hestr". This is shown in the sixth field as as I and D respectively.

Example: Encoding of the adjective "langr" in the phrase "seint er um langan veg at spyrja tíðenda":

<w lemma="langr" pos="AJPMSAI">langan</w>

Adjective	Grade	Gender	Number	Case	Species
AJ	P C S	M F N #	S P	N G D A Obl	I D

Note that in the comparative form, adjectives only have weak (indefinite) inflection.

8.5.3 Pronouns (P)

In recent grammars the traditional category pronoun is usually divided into pronouns in a strict sense (words replacing a noun) and determinatives (adjunct words), and that is our recommendation as well, cf. ch. 8.5.3 and 8.5.4 below. However, in some projects (i.e. the Old Norwegian lemmatised corpus) there is only a single category pronoun, and we have therefore added in ch. 8.5.5 a combined category, pronouns and determiners (cf. EAGLES, major categories).

Although pronouns in the strict sense of "words replacing a noun" are a smaller category than the traditional one, there are a nonetheless three distinct sub-categories. In the following these are treated separately to provide an overview. All pronouns are marked with P and then a field for subcategory, Per for personal pronouns, Int for interrogative pronouns and Ind for indefinite pronouns.

8.5.3.1 Personal pronouns (PPer)

The personal pronouns (PPer) are declined in first, second and third person. This is marked in the third field as 1, 2 and 3 respectively.

The inflection in gender varies for the personal pronouns, but we can generally account for three categories masculine, feminine and neuter, which are marked in the fourth field as M, F and N respectively. In some categories there is no grammatical markup for gender (see the list of tags below). In these cases the fourth field has an #.

Personal pronouns in the first and second person have three categories for number, singular, plural and dual, which are marked in the fifth field as S, P and D respectively. Personal pronouns in the third person have no inflection for number. The fifth field in this case has an #.

The personal pronouns are inflected in four cases, nominative, genitive, dative and accusative, which are marked in the sixth and final field as N, G, D and A respectively.

Example: Encoding of the personal pronoun "vit" in the phrase "vit erum fegnir":

<w lemma="vit" pos="PPer1#DN">vit</w>

Pronoun	Subcategory	Person	Gender	Number	Case
P	Per	1 2 3	M F N #	S D P #	N G D A

8.5.3.2 Interrogative pronouns (PInt)

The interrogative pronouns (PInt) have no inflection in person. This field should therefore be marked with an #. They are declined in three categories for gender, masculine, feminine and neuter, which are marked in the fourth field as M, F and N respectively.

Interrogative pronouns are inflected in two categories for number, singular och plural, which are marked in the fifth field as S and P respectively.

Finally, the interrogative pronouns are inflected in four categories for case, nominative, genitive, dative and accusative, which are marked in the sixth and final field as N, G, D and A respectively.

Example: Encoding of the interrogative pronoun "hverr" in the phrase "Frigg spurði hverr sá væri með ásum":

<w lemma="hverr" pos="PInt#MSN">hverr</w>

Pronoun	Subcategory	Person	Gender	Number	Case
P	Int	#	M F N #	S P	N G D A

8.5.3.3 Indefinite pronouns (PInd)

The indefinite pronouns (PInd) have no inflection for person. This field should therefore be marked with the # sign. They are inflected in three categories for gender, masculine, feminine and neuter, which are marked in the fourth field as M, F and N respectively.

Indefinite pronouns are inflected in two categories for number, singular och plural, which are marked in the fifth field as S and P respectively.

Finally, the indefinite pronouns are inflected in four categories for case, nominative, genitive, dative och accusative, which are marked in the sixth field as N, G, D and A respectively.

Example: Encoding of the indefinite pronoun "einnhverr" in the phrase "vill hann taka til at þreyta drykkju við einhvern mann":

<w lemma="einnhverr" pos="PInd#MSA">hverr</w>

Pronoun	Subcategory	Person	Gender	Number	Case
P	Ind	#	M F N #	S D P #	N G D A

8.5.4 Determinatives (D)

There are two sub-categories for the determinatives. In the following these are treated separately to provide an overview. All determinatives are marked with D in the first field. In the second field the sub-category is given as described below.

8.5.4.1 Possessive determinatives (DPos)

The possessive determinatives (DPos) are inflected in three categories for gender, masculine, feminine and neuter, which are marked in the third field as M, F and N respectively.

Possessive determinatives are inflected in two categories for number, singular and plural, which are marked in the fourth field as S and P respectively.

Finally, the possessive determinatives are inflected in four categories for case, nominative, genitive, dative and accusative, which are marked in the fifth field as N, G, D and A respectively.

Example: Encoding of the possessive "sinn" in the phrase "hann hugðisk þá at reyna afl sitt":

<w lemma="sinn" pos="DDetNSA">sitt</w>

Determinative	Subcategory	Gender	Number	Case
D	Pos	M F N	S P	N G D A

8.5.4.2 Demonstrative determinatives (DDet)

The demonstrative determinatives (DDet) are inflected in three categories for gender, masculine, feminine and neuter, which are marked in the third field as M, F and N respectively.

Demonstrative determinatives are furthermore inflected in two categories for number, singular and plural, which are marked in the fourth field as S and P respectively.

Finally, the demonstrative determinatives are inflected in four categories for case, nominative, genitive, dative and accusative, which are marked in the fifth field as N, G, D and A respectively.

Example: Encoding of the demonstrative "hinn" in the phrase "hitt fjall er hátt":

<w lemma="hinn" pos="DDetNSN">hitt</w>

Determinative	Subcategory	Gender	Number	Case
D	Det	M F N	S P	N G D A

8.5.5 Pronouns/determiners (PD)

This is the traditional category of pronoun, as defined in the grammars of e.g. Noreen 1923 and Iversen 1973. From a inflectional point of view this is a heterogenous category.

The personal pronouns are inflected in first, second and third person. This is marked in the third field as 1, 2 and 3 respectively. Other prononuns are not inflected in person and therefore marked with an #.

Many pronouns and determiners are inflected for gender, masculine, feminine and neuter, which are marked in the third field as M, F and N respectively. Once more, some are not, and are marked with an #.

Most pronouns and determiners are inflected for number, singular and plural, marked in the fourth field as S and P respectively, some also in dual, D. Those which are not inflected for number are marked with an #.

Finally, most pronouns and determiners are inflected for case, nominative, genitive, dative and accusative, marked in the fifth field as N, G, D and A respectively.

Example: Encoding of the pronoun "engi" in the phrase "ormrinn er slœgari en ekki annat kvikendi":

<w lemma="engi" pos="PD#NSN">ekki</w>

The categories for the combined category of pronouns and determiners can be given as follows:

Pronoun/determiners	Person	Gender	Number	Case
PD	1 2 3 #	M F N #	S D P #	N G D A #

8.5.6 Numerals (NU)

The numerals are devided into two subcategories cardinals (NUC) and ordinals (NUO).

Ordinals and the cardinals 1-4 are inflected in three categories for gender, masculine, feminine and neuter, which are marked in the third field as M, F and N respectively. The rest of the cardinals are not inflected for gender and are therefore marked with an #.

In addition to the inflection for gender, ordinals can also be inflected for number and are therefore marked for singular or plural in the fourth field. Cardinals are not inflected for number and are therefore marked with an # in this field.

Ordinals and the cardinals 1-4 are inflected in four categories for case, nominative, genitive, dative and accusative, which are marked in the fifth field as N, G, D and A respectively.

The ordinal fyrstr occurs either in a indefinite (strong) form, "fyrstr", or an definite (weak) form, "fyrsti". This is shown in the sixth field as as I and D respectively. Other numerals are marked with an # in this field.

The numerals hundrað 'one hundred (and twenty)' and þúsund 'one thousand (two hundred)' are marked as nouns.

Example: Encoding of the numeral "sjaundi" in the phrase "in sjaunda borg":

<w lemma="sjaundi" pos="NUOFSN#">sjaunda</w>

Numerals	Subcategory	Gender	Number	Case	Species
NU	C O #	M F N #	S P #	N G D A #	I D #

Numerals

Subcategory

Gender

Number

Case

Species

C
O
#

M
F
N
#

S
P
#

N
G
D
A
#

I
D
#

8.5.7 Articles (AT)

In recent grammars the traditional word class articles is usually classified as part of the word class determinatives. However, in some projects (i.e. the Old Norwegian lemmatised corpus) articles are treated as a separate class, and we suggest that as an alternative they may be classified as such. Cf. also the EAGLES guidelines, which recognise articles as a major category.

Articles are inflected as adjectives, except for grade. There are thus four categories - gender, number, case and species.

Example: Encoding of the article "einn" in the phrase "ein kona":

<w lemma="einn" pos="ATFSND">ein</w>

Adjective	Gender	Number	Case	Species
AT	M F N #	S P	N G D A	I D

8.5.8 Verbs (V)

Verbs are either finite or non-finite. In the former category, they are inflected for tense, mood, person, number and voice. In the latter category, participles are basically inflected as adjectives, while infinitives have a very restricted inflection. For practical reasons, we recommend to treat finite and infinite forms separately.

8.5.8.1 Finite forms

Finite verbs are inflected in two categories for tense, present and preterite, which are marked in the second field as Pres and Pret respectively.

Next, verbs are inflected in three categories for mood, indicative, subjunctive and imperative, which are marked in the third field as Ind, Sub and Imp respectively.

In the personal inflection there are three categories, first, second and third person, which are marked in the fourth field as 1, 2 and 3 respectively.

The verbs are inflected in two categories for number, singular and plural, which are marked in the fifth field as S and P respectively.

Verbs are also inflected for voice, active and middle. This is marked in the sixth field as A and R respectively.

Optionally, verbs may be marked for morphological class. This is particularly useful for distinguishing verbs which appear in both weak and strong forms, such as brenna, svelg(j)a etc. We suggest that four main classes may be recognised, strong verbs (St), weak verbs (Wk), reduplicating verbs (Rd) and preterite-present verbs (Pp).

Finally, verbs with enclitic pronouns may be marked with the value Enc.

Example: Encoding of the verb "taldi" in the phrase "hon taldi":

<w lemma="telja" pos="VPretInd3SA">tel</w>

Verb	Tense	Mood	Person	Number	Voice	Class (optional)	Enclitics
V	Pres Pret	Ind Sub Imp	1 2 3	S P	A R	St Wk Rd Pp	Enc

8.5.8.2 Non-finite forms

Infinite forms are either participles or infinitives. Both categories are inflected for tense, so we recommend that the two first fields are identical with the markup for finite forms.

As a next field, we recommend form, with the values participle (Part) and infinitive (Inf).

Since infinitives can have active as well as reflexive forms, the next field should indicate voice.

The next fields should be identical with the scheme for adjectives, i.e. gender, case, number and species. Infinitives are not inflected for these categories and are therefore marked with the # sign.

Example: Encoding of the verb "koma" in the phrase "hann er kominn":

<w lemma="koma" pos="VPretPartMSND">tel</w>

Example: Encoding of the verb "fara" in the phrase "hann mun fara":

<w lemma="fara" pos="VPresInfA####St">fara</w>

Verb	Tense	Form	Voice	Gender	Number	Case	Species	Class (optional)
V	Pres Pret	Part Inf	A R #	M F N #	S P #	N G D A #	I D #	St Wk Rd Pp

8.5.9 Adverbs (AV)

Adverbs (AV) are only inflected for grade, i.e. positive (P), comparative (C) and superlative (P).

Example: Encoding of the adverb "sterkliga" in the phrase "hann svaf ok hraut sterkliga":

<w lemma="sterkliga" pos="AVP">sterkliga</w>

Adverbs	Grade
AV	P C S

8.5.10 Prepositions (AP)

Prepositions are not inflected and only marked for word class, AP. The latter is an abbreviation for "adposition", which is the hyponymous term for "preposition" and "postposition" (found in e.g. Japanese, but not in the Nordic languages).

Example: Encoding of the preposition "at" in the phrase "koma þeir at kveldi til eins búanda":

<w lemma="at" pos="AP">at</w>

Prepositions
AP

8.5.11 Conjunctions and subjunctions (CC and CS)

In recent grammars, the traditional word class conjunctions is usually divided into two separate classes, conjunctions (e.g. "ok", "en") and subjunctions (e.g. "at", "ef"). The former category connects phrases on the same syntactical level, while the latter category typically introduces clauses. In traditional terminology, this is reflected in the subdivision of conjunctions into coordinating and subordinating. We recommend making a distinction between conjunctions proper = coordinating conjunctions (CC) and subjunctions = subordinating conjunctions (CS).

However, in some schemes (i.e. the Old Norwegian lemmatised corpus) only a single word class conjunctions is recognised. In that case, the second field may be given with an #.

Example: Encoding of the conjunction "ok" in the phrase "Logi hafði etit slátr allt ok beinin með":

<w lemma="ok" pos="CC">ok</w>

Example: Encoding of the subjunction "at" in the phrase "hon sagði at Baldr hafði þar riðit":

<w lemma="at" pos="CS">at</w>

Conjunctions	Subcategory
C	C S #

8.5.12 Interjections (IT)

Interjections are not inflected and only marked for word class, IT.

Interjections
IT

8.5.13 Infinitive marker (IM)

The infinitive marker is undeclined and only marked as IM. In Old Norse it usually has the form at.

Infinitive marker
IM

8.5.14 Relative particle (RP)

The relative particle is undeclined and only marked as RP. In Old Norse it usually has the form er or sem. Some grammarians would classify the relative particle as a subjunction, while others tend to look upon it as a pronoun.

Relative particle
RP

8.5.15 Unassigned (U)

Some words are corrupt, difficult to analyse, belonging to another language or indeterminate for other reason. These words are marked as unassigned, U.

Top of page

Version 1.0 published 20 May 2003. Version 1.1 published 5 May 2004.