Menota handbook – Introduction

Introduction: What is Menota?

1. Electronic editing of medieval texts
2. How to use these guidelines
3. Basic content of the edition

Version 2.0 (16 May 2008)

1. Electronic editing of medieval texts

The purpose of these guidelines is to define a framework for machine-readable editions of medieval Nordic texts. These guidelines are recommended for any scholar who wishes to produce detailed, machine-readable editions of primary works, that is, medieval Nordic manuscripts.

1.1. Menota and traditional editing practice

Editions may include a very great amount of information in addition to the basic text of the manuscript: introductory material, including textual and literary contexts; the textual content, including diplomatic and/or normalised text; a variant apparatus or various manuscript versions; notes and other forms of critical apparatus; glossaries and/or indices of names.

The present guidelines address all of these parts of an edition. The one exception is the textual or variant apparatus: as the approach of these guidelines is to encode different manuscript versions, the textual apparatus develops as each manuscript is encoded and aligned.

The approach taken here, however, differs from traditional editions in the way in which the additional information is included and consequently the possibilities of presentation. Traditional print editions rely on a very large amount of referencing between the text and the apparatus: note references may refer the reader to the notes section; glossaries and indices refer the reader back to the main text; the textual apparatus refers usually to line and/or page numbers; and aligned texts usually rely on visual parallels, such as facing pages. The approach taken here allows all of this information to be encoded without complex referencing, allowing information about a section of text to be checked, or presented, at the same time, depending on the capabilities of the display medium. The complexity of referencing, however, is replaced with a certain amount of complexity in encoding.

1.2. Machine-readable editions

The approach of Menota differs from the production of electronic texts using word-processing or desktop publishing software because the texts are machine-readable, that is, the texts are marked up in a way that meaningful entities within a text can be read and manipulated by a computer.

The approach taken here can be used to distinguish between different types of information in the text and consequently can extract and present the information of most interest to particular users, for example, students, literary scholars, linguists, palaeographers. A student may wish to read the normalised text; a linguist might only be interested in the word distribution, and so on.

Using this method, one can also produce editions for different media: printed and electronic books, interactive web applications, portable devices, CD-ROMs and so on.

1.3. Menota and other encoding schemes

Menota is based on the scheme defined by the Text Encoding Initiative. It defines further extensions based primarily on two major differences between Medieval Nordic texts and most other comporable corpora:

1. A very large degree of orthographical variation. This makes linguistic analysis difficult because of the difficulty in searching for words on the basis of a lemma. The compilation of glossaries, for example, cannot be done in any systematic way.

2. A very large degree of abbreviation of letters, groups of letters, words and so on.

The two problems are dealt with by breaking the text into 3 prototypical levels, where the text is encoded in its abbreviated form, in its expanded form and in a normalised orthography. These textual levels constitute the primary difference between Menota and standard TEI. Texts can be encoded on only one of these levels (typically the diplomatic), but can easily can be extended to two or more levels, thus making it more versatile than traditional editions, which are restricted to representing the text in only one way.

2. How to use these guidelines

These guidelines provide a way of representing a text in a machine-readable and platform-independent way. They do not provide in themselves a way of publishing the text, but rather a way of encoding a text so that it can be published and analysed by other means in a variety of ways. In short, you can use these guidelines to represent characters, words and other meaningful units of text, in a way that is consistent and unambiguous. The approach is represented by the chapters:

1. An introduction to XML.

XML is the electronic language used to represent features of the edition. It differs from the languages used by, for example, word processors and typesetting engines, in that it is used to represent types of content rather than ways of displaying the text. XML is currently the most common way of encoding textual content. Learning how XML works is perhaps the most difficult aspect of these guidelines, but once a few fundamental concepts are grasped, it is a useful tool which can be applied to a range of other areas, such as web publishing.

2. How to encode the basic units of characters and words.

The encoding of characters in an unambiguous way represents the most basic step towards producing an exchangable and machine-readable edition. It is in fact a fairly simple procedure which requires almost no knowledge of XML, but instead a basic idea of abstraction. The first thing to grasp is that the way characters are represented here is independant of individual fonts. One of the problems with many early electronic editions is that they have used non-standard fonts, and combinations of fonts in word processing programs. Once the font is obsolete, or if the software becomes obsolete, the electronic text is no longer of much use. The approach taken here overcomes these problems by representing characters either using a standard encoding or electronic references to different character types.

3. Levels of textual representation.

The guidelines discusses a simple and straighforward way of encoding text in a single-level transcription. However, in order to deal with the problems in section 1.3 above (frequent abbreviation and orthographic variation), the guidelines recommends a multi-level transcription. The text is divided into three 'levels': one which attempts to represent the text as it appears in the manuscript, including abbreviation and significant letter forms (reduced to a partially-limited set, however); the second represents the text in the orthographical form of the manuscript, but expands abbreviations, generally providing a diplomatic representation of the text; and the third level involves normalisation to a set of letters, based on the actual orthographical system, but representing the phonological system as it was when the text was believed to have been composed.

Editors using these guidelines may wish to use any combination of the levels to encode the text.

Each level is encoded on a word-by-word basis.

4. Representing the structure of the document.

The guidelines go on to explain how to encode larger units and features, including linguistic structures such as chapters, paragraphs and headings; physical features such as pages and lines in the manuscript; verse material and punctuation. Such information is fairly straightforward to represent, and this chapter should not provide many difficulties to editors who have grasped the earlier material.

5-6. Encoding characters and abbreviations.

This chapter describes in detail how different letter-forms and abbreviations - and their expansions - are to be represented in a Menota-compliant edition.

7. Representing altered, corrected and unreadable text.

This chapter explains how to encode characters and words which fall outside of the normal flow of text, because they are altered by scribes, incorrect or illegible. Such features are frequent in primary sources, but need to be encoded unambiguously so that the edition represents the status of all the text in relation to the primary source and its scribes, and subsequent editors.

8. Lemmatisation.

This chapter provides an approach to the linguistic encoding of a text for editors who are interested in producing search engines and glossaries. Basically, lemmatisation is the process by which every word-form in the text is linked to a single word without grammatical variation - the equivalent to a dictionary head-word or lemma. Once this process is done, the text can be searched for words, regardless of morphological or orthographical variation. Each word can be linked to a glossary, and vice-versa.

9. Encoding additional features.

This chapter describes how editors may wish to make references between the encoded text and other types of information. For example, it may be useful for an editor to treat a name occuring in a text as both a linguistic entity (a word) and a potential reference to information about an individual. This is roughly the equivalent to indexing, where named entities - people, places and so on - are linked to the text.

10. Encoding front-matter and other meta-information.

The final chapter describes the additional 'meta' information about the edition is to be encoded in the document header. Such information makes it much easier to understand the relationship between the file and other types of documents by describing and categorising it. The header also contains information about the process and responsibility of creating the edition.

3. Basic content of the edition

The content of the edition is basically the same as a print edition of a primary work. It contains:

1. Front matter, including a title for the work, publication information, simple information about the editor(s) responsible, a description of the editorial approach taken, detailed acknowledgements of contributions and so on. All this information is encoded in the TEI header, described in chapter 10.

2. A table of contents; because the encoded document represents the parts of the text, including headings, in a machine-readable way, the table of contents can be generated automatically. (For comparison, the document you are reading is encoded in xml, with each section heading marked as such - the contents at the top of the document are generated automatically from this information.)

3. The text itself, including representation of: the orthography (described in chapters 2 and 5), abbreviations (chapter 6); linguistic information, including division into words (chapter 2) and textual levels for each word (chapter 3), lemmata for words (chapter 8); higher-level structures such paragraphs and chapters, physical pages and lines (chapter 4); alterations made to the text by scribes and editors (chapter 8); and references to people, places, and so on (chapter 9).

4. Back matter, potentially automatically-generated, including a glossary generated from word lemmata and indices generated from other encoded information such as names.

First published 14 August 2006. Last updated 16 May 2008. Webmaster.