We use TEI

Chapter 3. Document structure

3.1 Introduction: The structure of the manuscript vs. the structure of the work
3.2 Main divisions of a TEI document
3.3 Chapters: <div>
3.4 Paragraph text: <p>
3.5 Sentences: <s>
3.6 Metrical text: <lg> and <l>
3.7 Headings: <head>
3.8 Page, column and line breaks: <pb/>, <cb/>, <lb/>
3.9 Overlapping structures

Version 3.0 beta

This is a preliminary version which can be changed or updated at any time.
The revision and updating of this chapter has been assigned to Alex Speed Kjeldsen.
Beeke Stegmann will look into the stylesheet needed for correct display of conflicting document structures, such as those discussed in ch. 6.6.

 

3.1 Introduction: The structure of the manuscript vs. the structure of the work

Viewed as physical objects, rather than as vehicles for texts, manuscripts have a certain structural hierarchy. What is regarded as a single manuscript may in fact comprise more than one volume; Flateyjarbók, for example, is bound in two volumes, and the large rímur codex Acc. 22 in three. A manuscript book is made up of quires or gatherings, each of which contains a number of leaves, normally eight. Each leaf has a recto side and a verso side, and each side may be further divided into columns. The text is then written in lines across the page or column. In order to be able to locate a word quickly and easily, all, or at least most, of these structural divisions must be registered. We need to know that a given word appears in the fifth line of the right-hand or b column on the recto side of folio 34. As it is customary to foliate manuscripts without regard to their quire division, the quires will not normally need to be included in the hierarchical structure, but since the quiring can have implications for the text itself this division should be indicated, and will also generally form part of the <msDesc> element, found in the document header.

At the same time, of course, manuscripts obviously do contain texts, which is the reason why most of us are interested in them in the first place. A single manuscript will often contain more than one work, each of which may, in the case of lengthy prose works such as sagas, be divided into chapters or sections. In the case of poetry, rímur for example, a single work (rímnaflokkur) will usually consist of several cantos or fits, each containing a number of stanzas, made up of a number of lines. It may be necessary to group these lines in some other ways as well. The stanzas comprising the mansöngur should be distinguished from the main body of the fit, for example, while to facilitate certain types of metrical analysis it might be desirable to divide the individual stanzas into couplets. Some types of poetry, such as the vikivakakvæði, will have a refrain or burden, which should ideally also be distinguished from the narrative section(s) of the stanza.

XML has at its foundation the notion of a text as a single hierarchical structure, which means that it does not work well where there are several concurrent hierarchies, as is obviously the case when one wishes for example to indicate the line divisions both in a poem and in the manuscript in which the poem is contained. The TEI Guidelines offer various solutions to this problem, enabling both the structure of the document and the structure of the text to be encoded.

3.1.1 Hierarchical divisions

The principal means of representing hierarchy is the <div> (i.e. “division”) element. <div> elements may freely nest within each other. The <div> element has, in addition to the universally available @id and @n attributes, a @type attribute, which specifies the name conventionally given to the level of division, e.g. 'chapter' , 'stanza' , 'couplet' , if attempting to represent the structure of the text, 'page' , 'column' , 'line' if the physical structure of the manuscript is to be preferred. It will be convenient to specify a value for the @type attribute in the <div> element at least each time a change of level occurs. The software, however, will keep count of the levels of nesting even if the type attribute is not used.

The complex structure of a work such as a set of rímur could be represented by using four levels of <div> elements, <div type="canto"> for the cantos or fits, <div type="part"> for the parts (for example the mansöngvar), <div type="stanza"> for the stanzas, and <div type="line"> for the lines. If the manuscript being encoded contains more than one set of rímur, as is frequently the case, it might be sensible to use <div type="canto"> for each set. A simpler form of mark-up is possible, however. Instead of <div> elements, the tags <l> (for “line”) and <lg> (for “line-group”, i.e. a group of lines functioning as a formal unit) can be used, reserving the <div> element for larger structural units. The @type attribute is then used to identify the type of unit, e.g. 'stanza' , 'couplet' , like in <lg type="stanza">. Here again the type need only be defined once. Lines and line-groups can also be numbered and identified using the @n and @id attributes.

This type of markup focusses on the hierarchical structure of the text. The actual physical realisation of the text is considered of secondary importance – if of importance at all – when dealing with modern printed literary works: little significance is attached to the page and line breaks in the various editions of, say, Orwell's Nineteen Eighty-Four. In some cases, however, the early editions of Joyce's works, for example, supervised by the author himself, the physical make-up of the text can be of great consequence. It may also be necessary to maintain the pagination and lineation of standard editions of major works, as these are frequently used in citations in scholarly works. In the case of chirographically transmitted material, the physical organisation of the text is more likely to be recognised as being of importance and in need of encoding. This can be done hierarchically, as above, using <div> elements, which are then given the appropriate @type attributes, e.g. 'page' , 'column' or 'line' , but it seems more appropriate to reserve these elements for structural divisions in the text, while indicating the physical structure of the document through the use of so-called “milestone” tags, i.e. <pb/>, <cb/> and <lb/>. These tags make up a separate hierarchy in the file and help to overcome the problem of overlapping structures in the mark-up; see also the discussion in ch. 4.10 below.

The rest of this chapter presents how the text may be encoded at higher structural levels than characters and words. Important elements here are the larger divisions of the text, like chapters, paragraphs (with headings), and stanzas. This chapter also presents how pagination and foliation, together with column-breaks and line-breaks, may be encoded. The following TEI elements are presented:

Elements Contents
<text>, <body> Main divisions of the text,
<div> division into chapters (multiple levels are encoded by nesting elements),
<p> prose paragraphs,
<lg>, <l> line groups and lines,
<head> headings,
<pb/>, <cb/>, <lb/> page-, column- and line-breaks.

3.2 Main divisions of a TEI document

The following presentation is based on ch. 4 “Default Text Structure” of the TEI P5 Guidelines.

A TEI document is always at its highest level enclosed by the start tag <TEI> and the end tag </TEI>. Within the <TEI> element, two other elements appear in a fixed order, namely the <teiHeader> and the <text> elements. Within the <text> element, the body text may appear, enclosed in the element <body>. If the text has front matter, there will be an element <front>, placed before <body> containing it. Similarly, there may be an element <back>, placed after <body> and containing back matter. The elements <teiHeader>, <text> and <body> are required in any TEI-conformant document, while <front> and <back> are optional. This, then, is the basic structure of a TEI document:

Elements Contents
<TEI> The TEI document begins here,
<teiHeader> ... </teiHeader> the header goes here,
<text> the text itself begins here,
<front> ... </front> any front matter goes here,
<body> ... </body> the main body of the text goes here,
<back> ... </back> any back matter goes here,
</text> the text ends here,
</TEI> the TEI document ends here.

3.2.1 Another possible first division of the text: More than one <text> element

The transcriber may want to divide a document into more than one text. This can be done with the <group> element, which should be contained in the top level <text> element taking the place of <body> in the simpler scheme illustrated above. The following structure appears:

    <text>
      <front> ... </front>
        <group>
          <text>
            <front> ... </front>
            <body> ... </body>
            <back> ... </back>
          </text>
          <text>
            <front> ... </front>
            <body> ... </body>
            <back> ... </back>
          </text>
        </group>
      <back> ... </back>
    </text> 
               

The main structure of the text, at the levels of work, first main division, second main division, first chapter of first main division, second chapter of first main division and so on, can be encoded in different ways. If the electronic document consists of more than one work, the <group> structure illustrated above is the natural choice. In that case, one would get multiple sets of further structural divisions, one set within each of the <body> elements. If the electronic document is considered as a single work, and placed in one <text> element, there will only be a single <body> element that needs further divisions.


3.3 Chapters: <div>

Further division of the <body> block is achieved through <div> elements, with one level nesting inside the other as the transcriber moves down through the hierarchical structure of the text.

3.3.1 Type- and level-specified <div> elements

In a complex document, <div> elements may be specified by @type and @n attributes. In this example, the three first chapters of a work have been contained in <div> elements at the same hierarchical level (siblings):

Elements Contents
<div type="chapter" n="1"> ... </div> Chapter one goes here,
<div type="chapter" n="2"> ... </div> chapter two goes here,
<div type="chapter" n="3"> ... </div> chapter three goes here (and so on).

3.3.2 Unspecified <div> elements

It is also possible to use <div> elements without specifying their type:

Elements Contents
<div> ... </div> Chapter one goes here,
<div> ... </div> chapter two goes here,
<div> ... </div> chapter three goes here (and so on).

3.3.3 Nesting <div> elements

Note that <div> elements may nest inside each other. For example, the levels of work, chapter and then paragraph can be encoded in the following manner:

Elements Contents
<div type="work"> The whole work starts here,
<div type="chapter"> the first subdivision starts here (nested),
<p> ... </p> one paragraph of the subdivision goes here,
</div> end of the subdivision,
</div> end of the work.

While <div> elements may nest as shown here, <p> elements may not. They must be encoded sequentially, i.e. as siblings.


3.4 Paragraph text: <p>

The basic-level element for prose text is the paragraph, <p>. Typically, the deepest level <div> element will contain one or more <p> elements:

Elements Contents
<div> A new chapter starts here,
<head> ... </head> this contains the heading,
<p> ... </p> first paragraph,
<p> ... </p> second paragraph,
<p> ... </p> third paragraph,
</div> the chapter ends here.

The <p> element may appear in other contexts, such as in the <teiHeader> element. It may also contain a number of other elements, but – as underlined above – it may not contain other <p> elements, i.e. it is not allowed to nest.


3.5 Sentences: <s>

The text within a paragraph may be divided into sentences, <s>:

Elements Contents
<p> A new paragraph starts here,
<s> ... </s> first sentence,
<s> ... </s> second sentence,
<s> ... </s> third sentence,
</p> the paragraph ends here.

Paragraphs are usually not divided into sentences, but will simply list the words in sequential order. However, texts that are going to be annotated for syntax should preferably be divided into sentences according to a suitable definition of what constitute a sentence. In the case of sentences linked by conjunctions (such as ok, en and eða) one might decide to open a new sentence whenever there is a subject as well as a predicate. See Haugen and Øverland (2014: 65–67) for a discussion of sentence boundaries in Old Norwegian.

The encoding of words, <w>, will be discussed in ch. 5.3 below.


3.6 Metrical text: <lg> and <l>

The elements discussed here are defined and explained in ch. 6 “Verse” of the TEI P5 Guidelines.

Texts in verse should be encoded using <lg> (line group), which in turn contains one or more <l> elements (lines). As with <div>, <lg> elements can nest. According to the TEI Guidelines <lg> is a sibling of, i.e. at at the same level as, <p>, and cannot be contained within it (unless it appears within a <q> element). Example:

Elements Contents
... </p> A paragraph ends here,
<lg> a line group starts here,
<l> ... </l> first line,
<l> ... </l> second line,
<l> ... </l> third line,
</lg> the line group ends here,
<p> ... and a new paragraph starts here.

Nesting of <lg> elements is useful for marking up longer poems. When a poem consists of two levels of line groups one may encode its structure as shown here:

Elements Contents
<lg type="stanza"> Here a line group on level one begins, a stanza,
<lg type="couplet"> here a subgroup starts, a couplet,
<l> ... </l> the first line,
<l> ... </l> second line,
</lg> and here the subgroup ends, the first of the couplets.
<lg> Here a new subgroup starts,
<l> ... </l> line,
<l> ... </l> line,
</lg> here the second subgroup ends,
</lg> and here the level one line group ends.

The <lg> and <l> elements may have several attributes, among other things for encoding information about rhyme or other metrical phenomena. See ch. 9.2 of this handbook for a more detailed presentation of metrical encoding.

Having <p> and <lg> as siblings can create problems for the encoding of prosimetrum texts, where lines or verse or even whole poems can appear within prose text, often as part of direct speech. However, rather than including <lg> directly within the <p> element, we recommend inserting the <p> and <lg> elements within <div> elements, using one <div> for each of them:

Elements Contents
<div type="chapter" n="1"> A chapter opens here,
<div type="text"> beginning with some prose text, indicated by a <div> element.
<p> ... </p> The text goes here,
</div> and ends here, indicated by the <div> element.
<div type="stanza"> Then a poem begins, indicated by a new <div> element
<lg> with a linegroup (a stanza)
<l> ... </l> containing some lines.
</lg> The linegroup ends here,
</div> ... and the poem (i.e. the <div> element) also ends here.
<div type="text"> A new piece of prose text begins, indicated by a new <div> element.
<p> ... </p> The text goes here,
</div> and ends here, indicated by the <div> element.
</div> The chapter ends here.

3.7 Headings: <head>

The element <head> is used for containing headings on all levels of the document. If <head> is placed at the start of a <div> element, it typically contains a chapter heading:

Elements Contents
<div> Here a chapter begins,
<head> ... </head> its heading,
<p> ... </p> the first paragraph of the chapter,
<p> ... </p> the second paragraph,
</div> and here the chapter ends.
<lb/> The opening of each line in a <p> element will be indicated by the <lb/> element
   @n specifies the line number with numerical values like '1' , '2' , '3' , etc.
   @rend specifies the physical position of the following part of a line with numerical values like '1' , '2' , '3' , etc.; this attribute is only to be used in the case of discontinuous headings

We recommend that a text is encoded according to its logical chapter strucure, i.e. that a chapter is closed by a </div> and the next chapter opened by a new <div>, containing a <head> element and then a number of <p> elements. This advice also applies to any manuscript in which headings appear in seemingly illogical positions.

Fig. 3.1. Holm Perg 34 4to, fol. 10v, lines 4–7.

Fig. 3.1 is a straightforward example. The only thing to note is that the first <div> ends in the middle of line 5 and the second <div> begins immediately afterwards, being opened with a <head> element. One should not, however, encode the <lb/> element more than once here, and that should be at the beginning of the line.

<div>
    <p> ....... 
      <lb n="4"/>ok þeir er með honom samþyckia nema kononge með hinna vitrazto manna
      <lb n="5"/>raðe litizt annat loglegare</p>
</div>
<div>
  <head>Um grið a frosta þingj Capitulus</head>
    <p>
      <lb n="6"/><c type="initial">A</c>ller þeir menn sem J frosto þings logum 
             ok for ero skulu J griðum 
      <lb n="7"/>vera hvær við annan þar til er þeir koma heim til sins hæimi
      .......
    </p>
</div> 
            

A suitable XSLT stylesheet will be able to display this piece of text according to its physical order,

3 ...............................
4 ok þeir er með honom samþyckia nema kononge með hinna vitrazto manna
5 raðe litizt annat loglegare Um grið a frosta þingj Capitulus
6 Aller þeir menn sem J frosto þings logum ok for ero skulu J griðum
7 vera hvær við annan þar til er þeir koma heim til sins hæimi
8 ..............................

as well as its logical structure, by displaying the rubric on a separate line, but otherwise in the same order as above:

3 ...............................
4 ok þeir er með honom samþyckia nema kononge með hinna vitrazto manna
5 raðe litizt annat loglegare
5 Um grið a frosta þingj Capitulus
6 Aller þeir menn sem J frosto þings logum ok for ero skulu J griðum
7 vera hvær við annan þar til er þeir koma heim til sins hæimi
8 ..............................

Note that in the latter display, the line number 5 appears twice, since the last part of this line, the rubric, has been displayed on a separate line.

Fig. 3.2. AM 619 4to, fol. 47r, lines 14–17.

The chapter division between lines 15 and 16 in Fig. 3.2 is also a fairly simple example. The rubric in the second half of line 16 should be encoded as a <head> in the second <div> of the example. This is straightforward, but the complicating factor is that we would like to encode (and be able to display) the fact that this rubric is located on the same line as the opening of the <div> which it is heading. In other words, if the manuscript is read line by line, the <head> is intercalated in the first sentence, “Salomon konungr gerðe fyrst || Rubric || mysteri guði...”. Our recommended encoding would be the following:

<div>
    <p> ....... 
      <lb n="14"/>ge gefen utan enda við miscunn drotens várs Iesu Crist þes er lifir
      <lb n="15"/>ok rikir með fæðr ok hælgum anda æin guð utan enda ameɴ.</p>
</div>
<div>
  <head><lb n="16" rend="2"/>Jn dedicatione templi. sermo.</head>
    <p>
      <lb n="16" rend="1"/><c type="initial">S</c>alomon konungr gerðe fyrst 
      <lb n="17"/>mysteri guði. ok bauð lyð ſinum at halda hotið þa er
      .......
    </p>
</div> 
            

In this encoding, the rubric is located correctly within the second <div> of the example, and the fact that it is physically located to the left of the beginning of the chapter text is indicated by the @rend attribute to the <lb/> element. Since the first part of line 16 contains the beginning of the text in the second <div> and the second part of the line contains the rubric which logically precedes the text, both parts should be specified with a <lb/> element with attributes for line number, 'n="16"' , and for their physical order, 'rend="1"' and 'rend="2"' . The @rend attributes are needed to indicate that in spite of both parts being located on the same line, “Salomon konungr gerðe fyrst”, is rendered first and “In dedicatione templi. sermo.” is rendered second. These @rend attributes were not needed in the previous example, since there was no conflict between the logical and physical order.

Note that the values for the @rend attribute always begin with 1 and continue upwards, but usually not to a higher number than 3. Since our recommended encoding gives priority to the logical order of the text, the numbering of the @rend attribute is needed to specify the physical order of the parts when there is a conflict between the logical and physical order – in this case that the <head> is located in the second part of line 16, not in the first part of the line, which one might expect from a logical point of view.

A suitable XSLT stylesheet will be able to display this piece of text according to its physical order:

13 ...............................
14 ge gefen utan enda við miscunn drotens várs Iesu Crist þes er lifir
15 ok rikir með fæðr ok hælgum anda æin guð utan enda ameɴ.
16 Salomon konungr gerðe fyrst Jn dedicatione templi. sermo.
17 mysteri guði. ok bauð lyð ſinum at halda hotið þa er
18 ..............................

Alternatively, a suitable stylesheet will be able to display this piece of text according to its logical order:

13 ...............................
14 ge gefen utan enda við miscunn drotens várs Iesu Crist þes er lifir
15 ok rikir með fæðr ok hælgum anda æin guð utan enda ameɴ.
15 Jn dedicatione templi. sermo.
16 Salomon konungr gerðe fyrst
17 mysteri guði. ok bauð lyð ſinum at halda hotið þa er
18 ..............................

Fig. 3.3. Holm Perg 34 4to, fol. 52r, lines 2–5.

The next example, Fig. 3.3, is somewhat more complicated. Here, the rubric extends over two lines, beginning at the end of line 3 and continuing at the end of line 4. (Note that the first line in the illustration is line number 2). Furthermore, the word “skips|brotzmanna” is divided between lines 3 and 4. Like in the previous example, we use the @rend attributes to specify the physical order of the parts when they are at odds with the logical order of the text. In this case, the conflict applies to the two parts of line 4:

<div>
    <p> ....... 
      <lb n="2"/>nema suo sem aðr var sagt En ef meira høggr bøte markar
      <lb n="3"/>spell ok landnam landz drottne nema hann lofe</p>
</div>
<div>
  <head><w>Um</w> <w>dugnad</w> <w>skips
  <lb n="4" rend="2"/>brotz manna</w></head>
    <p>
      <lb n="4" rend="1"/><c type="initial">N</c>v þarf skip upp at setia skere
             styri maðr <add>boð upp</add>
      <lb n="5"/>suo viða at þeir verði full aflla til upp at setia ok vt 
      .......
    </p>
</div> 

In this simplified example, the element <w> has only been used in the <head>, so as to illustrate the fact that the word “skips|brotzmanna” is divided over two lines. Since there is no hyphen in the manuscript, our recommended encoding of this particular word would be:

<w>skips<supplied><c type="hyphen">-</c></supplied><lb n="4" rend="2"/>brotz manna</w>

See ch. 4.5 above for advice on supplied hyphenation.

Once again, our stylesheet should be able to display the text according to its physical order, like here,

1 ...............................
2 nema suo sem aðr var sagt En ef meira høggr bøte markar
3 spell ok landnam landz drottne nema hann lofe Um dugnad skips-
4 Nv þarf skip upp at setia skere styri maðr \boð upp/ brotz manna
5 suo viða at þeir verði full aflla til upp at setia ok vt
6 ..............................

as well as according to its logical order:

1 ...............................
2 nema suo sem aðr var sagt En ef meira høggr bøte markar
3 spell ok landnam landz drottne nema hann lofe
3 Um dugnad skips-
4 brotz manna
4 Nv þarf skip upp at setia skere styri maðr \boð upp/
5 suo viða at þeir verði full aflla til upp at setia ok vt
6 ..............................

When double (or even triple) numbering of line breaks is used in a transcription, one should make sure that any automatic numbering program that is run on the <lb/> elements is set up not to override manually given numbers.


3.8 Page, column and line breaks: <pb/>, <cb/>, <lb/>

3.8.1 Page breaks and column breaks

TEI uses the empty element <pb/> to indicate page breaks. This element has an attribute @n which can be used for the page numbers. As it is customary to refer to the manuscript leaves, rather than pages, the value of the @n attribute should indicate front or back pages (recto, verso). Column breaks, <cb/>, should also be indicated in manuscripts with two or more columns. Recommended values for the @n attribute of the <cb/> element are “A”, “B” and so on. Example:

Elements Contents
<pb n="1r"/> Folio one, recto page, begins here,
<cb n="A"/> the first column begins here,
<cb n="B"/> and the second column begins here.
---
<pb n="1v"/> Folio one, verso page, begins here,
<cb n="A"/> the first column of the verso page begins here,
<cb n="B"/> and the second column begins here.

Page break information from, for example, a printed standard edition, can be encoded in addition to the <pb/> tagging that refers to the manuscript itself. If one for example would like to add page break information from a standard edition, we recommend using the @ed attribute:

<pb ed="Standard Edition" n="1"/>
               

3.8.2 Line breaks

Line breaks are also indicated with an empty element, the <lb/>, which is placed at the beginning of a new line and may be numbered by using the @n attribute:

<lb n="1"/>Line number one begins here.
               

We recommend that each page, column and line be identified with an element at the very beginning. So for a manuscript with two columns, the three first lines in the first column on the back of the third leaf (folio) would be encoded in this manner:

<pb n="3v"/><cb n="A"/><lb n="1"/>This is the first line.
<lb n="2"/>This is the second line.
<lb n="3"/>This is the third line.
etc.
               

In other words, there should be as many <pb/> elements as there are pages, as many <cb/> elements as there are columns, and as many <lb/> elements as there are lines. We strongly discourage the use of the <lb/> element in the same way as the <br> element in HTML, in which there typically is one <br> element less than the number of lines (as the <br> element is inserted between the lines).

We recommend that <lb/> is used consistently for indicating the line breaks of the manuscript itself. One may include more than one layer of line break encoding, distinguishing them from each another with the @ed attribute, as shown in ch. 4.7.1 above.


3.9 Overlapping structures

There are no simple ways of encoding overlapping structures in XML, since XML is a strict tree structure in which every element must be part of a single 'parent' element. For example, a word or sentence may be written over two manuscript pages. If we represent the manuscript page as an element, the words will not belong to a single page and a parser error will occur.

This problem is dealt with in the current chapter by using empty elements to represent page breaks in the manuscript, rather than a page of text (cf. ch. 4.7 above). The same is true for columns and lines, where words, sentences and paragraphs routinely overlap with the physical features of the manuscript. These elements, <pb/>, <cb/> and <lb/>, are empty in the sense that they are inserted at a specific point in the structure without any extension. For this reason, they are often referred to as milestones. Note the position of the slash in these elements.

In ch. 11 “Representation of Primary Sources” in the TEI P5 Guidelines the elements <addSpan/>, <delSpan/> and <damageSpan/> are defined. These elements are counterparts to the elements <add>, <del> and <damage>, but are all empty, and should be used when the feature to be encoded crosses structural divisions. There are in fact many more elements which can cross structural divisions, e.g. <sic>, <corr>, <unclear> and <supplied>, but there are no corresponding <sicSpan>, <corrSpan>, <unclearSpan> and <suppliedSpan>. Rather that adding these and several other elements we recommend using one generic empty element to cover all cases of overlapping structures. We have called this new element <me:textSpan/> and given it attributes from the classes “att.spanning”, “att.transcriptional”, “att.typed” and “att.global”, and the attribute @me:category:

Elements / attributes Contents
<me:textSpan/> A generic element to handle overlapping text structures
   @category Specifies the type of span, restricted to this list of values:
    'add' for contents that would otherwise be contained by the <add> element, cf. ch. 7.2.1
    'corr' for contents that would otherwise be contained by the <corr> element, cf. ch. 7.4.3
    'del' for contents that would otherwise be contained by the <del> element, cf. ch. 7.2.2
    'damage' for contents that would otherwise be contained by the <damage> element, cf. ch. 7.5.1
    'gap' for contents that would otherwise be contained by the <gap/> element, cf. ch. 7.3.1
    'me:expunged' for contents that would otherwise be contained by the <me:expunged> element, cf. ch. 7.4.2
    'sic' for contents that would otherwise be contained by the <sic> element, cf. ch. 7.4.3
    'supplied' for contents that would otherwise be contained by the <supplied> element, cf. ch. 7.4.1
    'unclear' for contents that would otherwise be contained by the <unclear> element, cf. ch. 7.3.2
    'other' for any other contents
   @spanTo Specifies the end point of the text span, using values like:
    'an1' anchor 1
    'an2' anchor 2, etc.
<anchor/> An empty element (milestone) which attaches an identifier to a point within a text
   @xml:id Specifies the identifier corresponding to the one used in the @spanTo attribute of the preceding <me:textSpan> element, using values like:
    'an1' anchor 1
    'an2' anchor 2, etc.

We will discuss an example of an overlapping structure in AM 673 b 4to (Plácitusdrápa 1):

Fig. 3.4. AM 673 b 4to, fol. 1r, ll. 1-4

The first three lines read approximately:

genget fiornes ualdr [quaþ........fr]egr nu | mun er lægiasc miuks scalldu manra[un sli] | ca morlins boþe finna uestu i frægre f[rest]

The letters in brackets were read by earlier editors, especially Finnur Jónsson in 1889. For this section, we will discuss the text at the end of the second line and at the start of the third. It is clear that part of each word is missing, but the damaged manuscript forms a single feature. Text can be supplied from Finnur Jónsson’s transcription, but we want to represent both the damage and the supplied text as a single feature, which overlaps with the middle of the two words. The simple encoding, without the unclear text marked or the supplied text, would be:

<w>manra<gap/></w>
<w><gap/><lb n="3"/>ca</w>

With the supplied text encoded in the conventional way, the following would produce an error:

<!-- WRONG: -->
<w>manra<supplied resp="FJ">aun</w>
<!-- the processor stops here because this is not well-formed XML --> 
<w>sli</supplied><lb n="3"/>ca</w>

The <unclear> and <supplied> elements, if used in their conventional way, would overlap with the <w> elements, meaning that the word tag would close before an element inside it had closed. That would stop an XML processor from proceding any further with the document.

In these guidelines, we offer two solutions to the problem of overlapping structures. The first is more complex, but more robust. The second is simpler, but is less machine-readable and may affect the validation of the document structure in other respects. Even so, we recommend the latter solution.

3.9.1 Linked segments

The following approach is more sound from the point of view of an XML document, but creates extra tagging. The feature is encoded in a series of separate elements, linked together.

In order to encode linked segments, the encoder should break the overlapping feature into parts which fit within the XML structure (usually within the word or dipl/facs/norm elements). Each part is identified using the @xml:id attribute, and they are linked together using the following attributes:

Elements / attributes Contents
@xml:id provides a unique identifier for the element bearing the attribute
@next used at the start and in the middle: an IDREF pointing to the element which marks the next tag of the same feature
@prev used in the middle and at the end: an IDREF pointing to the element which marks the previous tag of the same feature

The two-word example above is encoded thus:

<w>man<supplied source="FJ" xml:id="sup1.1" next="sup1.2">raun
     </supplied></w>
<w><supplied xml:id="sup1.2" prev="sup1.1">ſli</supplied>
     <lb n="3"/>ca</w>

Adding all three textual levels, including the unclear text encoded at the facs level, we would have:

<w>
  <choice>
    <me:facs>man<unclear xml:id="unc1.1" next="unc1.2">
      <gap extent="8"/></unclear></me:facs>
    <me:dipl>man<supplied source="FJ" xml:id="sup1.1" next="sup1.2">raun
      </supplied></me:dipl>
    <me:norm>manraun</me:norm>
  </choice>
</w>
<w>
  <choice>
    <me:facs><unclear xml:id="unc1.2" prev="unc1.1">ſli</unclear>
      <lb n="3"/>ca</me:facs>
    <me:dipl><supplied xml:id="sup1.2" prev="sup1.1">ſli</supplied>
      <lb />ca</me:dipl>
    <me:norm>slíka</me:norm>
  </choice>
</w>

It is recommended that the additional information for the feature (such as the editor responsible, type, etc.) be only included in the first element, but editors may wish to include the attributes in all elements.

For the purposes of display, the start of a feature can be marked by selecting the element with the 'next' attribute set, but not the 'prev'; and the end can be marked by selecting the element with the 'prev' attribute set but not the 'next'.

3.9.2 Boundary marking with empty elements

Another solution is to encode the beginning and end of a text span with empty elements. This method has been described in ch. 20 “Non-hierarchical Structures” of the TEI P5 Guidelines and will be applied here in a slightly modified version. As outlined above, we have introduced a generic element <me:textSpan/> which is specified by way of a @category attribute. If, for example, the overlapping structure to be encoded is a piece of supplied text, this fact is expressed through the value of the @category attribute:

<me:textSpan category="supplied"/> 

Thus, all instances of supplied text in the file will either be contained in <supplied> elements (in non-overlapping contexts) or in <me:textSpan category="supplied"> elements (in overlapping contexts).

In addition to inserting the empty <me:textSpan/> element at the beginning of the textual span, an attribute @spanTo is added with a suitable index, e.g.

<me:textSpan category="supplied" spanTo="an1"/> 

It now remains to mark the end of the span, i.e. the extent of the supplied text, with another empty element, the TEI <anchor/> element. This must be specified with an @xml:id attribute having the same index as the @me:spanTo attribute at the beginning of the span:

<anchor xml:id="an1"/> 

The full encoding will be like this:

<w>man<me:textSpan category="supplied" spanTo="an1"/>raun</w>
<w>ſli<anchor xml:id="an1"/><lb n="3"/>ca</w> 

Note that the value of @xml:id attribute must be unique within the whole document.

There is no simple answer to the problem of non-hierarchical structures in XML encoding. However, we believe that using empty elements as boundary markers may prove to be the simplest and most general encoding, and it is therefore the solution we recommend. With either technique, only one method should be used in each document.


First published 28 August 2016. Last updated 30 July 2017. Webmaster.