Ch. 1. Text encoding using XML

Chapter 1. Text encoding using XML

Version 1.0 (20 May 2003)

1.1 What is XML?
1.2 Appearance vs. structure
1.3 Elements
1.4 Attributes
1.5 Entities
1.6 Putting the pieces together
1.7 The Text Encoding Initiative
1.8 The TEI DTD (Document Type Definition)
1.9 Displaying the text

1.1 What is XML?

XML, Extensible Markup Language, is a standard, endorsed by the World Wide Web Consortium, which defines a simple yet flexible generic syntax for document markup. XML, like its predecessor SGML, Standard Generalised Markup Language, developed by IBM in the 1970s, allows for the definition of system-independent methods of representing texts of any kind in electronic form.

The term 'markup', originally used for the (hand-written) instructions added to a manuscript or typescript to indicate to the compositor how the printed text was to look in terms of spacing, font size, use of italics and so on, has been carried over into electronic word-processing to describe the codes used to indicate these same features and other aspects of processing. A 'markup language' is therefore at its most simple a set of codes which are used to indicate or 'tag' certain features in the text, normally for formatting purposes. In most modern software packages the markup is generated with little or no conscious effort on the part of the user - in many modern word processing programs, such as the ubiquitous Microsoft Word, the user is not even given the option of viewing the codes. But they are there: and to see just how many one need only open a document produced in, say, Word or WordPerfect in a plain text editor such as Notepad. A text of even a few short lines will be prefaced by several dozen lines - possibly even pages - of code.

The problem is that every program has its own set of codes, and it is only rarely possible to convert files from one to another without at least some loss of formatting. And it isn't just the formatting that goes haywire - any exotic (read non-English) characters are also likely to mutate. SGML was originally developed in order to avoid these problems be being entirely platform independent - hence G for generalised. It achieves this by identifying the logical elements of the document rather than specifying the processing to be performed on it: the markup is descriptive, in other words, rather than procedural. With descriptive markup, the same document can be processed by many different pieces of software, each of which can apply different processing instructions to those parts of it which are considered relevant.

SGML's greatest success has been HTML, Hyper-Text Markup Language, the language of the World Wide Web. HTML restricts document authors to a finite set of tags, however, most of which are presentationally oriented, and is thus inappropriate for most things other than web design. XML is essentially 'trimmed down' SGML. It is not, in other words, a single, predefined markup language like HTML: like SGML it is a metalanguage - a language for describing other languages. The syntax is essentially the same as SGML, but some of the more complex and lesser used options have been removed.

The great advantage of XML is that it brings the power and flexibility of SGML to the Web; an XML document can be marked-up entirely in accordance with the needs of the user and the result displayed in a standard web browser (see section 1.8 below). The implications for philologists are staggering.

In what follows, most of the more relevant areas of XML markup are touched upon. For a more thorough grounding, one of the many printed handbooks or websites devoted to XML should be consulted. A good place to start would be the World Wide Web Consortium's own XML pages: http://www.w3c.org/XML/.

1.2 Appearance vs. structure

It is customary, in English and most other western European languages, to use italic type in texts printed otherwise in plain roman to set certain things off the rest. Hart's rules for compositors and readers at the University Press, Oxford (39th ed.), for example, stipulates that the titles of books, films, plays, works of art and periodicals (but not chapters, shorter poems, articles) should be printed in italic, as should the names of ships (but not public houses), words and short phrases in foreign languages (other than those, such as quiche and blitzkrieg, that have been sufficiently anglicised so as to render this unnecessary), stage directions in plays, theorems in mathematical works and biological and zoological nomenclature. Although Hart's doesn't mention it, italic is also regularly used to indicate emphasis, for example in novels: 'I most certainly didn't ask him to come.' With ordinary word-processing software, all these things would be marked-up in the same way, i.e. with the relevant codes for 'italic-on' and 'italic-off'. If you think of the computer as a glorified typewriter and are only interested in producing copy with the correct formatting, fine. If you wish to take advantage of the possibilities offered by sophisticated information retrieval systems, however, you're in trouble, since a search engine will not be able to distinguish foreign words from book titles or the names of ships, for the simple reason that procedural markup such as that produced by ordinary word-processing software only indicates how something is to be displayed, but not why is it to be displayed that way. With descriptive markup, on the other hand, elements in the text are tagged according to their function - titles as titles, foreign words as foreign words, stage directions as stage directions and so on. These can then be processed in whatever way one desires, for example displayed in italics. By concentrating on the structure of the document rather than its appearance a great many possibilities are opened up. Elements in the text can be marked-up even where one has no desire to format them in any special way. One might wish, for example, to tag the names of persons, so that a search for 'King George', for example, would turn up only persons of that name rather than vessels or public houses.

1.3 Elements

The key concept in SGML/XML markup is the element. An element is essentially a textual unit, the idea being that texts, like houses, are made up of repeated occurrences of basic units arranged in a hierarchical structure; longer works in prose will be divided into chapters or sections, and these into sub-sections and then further into paragraphs, and there also may be lists and tables. Works of poetry may be divided into cantos or fits, and these into stanzas, and the stanzas into couplets, the couplets into lines, the lines into feet etc. The individual sections, whether chapters or cantos, will often have headings, which are not strictly speaking part of the main text, but nevertheless belong with it. Moreover, these elements will only combine in certain ways. A chapter will not begin in the middle of a paragraph, for example, or in a footnote. In SGML/XML pairs of tags are used to mark off these units, a start tag and an end tag, with the text in between being referred to as the element's content. Tags are placed within angle brackets, with a solidus to indicate an end tag. Chapters in a book, for example, could be demarcated by placing a <chapter> tag at the beginning of each one and a corresponding </chapter> tag at the end, while within each chapter there would be any number of paragraphs, tagged, say, <paragraph>. The way these two elements relate to each other hierarchically is determined by the DTD, or Document Type Definition, which in this case would stipulate that a <chapter> must contain one or more <paragraph> elements. SGML/XML syntax is really quite simple: for each element there is a declaration enumerating what other elements it may or must contain, how many of each, and if there are any constraints on the order. The more elements one has in one's system the more complicated, and subtle, that system becomes.

Let us take a concrete example, the poem 'Upon Julia's Clothes' by Robert Herrick:

When as in silks my Julia goes,
Then, then (me thinks) how sweetly flowes
That liquefaction of her clothes.

Next, when I cast mine eyes, and see
That brave Vibration each way free;
O how that glittering taketh me!

The structure of the poem is clear enough: it is made up of two stanzas each of which contains three lines. This structure could be marked-up in the following way:

<poem>
<stanza>
<line>When as in silks my Julia goes,</line>
<line>Then, then (me thinks) how sweetly flowes</line>
<line>That liquefaction of her clothes.</line>
</stanza>
<stanza>
<line>Next, when I cast mine eyes, and see</line>
<line>That brave Vibration each way free;</line>
<line>Oh, how that glittering taketh me!</line>
</stanza>
</poem>

If we abstract from this and attempt to describe the structure of poems in general we could say that a poem consists of one or more stanzas each of which is made up of one or more lines. This structure could be expressed in a Document Type Definition as follows:

<!ELEMENT poem        (stanza+) >
<!ELEMENT stanza      (line+) >
<!ELEMENT line        (#PCDATA) >

The + sign after stanza and line means they are required and repeatable, i.e. can occur one or more times (a question mark would indicate an optional element, i.e. one which can occur zero or one time, while an asterisk would indicate that the element was optional and repeatable, i.e. can occur zero or more times; if there is no occurence indicator, the element must occur once and only once). #PCDATA is 'parsed character data', which essentially means any number of valid characters. There is one obvious problem with this model, which is that it requires that all poems consist of at least one stanza, which is somewhat counter-intuitive, since it could be argued that a poem of only one stanza is made up only of lines. To remedy this, the content model for the poem element could be given as (line+ | stanza+); the two are separated by a vertical bar, the 'or' connector, which shows that either can be used but not both, and each is marked with a plus sign to show that whichever is used, it can be used more than once. A poem, in other words, consists either of one or more lines or of one or more stanzas (each comprising one or more lines).

A poem will also normally have a title and be attributable to an author (even if that author is the highly prolific 'Anon.'). The name of the author will obviously not always appear with the poem, however, for example in a series or collection, where there are several poems by the same author. These could be added by redefining the content model of poem as:

<!ELEMENT poem (title, author?, (stanza+ | line+))>

Here, the pair of elements stanza and line are grouped together in round brackets (parentheses) to show they are to be treated together; the comma acts as 'sequence connector', indicating that the elements/groups must occur in this order. In plain prose this means that a poem must have a title, may have an author, and then will have either one or more stanzas or one or more lines. This is still not entirely satisfactory, however, since there will also be poems without titles (haikus for example), which would not be allowed with this content model. Nor would it be possible to reverse the order of author and title. The content model must therefore ideally allow for optional and non-repeatable title and author elements which can appear in either order, but only preceding the text of the poem itself. In order to allow for this kind of flexibility, the DTD needs to become slightly more complex:

<!ELEMENT poem (((title | author?) | (author | title?)?), (stanza+ | line+))>

Here, the two possibilities are grouped together in round brackets with an or connector in between - title with an optional author or author with an optional title - and both possibilities are marked as optional.

A simpler way of dealing with this might be to define an element called, say, head, with the following content model:

<!ELEMENT head (#PCDATA | author | title)*>

This would allow one to preface the poem with plain text, author and title elements in any combination. This allows much greater flexibility, but also reduces greatly the amount of control one has over the document. With this content model, for example, there would be nothing to prevent one from having more than one title.

It is obvious too that a larger unit is required in order to accommodate more poems, for example <collection>. If one envisaged this collection as an anthology, one would probably wish to divide it into sections, in which poems by a particular poet were grouped together. Each of these sections would have a heading and possibly some prefatory matter, giving information on the author. Given the complex structure of all but the simplest documents, it is easy to see how a DTD can quickly become very complicated indeed.

Other elements used in markup have less to do with the overall structural hierarchy of the document and are more free-floating, i.e. can appear in a variety of contexts. The principal use of tagging such as this is to enable searches: one marks things so as to be able to find them later. One might, for example, wish to indicate that 'Julia' in line one of our poem is the name of a person:

<line>When as in silks my <name>Julia</name> goes,</line>

1.4 Attributes

Without further information, the usefulness of such tagging is sometimes limited. More specific information about a particular element instance can be given as an attribute. Looking at the poem just cited, one might, for example, want to add attributes to the elements <poem> and <stanza>, using convenient typologies to indicate genre, form, metre or rhyme-scheme, and a number attribute to the stanza and line elements. It might also be an advantage to indicate the type of name, in order to distinguish personal names from the the names of places, ships, public houses etc., or perhaps even have a separate element <person>, with attributes such as gender and role:

<line number="1">When as in silks my <person gender="female" role="object of desire">Julia</person> goes,</line>

One might want to identify the author in some more precise or uniform way, for example to distiguish this Robert Herrick, the 17th-century English poet and divine, from the early 20th-century American novelist of the same name (useful too for indexing); this could be done with an attribute called reg, for 'regularlised', as follows:

<author reg="Herrick, Robert (1591-1674)">Robert Herrick</author>

Like elements, attributes are declared in the DTD, a list of possible attributes (ATTLIST) being given for each element; it is also possible to specify what kind of value is acceptable for each attribute, and if necessary a default value.

<!ELEMENT person        (#PCDATA) >
<!ATTLIST person    gender   (male | female | unknown) "unknown"
    role     CDATA    #IMPLIED   >

In the original 1648 edition of Herrick's Hesperides, which included the poem cited here, all names of persons were printed in italics. Reproducing this would be an easy matter if all personal names were tagged as such, but the real advantage of this kind of tagging is for search purposes. One could, for example, search for all the women who were the objects of Herrick's desire - and there were many (the fact that he probably made them all up needn't concern us).

Attributes such as type and subtype are useful precisely because they allow for searches at varying degrees of abstraction. Markup such as the following:

I have a <dog>dog</dog>.

is all but pointless, since a free-text search for the word 'dog' would yield the same result. If, on the other hand, one moves to a greater level of abstraction, for example to:

I have a <animal class="mammalia" order="carnivora" family="canidae" genus="canis" species="familiaris">dog</animal>.

one could search for all animals, all mammals, all carnivores, all dog-like creatures and finally all dogs (wild and domestic), depending on how widely one wished to cast one's net. The use of standard international typologies such as this also allows for cross-linguistic searches.

1.5 Entities

The aspects of SGML/XML discussed so far are all concerned with the markup of structural elements within the document. SGML/XML also provides a mechanism for encoding and naming parts of the document's content, through entities. An entity is a kind of shorthand, a way of stating that a particular string of characters in the document should be replaced when the document is processed by some other string; this other string can be of any length, from a single character to a separate file containing millions of bytes, for example a text file or digital image. The name of the entity (entity reference) is placed between an ampersand and a semicolon: &entityname;. Such entities, which are known as General Enities, are defined in the DTD, in the Document Type Declaration subset or in an external entity set which is itself referenced in the Document Type Declaration subset. A simple declaration for a general entity looks like this:

<!ENTITY rh "Robert Herrick">

Here, whenever the processing software (a parser or browser) encounters the entity &rh; it replaces it with the text 'Robert Herrick'. In the case of our single poem, there is obviously no real advantage to treating the name of the author in this way, but in longer documents or collections of documents it can be an extremely efficient way of dealing with repeated content. Entities may also contain XML markup (provided it is well-formed) as well as text:

<!ENTITY rh "<author>Robert Herrick</author>">

An entity can also refer to an external file, as in the following example:

<!ENTITY chapter1 SYSTEM "chapter1.xml">

Such entities are called system entities: instead of the replacement text, the SYSTEM keyword and a relative or absolute URL is given. The processing software will then replace the entity with the document found at the address given, i.e. insert that document into the existing document. The resulting document must be well-formed XML, so one must ensure that that the document to be inserted is itself well-formed (although it need not have a single root element) and does not for example contain a prologue (i.e. XML and/or DOCTYPE declaration).

<!ENTITY chapter1 SYSTEM "chapter1.xml">

A third type of entities are called parameter entities; these are used inside markup declarations and need not concern us here.

Entities are particularly useful for providing descriptive mappings for non-standard (i.e. non-English) characters, such as the accented vowels (í, ê, ä etc.) used in many European languages, the German ß, Icelandic Þ or Danish and Norwegian ø, which are notoriously non-portable. There are standard entity sets for characters used in the western-European languages (ISOlat1 and ISOlat2), as well as character sets for Greek (ISOgrk1), Cyrillic (ISOcyrl1) and other alphabets. The Unicode standard covers all the world's languages, living and dead, and also allows for user-defined characters; the current version contains over 96,000 characters. Each of these characters is assigned a unique code point, which can be encoded in a variety of ways. The most common, mentioned above, is UTF-8. Numerical character references are either decimal or hexidecimal; decimal references begin with an ampersand and the number sign or hash mark (#), to which hexidecimal references add an x. Thus the hexidecimal character reference for the letter þ is þ, while the decimal reference is þ. These numerical character references are supported by standard browsers and do not need to be defined specially in the DTD. One may, however, prefer to use entities which are more immedieately intelligible to humans, for reasons of proof-reading or whatever, for example þ for þ. It is a simple matter to define characters as entities, giving as the replacement text the numerical character reference.

<!ENTITY thorn "þ">

Entity references can also be used for characters required for specific kinds of texts; the producer of a diplomatic text edition might want to distinguish between single and two-storey a, for example, by using separate entities. These entities could have the same replacement text, and thus appear identical when displayed, but still be available for search-purposes.

Entities for Old Norse special characters are defined, with their Unicode values, in ch. 5 and ch. 6, and a complete list is given in the character list.

1.6 Putting the pieces together

These, then, are the basic parts of a SGML/XML document. The key is the DTD, in which the elements, with their attributes, are defined in terms both of their content and their relationship to other elements, and entities or entity sets are defined or referenced. Anyone can, if he or she so wishes, devise a DTD to meet whatever encoding needs he or she may have. A host of XML authoring tools, parsers, browsers and search engines are available, many for free over the web. If all you want to do is make a searchable list of your cd collection, for instance, it is a relatively simple matter to create your own DTD. Most people will prefer to use an existing application, however. As was said, the most successful implementation of SGML to date is HTML (the XML version is known as XHTML). But HTML is not flexible enough to deal with all but the most basic of texts: there simply aren't enough elements. Another fundamental weakness of HTML is that it has, despite its origins in SGML, decidedly procedural (rather than descriptive) tendencies, in that many elements are used for the effect they will produce when the document is displayed rather than to mark up structural features. <p>, for paragraph, for example, is used to give white space, rather than necessarily indicating the beginning of paragraph, and <ul>, for unordered (i.e. unnumbered) list, is frequently used to produce indentation rather than for lists.

The Herrick poem, marked-up in HTML, would look like this:

<html>
<body>
<h2>Upon Julia's Clothes</h2>
<h3>Robert Herrick</h3>
<p>
When as in silks my <i>Julia</i> goes,<br>
Then, then (me thinks) how sweetly flowes<br>
That liquefaction of her clothes.</p>
<p>
Next, when I cast mine eyes, and see<br>
That brave Vibration each way free;<br>
Oh, how that glittering taketh me!</p>
</body>
</html>

Marked-up in XML, the poem looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE poem SYSTEM "poem.dtd" [
<!ENTITY rh "Robert Herrick">
]>
<poem type="lyric" subtype="erotic">
<title>Upon Julia's Clothes</title>
<author reg="Herrick, Robert (1591-1674)">&rh;</author>
<stanza number="1" form="tercet" rhyme="AAA" metre="iambic tetrameter">
<line number="1">When as in silks my <person gender="female" role="object of desire">Julia</person> goes,</line>
<line number="2">Then, then (me thinks) how sweetly flowes</line>
<line number="3">That liquefaction of her clothes.</line>
</stanza>
<stanza number="2" form="tercet" rhyme="AAA" metre="iambic tetrameter">
<line number="1">Next, when I cast mine eyes, and see</line>
<line number="2">That brave Vibration each way free;</line>
<line number="3">Oh, how that glittering taketh me!</line>
</stanza>
</poem>

Displayed in a browser, there would not necessarily be any difference between this and the same text marked-up in HTML, but the underlying information is far greater. From the point of view of a search engine, all that can be said about the HTML text is that it consists of two paragraphs each of which contains three line breaks, but there is no indication as to why this should be, i.e. there is nothing that says 'this is a poem', 'this is a stanza', 'this is a line of verse'. The name Julia, in the same way, will be rendered in italics, but there is nothing which indicates that it is the name of a person.

The first line of our XML document is the XML declaration, which tells any processing software that the document is in XML; it is not strictly speaking necessary to have an XML declaration, since any XML-aware software can work out for itself whether a document is in XML or not (the *.xml file extension also does this), but every XML document should ideally begin with one. The value of the version attribute is always "1.0"; it is possible, but unlikely, that there will be further versions. The other two attributes are optional: encoding, which specifies which encoding is to be used (the variable length encoding of the Unicode character set, UTF-8, is assumed by all the standard browsers), and standalone, the possible values for which are "yes" and "no", which indicates whether the document makes use of an external DTD. XML documents do not in fact require a DTD, provided they are 'well formed', i.e. do not contain any errors in syntax (elements which overlap, or are opened but not closed etc.); in cases where there is no DTD the value of the standalone attribute should be "yes". If the attribute is omitted, the value "no" is assumed. The second line, following the XML declaration, is the Document Type Declaration, which gives the root element (also called the document element), i.e. the 'outermost' element inside which all other elements in the document are contained, in our case collection, and states which DTD is to be used, given either as a relative or absolute URL, i.e. a local file name or a web address. The DTD referred to in the Document Type Declaration, 'poem.dtd', now looks like this:

<!ELEMENT poem        (((title | author?) | (author | title?)?), (stanza+ | line+))>
<!ATTLIST poem
    type      CDATA    #IMPLIED
    subtype   CDATA    #IMPLIED   >
<!ELEMENT author      (#PCDATA) >
<!ATTLIST author
    norm      CDATA    #IMPLIED   >
<!ELEMENT title       (#PCDATA) >
<!ELEMENT stanza      (line+)   >
<!ATTLIST stanza
    number    CDATA    #IMPLIED
    type      CDATA    #IMPLIED   >
<!ELEMENT line        (#PCDATA | person)* >
<!ATTLIST line
    number    CDATA    #IMPLIED   >
<!ELEMENT person      (#PCDATA) >
<!ATTLIST person
    gender    (male | female | unknown) "unknown"
    role      CDATA    #IMPLIED   >

The DTD, as was said, defines the structure of the document, and is thus like a grammar, detailing the elements which can appear in the text and their hierarchical relationship to each other. In order to ensure that this is done correctly, every encoded text needs to be checked against the DTD, if there is one, or checked for 'well-formedness' if there isn't. A computer program called a parser runs through the encoding and gives an error message if there are errors or inconsistencies in the markup, e.g. if elements are not opened and closed correctly, or used in the wrong place, or if elements overlap. If the elements in a document are correctly opened and closed, and non-overlapping, the document is called well-formed; a parser can determine 'well-formedness', as was mentioned above, without recourse to a DTD. If, in addition, the content types of elements, the nesting of elements and the use of attributes are all done according to the specification of the DTD, then the document is not only well-formed, but also valid.

Validation only checks the markup - not the content - of the document. A document can consist entirely of gibberish and still be valid - as, indeed, can a document with no content at all. The correctness or otherwise of the content is still the responsibility of the transcriber.

XML-aware software, such as SoftQuad's XMetaL, generally comes with a built-in validator, and separate validator programs are also available; the one probably most commonly used is James Clark's SP. It is also possible to validate documents online, for example with the validator at Brown University.

1.7 The Text Encoding Initiative

The most widely used SGML/XML implementation for more sophisticated text encoding is that devised by the Text Encoding Initiative (TEI), an international and interdisciplinary standard for the preparation and interchange of electronic texts. The TEI began with a planning conference which took place at Vassar College in New York on 12-13 November 1987. The participants agreed on both the desirability and feasibility of creating a common encoding scheme for use both in creating new documents and in exchanging existing documents among text and data archives, and the TEI began the task of developing a draft set of Guidelines for Electronic Text Encoding and Interchange, with working committees comprising scholars from all over North America and Europe drafting recommendations on various aspects of the problem. These were integrated into a first public draft, TEI P1 (P for 'Proposal'), published in June 1990. A second draft (TEI P2) followed in 1992 and 93, and the first official version of the guidelines (TEI P3) was published in May 1994. a fully XML-compliant version, P4, is available in electronic form at the TEI Guidelines web site; a print edition is also available from the University of Virginia Press.

The TEI began as a research effort cooperatively organised by three scholarly societies (the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing), and funded solely by research grants from the US National Endowment for the Humanities, the European Union, the Canadian Social Science Research Council, the Mellon Foundation and others. In December 2000, after a year's negotiation, a new non-profit corporation called the TEI Consortium was set up to maintain and develop the TEI standard. Four universities serve as hosts for this consortium: Brown University (Scholarly Technology Group) and the University of Virginia (Electronic Text Center and Institute for Advanced Technology in the Humanities) in the United States, Oxford University (Humanities Computing Unit) and the University of Bergen (Humanities Information Technologies Research Programme) in Europe. Executive offices for the consortium are in Bergen. The Consortium is managed by a Board of Directors, and its technical work is overseen by an elected Council.

There are hundreds if not thousands of projects currently using the TEI encoding scheme; Menota is one of them.

1.8 The TEI DTD

One of the great strengths of the TEI DTD is that it actually consists of a number of different tag sets which can be used in a variety of combinations, according to the needs of the encoder and nature of the material being encoded. First there are the core and header tag sets, which must always be included. The core tag set defines elements which may be said to be universal, i.e. not specific to particular types of texts. The header tag set defines elements which allow for the provision of documentary and bibliographic information about the electronic text itself. In addition to these there are eight possible base tag sets. Six of these are intended to be used with texts of a specific type (i.e. prose, verse, drama, transcriptions of spoken material, dictionaries and terminological data), while the other two ('general' and 'mixed') are for use on more heterogenous materials and contain elements from the other sets. Finally there are a number of additional tag sets which are designed for use with particular types of processing or research, for example for the transcription of primary sources and for textual criticism. In this way the encoder can tailor the DTD to his or her individual needs, selecting from the very large number of elements available those which are most relevant to the material to be encoded. Both base and additional tag sets are specified in the Document Type Declaration at the start of the document, as in the following example:

<!DOCTYPE tei.2 SYSTEM tei2.dtd [
    <!ENTITY % TEI.prose "INCLUDE">
    <!ENTITY % TEI.transcr "INCLUDE">
    <!ENTITY % TEI.textcrit "INCLUDE">
]>

All TEI conformant documents must contain two elements, a header, tagged <teiHeader>, in which, as was mentioned, meta-data, information about the electronic document, is provided, and the text itself, tagged <text>. What elements go into the <text> is to a great extent determined by which base and additional tag sets have been chosen.

The TEI tagset for verse has, not surprisingly, elements corresponding to line, stanza and so on in the DTD presented above. The Herrick poem, tagged in TEI conformant XML, would look like this:

<?xml version="1.0"?>
<!DOCTYPE text SYSTEM "-/-/tei2.dtd" [
<!ENTITY % TEI.verse "INCLUDE">
]>
<text>
<body>
<lg type="lyric" met="iambic tetrameter">
<head>Upon <name type="person">Julia's</name> Clothes</head>
<lg n="1" type="triplet">
<l>When as in silks my <name type="person">Julia</name> goes,</l>
<l>Then, then (me thinks) how sweetly flowes</l>
<l>That liquefaction of her clothes.</l></lg>
<lg n="2" type="triplet">
<l>Next, when I cast mine eyes, and see</l>
<l>That brave Vibration each way free;</l>
<l>O how that glittering taketh me!</l></lg>
</lg>
</body>
</text>

The element <l> is used for a line of verse (as opposed to line breaks in written or printed texts, which are dealt with in another way; see ch. 4), while <lg>, for 'line group', is used for a group of lines functioning as a formal unit - here both the poem as a whole and the individual stanzas - with a type attribute to indicate what sort of unit. The advantage of <lg> over our <stanza> is that, being more abstract, it is also more flexible; <lg> elements can also nest, i.e. appear within each other, which allows quite sophisticated markup of complex verse forms.

In addition to these structural elements the TEI also makes available a host of elements for indicating features of typography and layout; although these were originally intended for use in the description of printed materials most if not all are equally applicable to manuscripts. There are also tags which can be used for normalisation, grammatical information etc. The other chapters in this handbook explain in detail how they can be used.

1.9 Displaying the text

We have mentioned several times the possibility of displaying XML documents in standard web browers. In order to do so, one final piece is necessary: a stylesheet. As has been said, XML elements describe, ideally at least, the semantic structure of the text, rather than its appearance (although there is obviously a degree of overlap). Web browsers have built-in stylesheets for displaying HTML, and know that in an HTML document anything tagged <i> is to be displayed in italic, because HTML markup is essentially presentational: <i> means 'display in italic'. XML markup is semantic (and the elements user-defined), and in order for a browser to display an XML document, it needs to know what formatting to apply to what elements. It needs to be told, for example, that things within <title> tags should be displayed in italic. A stylesheet does precisely that.

There are essentially two options, Cascading stylesheets (CSS) and Extensible style language transformations(XSLT). CSS is a simple, non-XML syntax used to describe the appearance of any element in a document. XSLT, on the other hand, is itself an XML application which specifies rules by which the XML document is transformed into another document, either another XML document or something else; its most obvious use is to take XML and turn it into something more browser-friendly, i.e. HTML (or XHTML). The original document retains its complexity, but for viewing purposes it is changed into something even older browsers can deal with. This transformation can be done either at the browser-end, by the webserver, when the XML document is called up by the user, or by the creator of the document, who may not wish to make it available in its orginal state.

The stylesheet to be associated with the document is indicated by a xml-stylesheet processing instruction (or stylesheet link), which comes after the XML declaration, either before or after the Document Type Declaration, if there is one, but before the root element.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE poem SYSTEM "poem.dtd">
<?xml-stylesheet href="poem.css" type="text/css"?>

The value of href is the URL (absolute or relative) where the stylesheet can be found, while the value of type will either be "text/css" for cascading stylesheets or "text/xml" for XSL transformations. There are other (pseudo-) attributes, such as media, but they need not concern us here. The style sheet referred to here is a CSS style sheet, which indicates how each of the elements is to be displayed:

poem {
   display: block;
   font-family: "Book Antiqua";
   margin: 25pt 15pt 15pt 45pt;
   font-size: 13pt;
   line-height: 15pt}
title {
   display: block;
   font-size: 18pt;
   padding: 5pt}
author {
   display: block;
   font-size: 14pt;
   padding: 5pt }
stanza {
   display: block;
   padding: 5pt}
line {display: block}
person {font-style: italic}

Displayed by an XML-aware browser, such as Internet Explorer (version 5 and higher), Netscape (version 6 and higher) or Opera, this looks like this:

XSLT is far more powerful than CSS. With CSS one can determine exactly how the content of an element is to be displayed, in terms of font, colour etc., or whether it is to be dislayed at all (one might not, for example, wish to display some of the administrative information contained in the TEI header). CSS (CSS 2, which is not yet supported by all browsers) will also allow you to insert text before and/or after an element (using the before and after pseudo-element selectors). But that's about it. With XSLT, on the other hand, one can, for example, re-arrange the order of the elements or display the value of an element's attribute instead of its actual content (very useful, for example, if one wishes to produced normalised and unnormalised texts from a single marked-up file). An XML document can also be transformed into a non-XML format, for example, a PDF, RTF or PostScript file. And the same XML file can be transformed again and again into dozens of different formats, without any effect on the content itself.

Top of page

Preliminary version created 4 March 2002. Version 1.0 published 20 May 2003.