![]() |
Text Encoding Initiative
|
|---|
Syd Bauman
Women Writers Project
Brown University
syd_bauman@brown.edu
Terry Catapano
Library Studies
Rutgers University
thc@eden.rutgers.edu
The Text Encoding Initiative Guidelines "do not address the encoding of physical description of textual witnesses: the materials of the carrier, the medium of the inscribing implement,. . . the organisation of the carrier materials themselves (as quiring, collation, etc.), authorial instructions or scribal markup, etc." P3 section 18.4 It might therefore be assumed that one cannot use TEI P3 for such purposes. It is not, however, that such features cannot be encoded using TEI, just that specific guidelines are currently unavailable. We discuss why one might wish to encode this information, demonstrate two TEI-conformant methods for the encoding of the physical structure of a codex, and discuss possible advantages and disadvantages of each.
The arrangement of the text by bound page sequence allows a user to effectively reconstruct the experience of reading the text as it appears in a particular document. However, if one is interested in reconstructing the process by which the book was printed, the page order encoding offers little help. The re-arrangement of the pages of text as imposed for printing may make apparent places where the text was affected by typographical exigencies. It is also useful in electronic bibliographic analysis
The relationships between and among gatherings, sheets, formes, leaves, and pages can be somewhat confusing if not considered carefully, as each page has a number of relationships to encode. In a folio in fours, the book is made up of sheets folded once, then sewn into two-sheet gatherings. Each sheet has two adjacent pages of printed text on each side. See figures 1 and 2.
A gathering of a folio in fours has the following order and relationships:
GATHERING A LEAF A1 [on sheet A1, conjugate to LEAF A4] PAGE A1r [adjacent to PAGE A4v on outer forme of sheet A1] PAGE A1v [adjacent to PAGE A4r on inner forme of sheet A1] LEAF A2 [on sheet A2, conjugate to LEAF A3] PAGE A2r [adjacent to PAGE A3v on outer forme of sheet A2] PAGE A2v [adjacent to PAGE A3r on inner forme of sheet A2] LEAF A3 [on sheet A2, conjugate to LEAF A2] PAGE A3r [adjacent to PAGE A2v on inner forme of sheet A2] PAGE A3v [adjacent to PAGE A2r on outer forme of sheet A2] LEAF A4 [on sheet A1, conjugate to LEAF A1] PAGE A4r [adjacent to PAGE A1v on inner forme of sheet A1] PAGE A4v [adjacent to PAGE A1r on outer forme of sheet A1]
The same pages ordered by sheet/forme would have the following order:
GATHERING A
SHEET A1
OUTER FORME
PAGE A1r
PAGE A4v
INNER FORME
PAGE A1v
PAGE A4r
SHEET A2
OUTER FORME
PAGE A2r
PAGE A3v
INNER FORME
PAGE A2v
PAGE A3r
For most audiences, the logical division of a work into acts, chapters, poems, etc., is the most important cognitive structure (although division into pages is usually the most important navigational tool). Thus, TEI texts usually allocate the basic <div> structure to these logical divisions using < div> (or < div0>-<div7>) elements. Given the importance of the physical structure to certain audiences (analytic bibliographers jump to mind), it makes sense to use it as the source for the <div> hierarchy of a TEI documentary transcription intended for use by these audiences. For the purposes of our examples, we will assume that the physical structure is the only structure being encoded by the <div> structure of the TEI file.
Because there is no one correct arrangement for pages as printed, but there is one correct order for pages as bound, it makes sense to retain the bound order of pages in a TEI encoding. The <div> hierarchy is thus used to nest pages as parts of leaves. The forme and sheet <div>s, however, do not directly nest their constituent parts. They rather rely on next and prev attributes to indicate their components.
Reading a page of SGML in which every tag is a div tag with a type attribute whose value is structurally important can be quite tiring (for humans). In this example, we have used <gathering>, <sheet>, <leaf>, <formeOuter>, <formeInner>, and <page> as "syntactic sugar" for <div>s of the corresponding types in order to make it more readable. All of the elements used (except for the <seg> elements, which are merely being used as placeholders to indicate where page contents go) are really stand-ins for <div>. The TEI file for one gathering of a folio in fours would have the following basic structure:
<gathering id="G6A"> <sheet part="Y" id="S6A1" next="S6A4"> <leaf id="L6A1"> <formeOuter part="F" id="F6A1R" next="F6A4V"> <page id="P6A1R"><seg>page 1 data</seg></page> </formeOuter> <formeInner part="I" id="F6A1V" next="F6A4R"> <page id="P6A1V"><seg>page 2 data</seg></page> </formeInner> </leaf> </sheet> <sheet part="Y" id="S6A2" next="S6A3"> <leaf id="L6A2"> <formeOuter part="F" id="F6A2R" next="F6A3V"> <page id="P6A2R"><seg>page 3 data</seg></page> </formeOuter> <formeInner part="I" id="F6A2V" next="F6A3R"> <page id="P6A2V"><seg>page 4 data</seg></page> </formeInner> </leaf> </sheet> <sheet part="Y" id="S6A3" next="S6A2"> <leaf id="L6A3"> <formeInner part="F" id="F6A3R" prev="F6A2V"> <page id="P6A3R"><seg>page 5 data</seg></page> </formeInner> <formeOuter part="I" id="F6A3V" prev="F6A2R"> <page id="P6A3V"><seg>page 6 data</seg></page> </formeOuter> </leaf> </sheet> <sheet part="Y" id="S6A4" prev="S6A1"> <leaf id="L6A4"> <formeInner part="F" id="F6A4R" prev="F6A1V"> <page id="P6A4R"><seg>page 7 data</seg></page> </formeInner> <formeOuter part="I" id="F6A4V" prev="F6A1R"> <page id="P6A4V"><seg>page 8 data</seg></page> </formeOuter> </leaf> </sheet> </gathering>
In order to extract leaves or pages, a processor merely has to select the correct <div> (or syntactic variant). However, in order to extract a sheet or forme a processor must aggregate the appropriate partial elements into an aggregate element by chaining them using the id/idref mechanism made available via the next and prev attributes.
Although, in some sense, encoding the physical structure of the pages of a book as the <div> structure appears to be the most appropriate encoding, it is without doubt cumbersome for humans. Humans have trouble following all that deep nesting, and performing the "hand-pointing" needed to aggregate the various <div>s. Another possibility is to encode only the structure of the pages themselves using <div>, and then create virtual aggregations of the various other <div>s needed using <join>. For example:
<div type="page" id="P6A1R"><seg>page 1 data</seg></div> <div type="page" id="P6A1V"><seg>page 2 data</seg></div> <div type="page" id="P6A2R"><seg>page 3 data</seg></div> <div type="page" id="P6A2V"><seg>page 4 data</seg></div> <div type="page" id="P6A3R"><seg>page 5 data</seg></div> <div type="page" id="P6A3V"><seg>page 6 data</seg></div> <div type="page" id="P6A4R"><seg>page 7 data</seg></div> <div type="page" id="P6A4V"><seg>page 8 data</seg></div> <!-- ... --> <joingrp targtype="div" targorder="y" type="leaf" result="div" desc="each JOIN joins two pages into a leaf"> <join id="L6A1" targets="P6A1R P6A1V"> <join id="L6A2" targets="P6A2R P6A2V"> <join id="L6A3" targets="P6A3R P6A3V"> <join id="L6A4" targets="P6A4R P6A4V"> </joingrp> <joingrp targtype="div" targorder="y" type="forme" result="div" desc="each JOIN joins two pages into a forme"> <join id="O6A1" type="outer" targets="P6A4V P6A1R"> <join id="I6A1" type="inner" targets="P6A1V P6A4R"> <join id="O6A2" type="outer" targets="P6A3V P6A2R"> <join id="I6A2" type="inner" targets="P6A2V P6A3R"> </joingrp>
If desired, formes or leaves can be <join>ed into sheets, and sheets can be <join>ed into gatherings.
Two confusing points in the Guidelines are worth pointing out. First, the Guidelines state that a <join> element needs to be in "a position where the element indicated by its result attribute would be contextually legal." P3, page 443 This is potentially problematic, as a <div> is not valid inside a <joinGrp>. However, since "a <joinGrp> may appear only where the elements represented by its contents are legal" page 445, we may conclude that the individual <join> elements are relieved of their "valid position per result" restriction by virtue of being in a <joinGrp> that is itself so restricted.
Second, it is not clear whether the targType of a <join> whose evaluate attribute has the value one or all should be the GI of the element to which the <join> points or that of the element which is the final result of the pointing process. However, it is clear that if targOrder is N, then the GI of the elements pointed to by targets may be either of the two specified by targType (or presumably any one of the multiple GIs specified on targType). See the top of page 401.
The authors are unable to express a strong preference for one encoding methodology over the other. The <div> method at first glance seems more intuitive and, in the simplest cases, easier to follow. However, as soon as multiple chaining processes are required, the encoding becomes difficult to follow and maintain. The "syntactic sugar" variant may allow the human reader to feel less overwhelmed on initial examination of the text, but more importantly would allow a capture DTD that could make creating the complicated structures a bit easier.
The <join> method, on the other hand, although not much easier to create (and perhaps harder to create than a "syntactic sugar" version), is dramatically easier to proofread - it is easier to examine all of the formes at one time, then proceed to the leaves, etc.
We are reluctant to admit it, but in the end we would be strongly inclined to use whichever method had stronger software support. As far as we know, no matter which of the TEI methods is used for aggregating partial elements, there currently exists no software that will proccess them in the order indicated by their attributes (rather than sequentially).
Our discussion deals with the relatively simple case of a folio-in-fours. In smaller book formats, the structure and relationships among the various bibliographical elements becomes more complicated. In quartos and octavos (4 and 8 pages to the forme), there are more than two pages imposed per forme, some imposed upside down; furthermore, some folds are cut to enable the turning of pages. Frequently the structure of a book is not found to be so consistent. One common complication is the presence of "cancels". However, we believe the scheme discussed above is extensible enough to serve as the basis for the encoding of smaller formats or cancels.
Back to Technical Program