For our target annotation format, we are aiming for verse-level annotation of structure conforming to the Corpus Encoding Standard (CES) subset of the TEI [Ide1996]. In many respects, creating a parallel corpus from multiple versions of the Bible is an excellent match for the CES. Since the corpus is being created primarily for use in corpus-based computational linguistics research, the restrictions imposed by the CES, inn comparison to the full generality of the TEI, are suited to the task (CES Sec. 0.2.3). Moreover, the CES contains useful and explicit guidelines for the encoding, and the consistent structure and content of Bible text should make it straightforward to achieve not only Level 1 but Level 2 conformance to those guidelines, using fully automatic conversion of original files. (Level 2 conformance goes beyond the minimum by requiring both correct paragraph-level markup and consistent marking of some sub-paragraph elements; Level 3 conformance is not a goal since reliable identification of all the specified sub-paragraph elements, particularly names, would require significant manual effort.)
Finally, the cesAlign encoding conventions for parallel corpora are
ideally suited for the present task, since they permit an arbitrary
degree of parallelism, and because the recording of alignment
information in an external document makes it trivial to work with a
monolingual subset or any n-way parallel
subset.
CesAlign specifies the form for a separate alignment document linking
existing documents; the alignment document can be created for a pair
of Bible versions trivially by encoding one-to-one links between
book/chapter/verse labels, as illustrated here:
<link xtargets="GEN:1:1 ; GEN:1:1">Alignment at the sub-verse level would be considerably more tricky, of course, since different translations reflect different decisions about how verses are broken into clauses, etc. We leave this as a potential problem for future work.
Despite the fact that the CES is well suited for our task in many ways, we can suggest two ways in which the current CES draft may be problematic for our purposes. The first is merely a question of its scope, and may be remedied as the standard develops: one goal of our corpus-based research using the Bible is to investigate word sense and semantic issues, and these are explicitly outside the purview of the current CES draft (CES Sec. 0.2.4).
Second, and more important, we observe that the verse structure of the Bible does not respect the linguistic subdivisions chosen in the CES, at least with regard to the encoding standard for primary data (cesDoc). Built into the standard are basic elements of paragraph, sentence, token; however, verses can contain material above sentence level, as in (1), as well as sub-sentential units, as found in the two verses in (2).
(1) <v id="GEN:1:31">And God saw every thing that he had made, and,
behold, it was very good. And the evening and the morning were the
sixth day. </v>
(2) <v id="GEN:10:13">And Mizraim begat Ludim, and Anamim, and Lehabim,
and Naphtuhim, </v>
<v id="GEN:10:14">And Pathrusim, and Casluhim,
(out of whom came Philistim,) and Caphtorim. </v>
One alternative would be to use the cesDoc DTD,
using the
div element to identify chapters and treating verses as
paragraph-level elements -- this would produce a ``Level 1
CES-conformant'' encoding. However, identifying Bible verses with
either paragraph- or sentence-level elements would sacrifice
standardization at the semantic level (CES Sec. 1.3.3), since
``sentence'' or ``paragraph'' for this corpus would mean something
different than the conventional meaning of those terms for other
corpora.
Another alternative would be to utilize the notion of a chunk in the cesAna DTD for encoding linguistic annotation, annotating each verse as a chunk comprising the series of tokens within that verse. This would preserve adherence to the standard at the semantic level, but would sacrifice the notion of verse as a meaningful structural element at the level of the primary encoding -- instead shifting the burden to the linguistic level of encoding (following the cesAna DTD). This seems fairly unnatural.
We observe that this problem is potentially more general than the specific application of the CES to annotating Bible text. For example, we expect that encoding speech data will present similar problems. The basic structural element in conversational speech is the turn (e.g. see the Child Language Data Exchange System database, [MacWhinney1991], which contains transcripts of conversations), and like verses, turns may comprise material both below and above the level of the sentence.