Within a particular electronic version of the Bible, we have observed that data formats are fairly consistent. And once low-level character set issues are dealt with -- some pertaining to non-Latin character sets, and some involving the transition from a PC to Unix platform -- the input formats seem to group according to a reasonably small set of dimensions. These include:
An on-line Swahili version of the New Testament, for example, illustrates embedded formatting, with separate marking for chapters and verses (Matthew 2:1-2):
\c 2 \s Wageni kutoka mashariki \p \v 1 Yesu alizaliwa mjini Bethlehemu, mkoani Yudea, wakati Herode alipokuwa mfalme. Punde tu baada ya kuzaliwa kwake, wataalamu wa nyota kutoka mashariki walifika Yerusalemu, \v 2 wakauliza, <<Yuko wapi mtoto, Mfalme wa Wayahudi, aliyezaliwa? Tumeiona nyota yake ilipotokea mashariki, tukaja kumwabudu.>> \p
A French version illustrates plain text with one verse per line, as well as the name of the book being repeated with each chapter heading (Matthew 2:1-2):
Matthieu 2 1. J\'esus \'etant n\'e \`a Bethl\'ehem en Jud\'ee, au temps du \ roi H\'erode, voici des mages d'Orient arriv\`erent \`a \ J\'erusalem, 2 et dirent: O\'u est le roi des Juifs qui vient de na\^itre? car \ nous avons vu son \'etoile en Orient, et nous sommes venus pour \ l'adorer.
The simple, uniform structure of the source text appears to greatly reduce the variation in document encoding for the on-line source documents. Minor variation within a version does occur, for example verse numbers sometimes being followed by a period and sometimes not, but these are easily handled. By organizing the annotated versions book by book, we eliminate potential problems in reordering -- for example, the book of Hebrews is the 58th book in the English Bible, and the 63rd in the German Bible, although the relative order of every other book is identical.