The Bible is a widely available, representative sample of carefully translated texts in a variety of styles in a wide range of languages. These properties uniquely suit our research purposes, which include construction of translation lexicons, and evaluation of semantic tagging for multilingual machine translation and other natural language processing applications. The text is a single cohesive document comprising 66 books by 30-40 authors in a variety of text styles. The corpus provides a representative sample of language styles in the source texts, including narrative, poetry, and correspondence. The New Testament corpus alone ``compares favourably in size to other major collections analysed by scholars ... approximately as large as if not larger than the corpus of Homer's Iliad, of Homer's Odyssey, of Sophocles, of Aeschylus, of Herodotus ... [with] individual books ... comparable in size to other well-known classical texts: e.g. Plato's Apology approximates the size of Paul's Romans or 1 Corinthians'' [Porter1989].
As a resource for research using corpus-based statistical methods in computational linguistics, the Bible is small by current standards (e.g. see [Church and Mercer1993]); with some variation for language and translation, it is typically on the order of 800,000 words and 4-5 megabytes. However, this is on the order of some monolingual corpora widely used for corpus-based research, such as the Brown Corpus of American English [Kucera and Francis1967], and the breadth across multiple languages offers an opportunity for research not generally available with the larger corpora in use today.