Typographic Regularization in the WWP Textbase
| STG lead(s): |
Jacque Russom |
In materials printed before 1630, the letters v, u, j, and i did not have the values that they have today. The word that we spell ivory was written iuory, the word that we spell jury was written iury, etc. This difference in typographical conventions makes early texts more difficult to read and compromises the matching of forms in information retrieval tasks.
The Women Writers Project uses SGML tagging to encode a regularized spelling for such typographical variants, thereby allowing the option to display and search on either the original form or the regularized form. Encoding this information by hand was time-consuming and inefficient, since a great many high-frequency words require such tagging (e.g, haue, loue, iudge).
STG undertook to develop a system to automatically tag words subject to this typographic convention with the regularized form. To a great extent, pattern matching rules based on the linguistic principles for English consonants and vowels can be invoked to identify and tag words appropriately. These rules are supplemented by programs that recognize SGML markup indicating such things as word division across a line break, errors or abbreviations within a word, or structural elements to be excluded from regularization.
The Brown University Women Writers Project's main undertaking is an SGML-encoded full-text database of pre-Victorian women's writing in English. This collection currently includes nearly 200 texts representing a broad cross-section of the literate culture of pre-Victorian Britain. The WWP supports teaching and research in a wide variety of disciplines such as English, history, women's studies, comparative literature, and religious studies.
| Principal Investigator(s) or Parent Project Lead(s): |
Allen Renear
Scholarly Technology Group, Brown University
|
| Research domains: |
humanities computing |
Record last modified: 13-Sep-2005