![]() |
Text Encoding Initiative
|
|---|
DO DIGITAL LIBRARIES NEED THE TEI? A VIEW FROM THE TRENCHES.
LeeEllen Friedland (National Digital Library Program, Library of Congress)
As we mark the tenth anniversary of the Text Encoding Initiative (TEI) it seems especially
appropriate to consider how digital libraries and electronic text projects have evolved in that time
and whether that evolution has impact on the success of--or future prospects for--the application
of the TEI in the humanities. Some relevant questions include: are there significant differences
between digital library projects and scholarly electronic text projects and, if so, does the TEI
serve the resulting range of needs adequately? Are the core principles of the TEI still appropriate
and relevant? And finally, how easy
is it to actually employ the TEI in a workaday text conversion and encoding program? In this
paper, the issues cited above, and related matters, will be discussed from the perspective of the
National Digital Library Program at the Library of Congress, which implemented a TEI-based
DTD in 1993.
The National Digital Library Program at the Library of Congress is on an ambitious course to
digitize millions of primary source items from a broad range of historical collections. The
Library has been digitizing historical materials since 1990, when the American Memory pilot
program developed collections, initially on CD-ROM, to explore the potential audiences, uses,
and enthusiasm for digital resources on American history and culture. An end-user evaluation
was conducted in 1992-93 in forty-four test sites around the United States including school,
college, university, state, and public libraries. Students, teachers, library staff, and the general
public were surveyed about their experiences using the digitized materials, their interest in
different types of content, and various issues relating to the delivery systems. Since the
completion of the pilot program in 1995, the digitizing activity of LC's National Digital Library
Program has expanded to include an ever broader mix of historical source materials, on-line
access via the World Wide Web, and generally incorporate the digitizing process into the larger
work and mission of the institution.
The nature of the digital library program at LC has been shaped, from the inception of the
American Memory pilot, by the ideas and priorities of Librarian of Congress, James H.
Billington. It was he who established the focus on American materials (including both materials
about the United States and American imprints), and this remains the general subject orientation
of the program today. Though this is a broad subject, the Library collects materials in more than
four hundred languages, and "American" materials certainly do not represent the institution's
holdings in a comprehensive way. However, the multiple formats of LC collection materials are
well-represented in the digital library program. Though the state of relevant technologies (and
related factors) greatly influences the quantity and pace of digitizing work that is done with
materials in different formats, the NDLP digitizes all types of printed matter, manuscript
materials, prints, photographs, sound recordings, and motion pictures.
Another important characteristic of LC's program that has been defined by the Librarian's
priorities that the digitizing effort facilitate greater access to the Library's treasures for a larger
audience than has traditionally used the collections in the past. Indeed, the Librarian has often
used phrases like "getting the champagne out of the bottle" when describing program goals. As
part of this philosophy of providing greater access, the Librarian has emphasized kindergarten
through high school (K-12) students and educators, and the general public; post-secondary
students and educators, and the scholarly community have not been discounted, but neither have
they been made priorities.
These issues have guided the general development of the American Memory pilot and current
NDLP digital production. They have also had an impact on the design and implementation of the
NDLP text-conversion program. From the beginning, certain priorities have been clear. Perhaps
most fundamental was the idea that machine-readable texts were an important and new, or at
least not commonly available, type of access to be provided for these historical materials. The
ability to search the full text of a document and thus preserve its intellectual content and, often,
context (represented by the document structure, formatting, and other presentational features),
was seen as highly desirable. The emphasis on historical materials, of course, guaranteed a wild
range of text sources. But in the context of digitizing materials in multiple formats, including
printed matter, typed and handwritten manuscripts, printed illustrations of many types,
photographs, sound recordings, and motion pictures, printed matter was--and is-- just another
type of stuff. We were forced to seek certain economies, and not only the financial and work-effort kind. It was clear from the beginning that we needed to seek conceptual economies that
would allow us to make progress on all fronts and do the best job possible with each one,
considering the current state of technology. Ours is a fully integrated digital library project, not
just an electronic text project. And yet, being a library, no less a library with 525 miles of
bookshelves, we took our responsibility to adequately represent text materials quite seriously.
In developing our SGML document type definition (DTD), we confronted some hard choices and
made what might be characterized as some very library-like decisions. Our reasoning followed
these lines: We are the custodians of these text materials and we want to provide improved
access to them. We don't presume to know what all Library users might want to do with these
texts, and, even if we could anticipate every user's needs, we couldn't possibly accommodate all
of them.
We didn't want to have to force different document types into a single content model. Nor did
we want to have a baker's dozen of DTDs and match up every document with the best suited
DTD, or require that kind of sophisticated decision making from data-entry technicians who were
unlikely to possess the appropriate training. We knew that we would provide digital images of
the original pages of text materials and that we wanted the texts to faithfully retain original
errors. We had neither the staffing nor time (nor mission) to aspire to the standards of a typical
documentary editing project, and, therefore, had to be prepared to allow all texts to be converted
and marked up without benefit of expert examination to, for example, interpret handwriting,
identify subtle boundaries of text subsections, or distinguish between different types of names or
dates.
In response to these issues, we sought to cultivate some creative and flexible middle ground.
Little did we realize that we would find that middle ground in the TEI. There was an uncanny
congruence between the encoding principles derived during the American Memory document
analysis and the TEI guidelines. This should not be so surprising, however, since both projects
were firmly rooted in the careful analysis of a broad range of humanities texts. Though LC staff
did not expect to become TEI converts, we knew what we wanted and what types of capabilities
we had to have. It could be argued that the unexpected result--great compatibility and
congruence between the American Memory DTD and the TEI--underscores the appropriateness
of the TEI gestalt for use in the humanities. The descriptive flexibility afforded by the TEI is
profoundly important and, this author would argue, developing a digital library of historical
materials in the humanities would be impossible without it. One might say, in summary, that the
American Memory DTD is the TEI writ small.
After more than four years of a reasonably successful implementation and continuing evolution of LC's text-conversion program, observations on a number of matters are timely. This paper will discuss in detail such issues as the contrasts between digital libraries and electronic text projects (using the NDLP as the main point of reference), including the former's goal to provide encoded texts to a broad, unspecialized audience; the compromises required by integrating text files into complex systems architecture and multi-media Web presentations; and the everyday realities of balancing text encoding within a wide array of traditional library work necessary for preparing historical materials for digitization. In addition, discussion will focus on several of the founding principles articulated in the TEI, including the intention to (1) "provide a standard format for data interchange in humanities research" (is there any such thing--outside tiny tribes of specialists?), (2) "suggest principles for encoding of texts in the same format" (encoding principles are invaluable, but, one might argue, the concepts of formats are too canonical), and (3) "include a minimal set of conventions for encoding new texts in the format" (but is the minimal set simple enough for a broad program?).
Back to Technical Program