Text Encoding Initiative
Tenth Anniversary User Conference


TITLE: Taking Snapshots of the Web with a TEI camera


D Walker
Queen's University,
Kingston, Ontario
Canada
walker@qucis.queensu.ca

(Student Submission)

The goal of the Snapshot project is to observe and capture the linguistic development of a technological culture over the short period of several months. The culture in question is the World Wide Web and the linguistic variations in this culture are obtained by taking snapshots, pseudo-random samples of web documents. The documents are captured at regular intervals and are added to an open corpus of previously retrieved Web documents. The corpus then serves as the raw material for qualitative and quantitative linguistic analysis of the arguably unique nature of this form of electronic communication.

In order to facilitate this knowledge base, documents must be both encoded in a standardized way, and amenable to data retrieval by the researcher. The TEI serves as the encoding standard for this document system. The details of the encoding method are introduced here as a novel method for storing documents in an open and growing corpus. Some of the problems inherent with automatic document retrieval and encoding are also explored.

There has been significant discussion in academic circles on the nature of new forms of electronic textuality. Simone notes that the whole nature of texuality may be changing as result of electronic texts; text can no longer be considered 'closed' when it is so easy to manipulate and republish electronically (Simone,1996). Other purported properties of electronic communication include both its editorial and physical temporality. Hypertextuality also creates a new participatory role for the reader of electronic texts (Landow, 1993). Many of these attributes of electronic texts would presumably be coupled with observable linguistic phenomena.

The initial components of Snapshot were taken from its predecessor, NeoloSearch. The purpose of this original system was to generate an open corpus of web documents for searching and retrieving neologisms (Janicijevic and Walker, 1996). But it became clear that a broader scoped project, one which would potentially be able to show short term language variations and other properties of Web communication, would be a useful next step. It is supposed that these short term changes, especially the appearance of slang and new terminology, can be observed by taking samples of the Web at regular intervals. At each monthly interval then, the corpus is updated with a fixed amount of language data, approximately 7 million tokens. The documents are retrieved using a web-crawler system developed for NeoloSearch.

The initial system tokenized new documents and indexed the tokens by frequency. When a potential neologism was discovered, it was extremely difficulty to determine the context of the token since no contextual information was stored on the source documents. To achieve a more thorough analysis and to enable the study of other phenomena, it became clear that documents should be stored in their entirety and indexed in some database system. While many encoding schemes exist (such as the source format HTML), it seemed practical to use the robust TEI encoding scheme to better reflect the semantics of the storage system. TEI also provides what is hopefully a more permanent and more regular representation for the documents.

On its root level, the corpus is encoded using as many of the standards as possible set out by the British National Corpus, as summarized in (Dunlop, 1995). Because of both file management and automated document retrieval issues, it is more practical to store each document in a separate file linked using <xref> references from the main corpus file. Documents from a retrieval run are logically grouped together, and therefore represent an encoded grouping within the main corpus file. The main file, however, only contains stub references to the actual documents.

The documents themselves are only lightly encoded with the default TEI-Lite tagset as abbreviated from the full TEI P3 specifications. This is due primarily to the overhead and ambiguity involved in syntactically parsing text to extract linguistic facts. Additionally, it is not automatically discernible what kind of text is represented in the document (i.e. prose, drama, etc) so none of the specific tagsets can be used. The TEI-Lite DTD provides a clear and simple framework which is also used widely in other encoding projects (Burnard & Sperberg-McQueen, 1995). Since conversions between HTML and a skeleton TEI representation are relatively easy to accomplish with a simple context-free parser, some TEI encodings can replace the basic HTML codes. For instance, the <H...> headers can be hypothesized to represent the various divisions of a text, and paragraph breaks, <P>, can be anchored as TEI segments. Basic document information, such as source URL and date/time of retrieval can be stored in the document header. Handling of non-latin character sets have not yet been addressed in this framework.

The options for each run must be specified in the file header. These include: sample size, minimum document size, search behavior, language restrictions, and so forth. Each text division and segment is identified uniquely using a hash function which combines the retrieval group, document name, division (if applicable), and the segment, to uniquely represent each identifier in the system. This is useful for locating contexts for a user query or statistical analysis.

Despite its seeming simplicity, document parsing and encoding is not necessarily trivial. It is expected that a large number of documents of the Web are malformed. This means that the encoder must be capable of at least basic error recovery and that a simple search and replace of codes will not suffice. When an expected code is missing or an unexpected code is found, a <corr> tag can be used to indicate that a correction was made. Unknown elements can be represented using <unclear> tags. The encoding is in the spirit of Lavagnino (1996). Without burdening the corpus with excessive and arbitrary interpretations, it is still both necessary and potentially useful to capture original encoding errors.

The one major problem with automatic document retrieval and that is with the identification of duplicate documents and document versions. Many documents are duplicated on `mirror-sites' on the web. Therefore it is more than likely (and in practice virtually inevitable) that some document duplication occurs. To identify this phenomenon, and to also avoid the cumbersome task of complete document comparison, special frequency information can be stored within the document header. Comparing the frequencies of the top n tokens in the document with all other documents is a sufficient means of finding approximate document matches. At worst, this method will match too many documents to each other. This is acceptable since more rigorous text alignment techniques can be employed if a match is suspected.

While various algorithms purport to solve the rigorous matching problem (see Crochemore and Rytter, 1994), it is often the case that the answer to the document version problem is usually found in context. If one document is a newer version of a document with the same URL, it is likely to be a revision. If the document authors could be shown to be the same this would also suggest the relation. Unfortunately, neither these nor any of the other possible contextual criteria are sufficient for deducing the revision relationship between two documents. Documents are mirrored at dozens of disparate sites on the Web, with authorship and other significant indicators altered (Heery, 1996).

Fortunately, the exact descendance of one document from another does not need to be absolutely determined. Instead a `sameness' attribute can be introduced into the pointer definition which describes how `similar' one document is to another. While this method does not guarantee accuracy, it serves as a springboard for more complex document comparison and alignment.

There are various heuristics that can be applied to the frequency information to find some sort of measure of the text. A simple one is to simply tally the frequencies of the top n tokens modulus some appropriate m. The object is to have one number which 'describes' the document and can then be encoded in the text's identifier. Then the challenge of comparing documents is at least initially reduced to comparing two integers, a very fast operation. If the texts are identical, then a reference marker can simply be placed to the original in stub of the new retrieval section of the main corpus file, depending on the retrieval preferences settings. But if the match is not exact, then it must be determined if the document is `similar' but different, or is a newer version of the older one.

Associations between documents can then be encoded in the document referent in the main corpus file. It is then possible to show the relationships between a given document, its predecessor and successor versions, as well as documents which may share stylistic similarities (for instance technical reports).

For Snapshot, two parallel databases are maintained. Tokens are stored in a database for easy access by unix tools like grep and awk. Each token has associated with it a reference to the source identifier in the TEI encoded corpus. Thus a user can look back to the contexts of a curious token which resulted from a query on the database for specific frequency occurrences. This may prove unnecessary in the future, as tools like the XML toolset allow for easier 'command-line' interface to the corpus, provided that the encoding schemes are compatible (Flynn et al, 1997).

The implementation combines the features of document collection and corpus analysis into one interface. The still limited front end is programmed in Java-augmented CGI to enable maximum portability. Users can browse through documents, list tokens and observe their contexts. Some simple querying capabilities are also available. Unfortunately, the project has been limited by the still nascent state of accessible TEI and SGML resources.

The Snapshot project is an ongoing development effort. More research is required to evaluate the current design and implementation and to explore alternative data encodings. In addition, more work can be done to explore encoding of non-textual data and with a more distributed system where documents are not actually stored on the host machine. It may also prove fruitful to examine methods for the automated extraction and encoding of more linguistic facts from the source documents. Further experimentation is also required to find a method of truly 'sampling' the Web, as web-crawling may prove far from pseudo-random.

The rapidly changing world of Web publishing deserves closer scrutiny by linguists. By sampling language usage on the Web, this novel form of communication can be recorded in a less temporal form. By using the TEI as the encoding standard for this repository, it should be both a flexible and transportable resource.

References

Burnard, L. and Sperberg-McQueen, C.M.,TEI Lite: An Introduction to Text Encoding for Interchange at ftp://www.uic.edu/orgs/tei/intros/teiu5.tei

Crochemore, M. and Rytter, Wojciech (1994) Text Algorithms Oxford University Press, New York.

DeRose, S. J., and Durant, D. G., (1995) "The TEI Hypertext Guidelines" Computers in the Humanities Vol 29: 181-190.

Dunlop, D., (1995) "Practical Considerations in the Use of TEI Headers in a Large Corpus" Computers and the Humanities, Vol 29: 85-98.

Flynn, Peter et al, (1997) Frequently Asked Questions about the Extensible Markup Language at http://www.ucc.ie/xml/faq.sgml

Heery, R.,(1996) "Review of Metadata Formats." Program, Vol 30, No 4.

Janicivic & Walker,(1997) "NeoloSearch:Automatic Detection of Neologisms in French Internet Documents." Proceedings of ACH/ALLC'97: 93-94.

Langendoen, T. D. and Simons, G. F., (1995) "A Rationale for the TEI Recommendation for Feature-Structure Markup" Computers in the Humanities Vol 29: 191-209.

Lavagnino, John, (1996) "Completeness and Adequacy in Text Encoding" in The Literary Text in the Digital Age University of Michigan Press, Ann Arbor: 63-76.

Simone, Raffaele, (1996) "The body of the text" in The Future of the Book University of California Press, Berkeley: 239-252.

Sperberg-McQueen, C.M. and Burnard, L. The Text Encoding Initiative Guidelines (TEI P3) at ftp://info.ox.ac.uk/pub/ota/TEI/doc/teij31.sgml

Willett. P. (1996) "The Victorian Women Writers Project: The Library as a Creator and Publisher of Electronic Texts" The Public-Access Computer Systems Review Vol 7, No 6.


Back to Technical Program