STG's XML 1.0 Reference Validator
This report examines why validation, and readily available validation facilities, are critical to the rapid dissemination and success of XML; it also introduces a new, public reference validator intended to help fill this niche.
Table of Contents
Note: This report was written originally in October of 1998 - at a time when there were no complete, working, web-available XML validators. Since then, in addition to STG's validator, one other web-available validator has appeared (author: Richard Tobin). Others will doubtless follow.
With all the hubbub surrounding XML lately - all the conferences, debates, books, papers, and articles - it is a surprising fact that only a small fraction of the XML available on the net is actually valid; i.e., only a small fraction of it follows the full February 1998 W3C XML 1.0 spec. The reason for this is simple: There isn't much XML software (as yet) to adequately generate and check it. Nor are there any full, working, web-based XML validation services analogous what we see in the HTML world.
Access to validation services, however, is critical to the success of XML because without it we end up back where we started, i.e., back to the very same chaos that prompted the development of XML in the first place.
In efforts to help reduce the chaos, and make validation facilities more broadly available, Brown University's Scholarly Technology Group (STG) has placed on its website a public reference XML 1.0 validator. This report examines the rationale behind that validator, and offers a brief semi-technical overview of its design.
The ubiquity of invalid XML documents (or, more broadly, our inability to detect them easily as such) presents a serious obstacle to the rapid dissemination and success of XML because it perpetuates the same interoperability problems that have hampered the development of XML's cousin, HTML.
As most Web designers and programmers are well aware, nonconformant HTML (i.e., HTML that fails to validate against an IETF or W3C standard) is, in many quarters, more the rule than the exception. Nonconformant HTML, though, often works out in practice because browser manufacturers, in addition to creating their own HTML extensions, have managed to work around most of the mistakes that programmers and authors typically make. But the manufacturers can't anticipate every possible mistake; and neither can every piece of software we use with our HTML. As a result HTML software is something of a free-for-all. Some software works fine with some HTML. Other software breaks on the same material.
The fundamental reason why HTML software has become such a free-for-all is that HTML began its life with no formal specification. Worse yet, when formal specifications finally did begin to appear, they came too slowly to be of much use to Web designers and programmers. As a result, every browser manufacturer felt obligated to define its own version of HTML. Microsoft and Netscape also felt it necessary to hire armies of programmers to figure out what their competitors were doing.
The result has been a dramatic increase in the cost and complexity of HTML processors - and an interoperability nightmare.
With XML ("Extensible Markup Language"), the situation is potentially quite different from what we have seen with HTML. With XML we don't have to worry as much about browser manufacturers arbitrarily redefining the specs. Nor do we have to wait for standards bodies to reach consensus. With XML, each of us has the power to take matters into our own hands; to define our own markup language, or to extend an existing one - and to decide what is, and isn't, a valid construct in that language. What is more, we can do all this in a way that conforming XML processors will understand. In other words, we can do it without creating the same interoperability problems that have dogged HTML.
The mechanism through which XML grants us these powers is the document type definition (DTD) - a document that specifies what elements, attributes, and entities an XML document instance may consist of, and in what order and combination. With a DTD (and a stylesheet) users have close to total control over the language and presentation of their documents.
(Although HTML has official DTDs, they are controlled by standards organizations, are rarely used, and often do not reflect actual practice.)
Despite the freedom that XML DTDs can give us, there is, as yet, little software that allows anyone to take advantage of them. Most XML processors available now essentially ignore the DTD. And of those that do full (DTD-aware) validation, only one, as of this writing (Oct 98), is available freely over the Internet (I have not yet managed to get that validator, based in Korea, to work). See Robin Cover's definitive XML testing and validation resource list.
The absence of a full, working, publicly available XML reference validator creates a critical gap, especially now that consortiums have begun popping up everywhere, defining their own XML-based formats, and laying claim to its platform independence and interoperability. Without widely available validation facilities these claims are null because there is no way to verify, or enforce, actual conformance.
Perhaps not surprisingly, even an informal check of actual and proposed XML interchange formats reveals that most do not reflect valid XML 1.0 constructions. Some are so far from the spec that one wonders how anyone could call them XML. Until there is a publicly available reference XML validator people can point to, it will be difficult to stem the tide of this faux XML, and to get down to the business of creating genuinely interoperable formats, and field testing the XML processors that are to operate on them.
It is in efforts to fill this need for an XML reference validator that the Brown University Scholarly Technology Group (STG) has placed on its website a simple form-based XML 1.0 validation system.
Using STG's XML validator is easy. Just go to the Web form, and either type in a local filename, or paste some actual XML into its text field; then click on the validate button. The validator will then either respond with a "validates OK" message, or else output a list of error and warning messages.
The overall design of STG's system is tripartite. It is a familiar design common to many "traditional" web-based interfaces. It consists of:
The back end (component 3 above) is written specifically for legacy computer systems that lack intrinsic library support for Unicode and that may even have old-style SGML catalogs around. It validates at a rate of about ten seconds a megabyte on an old dual 125mhz HyperSparc 20 server, about four seconds per megabyte on a Pentium Pro 200 desktop. For more information on the back end, see its Unix man page.
The PERL script (component 2 above) is something of a bottleneck, but it uses the now nearly universal CGI interface, and has the advantage of being portable and easy to maintain. The same might be said of the static HTML form (1 above), which provides a simple, effective, maintainable entry point into the system. Obviously it would be nice to have an XML-based entry point, but the software is not yet available to support this.
The reference validator's back end has just finished a brief in-house alpha testing, and the system as a whole is now ready for public access on STG's main website:
We consider the system to be in beta testing now, and we invite bug reports. (Doubtless there will be more than a few of these.)
The source code for the parser is available at STG's website, as are binaries for a few platforms.
Please direct questions or comments on the system, or on any of the issues surrounding its release, to the STG staff (address below).