Xmlparse Unix Manual Page
xmlparse - a validating XML parser
xmlparse [-c <config filename>] [-C <SGML catalog filename>]
[ - d <dirname>] [ - E <max errors>] [-f] [-h] [-l <debug
level>] [-m <message catalog>] [-n] [-s] [-v] <filenames or
Xmlparse is a full validating XML parser for use as a back-
end to Web-based XML validation systems, or as a general-
purpose XML validation tool. It is particularly well-suited
to legacy SGML documents that are in the process of being
converted, along with their associated DTDs, to XML.
Xmlparse knows the difference between SGML and XML, and can
often elucidate mistakes that stem from SGML/XML incompati-
bilities (e.g., it reminds users that SDATA entities don't
exist in XML; it warns users about nondeterministic content
models, which are illegal in SGML; it also flags general
problems like declared but not used, and used but not
declared, elements in DTDs).
Xmlparse may be invoked with several command-line options
that tell it where to send error output, and where to look
for catalog, message, and other auxiliary files.
Normally environment variables and/or compile-time defaults
should provide reasonable fallbacks for all of these
command-line run-time options.
Use filename as the configuration file. See also
option - d below. Do not leave this file world-
Use filenames as the SGML catalog files (if more than
one is given, separate them with a colon). Note that
if -C is not supplied on the command line, the value of
the SGML_CATALOG_FILES environment variable is used
Use directory as the default location for data,
library, and configuration files.
-E max errors
Print no more than max errors errors and/or warnings
for every file parsed.
- f Force undefined attributes and element names in
namespaces to validate OK
-h Print a brief help message, then exit. See also - v
Set debugging level to level (must be an integer from 0
to 7; higher = more information). Debugging messages
go to syslog(3) (facility DAEMON, priority DEBUG). Cf.
system messages, which go to syslog only when specified
(see -s below). This switch only works if the system
administrator left debugging enabled at compile time.
Use filename as the message file name. This file con-
tains all error, warning, and parsing messages emitted
by xmlparse at run-time. Do not leave this file
-n Resolve only remote http:, urn:, and ftp: system ids
(be certain to use this option if you are running
xmlparse as a back end to a web-based validator). Note
that, even with the - n option, xmlparse will still
resolve local files if supplied on the command-line.
It will not, however, resolve URIs given on the
command-line unless they begin with http:, urn:, or
-s Output system error and warning messages to syslog(3)
(facility DAEMON, priority ERR or WARNING). These
error messages cover things like malformed SGML catalog
files, missing system files, and so on. Debugging mes-
sages (see -l above) always go to syslog (DAEMON,
DEBUG). Parsing errors always go to stderr.
-v Print version number, then exit. See also -h above.
Run-time settings may be supplied, not only through
command-line options, but also through a system-wide confi-
guration file (usually installed as
/usr/local/lib/xmlparse/xmlparse.cfg). Where they coincide,
directives supplied in the configuration file override
command-line options and compile-time defaults.
Normally the configuration file is used only to set the
external FPI and/or URI resolution commands (used by
xmlparse to resolve PUBLIC and SYSTEM identifiers). It may
also be used, however, to override the command-line options
-C, -E, -l, -m, -n, and -s. All configuration file direc-
tives are fully documented in the sample configuration file,
xmlparse.cfg, included with the base xmlparse source distri-
If no validation errors are detected, xmlparse exits with
status 0. Warnings may be issued to stderr. If actual
errors are detected, xmlparse exits with status 4, and emits
a list of parsing errors/warnings to stderr. Fatal system
errors resulting in early program termination produce other
Xmlparse may emit various diagnostic messages at run-time
about missing files or arguments. By default, these go to
stderr. They may, however, be redirected to syslog(3)
through the -s command-line switch (on which, see above).
Xmlparse is aggressive in reporting ambiguous content
models, elements that are declared but not used in any con-
tent model, unresolvable public and system identifiers, and
Xmlparse also issues warning messages that encourage DTD
writers to declare things before using them. For example,
it reports cases where ATTLIST declarations name as-yet
undeclared elements; it also flags unparsed entity declara-
tions that point to as-yet undeclared NOTATIONs.
Xmlparse implements the published (February 1998) XML 1.0
standard. It will also check namespaces (see, however, the
-f option above).
Xmlparse deviates from the 1.0 spec in one notable way: That
it ignores syntactically meaningless whitespace inside of
declarations and markup. The rationale here is that this
practice not only follows SGML (e.g., Handbook, 65
[371:16]), but also simplifies processing - and renders XML
more easily manageable using programming tools like flex(1).
Note that this deviation from the spec has nothing to do
with the hotly debated issue of whitespace in actual charac-
ter data (which the validator maintains internally, as per
Xmlparse also deviates from the strict 1.0 standard in its
early reporting of malformed entity replacement text (if an
entity's replacement text would be malformed, xmlparse flags
it, whether or not you actually use the entity). The
rationale here is that early reporting of malformed entity
replacement text prevents users from declaring entities that
are at best useless, and at worst harmful in that they
trigger DTD-based errors in documents whose DTDs were
thought to be correct.
Xmlparse does not prohibit '<' in attribute values. The
rationale in this instance is that excluding '<' actually
complicates processing for validating parsers. Also, with
all its intricate entity replacement rules and constraints,
XML is already such a pain to process that this so-called
DPH restriction is just plain silly.
A final area in which xmlparse deviates from the XML 1.0
spec is that it ignores the encoding types specified by
external transfer protocols, such as HTTP. Experience
reveals that these protocols very often provide incorrect
encoding information (e.g., UTF-8 usually gets sent as ISO-
8859-1 or plain-text ASCII). As a practical necessity,
therefore, xmlparse relies for encoding information on its
own internal charset detection facilities and on the encod-
ing declaration, if the text provides one.
To set up Xmlparse follow the instructions in the INSTALL
file that came with the source distribution. These instruc-
tions cover source code configuration and building, as well
as the actual installing.
Xmlparse has been coded specifically for platforms that
still lack support for UCS-2/4, UTF-16, and Unicode (i.e.,
nearly all stock Unix systems). It can also make limited
use of legacy SGML catalog files (basically it ignores com-
ments and lines that don't start with PUBLIC).
Xmlparse compiles using stock GNU tools available for nearly
all POSIX systems (e.g., (G)CC, Bison, and Flex [patched for
Xmlparse is an ugly, inelegant piece of software built to
run on legacy POSIX systems with C libraries and compilers
that don't understand Unicode (i.e., nearly all Unix systems
out there today).
Xmlparse assumes that all auxiliary files, other than the
XML source files and DTDs, are encoded using straight ASCII
or UTF-8. This includes the message catalog, the system-
wide configuration file, and any SGML catalogs used.
Xmlparse will parse XML source files and DTDs that use UTF-
8, UTF-16, UCS-2/4 (big or little-endian), or any of the
ISO-8859 standards, although all messages it emits are con-
verted to UTF-8. Naturally, documents that don't use UTF-8
should provide an encoding declaration, since xmlparse will
otherwise assume the default, UTF-8 (as per the spec).
Documents using ISO 8859-x should include an encoding
declaration as well.
Xmlparse handles memory inefficiently. This inefficiency is
compounded by its internal use of the wchar_t data type (if
available) for character and string operations.
Xmlparse also emits geekly line-numbered error messages that
XML/SGML neophytes may find inscrutable. These messages are
kept in a simple sprintf catalog that hard codes argument
orderings, and will therefore be a pain to port to some
Xmlparse was written by Richard Goerwitz for the Brown
University Scholarly Technology Group.
Send bug reports to <STG_info@Brown.EDU>.
Copyright 1998 by Richard Goerwitz and Brown University
Xmlparse is free software. Use it if you like (with
appropriate acknowledgments) and modify it to suit your
needs. But don't blame us if it doesn't do what you want or
expect. Make sure to check the COPYRIGHT file that came
with the xmlparse source distribution for a full statement
of copyright and usage conditions.
Man(1) output converted with