Making Valid XHTML Documents from Microsoft Word 2004 Using BBEdit and HTMLTidy
by Kerri A. Hicks
Introduction
While Microsoft Word is the most widely used document editor on campus, it is not a particularly good tool for creating documents for distribution on the web. Web developers have published Microsoft Word files to the web in various ways over the years.
- Web-page maintainers painstakingly convert Word documents into web pages by hand, copying and pasting text. One drawback is that style information is easily lost, footnotes are not easily preserved, and various other important parts of a document may have to be recoded from scratch.
- Publishers may choose to simply put the Word document on the web, as a downloadable file. However, the web user who wishes to view the document must have Microsoft Word (or some other software that can read Word documents) installed on his or her computer, and the configuration of that software can cause the formatting, layout, or other functionality to render the document useless.
- Some publishers create PDF documents from their Word files, and link to them. While the PDF format is more readily accessed by web browsers than Word format is, it still requires additional software or plugins. Also, depending on how the PDF is made, the text may not be selectable for copying and pasting into other documents, and the PDF, unless specifically made so, is likely to be inaccessible to people with disabilities using particular assistive technologies.
- Other creative, but time-consuming, ways.
Since our work at STG often includes receiving Word documents from faculty that must be published to the web, I've come up with a simple process to leverage the power of Microsoft Word, BBEdit, and HTMLTidy to create XHTML files that not only work properly in all modern web browsers, but that also meet current web standards for validity and well-formedness. NB: This technique will preserve most structural formatting for most Word documents. If your document has a great deal of customization or an intricate layout, you may have to take more advanced steps.
Why Worry about Validity?
Ensuring that a document is valid is not simply an exercise for the sanctimonious. Documents that meet current web standards will easily live a long life, through changes in operating systems, browsers, and any other new technologies. We've seen the importance of this in recent years -- developers who built pages with 'hacks' or 'cheats', or folks who wrote pages for one specific browser have had to update, change, or discard much of their code, costing hours of work. Most standards bodies realize that they need to support older standards as they develop new ones. So if you do it right all along, you can be pretty sure that your documents will still work years down the road, and will be properly searched, indexed, and archived.
