Making Valid XHTML Documents from Microsoft Word 2004 Using BBEdit and HTMLTidy
by Kerri A. Hicks
continued from page 1
First things first
Make sure you have access to an Apple Macintosh computer running OS X, a copy of Microsoft Word 2004, and a copy of BBEdit 8, all properly installed.
The Steps in Word 2004
Screen shot of the Microsoft Word 'Save as Web Page' dialog box, with appropriate settings.
- Open your Word document.
- Go to the 'File' menu at the top of the Word window, and choose 'Save as Web Page...'
- On the sheet that pops up, decide where to save your file, and what to name it. (Word defaults to saving your file with a '.htm' extension. You may wish to change that to '.html', which is more common.) The 'Format' field should show 'Web Page (HTML)', and the 'Save entire file into HTML' should be selected. (Don't click 'Save' yet.)
- Select the 'Web Options...' button, and in the 'General' tab of the dialog box, type the title of your web page into the 'Web page title' field. Select the 'OK' button.
- Select the 'Save' button to save your document.
Now Word has created a web page for you, and has also extracted any images in your document, creating a folder with the name of your web page, and placing the images in that folder, so that the document can automatically link to them. The HTML that Word has created, though, is full of code that is unnecessary, and it makes the page bloated and non-standard. The next step is to strip out all that non-standard code, and replace it with XHTML-compliant code.
The Steps in BBEdit
- Use BBEdit to open the HTML document that you just created.
- From the 'Markup' menu, choose 'Tidy', and choose 'Clean Document'. (1) Check the first three items in the box, and only those. This step will remove much of the non-standard code from the top of your document.
- Go to the 'Search' menu, and choose 'Find...'.
- In the 'Search for' box at the top, type
<o:p> - Be sure there's a checkmark in the the 'Start at top' option.
- In the 'Replace with' field, there should be nothing -- not even a space.
- Select the 'Replace All' button. (2)
- In the 'Search for' box at the top, type
- Repeat the above step exactly, but in the 'Search for' box, type
</o:p> -
From the 'Text' menu, choose 'Straighten Quotes'. This will turn any leftover MSWord 'Smart Quotes' -- sometimes called 'curly quotes' -- into web-friendly quotation marks and apostrophes. (3) You won't see a report after this step, you can simply trust that it's been done.
Screen shot of BBEdit's 'Translate' dialog box, with appropriate settings.
- From the 'Markup' menu, choose 'Utilities' and then 'Translate...'. This step will turn any non-standard characters, including many foreign-language characters, into HTML entities. In the box that appears:
- Select 'Text to HTML'
- Select 'HTML Entities'
- In the 'Use' field, select the 'Name' radio button
- Select 'Ignore < and >'
- Select 'Encode all Unicode characters'
- Use the 'Translate' button.
- Lastly, from the 'Markup' menu, choose 'Tidy' again, but this time select 'Convert to XHTML'.
- If there are any errors, a box will pop up, showing you what the error is, and what line it is on. Usually, the errors are simple things such as, "Table needs a summary attribute". Remedy any of these errors as best you can. If there are problems that you cannot solve yourself, feel free to send email to Brown's Webpublishers' email list (for Brown users only) with your questions.
- You're done. Now save the document, and move it to your web server. (Be sure to move any images that Word saved along with it.)
I welcome any feedback about this process -- what worked, and what didn't. Also, if you can write AppleScript and would be willing to write an AppleScript to automate this process, I'd be delighted, and will distribute it here.
(1) The 'Tidy' command invokes HTMLTidy, a powerful open-source application that does a great deal of markup conversion. BBEdit has built HTMLTidy in, so you won't have to install the application separately.
(2) The <o:p> elements, for some reason, are left behind in the current version of HTMLTidy. However, a bug was filed with the HTMLTidy development team, and they have fixed Tidy such that it will, in future versions, remove the <o:p> elements, which will some day allow you to skip this step.
(3) This isn't usually necessary, but I often cut-and-paste text from Word into BBEdit, and in those cases, it is absolutely necessary. So I do it as a habit, because it's a good idea.
