Word 2003 XML

Better than I expected, but good enough for a production system?

After styling some headers in a sample Word document as Heading 1, Heading 2, Heading 3, and so forth, I was pleased to see that when I saved the document as Word 2003 XML, sub-section container elements were wrapped around the appropriate elements, grouping a Heading 3 title and all block elements up to the next Heading 3 (or higher) block together, nested within the group that began with a Heading 2, etc. (Although Open Office 2.0 offers Word 2003 XML as a Save As choice, it does not add these containers.)

I did this with a pretty simple example, so I don't know how well it would work with more serious documents. Has anyone incorporated the use of this XML into a production system? How robust is it? I understand that getting users to use the styles consistently is a classic problem; for now, I'm more interested in how well the Word 2003 XML itself held up to the demands of a production XML system.


Yes, I've used WordML 2003 in several production systems. It largely works - what causes problems is the occasional unexpected exceptions.

For example, hyperlinks can be represented as a w:hlink, or as a set of (w:fldCode begin, w:instrText HYPERLINK, w:fldCode end) items. If you click on a hyperlink in a document, and then save it, then the XML changes from using the first representation to the second representation!

Word automatically puts in wx:subsection elements around headings (actually, it's around blocks delimited by styles marked with an outlineLvl). This is great, and really useful for processing. However, if you're using WordML's capability of including XML in your own namespace in the document, then it doesn't put in the wx:subsections in at all, and all your code depending on them breaks!

And, as Andrew of Griffin Brown discovered, Word Service Pack 2 changes the format slightly. The example he describes is not too bad, I don't think - the real problem is that elements like

[w:instrText]HYPERLINK blahblahblah[/w:instrText]

can (seemingly randomly) change into

[w:instrText]INK blahblahblah[/w:instrText]

which is much harder to process.

So, summary, yes, it's usable. Yes, I've used it in production systems. But, the documentation is poor, I've had to discover gotchas like these by experience, and its not as easy to process as ODF.

The Word 2007 OOXML documentation is a useful reference, actually: the XML format hasn't changed very much between the two, and the docs are much better.

I ended up rewriting most of the conversion tools used for incoming manuscripts around Word 2003 XML. It was a relatively easy decision not because it was such a great format but rather that the non-XML-based tools were all horribly broken. I was thrilled at thought of being able to attack the problem as one of document translation through XSLT2 rather than massaging the input to a (binary) Black Box.

The end result was reasonably robust, although we have the luxury of authors who use a well-designed Word template with remarkable fidelity. Even after a few months of tweaking, it was never at the point where it could be run without supervision. "Fairly high quality"--that's about as good as I'd expect to get out of it.

My experiments with translating Word 2003 XML into DocBook were much easier. However, that path was extremely fragile (often producing invalid DocBook that would have to be hand-validated).

All that said, I wouldn't build _anything_ against Word 2003 XML now that Word 2007 OOXML is around the corner (with the added promise of export plugins for older Word versions).

Thanks Keith! I have a question, and I'm going to guess at the answer: why did you need XSLT2? Was it because Word's algorithm for determining wx:subsection tag placement wasn't good enough and you needed XSLT2's grouping ability?

(If you're in Boston next week, I'd love to talk about this more.)