After I wrote recently about the awful markup used to identify index entries when you save a Word 2003 file as XML, Jon Udell wrote to me to relay MS Office Program Manager Brian Jones' query about whether I felt similarly about other markup in the XML version of a Word document. I haven't had the time to do a comprehensive review of the XML, and I've written before about a pleasant surprise I found in it (and I was annoyed at the fuss over Microsoft paying Rick Jelliffe to add some perspective to the ODF/OOXML Wikipedia entries—it's Rick Jelliffe, for chrissake) but a bit more investigation let me generalize from my earlier negative comments, and after writing it out to Jon I thought I'd expand on it a bit and post it.
The project I'm writing doesn't need the hyperlinks or table of contents markers in the Word XML, but from what I've seen of them, it looks like the XML representation of most of the Insert Field features seem to be that XML-ized version of the RTF: <w:fldChar w:fldCharType="begin"/>, then a w:instrText element with some cryptic string such as ' TOC \o "1-2" \n \p " " \h \z ' for a table of contents marker, 'HYPERLINK \l "_Toc135558539"' for a hyperlink, ' XE "' for an index entry, and <w:fldChar w:fldCharType="end"/> to finish it.
To test this theory, I created a sample document with about a dozen things added with different Insert Field selections and exported the result as an XML document. The XML version of most of the field constructs begin and end with w:r elements containing w:fldChar elements with w:fldCharType attribute values of "begin" and "end". Some store their information in a w:r child of a w:fldSimple element instead. The w:fldSimple element's w:instr attribute seems to be the equivalent of the w:instrText cousins of the w:fldChar "begin" and "end" elements, with cryptic strings of uppercase keywords, punctuation, and quotation marks like the TOC one shown above to say something about their purpose. (To be fair, the "Hyperlink" field had an actual w:hlink element to represent it.)
Indicating where the constructs begin and end with two separate, generic empty elements that have a fldCharType attribute value of "begin" and "end" is much more difficult to work with than a matched pair of start- and end-tags. XML isn't simply the representation of data with tags enclosed in angle brackets in such a way that Xerces doesn't complain about it; much of the point of XML is to clearly indicate where things (and sub-things) begin and end using a matching pair of start- and end-tags. I suppose that an XML representation of a Word file must address the possibility of overlap—what if the document has bold text, then bold italic, then just italic?—but if the OpenOffice coders can parse the original Word file and turn it into good markup, we know it can be done.
A new annoyance revealed by my further research is the fact that those w:instrText elements store their cryptic strings of information such as ' TOC \o "1-2" \n \p " " \h \z ' as PCDATA. Using XSLT, it's usually easy to check whether an element has no content (regardless of the number of descendant elements it has) by checking whether normalize-space(value-of(.)) = "", and when processing XML versions of Word there are often empty paragraphs and maybe even empty sections that you want to throw out, but these w:instrText elements prevent this from working. I know that storing content in PCDATA and metadata in attributes is only a convention, but it's a convention of document-oriented XML going back to SGML days, and an XML version of a Word file is certainly document-oriented XML. (More on this in the comments to my earlier entry on the topic.)
The kinds of things that a Word user picks "Insert Field" to add are often very important to what makes a Word or XML document richer than plain ASCII text with no markup, and it's a shame that whoever designed the MS XML to represent these didn't do a little more modeling of the data necessary to represent each field type and instead just mapped the RTF (or whatever internal structures that I'm sure the RTF reflects) to pointy brackets and strings full of internal codes. I'm sure it made their design work go more quickly, but the result is something that offers few good arguments for advocacy as a standard.
Jones' blog has been talking up an open source API for processing the Office XML, and while it's good that such a tool exists and is open source, it doesn't address the issues I describe above. The "don't worry about the data complexity, we have a tool that takes care of it" argument often presented in such cases leads to a software dependency, and the reason we use open data standards is to avoid dependency on specific tools. (A dirty little secret of the SGML world was that while we all preached the gospel of an open ISO data standard as a way to avoid dependency on specific software tools, most serious production work relied on Omnimark, a company that at the time was run by a man who would rather tell developers what they needed than listen to what they needed. One former employer of mine converted their SGML system to use XML purely to eliminate their dependency on Omnimark.) A dependency of a data format on a specific tool takes away from arguments toward making that data format a standard.
The things that a Word doc file or an XML version of that doc file must represent can be complex, and I'm sure that further investigation of the XML, if I had the time, would reveal further pleasant surprises and further annoyances. So far the score, on balance, is pretty low.