I've written before about using OpenOffice to convert Microsoft Office files to OpenOffice files (and hence XML) with a shell prompt command that starts up OpenOffice with the MS Office file, does a Save As, and then quits OpenOffice. Because it can be done from the command line, this makes conversion of multiple files with a batch file or shell script much easier.
I recently had to do the same thing with Word to convert Word files to MS XML, and it turned out to be similar: you write a macro that does the SaveAs and then quits, and you start up Word from the command line naming the file to convert and the macro to do the conversion.
The macro I wrote yesterday could use some refinement, but it works:
Sub SaveAsXML() NewFilename = (Replace(ActiveDocument.FullName, ".doc", ".xml")) ActiveDocument.SaveAs FileName:=NewFilename, FileFormat:=wdFormatXML Application.Quit End Sub
(It seems like I have to write a bit of VB code about every three years, so with any luck that's it until 2010. I was sorry to hear that in my nephew's first year at the University of Kansas, the "Intro to Programming" course uses VB. As I said to my sister, "But you're not living in a Seattle suburb anymore!") If you want this to save as something other than XML, see the other options for the FileFormat parameter.
My word2xml.bat batch file to tell Word to start up with a given file and run the macro looks like this:
"C:\Program Files\Microsoft Office\OFFICE11\winword" %1 /mSaveAsXML
There are other command line options for winword.exe besides /m, but none looked very interesting to me.
As with my command line trick for converting MS Office files to OpenOffice files, this technique can get filed with quick and dirty perl scripts: if you have a batch of files that need a one-time conversion some afternoon, it's great, but it's not really fast, so if you're building a production system that needs to perform this conversion every day, there are some other options that will be more complex to set up but will run more quickly because they won't require starting up and shutting down the word processor for every document.
As far as what to do with the Word XML files once I have them, well, don't get me started...