Converting SGML DTDs to XML

Not quite to XML DTDs, but close enough to be useful.

I recently had to analyze a large batch of SGML DTDs for a client who planned to convert their publishing system to XML. I was mostly looking for redundant declarations in multiple DTDs that could be pulled into shared modules, but I also wanted some lists of elements and attributes that I could compare against statistics compiled about sample data so that I could see which elements and attributes were actually being used, because there's not much point converting SGML declarations for elements that aren't even used into XML element declarations.

When I want to analyze a collection of information that doesn't neatly fit into one or more tables, I want it in XML so that I can write little XSLT stylesheets to churn through it and count and compare things, and I found a surprisingly easy way to make an XML version of all of this SGML DTD information. I didn't quite turn it into an XML DTD or schema—there was enough refactoring planned for the DTD conversion that we didn't bother—but a few more steps and a minimal amount of manual work would have made that pretty straightforward.

The key was Earl Hood's perlSGML DTD analysis tools. I wrote about these in a 1998 book I did called SGML CD. (The book was originally going to be called "SGML for Free", because it documented all the best free SGML tools, but Prentice Hall decided that including a CD of the software itself would make the book more appealing, and they changed the book's title to make it clearer that a CD came with it.) Earl's tools are a collection of perl scripts that read an SGML DTD and give you various ways to explore it.

One script, called dtd2html, creates a directory full of HTML reports about various aspects of the DTD such as which elements have which subelements and attributes, which parent elements they can have, and which elements have which attribute types and which of those are required. My original idea was to run these HTML files through tagsoup so that I could use XSLT stylesheets to pull out the information that I wanted, but it wasn't as easy for the stylesheets to find what I needed in the dtd2html output as I had hoped. This was easy enough to fix: I added a few lines to the dtd2html perl script to wrap some div elements around the parts that I was interested in. These div elements included class attributes with names that served as hooks to make it easier for the XSLT stylesheet to find them, so that once I ran the modified version of dtd2html and tagsoup again, the stylesheet was pretty simple to write.

I'm not going to post my revised version of dtd2html because I wrote it for client work, and getting the right permissions would be more work than adding the lines that added the div tags to the perl script. If you need something like this, though, you can make your own customized additions to dtd2html with very little trouble.

I'll be discussing this and related techniques in my XML 2008 talk on Automating Content Analysis with Trang and Simple XSLT Scripts on December 9th.

1 Comments

Implementing captcha on my weblog forced me to convert to using some Movable Type 4 templates where I had been using MT 3 ones, and this screwed some things up, so I apologize if there are any problems adding comments. Kudos to the support people at pair.net, who patiently helped me straighten out the initial captcha problems.