Picking XML schemas and tools?

Then first think about your content and users.

At last week's XML in Practice 2008 conference, I joined Micah Dubinko, Evan Lenz, and Frank Miller for the panel on working with authoring tools and schemas. (Lisa Bos of Really Strategies did a fine job hosting the panel; she should consider doing one of those interview podcast shows.) The panel's full title mentioned both DITA and DocBook, and while Mark Shellenberger predicted a "cage match," several people later seemed disappointed that there weren't more DITA/DocBook partisan sparks flying. I prefer not to take identity politics to the point of identifying myself with only one technical content schema, and I think that Micah, Evan, and Frank felt the same way. (I'd love to be the moderator of a Norm Walsh/Eliot Kimber discussion on DocBook/DITA issues, though.)

A schema is metadata whose job is to add value to data.

When it was my turn to introduce myself and my background, I wanted to draw a connection from the panel topic to the services of my employer, Innodata Isogen, so I mentioned that we had a lot of experience helping publishers find a good fit between their content, schemas (sometimes DocBook, and sometimes DITA!), tools, and users. This was off the top of my head, but I thought about it more as Evan introduced himself and jotted in my notebook: "content/schemas/tools/users".

People often want to know what the best schema is, or the best editing tool. I got to thinking about how the best way to determine those two parts of the content/schemas/tools/users lineup is to take a good hard look at the other two.

Content analysis is underrated. People discuss the virtues of one schema or another as if the schema by itself will do something for them, but a schema is metadata whose job is to add value to data. If you're wondering how well each of three schemas fits your content, then type and paste some of your content into documents that conform to those schemas and see for yourself. Once, while helping the PRISM standard group think through a content DTD to go with their metadata spec, I typed up Entertainment Weekly interviews with Will Smith and Tommy Lee Jones the week that "Men in Black II" came out (Entertainment Weekly is a Time Inc. publication; so is Mad Magazine, as I found out during a PRISM meeting in their building) in DocBook and one or two other DTDs that I can't remember right now. Doing this makes it much clearer whether the schema has the data and metadata elements and attributes you need and if its required structures fit your structure.

To frame any thoughts about users of the authoring tools and schemas, consider the two extremes: on the one hand, especially if you're in aerospace or some other heavy industry, you might have a staff of users who use powerful, higher-priced, editing tools because that's the job specialty you need from them. If it's not their job specialty, and you need it to be, you arrange training. In the other extreme, you might be a legal publisher whose authors include your country's leading expert on bankruptcy, and you're happy enough to publish this author's treatise on bankruptcy law that if he or she turns it in on floppy disks with WordPefect 4.2 files, you'll do what you must in order to convert that content into the XML that you use to drive your publishing system. If you gave this author an $800 XML authoring tool and a week of training, you'd probably annoy this valued author more than anything else.

Most content creators in XML publishing scenarios fall between these two extremes. There are a lot of them who are comfortable with Word but who can be convinced to use something similarly WYSIWYGgy that imposes the structure you need, but you might not have $800 plus training costs to spend on them. Don't lose heart; there are alternatives.

Once you have a better idea of what your content needs and your users and budget can handle, it's easier to think about the best schema and tools for your system. You can remove the tools question from consideration if you contract with a business partner to create the XML for you; you specify the schema you want (or work with them to determine the best one) and the quality levels you want, and then they do it for you. One of Innodata Isogen's newer services that we've had increasing success with is in content origination and authoring. For people wondering about how to work with another firm to have them take on these tasks, innodata-isogen.com's "Knowledge Center" has a new section titled Outsourcing Content Origination and Authoring: Closing the Publishing Loop, which includes a white paper covering the issues, upcoming webinars to listen to people with long experience with this, and more.


You must admit the panel had the potential for sparks.

It is a testament to the professionalism and skill of the panelists that it didn't devolve.

I particularly agree with your statement "content analysis is underrated". Too often people say "I want to use X schema/DTD" without having looked at their data to see if that makes any sense. And it is very difficult to convince them otherwise, even after doing some of that analysis.

Thanks for fleshing out some of the things you said during the session.