XML: too flexible?

Some biologists really like relational databases.

This week I gave a talk to some biology researchers at the University of Virginia. The basic thesis was that large databases typically need to fit into the neat rows and columns of relational tables, but that new XQuery/XML databases let you store and retrieve huge amounts of data with potentially much more complex structure, and that while this has obvious applications in the publishing world—the world that begat XML—it could have useful applications in other domains as well. I never got past ninth grade biology, but I'd read a little on bioinformatics recently, and these people are accumulating and combing through a lot of complex data.

Going into the talk, I assumed that the listeners either were or weren't familiar with XML, and for those who weren't I'd explain the basics and we'd go from there. I didn't count on people who considered themselves familiar with it but had some misconceptions based on their own use of it. XML is popular in the sciences as an interchange format, and one professor in particular had a difficult time believing that applications could be built with rigorously structured XML.

He said several times (based on one of my slides showing sample XML) that if someone can put a title element anywhere they want, then his application won't know where to look for it, and I kept going back to my slide showing the declaration for the chapter element that said that one and only one title element had to go at the very beginning of each chapter. I'm probably misrepresenting some of what he said, because we went around in circles a few times without completely understanding each other, but one of his key points was that he likes how the normalization process forces someone (in his case, his graduate students) to really think through the relationships between the pieces of information they're storing. This sounded to me like claiming that a well-designed relational database was better than a badly-designed XML database.

Thinking back on it, I realize that while XML is common outside of the publishing world, highly structured XML is not as common, except in cases where it's an interchange format that maps directly to some pre-existing relational tables or Java classes, as is often the case with more transactional uses of XML. The biologists had heard of DTDs and schemas, but hadn't bothered with them much, because looking at a handful of XML for a given data class showed them the structure they needed to know. Validation technologies such as Schematron and RELAX NG were understandably way off their radar.

I did have a slide saying that an advantage of XML databases over object-oriented databases (the other technology that tried to take large databases beyond rows and columns) was that prototyping was a lot easier: you can just throw together some XML and start querying it, while the analysis and design of object-oriented systems can mean a lot of up-front work before you can actually do anything with your data—note how many big fat books there are just on OO analysis and just on OO design. While discussing this slide, I mentioned that for a serious production XML application you should create checkpoints for design review and analysis and so forth before you build too many application dependencies on your thrown-together data, but it looks like I need more slides to make it clearer that while XML can be as flexible as you want, the developer can have a lot of control over the degree of that flexibility, and that large, carefully controlled systems have been built that never would have worked with relational databases—in the print and online publishing worlds, at least.