Documents vs. Data, Schemas vs. Schemas

Paper presented at XML 2004 conference in Washington D.C. Don't miss illustration accompanying presentation of data-head and doc-head.

Bob DuCharme

Consulting Software Engineer
LexisNexis http://www.lexisnexis.com


Charlottesville
Virginia
United States

Abstract

One class of XML use, often called "data-oriented," uses XML to relay database or transaction information. The second class of XML use, usually known as "document-oriented," uses XML to store information destined for publication in one medium or another. Despite the gray area between the two categories and their unfortunate names, the distinction provides a useful context for discussing XML processing models.

In particular, the distinction helps determine which of the two dominant schema languages can contribute more to your applications. The W3C's Schema Language was influence by the DBMS vendors on its Working Group and the B2B e-commerce concerns in the air at the time, giving it more features to benefit transaction-oriented XML. RELAX NG gives finer-grained control over the flexibility allowed in element content models, making it more useful in the processing of content destined for publication. A review of where these two schema languages have gained traction confirms the greater suitability of RELAX NG for content publishing applications and W3C schemas for transactional applications.


Distinguishing between "document-oriented" XML and "data-oriented" XML has long been a popular way to describe the two basic classes of XML applications. An August 2004 Google search for "xml data-oriented document-oriented" found over 7000 hits. What's the difference between data-oriented XML and document-oriented XML?

Technically, there is no difference. The first sentence of the first section of the XML Recommendation[XML] tells us that "Extensible Markup Language, abbreviated XML, describes a class of data objects called XML documents and partially describes the behavior of computer programs which process them." All XML is data in documents, so the choice of terminology for distinguishing between application classes is bad. The concepts are still important.

While most of the characteristics that distinguish between document- and data-oriented XML are heuristics, Sean McGrath[SM1] proposed a more quantitive approach on the xml-dev mailing list. He defined document-oriented XML as "XML in which corpora conforming to schema X exhibit power law distributions of the element types in X" and data-oriented XML as "XML in which corpora conforming to schema X exhibit uniform distributions of the element types in X." He created some graphs[SM2] to demonstrate the power law case, in which some elements appear far more often than others. For example, in document-oriented XML, a typical document has many more instances of the element that represents a paragraph than it does of an element that represent a fourth-level header or a foreign term.

All of McGrath's graphs demonstrate the document-oriented case, and he offers little data to demonstrate the uniform distribution case. It does make sense intuitively: XML representing a series of transactions is not likely to have a thousand instances of one element and one instance of another.

Another somewhat deterministic approach to categorizing an XML document type in one of these application classes is to look for the presence of text nodes that have element nodes as siblings. (Popular usage often refers to this as mixed content, which is correct, but according to the XML Recommendation any element with character data is considered to have mixed content, even those with no child elements.) One example of an element that can store a mix of text nodes and child elements is the XHTML p element. It has optional emph, b, tt and other inline elements that may be found within the sentences making up the paragraph stored in a given p element. The structure of such an element is notoriously difficult to map to the normalized tables of a relational database, and if you have paragraphs of sentences, you probably have content being published in a format readable by human eyes, putting XHTML squarely in the document-oriented category.

One response to McGrath's posting about element usage distribution came from Bob Glushko, no stranger to either document- or data-oriented XML development.[BG] He replied, "I don't buy into this data-centric vs doc-centric view of the world. It is obviously a continuum (called the "Document Type Spectrum" in the Document Engineering book I'm writing with Tim McGrath [just about done, MIT Press early 2005]). On one end are pure narrative things and on the other end are purely transactional ones: Moby Dick to invoices. In the middle are hybrid types like catalogs and reference books that have lots of structured content mixed in with narrative content."

Glushko's use of the terms "narrative" and "transactional" to describe the two classes of XML data are an improvement over "document-oriented" and "data-oriented." Rick Jelliffe used another good pair of terms in a March 2004 weblog posting, in which he predicted that XML Schemas 1.1 would be "slightly more DBMS-oriented and even less publishing-oriented." I myself have used the terms "transaction-oriented" and "content-oriented" to contrast the two classes of documents, and will use these terms here.

As mentioned above, a key factor in XML becoming more popular than the inventors of "SGML on the web" ever dreamed was its suitability for transmitting e-commerce transaction data. When the first working draft of the W3C Schema specification[XSD1] came out in May of 1999, and for the two following years that led to the specification becoming a Recommendation, the dot com boom was in full flower. People all over the IT industry were thinking big ideas about creating new e-commerce systems and making big money. Many saw the new features of W3C Schema playing a role in the data architecture of these systems; the Recommendation lists 43 representatives of 32 organizations as Working Group members and 11 other organizations that had sent respresentatives as the specification moved toward Recommendation status.

Step one of creating any schema language--and many were created before the W3C's, and several since--was designing an XML-based representation of the information that XML 1.0 DTDs could store. While the DDML (Data Definition Markup Language)[DDML] developed on the xml-dev mailing list stopped there, all the others moved on to step two, the wish list: the additional constraints that their schema language could express that DTDs couldn't.

When the W3C Schema language reached Recommendation status, the Working Group included two representatives each from Oracle, IBM, and Microsoft. These are large companies with diverse interests, but they're also responsible for the vast majority of relational databases used today. Other companies on the committee, such as Commerce One, Progress Software, and webMethods, were staking the majority of their business models on the e-commerce dream. These companies didn't send multiple representatives to the Schema working group because they saw a lucrative future in document publishing, but because they saw a big future in building the infrastructure that would make the "new economy" transactions happen more quickly and easily (or, to use the buzzword of the time, "seamlessly.")

What kind of features did the W3C Schema Working Group add to the XML version of DTDs to make such systems easier to build?

  • Data typing. Most serious schema languages provide a way to assign data types to both elements and attributes. Choices typically include string, integer, boolean, date, and other traditional data types beyond the strange types XML inherited from SGML such as CDATA and NMTOKENS. Of all the potential XML applications, data typing benefited more than just those involving the exchange of data with databases, but the number 2 requirement listed for the XML Schema Part 2 Recommendation ("define a type system that is adequate for import/export from database systems (e.g., relational, object, OLAP)") made it clear that this was a high priority. The ability to round-trip data between XML and relational databases with no loss of information would become a cornerstone of much e-commerce development.

  • nil values. The boolean attribute xsi:nil was added to provide an equivalent to the null values of relational databases.

  • More sophisticated identity constraints. An XML 1.0 attribute declared to be of type ID must have a value unique among all ID attribute values in its document, and an attribute of type IDREF must have a value that is present in one of the attributes of type ID. The W3C Schema language expanded on this minor ability to provide referential integrity by letting you specify that particular attribute, element, and even composite values must be unique. In addition to declaring such key values, you can declare that elements or attributes must be populated with values present in one of the key values.

Two more significant changes from XML 1.0 DTDs to W3C Schemas were the separation of type definitions (which often specify content model structure) from element declarations and the ability to derive new data types from existing data types. These were not done specifically to increase compatibility of XML data structures with databases, but with popular object-oriented development languages, particularly Java.

Many hoped that the potential new compability with object-oriented development systems and tools such as UML would make the development of all classes of XML-based systems easier. In practice, this advantage has been limited to developing XML structures for use with systems where UML and other object-oriented methodologies were already in place. These were far more likely to be systems using XML to store transactional information than publishing systems.

What did W3C Schemas offer to content-oriented XML systems? Of the features listed above, everyone was happy with data typing. Publishing workflows have metadata that can make use of integer, date-time, and boolean data types. One can picture use cases in which nils and type derivation might be useful in XML systems whose main purpose is the publication of content, but from a practical standpoint, they haven't come up much. Identity constraints have always helped electronic publishing because they can help ensure that links will point to one location, not zero or two, but they only help a little--the limitations of ID uniqueness to the scope of a single document, and the fact that a "document" can be a fairly artificial, system-dependent construct (for example, in the XML for a set of books, a single file's document element might be book, chapter, or section element. ) have limited its usefulness. W3C Schema's greater sophistication at specifying and referencing unique keys has done nothing to alleviate this limitation.

Substitution groups, which let you identify elements as part of a group that can be substituted for another element, could be useful in content-oriented publishing. If a content model includes a para element and we want the p elements in one document and the par elements in another to be treated like para elements, substitution groups can add flexibility to a system aggregating multiple feeds of input for publication.

W3C Schemas all group element lets us specify unordered content models while still requiring the existence of the model's subelements. For example, we can declare that a shirt element must consist of a single style element, a single color element, and a single size element, in any order. This returns to XML a feature taken out of SGML in the simplification process that created XML, but only to a degree: an element using this in a content model must use it at the top level. (In SGML or RELAX NG Compact terms, this would be like prohibiting the use of the & connector inside of parenthesized expressions within your content model.) For example, there is no way to specify that a shirt element must begin with a stockNumber element and then be followed by a style element, a color element, and a size element in any order in a W3C Schema.

An any element in a W3C Schema lets you specify that any element may appear at a certain point in a content model. You can restrict this to any element from a particular namespace, any element that isn't assigned to a namespace, or any element from outside of the namespace being defined. The processContents attribute for this element lets you specify whether the element that appears in that position must be validated, shouldn't be, or should be if possible. The W3C Schema anyAttribute element lets you create a similar slot for a wildcard attribute in your element's declaration.

Substitution groups, unordered content models, and the any and anyAttribute elements can be useful for content-oriented XML because they allow flexibility in specifying the relationship between known XML and unknown XML. This is not a big issue with transaction-oriented XML, in which a system receiving a "document" usually has a rigid understanding of what's being sent to it, but publishing applications often aggregate data from multiple sources, so flexibility in describing the relationship of elements in a content model is valuable.

Many of RELAX NG's differences from W3C Schemas have no particular focus on specific classes of XML applications. RELAX NG's formal mathematical basis, its alternative compact syntax, and its more readable spec can appeal to anyone. Its lack of a specific data typing system, offering instead hooks in which to insert a typing system of the developer's choice, can appeal to developers of highly specialized applications. (In practice, most RELAX NG implementations support the W3C typing system, which seems to serve the needs of most publishing applications.)

In RELAX NG, the schema itself is a pattern that a document must match, and patterns with a wide variety of structures can be specified as components of the main schema pattern. RELAX NG's equivalent of the W3C Schema language's any and anyAttribute elements provides a good example of the structured flexibility that this use of patterns provides; you can name classes of elements or attributes that may be included at a certain point and then, by using pattern names with RELAX NG's except element, specify customized classes of elements or attributes to exclude from the allowable elements and attributes.

RELAX NG lets you specify unordered lists of elements without the restrictions on their use that the W3C Schema language imposes. An unordered list can be part of a pattern, and patterns can be built from other patterns, and there's no need to track whether an unordered list is at the top level of an element's definition or not.

Another significant difference between RELAX NG and other schema languages is its treatment of an element's attribute list as part of its content model. This lets you specify dependencies between element and attribute presence, and even between element and attribute values. For example, you can specify that a link element must have either a URL attribute or a URL child element, so that a validator considers the presence of both or of neither to be an error. (With XML 1.0 DTDs and W3C Schemas, this kind of flexibility means specifying both the URL attribute and the URL child element as optional, so that a validating parser has no way of knowing that the presence of both or the presence of neither are errors.)

Specific values can be named in RELAX NG content models. For example, a schema can specify that if an element's media attribute has a value of "web", then the URL child element is required, but if media equals "book", then an ISBN child element is required.

RELAX NG also allows much finer-grained control over the role of text nodes (or, in XML 1.0 terms, PCDATA) in content models. If your name elements might be simple text or might have element content consisting of a firstName element followed by a lastName element, both XML 1.0 DTDs and W3C Schemas force you to specify a content model that allows text to be mixed with your firstName and lastName elements. RELAX NG lets you specify that the name element must be either straight text or a single firstName element followed by a single lastName element.

XML 1.0's over-reliance on content models of the form (#PCDATA|x|y|z)* was not helped much by the W3C Schema language, which--apart from its limited offering for the specification of unordered content models--often forces you to specify the same model using an XML-based syntax. RELAX NG's ability to specify more complex relationships between text nodes and potential siblings of those text nodes is particularly handy for specifying narrative text with optional inline elements, as with elements representing paragraphs. This kind of narrative text is often what distinguishes content-oriented XML from transaction-oriented XML; the latter is more likely to list the values of fields from one or more records in a relational database than to be a paragraph of sentences with inline elements between the text nodes.

The very existence of two schema languages as alternatives to DTDs has slowed the adoption of both among the content publishing world. Many assume that one will win out over the other, and they postpone a move beyond DTDs rather than pick a schema language that may not win.

Moany developers in the transaction-oriented development world are not even aware of RELAX NG's existence. When developer tools such as IBM's WebSphere or tools from Oracle or Microsoft mention XML schemas, they mean W3C schemas. Important applications that specify structures using W3C Schemas include SOAP, XForms, UBL (Universal Business Language), ebXML, and UDDI (Universal Description, Discovery, and Integration). In many cases, the developers who use these markup languages never see the schemas, because their tools automate any necessary interaction with them.

Where has RELAX NG gained ground? The OpenOffice alternative to Microsoft Office, a desktop application that uses zipped XML as its native format, uses RELAX NG to specify the structure of the documents, spreadsheets, and slide presentations that it creates. New work on the DocBook DTD uses RELAX NG; the TEI (Text Encoding Initiative) has converted from using DTDs to using RELAX NG to specify the content models for the historical literarary and linguistic texts that the TEI has encoded since long before XML was invented. The W3C's XHTML 2.0 specification is using RELAX NG to develop content models, with plans to make DTD and W3C Schema versions available once the content models are more settled. [XHTML]

These are all document types for narrative content. The content varies from the software documentation that has made DocBook popular to the ancient epic poems that can be found in TEI work, but it predominantly consists of paragraphs of text with specialized inline elements in the middle of sentences and block structures grouping the paragraphs with headers, subheads, lists, and other block structures commonly found in such works.

The XHTML case is particularly interesting because the W3C Working Group developing it chose not to use the W3C's own schema language. W3C Recommendations are supposed to serve as building blocks for other W3C Recommendations, but the W3C HTML working group apparently felt that RELAX NG would make their job easier than W3C Schema. The availability of James Clark's trang utility to convert RELAX NG schemas to both DTDs and W3C Schemas doubtless made the decision easier because those who want W3C schemas or DTDs of the XHTML schemas can ultimately get them. No such converter exists for W3C Schema, so that schemas developed in this language stay in this language.

RELAX NG isn't good only for content-oriented XML. James Clark has included schemas for XSLT 1.0 and RDF/XML with the distribution for his nXML Emacs mode for schema-driven editing of XML documents, and neither of these formats would be considered publishable narrative content. (RDF often accompanies narrative content, but RDF/XML's uneasy relationship with elements that mix text nodes with element siblings--one of the key indicators that an XML document is content-oriented and not transaction-oriented--is one of the stumbling blocks to RDF/XML's adoption.) These two schemas are excellent demonstrations of RELAX NG's power, and useful for much more than just driving Emacs in nXML mode.

If W3C Schemas are your best choice, the choice may have already been made for you by the tools or specifications that you're using. Many well-known XML development tools assume the existence of W3C Schemas to specify the structures and data types of your document's components. Then again, this doesn't necessarily require you to use W3C Schemas; like the W3C HTML Working Group, you can develop and store schemas using RELAX NG and then use trang to create W3C Schema versions for the steps in your workflow that require it. On the other hand, if you're developing a SOAP-based web service, then much of the schema work has already been done for you in W3C Schemas (SOAP itself, and WSDL and UDDI if you need them) so adding RELAX NG to the mix may not buy you much.

If your application really is transaction-oriented, the key benefits that RELAX NG offers to content-oriented document modeling won't help you much. If your application requires two systems to exchange small amounts of information as one of those systems engages in a particular task, the kinds of content model flexibility described above can add unnecessary complications to an application. A system that provides a service in response to requests that bundle parameter values has a right to expect those parameters in a fairly rigid format. If those parameters began or will end up in a relational database, their structure won't be too complex anyway. When Rick Jelliffe predicted that XML Schemas 1.1 would be "slightly more DBMS-oriented and even less publishing-oriented," he saw that the W3C Schema language already had more to offer to transaction-oriented XML application development than it did to content-oriented development.

If your document types are content-oriented, which is more likely if you are in any branch of the publishing industry or if you're responsible for the documentation of large, complex systems, note the pattern followed by other large projects for publishing narrative content: they chose RELAX NG. While many of the more complex content models of narrative content can be specified with W3C Schemas, complexity that strays from the kind of data that fits in a relational database can lead to problems. Even the leading W3C Schema parsers don't behave consistently when handling complex edge cases, and the confusing and abstract language of the W3C XML Schema Structures Recommendation is often little help in determining which parser acted properly. RELAX NG makes it easier to specify the kinds of structures that can appear in narrative content, and when in doubt about RELAX NG application behavior, its specification [RNG] is relatively easy to comprehend.

If your application falls somewhere between the obviously content-oriented and the obviously transaction-oriented, then you need to review and prioritize the degree to which the issues above apply to you: do your tools impose any requirements? What kind of dependencies can you afford to build on those tools? What standards play a role in your project? How complex are your structures, and how much flexibility do you want to allow in their creation? What are your business partners' expectations? Do you understand all of the necessary processes enough that you'll never need clarification of expected behavior from a language specification?

How badly do you need any schema language? The choice isn't only between W3C Schemas and RELAX NG; if you have a DTD-based system running perfectly well, then compare the benefits that schemas will bring with all the costs of making the transition. Remember, many large, complex publishing applications still run just fine using DTDs. The pressure to appear up-to-date should not be a factor in your decision; don't try to fix something that isn't broken unless you've got solid requirements that you've analyzed from every angle.

And finally, remember that you need not commit all of the XML processing in your organization to one schema language or another. If internal or external web services are delivering documentation or news stories to other systems, the delivered content can be specified using RELAX NG, and the delivery envelope can be specified using a W3C Schema. Both schema languages' ability to specify slots for "any" elements or attributes means that they can even play together.