Maintaining Schemas for Pipelined Stages

Bob DuCharme
June 6, 2002
Comment and criticisms encouraged: bob@snee.com

The Problem

The whole idea of sending a stream of XML documents through a pipeline instead of through a monolithic process that performs all necessary processing is getting more popular lately, as shown by projects such as XPipe and DSDL. In a production environment, when altering one process's behavior could spell trouble for others that read that process's output as their own input, some sort of contract or defined interface between stages makes it easier to manage the relationship between those processes. In a project that I'm working on, schemas will provide those contracts. (In fact, the ease of doing this with schemas over doing it with DTDs is one of the reasons to switch to schemas.)

But, how could I maintain a set of related schemas used for different stages in the processing of the same document set? For example:

A document received from outside of the system must pass through three processes which we'll call floob, zatz, and glikk. At each stage, the document conforms to a slightly different schema. These schemas are important, because they serve as a contract between the implementers of each process; the zatz developers use the post-floob.xsd schema as part of their requirements that specify input, and the glikk designers do the same with post-zatz.xsd. (Before you read too much into the file extensions, note that the problem and solution are the same for both W3C Schemas an RELAX NG. I'll use W3C Schemas in describing my problem and solution and then say a word about how I tested it with RELAX NG.)

document pipeline

These are not different schemas. They're variations on the same schema. The problem is how to track the variations. The first two options that come to mind are these:

In a complex enough environment, the first option is unacceptable. If the floob process adds a checkIn attribute value to a document and the zatz process needs to use that value, then the checkIn attribute must be a mandatory attribute in the post-floob.xsd schema, but it can't be in the public.xsd schema.

Anyone who came to XML from an electronic publishing background knows that the second option is also unacceptable, because it's too prone to error. As with the documents themselves, the best way to create multiple related ones reliably and repeatably is to create a master one and generate the others from it.

Below I've outlined and demoed an approach to doing just that: creating a schema that stores information about which components go into which schemas, as well as a short stylesheet that generates the schemas.

First, a note on terminology: I decided to call the schemas "stage schemas," because each one is the schema for documents at a different stage of processing. I avoided the term "version" because two different versions of a schema sound like two different releases not intended to be used together.

The Master Schema

Below is the beginning of the master schema (all files mentioned are available in this zip file) from which the XSLT stylesheet generates schemas for the individual stages. Non-standard parts of the schema from the http://www.snee.com/ns/stages namespace are bolded. There are two kinds of additions: an sn:stages element inside of an xs:appinfo element to list the names assigned to the various stages, and an sn:stages attribute added to some schema components to identify which stages that component should be in. Any schema components without this attribute, like the element declaration for the title element, are assumed to be meant for all the stages.

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           xmlns:sn="http://www.snee.com/ns/schemas">

  <xs:element name="article">

    <xs:annotation>
      <xs:appinfo>
        <sn:stages>
          <sn:stage name="public"/>
          <sn:stage name="post-floob"/>
          <sn:stage name="post-zatz"/>
          <sn:stage name="final"/>
        </sn:stages>
      </xs:appinfo>
    </xs:annotation>

    <xs:complexType>
      <xs:sequence>
        <xs:element ref="title" maxOccurs="1"/>
        <xs:element ref="par"   maxOccurs="unbounded"/>
      </xs:sequence>

      <xs:attribute name="dateline" type="xs:string" use="required"
                    sn:stages="post-zatz final"/>

      <xs:attributeGroup ref="stamps"
                         sn:stages="post-floob post-zatz final"/>

    </xs:complexType>
  </xs:element>


  <xs:element name="title" type="xs:string"/>

  <!-- schema continued; see complete one in zip file -->

The elements and attribute don't have to be from the http://www.snee.com/ns/schemas namespace. As long as they're not from http://www.w3.org/2001/XMLSchema namespace, schema-processing software is supposed to ignore them. Besides, this schema isn't for use with documents, anyway; its purpose is to provide a base from which to generate the various production schemas.

Extracting Stage Schemas

The getStage.xsl stylesheet takes a parameter naming the stage that you want and creates a schema for that stage from the source schema. It uses the Xalan Java 2 tokenize() extension function, so to be run as written it requires that particular XSLT processor. (Saxon includes an equivalent extension function, and the need for a standardized one is being considered for XSLT 2.0.)

For example, if the master schema is stored in master.xsd and you want to pull a post-floob one and store it in post-floob.xsd, you would enter the following command line to apply the stylesheet to master.xsd using Xalan Java 2:

java org.apache.xalan.xslt.Process -in master.xsd -xsl getStage.xsl -param stageName post-floob > post-floob.xsd

I found it easier to create a batch file called getStage.bat that could be run with a simpler command line, like this:

getStage post-floob xsd

(The xsd part shows that it's being run with W3C Schemas.) In addition to extracting a schema for the designated stage, getStage.xsl performs a bit of error checking:

Both of these constraints could have been enforced by declaring the sn:stage element's name attribute to be of type ID and the sn:stages attribute to be of type IDREF, but validating the schema against a DTD would have been an extra implementation step. The stylesheet had to be written and run anyway, so in the spirit of Schematron I let the XSLT stylesheet do the constraint checking.

For further details on the implementation of the stylesheet, see the comments in the getStage.xsl file.

What about RELAX NG?

I first worked this out with W3C Schema. When I decided to try it with RELAX NG schema, I didn't have to change a byte of the getStage.xsl stylesheet; it worked just fine as it was. All I had to do was to change the getStage.bat driver file to allow for the possibility of reading from and outputing to files with an extension of rng.

To test it with RELAX NG schemas, I used Sun's free rngconv utility to convert master.xsd to master.rng. Then, I added the declarations for the xsi:noNamespaceSchemaLocation attribute and namespace so that I could use the same XML documents (public.xml, post-floob.xml, and so forth) as a test, and I added the snee stage elements and attribute described above to the RELAX NG schema. After making my modifications to the getStage.bat driver file, I used it to create RELAX NG schemas from master.rng for the four stages. Sun's multi-schema validator showed that the same test documents were as valid against the extracted RELAX NG schemas as Xerces Java found those documents to be against the W3C stage schemas extracted from master.xsd.

Conclusion

The example that I made up to test this was quite simple; download schemaStages.zip for the stylesheet, the master schemas, the batch file, the eight extracted schemas, and four sample document files that conform to each W3C/RNG pair of schemas. The test.bat file does all the extractions and schema validations of the sample documents against the various extracted schemas.

I would love to hear any suggestions for things to add to the sample master schema to stress test the whole concept a little harder. In fact, the reason that I just put this on my own web site instead of publishing it in a more accredited forum is because I want to give other people a chance to poke holes in it before I make too many claims for its value. So please, e-mail me to tell me about holes!