A brief, opinionated history of XML

From someone who had a front row seat.

There are a few histories of XML out there, but I still find myself explaining certain points to people surprisingly often, so I thought I'd write them down. If you don't want to read this whole thing, I'll put the moral of the story right at the top:

They didn't understand that it wasn't designed to meet their needs. It was designed to make electronic publishing in multiple media easier.

XML was designed as a simplified subset of SGML to make electronic publishing in multiple media easier. People found it useful for other things. When some people working on those other things found that XML wasn't perfect for their needs, they complained and complained about how badly designed XML was. They didn't understand that it wasn't designed to meet their needs. It was designed to make electronic publishing in multiple media easier.

Automated typesetting and page layout...

In the 1970s, computerized typesetting made automated page layout much easier, but three guys at IBM named Goldfarb, Mosher, and Lorie got tired of the proprietary nature of the typesetting codes used in these systems, so they came up with a nonproprietary, generic way to store content for automated publishing that would make it easier to convert this content for publication on multiple systems. This became the ISO standard SGML, and the standardized nonproprietary part made it popular among U.S. defense contractors, legal publishers, and other organizations that did large-scale automated publishing.

When I first got involved, SGML was gaining popularity among publishers creating CD-ROMs and bound books from the same content, because they could create and edit an SGML version and then run scripts to publish that content in the various media. The structure of an SGML document type (for example, the available text elements and element relationships in a set of legal court cases, or the elements and element relationships that you could use in a set of aircraft repair manuals) was specified in something called a DTD, which had its own syntax and was part of the SGML standard. The scripts to convert SGML documents were usually written using a language and engine called Omnimark, which was a proprietary product, but a perl-based alternative was also available.

When Tim Berners-Lee was wondering how exactly to specify that one of his new hypertext documents had a title here, a subtitle there, and a link in the middle of a paragraph that led to another document, SGML was a logical choice—it was a text-based, flexible, non-proprietary, standardized way to specify document structure with various tools available to help you work with those documents. That's why HTML tags are delimited with angle brackets: because SGML elements were (nearly always) delimited with angle brackets. Dan Connolly sketched out the first HTML DTD in 1992.

SGML's designers couldn't see into the future, so they deliberately made it very flexible. For example, you could use other delimiters for element tags besides angle brackets, but everyone used angle brackets. SGML parsing programs were still required to account for the possibility that a document used other delimiters, and the possibility that many other options had been reset, so these parsers were large and complex, and few were available to choose from. By the mid-90s, enough best practices had developed that Sun Microsystems' Jon Bosak had the idea for a simplified, slimmer version of SGML that assumed a lot of default settings and could be parsed by a smaller program—maybe even a program written in Sun's new Java language—and that could be transmitted over the web when necessary. The documents themselves would be easier to share over the web than typical SGML documents, following the example of HTML documents.

Around this time SGML was considered a niche technology in the electronic publishing industry, and I worked at several jobs where I wrote and modified DTDs and Omnimark scripts to create and maintain document conversion systems. I also went to the relevant SGML conferences, where I got to know several of the people who eventually joined Jon to create the simplified version of SGML. (Many are still friends.) At first this group called their new spec WebSGML, but eventually they named it XML.

You could still process XML with Omnimark and other SGML tools. Many people would fail to appreciate the value of this design decision: as a valid subset of SGML, XML documents could be processed with existing SGML technology. This meant that on that day in 1998 when XML became an official W3C standard, we already had plenty of software out there, including programs like Adobe's special SGML edition of FrameMaker, that could process XML documents right away. This gave the new standard a running start, and XML may not have gotten anywhere without this running start, because those of us using the existing tools didn't have to wait around for new tools for the new standard and then work out how to incorporate these tools into our publishing workflows. We already had tools and workflows that could take advantage of the new standard.

I've heard some people describe certain things that SGML specialists didn't like about XML, but these people don't understand that XML was invented by and for SGML specialists, and it made SGML peoples' lives much easier. For one thing, we weren't so dependent on Omnimark anymore; at least one of my former employers switched from SGML to XML just so they could ditch Omnimark. XML's companion standard XSLT let us convert XML to a variety of formats using robust, free, standardized software, and as the web became a bigger publishing medium we found ourselves writing XSLT stylesheets to convert the same XML documents to print, CD-ROM, and HTML. Electronic publishing had never been so easy.

...and beyond...

Then along came the dot com boom. People got excited about how "seamless e-commerce" would change everything. People would save money as obsolete middlemen were removed from old-fashioned transactions, and people would make lots of money by taking part in this streamlining (selling pick axes during a gold rush) or by automating the buying and selling of products.

Orders would be transmitted over this fabulous free network known as The Internet instead of over the expensive, proprietary EDI networks. But when my computer sent an order to yours, how exactly would this order be represented? XML provided a good syntax: it was plain text, easy to transmit and parse, and could group labeled pieces of information in fairly arbitrary structures while remaining an open, straightforward standard. (When I say "straightforward", I'm talking about the original spec here, not the collection of related specs that most people are referring to when they complain about the complexity of XML. More on this below.) This let people send any combination of information back and forth, regardless of the potential lack of compatibility between the back end systems that the different parties were using.

So, as an important technology of the dot com boom, XML became trendy, and it was a heady feeling to suddenly be an expert in a trendy technology. I'll never forget hearing it mentioned in a Microsoft ad on a prime time network TV show; sure, it was spoken by the character of a geek who normal people weren't supposed to understand, but still, this subset of a niche technology that my friends help to invent was mentioned on prime time network TV. Three different series of XML conference series were running, and they were much better attended than the single one that's left now. The best part was that there was enough money behind some of those conferences to fly most speakers in and put them up in hotels, which got me my first trips to London and Silicon Valley.

XML wasn't really a perfect fit for ecommerce systems, though. The elements vs. attributes distinction, which publishing systems used to distinguish between content to publish and metadata about that content, didn't have a clear role when describing transactions that weren't content for publishing. XML had some odd data types (NMTOKEN? CDATA?) that only applied to attribute values, instead of traditional data types like integer, string, and boolean that could be applied to content as well as attributes.

And then there was that strange DTD syntax: if XML was so good at describing structure, why wasn't XML used to describe the structure of a set of documents? The answer is above, but it didn't get publicized very well, so many people complained about DTD syntax. Everyone agreed that an XML-based schema syntax that provided for traditional data types would be a Good Thing, so various groups came up with proposals and the W3C convened a Working Group to review these proposals and come up with a single standard.

But, in the words of Cindy Lauper, money changes everything. XML itself was assembled by eleven specialists in a niche technology, SGML, that wanted to make standardized electronic publishing simpler, and they managed to stay under most radar systems and come out with something simple and lean. However, when the XML Schema Working Group convened, many big and small companies were smelling lots of money and wanted to influence the results. Of the 31 companies that sent representatives to this Working Group (31!), many had little or nothing to do with publishing, electronic or otherwise. There were database vendors such as Microsoft, Informix, Software AG, IBM and Oracle (to be fair, large software companies have always been up there with legal publishers and defense contractors as believers in automated publishing technology; note where SGML got its start). There were successful or aspiring B2B ecommerce vendors such as CommerceOne, Progress Software, and webMethods. Microsoft, Xerox, CommerceOne, IBM, Oracle, Progress Software, and Sun were each interested enough to send two representatives to the committee, so there were a lot of cooks working on this broth.

The result was a three-part specification: Part 0 was a primer, Part 1 specified how to define document structures, and Part 2 described basic data types and how to extend them. Part 2 is pretty good, and also provides the basis for RDF data typing. Part 1, in my opinion, ended up being an ugly, complicated mess in its attempt to serve so many powerful masters.

Two members of the original eleven-member XML team, James Clark and Makoto Murata, developed an alternative to Part 1 that was both simpler and more powerful called RELAX NG schemas. Clark had written the only open source SGML parser, and the first XSLT processor, and came up with the name "XML," among his many other achievements; he's also written some great software to implement RELAX NG and convert between schema formats. RELAX NG never became as popular as XML Schema, because it didn't have the big industry names behind it, and because it was optimized around the original XML use case: describing content for publication.

Despite a complex syntax, incompatibilities among parsers, an often inscrutable spec, and less expressive power than RELAX NG, the W3C XML Schema specification has become popular because it's a W3C standard that addresses the original main problems of XML for ecommerce: it specifies document structures using XML, it lets you use traditional datatypes, and it has the added bonus for many developers of making it easier to round-trip XML elements to Java data structures. (After railing against the influence of this last part for years, I learned that it was primarily the work of Matthew Fuchs, an old friend I've known since he was finishing up his Ph.D. in computer science at NYU's Courant Institute when I was doing my masters there in the mid-nineties. He was the only other person there who even knew what SGML was.) So, XML Schema continues to be used by many large organizations to store data that doesn't fit neatly into relational tables. In fact, TopQuadrant has been adding more and more features to the TopBraid platform to make it easier to incorporate such data into a system that uses semantic web standards.

...and back.

Getting back to to the topic of leaner, simpler alternatives for representing information of potentially arbitrary structure, the JavaScript-based JSON format started getting popular around 2006. The third paragraph of its Wikipedia page flatly states that "it is used primarily to transmit data between a server and web application, serving as an alternative to XML."

A Google search for "json replace xml" gets over 5,000 hits. (That's with the quotes around the search terms, to make Google search for the exact phrase. Without the quotes, it gets almost five million hits.) I like JSON, and see how it can replace many of the uses of XML that have been around since the dot com boom days, but anyone who thinks it can completely replace XML doesn't understand what XML was designed for. Documents with inline markup (or, in XML geekspeak, "mixed content"—for example, the way the HTML a element can be in the middle of a sentence within a p element) would theoretically work fine in JSON, but in practice, it would be too easy to screw it up when editing it with a text editor by accidentally adding or removing a single curly brace. Tools to hide the syntax behind a more intuitive interface may address the issue, but dependence on such tools was something that the original XML designers wanted to avoid. And frankly, when I picture a complex prose document stored in JSON, I hear the ghost of Microsoft's RTF dragging chains through the attic.

Between JSON's growing role as an inter-computer data format and RELAX NG's foothold in schemas like DocBook and companies like LexisNexis, I see the XML infrastucture getting back to its original use cases, which makes good sense to me. Each year at the XML Summer School in Oxford, it's been very interesting to see the new things people are doing with XML, especially as XQuery-based XML databases like MarkLogic and eXist grow in power. I've been chairing the semantic web track at the summer school for the past few years and hardly been involved in XML at all, but it's always great to hear what my old friends are up to. Especially when there's great beer available.

Please add any comments to this Google+ post.

bobdc.blog

Bob DuCharme's weblog, mostly on technology for representing and linking information.

A brief, opinionated history of XML

Automated typesetting and page layout...

...and beyond...

...and back.

Search