25 January 2012

A brief, opinionated history of XML

From someone who had a front row seat.

There are a few histories of XML out there, but I still find myself explaining certain points to people surprisingly often, so I thought I'd write them down. If you don't want to read this whole thing, I'll put the moral of the story right at the top:

They didn't understand that it wasn't designed to meet their needs. It was designed to make electronic publishing in multiple media easier.

XML was designed as a simplified subset of SGML to make electronic publishing in multiple media easier. People found it useful for other things. When some people working on those other things found that XML wasn't perfect for their needs, they complained and complained about how badly designed XML was. They didn't understand that it wasn't designed to meet their needs. It was designed to make electronic publishing in multiple media easier.

Automated typesetting and page layout...

In the 1970s, computerized typesetting made automated page layout much easier, but three guys at IBM named Goldfarb, Mosher, and Lorie got tired of the proprietary nature of the typesetting codes used in these systems, so they came up with a nonproprietary, generic way to store content for automated publishing that would make it easier to convert this content for publication on multiple systems. This became the ISO standard SGML, and the standardized nonproprietary part made it popular among U.S. defense contractors, legal publishers, and other organizations that did large-scale automated publishing.

When I first got involved, SGML was gaining popularity among publishers creating CD-ROMs and bound books from the same content, because they could create and edit an SGML version and then run scripts to publish that content in the various media. The structure of an SGML document type (for example, the available text elements and element relationships in a set of legal court cases, or the elements and element relationships that you could use in a set of aircraft repair manuals) was specified in something called a DTD, which had its own syntax and was part of the SGML standard. The scripts to convert SGML documents were usually written using a language and engine called Omnimark, which was a proprietary product, but a perl-based alternative was also available.

When Tim Berners-Lee was wondering how exactly to specify that one of his new hypertext documents had a title here, a subtitle there, and a link in the middle of a paragraph that led to another document, SGML was a logical choice—it was a text-based, flexible, non-proprietary, standardized way to specify document structure with various tools available to help you work with those documents. That's why HTML tags are delimited with angle brackets: because SGML elements were (nearly always) delimited with angle brackets. Dan Connolly sketched out the first HTML DTD in 1992.

SGML's designers couldn't see into the future, so they deliberately made it very flexible. For example, you could use other delimiters for element tags besides angle brackets, but everyone used angle brackets. SGML parsing programs were still required to account for the possibility that a document used other delimiters, and the possibility that many other options had been reset, so these parsers were large and complex, and few were available to choose from. By the mid-90s, enough best practices had developed that Sun Microsystems' Jon Bosak had the idea for a simplified, slimmer version of SGML that assumed a lot of default settings and could be parsed by a smaller program—maybe even a program written in Sun's new Java language—and that could be transmitted over the web when necessary. The documents themselves would be easier to share over the web than typical SGML documents, following the example of HTML documents.

Around this time SGML was considered a niche technology in the electronic publishing industry, and I worked at several jobs where I wrote and modified DTDs and Omnimark scripts to create and maintain document conversion systems. I also went to the relevant SGML conferences, where I got to know several of the people who eventually joined Jon to create the simplified version of SGML. (Many are still friends.) At first this group called their new spec WebSGML, but eventually they named it XML.

You could still process XML with Omnimark and other SGML tools. Many people would fail to appreciate the value of this design decision: as a valid subset of SGML, XML documents could be processed with existing SGML technology. This meant that on that day in 1998 when XML became an official W3C standard, we already had plenty of software out there, including programs like Adobe's special SGML edition of FrameMaker, that could process XML documents right away. This gave the new standard a running start, and XML may not have gotten anywhere without this running start, because those of us using the existing tools didn't have to wait around for new tools for the new standard and then work out how to incorporate these tools into our publishing workflows. We already had tools and workflows that could take advantage of the new standard.

I've heard some people describe certain things that SGML specialists didn't like about XML, but these people don't understand that XML was invented by and for SGML specialists, and it made SGML peoples' lives much easier. For one thing, we weren't so dependent on Omnimark anymore; at least one of my former employers switched from SGML to XML just so they could ditch Omnimark. XML's companion standard XSLT let us convert XML to a variety of formats using robust, free, standardized software, and as the web became a bigger publishing medium we found ourselves writing XSLT stylesheets to convert the same XML documents to print, CD-ROM, and HTML. Electronic publishing had never been so easy.

...and beyond...

Then along came the dot com boom. People got excited about how "seamless e-commerce" would change everything. People would save money as obsolete middlemen were removed from old-fashioned transactions, and people would make lots of money by taking part in this streamlining (selling pick axes during a gold rush) or by automating the buying and selling of products.

Orders would be transmitted over this fabulous free network known as The Internet instead of over the expensive, proprietary EDI networks. But when my computer sent an order to yours, how exactly would this order be represented? XML provided a good syntax: it was plain text, easy to transmit and parse, and could group labeled pieces of information in fairly arbitrary structures while remaining an open, straightforward standard. (When I say "straightforward", I'm talking about the original spec here, not the collection of related specs that most people are referring to when they complain about the complexity of XML. More on this below.) This let people send any combination of information back and forth, regardless of the potential lack of compatibility between the back end systems that the different parties were using.

So, as an important technology of the dot com boom, XML became trendy, and it was a heady feeling to suddenly be an expert in a trendy technology. I'll never forget hearing it mentioned in a Microsoft ad on a prime time network TV show; sure, it was spoken by the character of a geek who normal people weren't supposed to understand, but still, this subset of a niche technology that my friends help to invent was mentioned on prime time network TV. Three different series of XML conference series were running, and they were much better attended than the single one that's left now. The best part was that there was enough money behind some of those conferences to fly most speakers in and put them up in hotels, which got me my first trips to London and Silicon Valley.

XML wasn't really a perfect fit for ecommerce systems, though. The elements vs. attributes distinction, which publishing systems used to distinguish between content to publish and metadata about that content, didn't have a clear role when describing transactions that weren't content for publishing. XML had some odd data types (NMTOKEN? CDATA?) that only applied to attribute values, instead of traditional data types like integer, string, and boolean that could be applied to content as well as attributes.

And then there was that strange DTD syntax: if XML was so good at describing structure, why wasn't XML used to describe the structure of a set of documents? The answer is above, but it didn't get publicized very well, so many people complained about DTD syntax. Everyone agreed that an XML-based schema syntax that provided for traditional data types would be a Good Thing, so various groups came up with proposals and the W3C convened a Working Group to review these proposals and come up with a single standard.

But, in the words of Cindy Lauper, money changes everything. XML itself was assembled by eleven specialists in a niche technology, SGML, that wanted to make standardized electronic publishing simpler, and they managed to stay under most radar systems and come out with something simple and lean. However, when the XML Schema Working Group convened, many big and small companies were smelling lots of money and wanted to influence the results. Of the 31 companies that sent representatives to this Working Group (31!), many had little or nothing to do with publishing, electronic or otherwise. There were database vendors such as Microsoft, Informix, Software AG, IBM and Oracle (to be fair, large software companies have always been up there with legal publishers and defense contractors as believers in automated publishing technology; note where SGML got its start). There were successful or aspiring B2B ecommerce vendors such as CommerceOne, Progress Software, and webMethods. Microsoft, Xerox, CommerceOne, IBM, Oracle, Progress Software, and Sun were each interested enough to send two representatives to the committee, so there were a lot of cooks working on this broth.

The result was a three-part specification: Part 0 was a primer, Part 1 specified how to define document structures, and Part 2 described basic data types and how to extend them. Part 2 is pretty good, and also provides the basis for RDF data typing. Part 1, in my opinion, ended up being an ugly, complicated mess in its attempt to serve so many powerful masters.

Two members of the original eleven-member XML team, James Clark and Makoto Murata, developed an alternative to Part 1 that was both simpler and more powerful called RELAX NG schemas. Clark had written the only open source SGML parser, and the first XSLT processor, and came up with the name "XML," among his many other achievements; he's also written some great software to implement RELAX NG and convert between schema formats. RELAX NG never became as popular as XML Schema, because it didn't have the big industry names behind it, and because it was optimized around the original XML use case: describing content for publication.

Despite a complex syntax, incompatibilities among parsers, an often inscrutable spec, and less expressive power than RELAX NG, the W3C XML Schema specification has become popular because it's a W3C standard that addresses the original main problems of XML for ecommerce: it specifies document structures using XML, it lets you use traditional datatypes, and it has the added bonus for many developers of making it easier to round-trip XML elements to Java data structures. (After railing against the influence of this last part for years, I learned that it was primarily the work of Matthew Fuchs, an old friend I've known since he was finishing up his Ph.D. in computer science at NYU's Courant Institute when I was doing my masters there in the mid-nineties. He was the only other person there who even knew what SGML was.) So, XML Schema continues to be used by many large organizations to store data that doesn't fit neatly into relational tables. In fact, TopQuadrant has been adding more and more features to the TopBraid platform to make it easier to incorporate such data into a system that uses semantic web standards.

...and back.

Getting back to to the topic of leaner, simpler alternatives for representing information of potentially arbitrary structure, the JavaScript-based JSON format started getting popular around 2006. The third paragraph of its Wikipedia page flatly states that "it is used primarily to transmit data between a server and web application, serving as an alternative to XML."

A Google search for "json replace xml" gets over 5,000 hits. (That's with the quotes around the search terms, to make Google search for the exact phrase. Without the quotes, it gets almost five million hits.) I like JSON, and see how it can replace many of the uses of XML that have been around since the dot com boom days, but anyone who thinks it can completely replace XML doesn't understand what XML was designed for. Documents with inline markup (or, in XML geekspeak, "mixed content"—for example, the way the HTML a element can be in the middle of a sentence within a p element) would theoretically work fine in JSON, but in practice, it would be too easy to screw it up when editing it with a text editor by accidentally adding or removing a single curly brace. Tools to hide the syntax behind a more intuitive interface may address the issue, but dependence on such tools was something that the original XML designers wanted to avoid. And frankly, when I picture a complex prose document stored in JSON, I hear the ghost of Microsoft's RTF dragging chains through the attic.

Between JSON's growing role as an inter-computer data format and RELAX NG's foothold in schemas like DocBook and companies like LexisNexis, I see the XML infrastucture getting back to its original use cases, which makes good sense to me. Each year at the XML Summer School in Oxford, it's been very interesting to see the new things people are doing with XML, especially as XQuery-based XML databases like MarkLogic and eXist grow in power. I've been chairing the semantic web track at the summer school for the past few years and hardly been involved in XML at all, but it's always great to hear what my old friends are up to. Especially when there's great beer available.

SGML CD cover     XML Annotated Spec cover     XSLT Quickly cover

Please add any comments to this Google+ post.

16 December 2011

Having a Blue Ridge Christmas

They're playing my song!

A few months ago I saw a call for contributions of recordings of original holiday songs for a CD to be called "A Charlottesville Songwriters Christmas" to benefit a local charity. Around here there seems to be a law that when you name a business you have to name it either Jefferson (whatever), Piedmont (whatever), or Blue Ridge (whatever), so I decided to write a song whose name is a variation on "Blue Christmas" called "Blue Ridge Christmas." I thought about trying to put together a band to record it, but some friends who I've played jazz with are also in a local soul band with a really great singer (note his day job), so I offered it to them, and they made a great recording of it.

For the holiday season, the Charlottesville Downtown Business Association made a video to encourage people to shop on the downtown mall and they chose this recording as the music. It was fun for me to see it, and it's nice to know that letting my friends hear the song won't mean ripping it from a charity CD and putting it where people can download it. This doesn't quite compare with my brother's work for VW or Wendy's, but it's fun to know that it came out well and that lots of people can see the video—and that the song has had a bit of airplay on WNRN!


Please add any comments to this Google+ post.

21 November 2011

Javascript from the command line

In Linux and Windows. (Goodbye Cscript!)

Mozilla Rhino

A few years ago I wrote about Windows command line text processing with Javascript using Microsoft's Cscript utility. I was surprised to find no Linux equivalent, and while I'd heard of Mozilla Rhino I had some vague ideas about how using it only meant integrating it into other applications.

After some hunting, I learned that Rhino includes a jar file that makes it easy to run a script from the command line. Once you have it, running a script named myscript.js is as simple as this:

java -jar js.jar myscript.js

If you're really interested in text processing, you can pipe and redirect the output.

After I downloaded Rhino and got this to work I searched my hard disk and found that js.jar was already on my hard disk in several places: with OpenOffice, with Swoop, and with Eclipse (and therefore with TopBraid Composer), so I've had it right under my nose for years. My brother checked his Mac and found that js.jar came with an open source speech recognizer that he had installed.

One neat part was that some fairly complex JavaScript scripts that I had run with Cscript ran with js.jar after one minor change that actually improved the scripts: instead of a print() function for basic text output, Cscript has this WScript.Echo() thing instead (WScript is a more Windows-oriented version of Cscript), so I had put the following function in my command-line JavaScript scripts:

function print(OutString) {
  WScript.Echo(OutString);
};

Because js.jar supports a native print() function, the only change necessary to any of my scripts was to comment out the three lines above, and js.jar then happily ran my existing scripts.

If you start up js.jar without providing a script name as an argument, you get a js command line. Enter help() there to see some interesting commands that you can add to your scripts—for example, readUrl(). (Note that these commands are case-sensitive.)

I mostly tested this on a Windows machine, but it all worked fine on a machine running the latest Ubuntu.

The reason I got interested in this recently was that I had just pulled a ton of menu definition JavaScript off a website, with the majority of it being JSON definitions of the website's menu structure. I wanted to store all these definitions in SKOS RDF. Once I added and redefined a few functions in the JavaScript code that I had downloaded, I ran it all and redirected the output to RDF files all pretty easily. I'm definitely going to have some more fun with this.


Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists