« Great survey of RDF/web development tools | Main | The Economist welcomes the Semantic Web »

Generating RDFa from Movable Type

Easy to generate, easy to use.

Did you know that the default template for the Movable Type weblog publishing software adds metadata in commented-out RDF/XML to permalink pages? (By "permalink pages," I mean the pages that store permanent versions of each weblog entry, as opposed to the versions on the main index page, which are only there for a few weeks.) For example, this weblog is one I picked at random after doing a Google search for "Movable Type"; doing a View Source on it will show the commented-out RDF. (For all I know, the weblog's author never heard of RDF.) Being commented out, this RDF is not particularly useful, but this default habit of Movable Type inspired me to try to get Movable Type to put RDFa versions of the same metadata into this weblog's permalink pages. It works, and you can now pull RDF/XML out of any permalink version of one of my weblog entries with a single URL, like this. It's much easier than trying to do something with commented-out RDF/XML.

If you follow that link, which tells the W3C online version of Saxon to process my last weblog entry with a stylesheet that Fabien Gandon of INRIA put on their server that converts the RDFa to RDF/XML, your browser will only display the PCDATA in the RDF/XML, which won't look like much. Do a View Source on it to see how the INRIA stylesheet pulled the RDFa out of that weblog entry and converted it to RDF/XML. Better yet, follow this link to see the extracted RDF/XML document validated and converted into more readable triples. (Scroll to the right of that output to see the predicate and object of each triple.) Both literals and URIs appear as objects in these triples, which is nice.

The first five triples for each document are based on meta tags already inserted by Movable Type. The remaining triples are there because of the following tags that I added to my Movable Type "Individual Entry Archive" template just before the script element at the end of the head element:

  <meta about="<$MTEntryPermalink$>">
    <link rel="trackback:ping" href="http://madskills.com/public/xml/rss/module/trackback/"/>
    <link rel="dc:identifier" href="<$MTEntryPermalink$>"/>
    <meta property="dc:creator" content="Bob DuCharme"/>
    <meta property="dc:title" content="<$MTEntryTitle encode_html="1"$>"/>
    <meta property="dc:date" content="<$MTEntryDate format="%Y-%m-%dT%H:%M:%S">"/>
    <meta property="dc:description" content="<$MTEntryExcerpt encode_html="1"$>"/>
    <link rel="dc:subject" href="http://www.snee.com/ns/blogcat/<$MTCategoryLabel$>"/>
</meta>

(I have a version with fewer tags and hard-coded values on the weblog's main index page, so if you're reading this entry from there, a View Source won't show as much metadata as the permalink pages have.) In addition to the new tags above, the html start-tag in the Movable Type template needs declarations for any referenced namespaces—in this case, dc and trackback.

To make it all nice and well-formed, I also replaced the template's &laquo; and &raquo; entity references that put the « and » characters near the top with the numeric character references &#171; and &#187;. I tried commenting out the DOCTYPE declaration in the Movable Type template, because these new meta elements make the document invalid XHTML 1.0 and I wouldn't be parsing it against a DTD anyway, but some odd interactions with the CSS made certain parts of the page disappear, so I left the DOCTYPE declaration alone. TagSoup had no problem with the extra RDFa metadata, and didn't include the DOCTYPE declaration in its output, so you might want to use that if you're processing HTML files that contain RDFa metadata. (If you're processing HTML from the wild, you'll want to use TagSoup anyway.)

Why is it worth all this trouble? Why is RDFa cool? Because document metadata that looks like the following is easy to read, easy to write, and easy to convert to RDF/XML:

<meta about= "http://www.snee.com/bobdc.blog/2007/01/great_survey_of_rdfweb_develop.html">
  <link rel="trackback:ping" href="http://madskills.com/public/xml/rss/module/trackback/"/>
  <link rel="dc:identifier" href="http://www.snee.com/bobdc.blog/2007/01/great_survey_of_rdfweb_develop.html"/>
  <meta property="dc:creator" content="Bob DuCharme"/>
  <meta property="dc:title" content="Great survey of RDF/web development tools"/>
  <meta property="dc:date" content="2007-01-17T08:33:38"/>
  <meta property="dc:description" content="For both reading and writing RDF...."/>
  <link rel="dc:subject" href="http://www.snee.com/ns/blogcat/RDF/OWL"/>
</meta>

While everyone, including me, loves to beat RDF/XML, it's not quite a dead horse—as an exchange format for moving metadata between applications, it's just fine, and because a stylesheet such as the INRIA one can convert RDFa to RDF/XML, it means that you can easily use RDFa metadata in a wide variety of applications.

A new data format isn't useful until there's enough data in that format to drive some applications. I think that RDFa will be very useful, and making this modification to a Movable Type template automates the generation of useful RDFa metadata. Because Movable Type regenerated all of the weblog's permalink entries after I changed the template, I now have lots of RDFa to play with, and I'll have more each time I write a new weblog entry. And thanks to Fabien Gandon, I didn't have to do any coding to make it happen!

Comments

(Note: I usually close comments for an entry a few weeks after posting it to avoid comment spam.)

Very interesting! Is this compatible with the Dublin Core recommended encoding guidelines (http://www.dublincore.org/documents/dcq-html/)?

For example, on my blog today, you can view source and see:
<link href="http://purl.org/dc/elements/1.1/" rel="schema.DC" />
<link href="http://purl.org/dc/terms/" rel="schema.DCTERMS" />

<meta name="DC.language" content="en-US" />
<meta name="DC.type" content="blog" />
<meta name="DC.publisher" content="Scott C. Hudson" />
etc.

With your approach, would I add a separate section, or add to my existing entry:

<meta name="DC.language" property="dc:language" content="en-US" />

Hi Scott,

I hadn't thought about that. It looks like the link elements are similar, having href and rel attributes, but the meta elements aren't, with their name attributes. And, the approach for qualifying the names is obviously different--RDFa uses a namespace prefix and a colon, which many people don't like in an attribute value, but I've worked with enough XSLT to be used to it. And, I know of software that can understand those qualifiers, which doesn't apply to the Dublin Core approach.

Why would you put name qualifiers (DC and DCTERMS) after the period in the @rel values? It looks like http://www.dublincore.org/documents/dcq-html/ has them before.

Check out section 2.7 on that link. That's where I got the example from. One other component I didn't add to my posted example, is that the head has a profile attribute:
<head profile="http://dublincore.org/documents/dcq-html/">

So for completeness sake, should I add the property attribute to each of my meta elements?

Does your RDFa extraction handle this type of meta info, or will I have to have a meta about wrapper?

--Scott

Hi Scott,

if you're looking for something that's more in line with the DC encoding guidelines, you might want to check out eRDF (just google for "embedded RDF"), which also doesn't invalidate your HTML or XHTML 1.0.