« The Economist welcomes the Semantic Web | Main | More ways to make money from the semantic web »

Generating RDFa from Movable Type, Part 2

Why generate metadata that's redundant with data?

After I wrote recently about tweaking a Movable Type template so that RDFa metadata would be automatically generated with the individual archive versions of each weblog posting, Ben Adida suggested that it would be better if I had added the markup inline with the weblog entry instead of grouping it into a single block in the web page's head element.

I put the metadata in the head element because I've been pushing the RDFa development effort to consider the use case of metadata that describes content without being content, such as workflow information about a document or about a component of a document. (For an example, see the Content Management Metadata example that I submitted to the W3C's RDFa Use Cases document.) I realized, though, that for most of the RDFa metadata that I had added to my weblog template, Ben was right, because so much of that metadata was redundant with the web page's data. The following meta element uses the MovableType tag $MTEntryTitle to plug the document's title into the content attribute,

<meta property="dc:title" content="<$MTEntryTitle encode_html="1"$>"/>

so that for a document like this one I'd end up with this meta element:

<meta property="dc:title" content="Generating RDFa from Movable Type, Part 2"/>

Since that title is already showing up elsewhere in the document as the title that you see when you read the web page, this second indication of the title is redundant with the displayed title. (It's actually the third indication of the title—I've always hated how a typical HTML document needs to have its title specified in both the title child of the head element and in an h1 element near the top of the body.)

The Movable Type template that I use doesn't store the main title in an h1 element, but I'm not going to screw around with its structure. I went with the RDFa philosophy of adding a few tags here and there to provide machine-readable clues about the meaning of the existing content. The bold text here shows how the template now generates the same dc:title triple that it generated before, without adding another copy of the title string to the web page.

<h3 class="entry-header"><meta property="dc:title"><$MTEntryTitle$></meta></h3>

The description was already in the document as well, so adding a few tags also made it serve double-duty as data and metadata.

<p><b><meta property="dc:description"><$MTEntryBody$></meta></b></p>

The other place where I added tags into the content includes an assignment of the author name as the dc:creator and another more interesting case. The meta element for the dc:date predicate isn't re-using data as metadata—it wraps the "Posted by" date with tags that include a content attribute to provide the object of the triple instead of using the PCDATA content of the meta element.

<span class="post-footers">Posted by <meta property="dc:creator">
<$MTEntryAuthorDisplayName$></meta> on 
<meta property="dc:date" 
content='<$MTEntryDate format="%Y-%m-%dT%H:%M:%S">'>
<span class="separator">|</span> <a class="permalink" 

The result will look something like this:

<span class="post-footers">Posted by <meta property="dc:creator">
Bob DuCharme</meta> on 
<meta property="dc:date" content='2007-01-31T03:45:05'>
January 31, 2007 03:45 AM</meta></span> 
<span class="separator">|</span> <a class="permalink" 

The second meta element here has a great reason for using the content attribute to provide an alternative object for the triple about the Dublin Core date: it's using an ISO 8601 version of the date, which is more useful as metadata than one that spells out the month name, because it's easier to sort and to use in query criteria. I might have put this meta element in the document's head (where I still have some metadata) but wrapping it around the displayed date directly associates it with that very relevant part of the document. If I did the same for the dates shown with the comments about a weblog entry, someone could search for comments after 20070201T12:00:00 and before 20070201T15:00:00 and find one posted with a displayed date-time stamp of "February 1, 2007 at 2:30 PM".

I disagree with Ben's assertion that "marking up the actual rendered data... is what RDFa is all about" [my emphasis]. The beauty of RDFa is that its design lets us do that and more with it. Providing metadata that is not redundant with content will serve a real business need in the publishing world. Inline RDFa is more difficult to generate automatically than a single block inserted into a document's head element (and what's that head element for, if not document metadata?), and I don't think that hand-crafted RDFa use will add up too quickly.

Also, for a block of metadata in the header, it's easy to specify the full URL of the containing document as the subject of all those triples by wrapping the meta elements with another one that has the URL in an about attribute. The way I have it now, the subject of many of the triples is the empty string, which is understood to be the containing document. Understood by what, in the semantic machine-readable sense? If I write an app that pulls such triples from two different documents, it's going to see the same empty string subject for all of those triples unless I write extra code to fix that. One trick to get around this is an about attribute on the html start-tag with the full URL of the document; the triples come out great, but because I've never seen this done before, I wonder whether it's considered good or bad practice. Let me know if you have any ideas about this.

Meanwhile, I'm accumulating useful triples. Looking for ways to add RDFa to auto-generated HTML is fun; it look like I already have a nice start.


(Note: I usually close comments for an entry a few weeks after posting it to avoid comment spam.)

OT but important: can you please switch to black lettering? I have to magnify your text about five times before I can read it, and it's still not easy.


Nice work Bob!

The use-case you give for the @content attribute is a good one--that of making sure that search engines can find information that may not be present in the document in a form that a machine might recognise. Other examples would be to attach the date information to words like "tomorrow" or "last week", to attach names to titles like "the Prime Minister" or "the President", or full names to countries like "the US".

Another thing that might be of interest is that RDFa is acquiring the idea of a 'profile' that allows non-QName information that is known about to be converted to have a QName. If that's gobbledygook for anyone reading, I apologise...it just means that words that are not qualified by being in some namespace, a preprocessor will find recognised ones, and make it so that they are in a namespace. Using this page as an example, it uses things like @rel="start" and @rel="prev" whic are defined in HTML; an RDFa parser that uses the HTML profile preprocessor (still being defined, but currently called hGRDDL) will see those as the predicates h:start and h:prev giving you even more triples to play with. :)

Great to see RDFa being put through its paces.

Best regards,


PS I almost forgot to mention; the triples you get in head should already include the full URL for the document, and not an empty string. There may be a problem with the processor you are using...which is it?

Mark Birbeck, formsPlayer

mark.birbeck@x-port.net * +44 (0) 20 7689 9232
www.formsPlayer.com * internet-apps.blogspot.com

standards. innovation.




I wouldn't expect the INRIA stylesheet to put the full URL in rdf:Description/@rdf:about, because an XSLT stylesheet has no way to know the name of the input document, but it doesn't work for RDFlib either when run locally or on Elias's web service (compare http://www.snee.com/temp/test1.html with http://torrez.us/services/rdfa/?url=http%3A%2F%2Fwww.snee.com%2Ftemp%2Ftest1.html).

So the full URL should be plugged in? That's good news.