« My new favorite typo | Main | RDF/OWL for data silo integration? »

What data is your metadata about, and where is it?

If metadata really is data about data...

In a recent posting, I mentioned that I've been thinking lately about how some people doing metadata work (in particular, people doing RDF Schema and RDF/OWL ontologies) don't care much about whether there is any corresponding data to go with their metadata. We all dutifully define metadata as "data about data," but a lot of metadata out there is not really about any existing, useful data. Dan Connolly called it "ontologies for the sake of ontologies."

When describing this peeve of mine to Paul Prescod, he asked me "are you saying that people are creating metadata ontologies before they create the data and should do it the other way around?" I replied that too many people create a metadata ontology for data that doesn't exist, announce its availability and the kind of data it would be good for, and then move on to create more ontologies. For example, a year or so ago on the semantic web mailing list, someone posted an announcement about an RDF Schema that he had created with a description of the kind of data it would be useful for. I privately e-mailed him suggesting that he create a file of RDF triples as sample instance data that went with his schema to demonstrate how to use this schema, and he sent back a very appreciative email that thanked me and said that this was a great idea and that he would go ahead and do it. Here's what I'm wondering: why did this have to be suggested to him? Why wasn't it self-evident that he should create sample data before announcing his schema on the mailing list? Why is it so common for people working with RDF Schema and RDF/OWL to think that if they build it, the data will come, simply because they announced that their work is available? When people develop relational schemas or XML DTDs or schemas, creating sample data that conforms to them is a normal step in testing the suitability of their work; why is it less obvious to people doing ontology work that there should be meatdata to go with their metadata?

When I chose this topic for my five-minute rant at the XML Summer School, I quoted John Chelsom's statement earlier that week that the main purpose of metadata is to speed up and enrich the searching of resources. To give some credit to the people I'm ranting against, I wouldn't agree with this assertion 100%; some people are doing valuable work with pure metadata about medical conditions and potential treatments as they use RDF/OWL tools to find new relationships, but I think too many are designing metadata for nonexistent data that they somehow think they will inspire someone else to create.

In typical discussions about the lack of RDF data on the web, some people point out the progress in the development of tools that let us treat non-RDF as triples, thereby adding this data to a potential semantic web. I think that this is great, but what I'd really like to see is RDF/OWL ontologies that describe this data so that we can get more value from that data. In his talk, John also described the concept of "turning content into knowledge by adding metadata and ontology." This would make a great mission statement for someone, and it gives us a clue about the appeal of designing metadata for non-existent data: it's easier. As with many IT projects, starting with a body of existing data and then creating a model that works well with it is messier and more difficult than starting with a blank slate, but from the potential semantic web to the internal systems of many, many, companies, the greatest opportunities for the use of metadata are in building metadata around existing data.

In forthcoming postings here, I'll write about (or, more likely, ask about) the creation of RDF/OWL ontologies for existing sets of data and how those ontologies add value to that data. Please let me know, in comments here or by private e-mail, of any projects you know of that do this.


(Note: I usually close comments for an entry a few weeks after posting it to avoid comment spam.)

Why is it so common for people working with RDF Schema and RDF/OWL to think that if they build it, the data will come, simply because they announced that their work is available?

Because people like making abstractions, it's an intellectually satisfying job... Look at the programmers, every java programmer loves creating frameworks, the more general the better, but are no so keen to use it to build something useful.

When Bill de hÓra calls me out as one of the "markup people out there who can live in the RDF world" [1], I like to think he's characterizing folks who work with RDF, but strictly as an auxilliary to XML. IOW, it's exactly what you're saying: worry about what the data is, then decide what metadata representation and processing is needed to enhance its value. As an aside, I'd tend to include you, Eric van der Vlist and Edd Dumbill in Bill's list, and I'm not sure I'd include Shelley Powers, unless I misunderstand her position on certain things (quite possible).

Anyway, I've always been able to tolerate what I consider metadata fundamentalism from some RDF/topic maps folks. By that I mean folks who prefer to encode everything they process in such metadata technologies. I've heard advocates say that people don't need the semantic vagaries of XML formats when they can have the rigor of RDF/TM. I disagree, but what ultimately soured me on RDF was the related, but distinct problem of over-engineering (and over-theorizing) in the RDF world [2]. (Of course you know that, because you're one of those who commented :-) )

Good piece, Bob. Thanks.

[1] http://www.dehora.net/journal/2004/07/rdf_101.html
[2] http://copia.ogbuji.net/blog/2005-09-14/Is_RDF_mov


The reason I like using RDF is because I can (infinitely) defer committing to regular structures. I take my data and add properties as and when I need them. I pick and choose existing properties and classes, or make my own up. The few 'ontologies' I've written have been either loose collections of terms I couldn't find elsewhere, or descriptions mined from the dataset (so people know what's there).

But working ontology first seems to throw that advantage away. They're yearning for the shackles of XML schemas, when they ought to be enjoying the freedom of RDF schema.

"turning content into knowledge by adding metadata and ontology"

Been into a library lately?

Point taken about ontologies.

Maybe their is an opportunity here. Even small libraries have huge datasets, often in electronic form. Most accept volunteer assistance. Maybe some of this intellectual energy could be devoted to real metadata projects in real libraries. Especially small libraries with small budgets (towns under 300k pop for example, schools, small colleges, etc).



I'm as guilty as anyone of creating schemas without data, but I'm not sure its necessarily a problem per se. For example, ages ago I hacked out a project ontology. It included a lot of guesswork about what I thought I would need. When I finally did some coding around it, working with instance data, I found that I was only using a fraction of the terms I'd defined. Ok, so there was some wasted effort here, but if you considered the v0.1 of the ontology just a working sketch, it's not that bad.

So although I agree that the ratio of ontologies to instance data is pretty silly, I'm not convinced this is inherently bad. URIs aren't exactly expensive. Duplication of ontologies is generally undesirable, but I think probably unavoidable in the first few passes.

Perhaps (a big perhaps) when Damian mentions "yearning for the shackles of XML schemas" this also applies to the criticism of the ontology:data ratio. In the XML world schemas are treated as precious resources. Whereas in RDF, by making *any* statement you are making at least one schema/ontology level assertion (that the predicate resource is a property).

On your metadata point, I reckon the more interesting side of RDF is where it's talking about things in general, rather than about documents (about things). But given the doc-nature of the current web, there is a lot of low-hanging fruit on the metadata side.

Re. ontologies for existing data - that's half the motivation behind the micromodels stuff. People are creating material in HTML, the microformats folks provide a way of making the data explicit, XSLT offers the bridge to RDF.

Danny -

Did you announce to the world that you had made a project ontology available before or after working with related instance data and discovering that only a fraction of it was useful?

More generally, is it better to announce to the world "I've written this ontology and made it available" as soon as possible, which in your case meant that the majority of it was not useful, or after putting it through some paces and identifying which parts are useful?

Of course this is a rhetorical question; the classic ontology developer answer would be "but some of the other parts might be useful to someone someday." This is just wishful thinking that only increases the amount of useless ontology work that people must wade through to find something that might work for them. I'm not looking for rigorous testing, but just a demo, or as you did, some coding around it.

In attitudes about ontology development today, there's way too much "might be useful to someone someday" with no desire to put a little effort into determining how useful it might be and to who. Developers of other kinds of schema or software ("coding"!) with this attitude would have a difficult time being taken seriously. I'm sure that sourceforge and download.com are full of software that does nothing for anyone but their authors, but at least it did something useful for them.