« Learning more about SPARQL | Main | SPARQL and relational databases: getting started »

White paper on metadata standards

Not as confusing a choice as many think.

I recently wrote a white paper for Innodata Isogen titled Content Metadata Standards: Libraries, Publishers, and More that is available for free if you don't mind registering first. (If you do register, you'll find a nice choice of other white papers available on topics such as DITA, content re-use, and ebooks.)

I've heard several people say "there are dozens and dozens of metadata standards out there! It's all so confusing!" It's really not that bad, and this paper addresses several key issues and tours through the more well-known standards. To summarize three of the main points:

  • Dublin Core is pretty central. Some people complain that it's too vague (for example, it has a "date" field, but date of what?) but being very generalized is what makes it so broadly useful. Most other metadata standards build on it for more specific uses.

  • It seems like half the metadata standards out there are administered by the US Library of Congress. Many of these standards build on the LoC's original MARC standard for bibliographic information, and others began at one library or university or another and moved to the LoC's stewardship as they grew. (More good news from the LoC: I just found out from a Rick Jelliffe posting that they're putting together a set of cool URIs for US federal legislation.) Many of these are focused on increasingly modern needs such as digital scholarship.

  • The more industry-specific standards, by their very nature, make it relatively easy to identify whether they're relevant to what you as a publisher need. For example, if you're involved in magazine publishing, PRISM will be valuable; for book publishing, there's ONIX. ("Involved in" here could mean being such a publisher yourself, but it could also mean being having such publishers as business partners selling you content or buying it from you.)

Other issues covered by the paper are the OCLC's five classes of metadata, which provide a nice framework when evaluating your own needs; content standards such as DocBook and DITA with built-in metadata slots; specialized vs. generalized metadata, and controlled, taxonomy-based keyword metadata vs. folksonomies.

If there are any important issues about metadata that a general publishing audience would want to know about but which aren't covered by the paper, please let me know.

Comments

(Note: I usually close comments for an entry a few weeks after posting it to avoid comment spam.)

We are considering adding explicit support for Dublin Core metadata as part of the Publishers schema we are creating in the DocBook Subcommittee for Publishers. Thoughts on that approach?

I think ONIX might be a bit heavy-weight to add to our schema, though...

Forgot to add: Our reasoning behind adopting Dublin Core in the DocBook Publishers schema is:
1. interoperability
2. widely recognized standard. DocBook already has support for external standards, such as SVG and MathML, so why not for metadata?
3. tool support/integration. It should be easier for tool vendors to add support for a recognized industry standard.

--Scott

Hi Scott,

There's already a good chunk of Dublin Core in DocBook now, right[1]?

ONIX support would mean narrowing your definition of publisher to mean "(hard copy?) book publisher," and I'm sure you want to keep it broader than that.

Bob

[1] http://www.docbook.org/specs/cs-docbook-docbook-4.2.html#d0e652