Metadata data entry

Who (or what), and why.

How do we assign metadata to data? Ontologies often say "here is some information about the metadata we'd like to have for our data", but the actual assignment of metadata that conforms to an ontology is usually more work than developing the ontology. Who assigns this metadata, and why do they do it? You have three choices: people who do it because they're paid to, people who do it because they want to, and automated processes. I'm reading up on doing it with automated processes and will hopefully be reporting on this soon.

For now, the second choice is the most interesting because it's the newest and people are still trying to get a handle on it. (Don't miss the Talis interview with Thomas Vander Wal, who coined the term "folksonomies" and has thought very hard about their potential relationship to taxonomies.) In the early days of folksonomies, some didn't like the idea that the assigned metadata might not conform to a specific taxonomy or ontology. Folksonomies trade query precision (if you can't know all the terms that may have been assigned to a resource, you can't be sure of finding it) for something that's often more valuable: a lower threshold in the resistance of volunteers to do free data entry. It reminds me a bit of Tim Berners-Lee's acceptance of broken links when he designed a large-scale hypertext system: by questioning one of the original requirements for system integrity, you can end up with a large, inexpensive system that still helps people retrieve much of the information that they want.

Folksonomies remind me of Tim Berners-Lee's acceptance of broken links.

The success of folksonomies and the web don't prove the original "requirements" to be wrong; they just prove that similar systems that don't meet those particular requirements can still be useful. Systems that do meet those requirements can be even more useful, but they're more expensive to create and hence to use. Attorneys searching the legal cases stored in LexisNexis or Westlaw have every right to expect that all the links work and that the keywords assigned to each case belong to a carefully maintained taxonomy. This is in their requirements, because if you plan to tell a judge "you should rule in favor of my client because another judge ruled in favor of a plaintiff in a nearly identical situation, and no one's ever overturned that ruling", you want to be really, really sure that no one's ever overturned it. That's why these products are expensive, and that's why people cheering for comprehensive free online versions of the law (which I'm all for) don't realize which features they'll be giving up if they switch to the free ones.

How do you get metadata assigned to a large volume of resources? Letting people assign arbitrary keywords might be almost enough, but you need to provide them with a little more incentive, such as making their own resources easier to find through the use of their own tags—for example, I can search for my own pictures on flickr or bookmarks on by searching the tags that I created for them. (Another incentive, which doesn't play too well into the data integrity angle, is letting people assign silly funny metadata—check out the tags assigned to Kevin "Mr. Britney Spears" Federline's CD on Amazon).

Paying other people to tag resources with keywords has the advantage of ensuring that the metadata conforms to a worked-out structure. This makes the metadata (and hence the data) more valuable and isn't always as expensive as you might think, even when you need specific subject expertise in your metadata assigners. (Please excuse the brief plug for my employer, Innodata Isogen, and contact me if you'd like to know more.)

While some metadata is free, such as the size of a file and the last time it was edited, creation of new metadata is never completely free. If you're not paying people outright, you must come up with and then implement some system that makes people want to do your metadata data entry without being paid. What if someone created a site that let users make up tags, and no one did so? There are plenty of examples of such sites, which jumped on the Web 2.0 bandwagon as if letting users tag contents was some sort of silver bullet. Despite what David Weinberger writes in Everything is Miscellaneous, simply letting people add metadata isn't enough. They need an incentive.

If you don't want to pay someone to paint your fence, you can do what Tom Sawyer did and convince others that it's a fun privilege for them to do it for you. Weinberger and Don Tapscott help lead cheers that metadata data entry is fun, but it seems to be the most fun for people making fun of Kevin Federline, and if it doesn't make your life more fun, it better do something to make your life easier. Coming up with that incentive is the real silver bullet, if you want to avoid writing a check for human labor or automated systems to do this work for you.


Another way is to make capturing metadata fun or competitive. See The ESP Game for example, or Peekaboom. These both come from Louis von Ahn and his team.