My old friend Dale Waldt (I remember, immediately after the announcement of the existence of XML at SGML 1996, going up to my then-coworker Dale and asking "So what do we think?") recently posted an entry on the Gilbane XML blog titled Why Adding Semantics to Web Data is Difficult. A few days ago I posted a comment saying that the things that he saw as missing from semantic technologies are actually already there and working well, but my reply hasn't shown up yet, so after a bit of revision, I'm putting it here. For my blog entry categories, I've put this under "Publishing" because most of what I've written below is already familiar to people in the semantic web world, but not as widely known in the publishing world.
Consider though, that the schema in use can tell us the names of semantically defined elements, but not necessarily their meaning. I can tell you something about a piece of data by using the <income> tag, but how, in a schema can I tell you it is a net <income> calculated using the guidelines of US Internal Revenue Service, and therefore suitable for eFiling my tax return? For that matter, one system might use the element type name <net_income> while another might use <inc>.
This is why the semantic web is built around URLs, not just element names. If someone refers to a "title" and you don't know whether that person is an HR administrator who means "job title" or a realtor referring to the deed to a piece of property, you don't know what they mean. However, if I refer to a http://purl.org/dc/elements/1.1/title, you know that I mean the title of a work or resource, because the URL makes it clear that I'm referring to the Dublin Core sense of the term.
The things that Dale saw as missing from semantic technologies are actually already there and working well.
As I understand it, XBRL's goal was not to standardize the vocabularies of element type names as much to standardize ways of identifying them. For example, in GE's XBRL financial statement, they chose to identify net income with the URL http://www.xbrl.org/us/fr/common/pte/2005-02-28#usfr-pte:NetIncome and have this declared in a filed document. Instead of encouraging everyone to create their own new vocabularies, though, the XBRL effort did create a set of US GAAP taxonomies, and these are forming a core set of documented, commonly understood terminology for U.S. accounting.
How will we know that elements labeled with <net_income> and <inc> are the same and should be handled as such?
Let's assume that company X uses the term "net_income" and company Y uses the term "inc". When they publicly define what they mean by these terms using OWL ontologies or XBRL taxonomies, they avoid the confusion you describe by defining them with URLs, just as the OCLC did for Dublin Core terms, so let's say the terms' full names are http://www.x.com/ns/xbrl/net_income and http://www.y.com/some/path/inc. (Of course, if an XML document includes the namespace declarations xmlns:x="http://www.x.com/ns/xbrl/" and xmlns:y="http://www.y.com/some/path/", the element names can use the abbreviations x:net_income and y:inc.)
The following bit of OWL asserts that they're both the same as GE's term for net income, and a SPARQL query that uses the GE URL to say "get me net income figures" will get the others as well:
<owl:ObjectProperty rdf:about="http://www.xbrl.org/us/fr/common/pte/2005-02-28#usfr-pte:NetIncome"> <owl:equivalentProperty> <owl:DatatypeProperty rdf:about="http://www.x.com/ns/xbrl/net_income"/> <owl:equivalentProperty> <owl:equivalentProperty> <owl:DatatypeProperty rdf:about="http://www.y.com/some/path/inc"/> </owl:equivalentProperty> </owl:ObjectProperty>
This nicely demonstrates the potential of OWL as metadata that adds value to existing bodies of data.
OWL has been a standard for four years, and there are several implementations available that let you do this. (Speaking of semantics, in addition to defining such equivalences, OWL can also encode semantics.)
The great thing about OWL's relationship to XBRL is that much of XBRL is about defining taxonomies and semantics, and OWL is about building on such definitions to get more value out of data.
Obviously a industry standard like XBRL (eXtensible Business Reporting Language) can help standardize vocabularies for element type names, but this cannot be the whole solution or XBRL use would be more widespread.
XBRL helps to standardize naming within the world of business reporting, but the need for vocabulary definition standards and tools goes well beyond that world. (The full set of XBRL specs is also a complex solution to a complex problem, which slows the adoption from getting widespread very quickly.) The goal of RDFS was to help people define such vocabularies, but OWL provides a superset of RDFS and offers more slick tools, so people sometimes build OWL ontologies when they only need an RDFS vocabulary.
I think the Semantic Web will require more than schemas and XML-aware search tools to reach its full potential in intelligent data and applications that process them. What is probably needed is a concerted effort to build semantic data and tools that can process these included browsing, data storage, search, and classification tools.
For data storage and search, commercial and open source triplestore tools are available. (I recently mentioned that I've been blogging less because I've been looking into them.) For browsing, new semantic web Firefox plugins crop up all the time. I'll discuss classification next week, but as a hint, it turns around the question of what semantic web technology can bring to the publishing world—it's more about what they can learn from the publishing world.