(semantic web) - semantics = linked data?

So much of the best "semantic web" technology has little to do with semantics.

When people talk about semantic technology, they're often talking about technology that has nothing to do with semantics. They're talking about the new possibilities that the RDF data model and the SPARQL query language add to distributed database applications, and there's a lot to talk about. As Jim Hendler once wrote,

My document can point at your document on the Web, but my database can't point at something in your database without writing special purpose code. The Semantic Web aims at fixing that.

Why do we describe technology for easier integration of machine-readable data on the web as "semantic"? I don't mean to pick on Jim—I had the quote handy because it's in my file of favorite quotes, and few understand the semantic add-ons to Linked Data that will make for a proper Semantic Web better than he does—but I don't see semantics necessarily playing much role in the technology evolving to let web databases easily point at each other. There are some semantics built into the middle third of all RDF triples, because the requirement that a predicate use a full URL means that I can't just say "title" there, leaving you to wonder whether I'm talking about a job title, the deed to a piece of property, or the title of a work; I have to say something like http://purl.org/dc/elements/1.1/title to make it clear that I mean the title of a work. In other words, I must make the semantics of the triple's predicate clear.

There is plenty of payoff when applications can combine data from different sources to do things with no need for a central schema tying them together, and this is possible without any program logic addressing the semantics of that data.

Other than that, I don't see what's semantic about exposing data as triples and using SPARQL to get at it as described by Tim Berners-Lee's original essay on Linked Data principles, except that the general ideas are an outgrowth of the older idea of the Semantic Web. We're seeing now that as more data gets exposed and linked this way, more and more possibilities open up. Once enough data is linked using this technology, then there will be enough to work with to start making general-purpose semantic applications, but until then, the use of OWL and related technologies that really address semantics will be limited to niches. Companies such as TopQuadrant and Clark & Parsia are doing very interesting work in those niches, and they're blazing the trails for when the broader information technology and publishing worlds are ready to take advantage of the semantics of this linked data. (In a recent Semantic Web gang podcast, someone said that new technology traditionally moves from NASA to the military to corporations to independent end users, and that we're seeing the reverse with Semantic Web adoption. I guess he didn't know that NASA is a client of both TopQuadrant and Clark & Parsia.)

While Zepheira's web site certainly uses the word "semantic" a lot, they seem more focused on linked data technologies as they focus on helping their clients "integrate, navigate and manage information across personal, group and enterprise boundaries." I think that this is a better place for most developers to focus on, at least for now, because there's a better chance of a medium- and even short-term payoff. That's the data infrastructure that actual semantic technologies can build on, so for now let's focus on the value of the infrastructure: data exposed (either publicly or behind the firewall across internal enterprise boundaries, which I believe is where Zepheira's been helping a lot of clients) in a standard way so that the growing number of tools built around those standards can take advantage of that data. This is just what the organizations in the Linking Open Data dataset cloud have been doing. There is plenty of payoff when applications can combine data from different sources to do things with no need for a central schema tying them together, and this is possible without any program logic addressing the semantics of that data.

Of course the real semantic technologies such as OWL and inferencing engines build on that, so this will bring even cooler applications. Nevertheless, to evangelize the data infrastructure that this will build on and to allay the fears of enterprise IT people who remember pie-in-the-sky AI promises when they hear the word "semantic", telling them about Semantic Web technology without the semantic parts (a.k.a. Linked Data) looks like an easier sell to me.

Comments? Corrections? Is the full URL in predicates enough to say that any use of RDF triples qualifies as semantic technology? (If anyone tells me that I'm misunderstanding the term "semantics", I'll be tempted to say "well, that's just semantics", so be forewarned.)


talking with more people about (semantic) web the word semantic always confused for IA meaning... Linked Data works well for this... and there are an interesting trick on your title: semantic web - semantic = web (the real, not 2.0, 3.0 and so much)



Why do people say "semantic" when there isn't much semantics in what they're doing? Well, marketing mojo is a perfectly reasonable answer. I don't mean to be cynical, since this seems an unobjectionable strategy.

The same goes for people's worries that "semantic" equals AI -- marketing is, by definition, a pragmatic endeavor, so if in some cases it makes sense to soft-sell that connection, one should soft-sell it. If in other cases -- which the anti-OWL crowd never seems to consider -- the connection between semantics and AI is a plus for a customer, then you can emphasize it.

For a more technical answer, re: Jim's thing about pointing at other people's databases...This is tricky. Even within organizations, or within parts of organizations, integrating directly with someone else's database is tricky, often introducing a tight coupling that you don't really want.

Using some "semantics" in this context really means integrating data models (or service interfaces) rather than integrating data sources directly, such that consumers and producers are sufficiently decoupled to be able to ignore some (though not all) changes in the underlying data.

The standard way to do this (and the way which is in line with historical trends in IT) is to have some declarative abstract representation of the data source, or database, and integrate with *that* thing, since it will tend to be more change resistant than the underlying thing it is an abstraction of. Hence the use of ontologies for integration, etc. In this usage pattern, a reasoner is an aid to (1) developing the ontology in the first place, and (2) a supplement to the code you write to integrate with the ontology instead of the the thing it represents.

(So you get the reasoner to do check that the model is logically consistent, to do subclass and subproperty inference, or most specific type realization, or inference explanation in order to shorten the total amount of code you have to write, etc.)

RDFS gives you some abstraction constructs over the underlying messy reality, but if you're doing RDFS, you're not exactly semantics-less. OWL gives you more, obviously. ISO Common Logic gives you even more -- at least, in principle -- at the cost of some tradeoffs, etc.

But it's this problem of direct coupling of data sources that makes me think that the Linked Data thing, at least as I presently understand it, is not a useful approach for the sorts of things we're trying to do. Oh and my skepticism about the claims of network effect -- that once you get enough "linked data" some cool semantics effects emerge. I think there's no reason whatever to believe that will happen. Or to put my skepticims in a weaker, falsifiable form: no one has explained, with sufficient detail, a plausible scenario whereby having a lot of "linked data" means you don't need to build models or ontologies or etc.

Oh, and PS: The rhetorical strategies around the notion of "niche" -- OWL is the "niche", Linked Data is the "mainstream" -- relies on a shared set of empirical data (or shared set of empirical *hunches and intuitions*) about what's getting used more often, when, where, etc. Apparently we don't share the same data or intuition with you, Bob, such that OWL is the "niche" and Linked Data is the mainstream.

It may seem that way in the semweb blogger echo chamber, but it doesn't seem that way anywhere else, at least not to me. FYI. :>

Hi Kendall,

That's a good point about data integration. Saying that field W in database X is the same as field Y in database Z is not something to do lightly if you're doing updates based on those values, so understanding and documenting the semantics of those fields makes such an association much more robust.

I certainly don't believe that once you get enough linked data some cool semantics effects will emerge spontaneously; my point is that as more and more interesting public data sets become available and point at each other, there will be more opportunities to create ontologies that add value to that data and write apps that take advantage of that added value, which is where I think the real semantic goodness lies, and least in terms of apps with the potential for wide deployment.

Perhaps I should have described in more detail why I wrote this posting. I see organizations in industries like publishing (which I will address more specifically in the near future) asking what semantic technologies can do for them, but the concept of "semantic technologies" is such a vague blob to them that both for them and those helping them to address the question it becomes more difficult to line up potential actions with potential benefits. I think that breaking down the categories of semantic technologies into related units will make this easier, and my "(semantic web) - semantics = linked data" cut, while obviously broad and generalized, is an attempt at this.

I certainly don't feel that Linked Data is mainstream. The semantic web marketing that you mentioned has a much bigger head start. I used "niches" to describe those who can really take advantage of semantics at this point; I'm sure it will move beyond niches over time. (I look forward to it!) I see the potential value of exposing data in SPARQL endpoints, without necessarily defining an ontology for the data sets, as having more potential for publishers and many others for now. In other words, I'm not talking about what's getting used more often, but what I feel has more short-term potential. Just my opinion.

(I won't repeat what Kendall has already said re data integration strategies)

While I'm not a huge fan of the *word* "semantics" (many find it confusing or obscure), there are plenty of semantics intimately involved in all RDF-based linked data activities. At the heart of the SW effort is a project to make mechanically clearer what Web documents are telling us. A big part of this is to do with reference - knowing what real world entities are being described. Colloquially, "what they are about".
Linked data efforts care about that at least as much as the rest of the SW world: URIs for things, well known URIs for things, URIs for things that can be readily used to find good and machine-readable descriptions of those things,.... And at least to the extent they use FOAF constructs and habits, there's some modest but significant use of OWL too: the use of the 'inverse functional property' construct (eg. isPrimaryTopicOf) to help point out identifying properties in a description, even if the property itself is not one known to an aggregator.

In general I'm pretty wary of encouraging SW enthusiasts to fracture into competing sub-tribes. There is too much "we don't need that fancy academic OWL" rhetetoric floating around, which is to my mind as senseless as having Java users berate Javadoc and IDEs.

Even a tiny little vocabulary the size of FOAF is complex enough that internal contradictions and other mistakes are a real risk ('can documents be agents? are onlineaccounts documents? or agents? can they be both? can two different documents have the same foaf:sha1 value? why not', etc.). OWL is a tool that can help us achieve clarity, and detect inclarities, in this area, regardless of whether there are "intelligent agents" running around at click-time drawing inferences and doing what-u-wait inferences.

And don't get me started on 'owl:sameAs', ...

In general I'm pretty wary of encouraging SW enthusiasts to fracture into competing sub-tribes. There is too much "we don't need that fancy academic OWL" rhetetoric floating around, which is to my mind as senseless as having Java users berate Javadoc and IDEs.

Dan, I agree:


I used to apologize for the word "Semantic" in "Semantic Web", until a student in one of my classes who happened to be a professional linguist told me to stop apologizing. Why? Because, he told me, there are many meanings of the word "Semantics" in Linguistics, including speech acts, formal semantics, etc. But, he pointed out, all of them refer to one very simple notion of semantics - that a symbol can refer to something in the world. He went so far as to say that this was the fundamental notion of "Semantics" in linguistics. Other linguists might challenge that statement for linguistics in general, but it holds up in the Semantic Web. The basic idea of linked data is that you can refer to something in the world with a symbol (where a symbol is a URI).

This is the basis for the non-niche work that makes up the bulk of TopQuadrant's custom, in fact, as far as we are concerned, the jury is still out on the usefulness of "OWL and related technologies" in real enterprise applications. Our customers are getting on pretty well with the more basic notion of "Semantics".

There's a reason why Jim and I called our book "Working Ontologist" - we only refer to OWL inasmuch as it can be used to specify how datasources relate to one another.

TopQuadrant is certainly doing a lot of complex ontology-based work for NASA. Having said this, our business is about helping organizations harness (read - integrate, share, analyze) information distributed across systems and parties. Much of the work at NASA is about data integration.

Majority of our customers use pretty light ontologies/schemas. There is no way of getting away from some kind of a schema or structure – XML has it, spreadsheets have it, databases have it, etc. And this is what our customers are bringing together. TopBraid Suite generates RDFS/OWL representation of the schemas used to interpret the data so that the data and its structure can be exposed in RDF for SPARQL queries - through either conversion of the data or for translating SPARQL queries into SQL. We see ourselves as a SPARQL company. Take a look, for example, at: http://topquadrant.com/sparqlmotion/

Most of the data sources our customers want to integrate tend to be internal but also some external – from the technology perspective we do not really see any difference. One business difference is whether a customer does it to expose the data outside their organization on the World Wide Web. A few are considering doing this, but majority wants a more flexible way of integrating data and creating and exposing data services to other parties within their own and partnering organizations. Many also want to take advantage of flexible schemas and databases they get from using RDFS/OWL as opposed to more rigid world of relational databases. We see the later benefit being of considerable interest to companies involved in managing and publishing content and wanting to have flexible taxonomies and metadata.

Since the areas we see most developers focusing on are inline with what you have described in the post (managing, navigating and integrating information), I guess we agree on where the value is.

What I am not so sure about is the contrast you are drawing between this and the word “semantic”. If “semantic” is interpreted as focus on complex description logic ontologies, then we see some of it here and there, but not much. We do see people wanting to express their business rules as part of the data integration and application development. For example, in the vocabulary management application there could be a rule that indicates that a “level” number of a topic needs to be changed in a certain way if it is moved within a hierarchy. TopBraid Suite makes it easy to automate this.

I'm speaking with my Uche hat on, because I think in many ways, Zeheira is an integration of differing perspectives (I'm, for example, more of an XML type than most). As for the word "semantics", it's interested me even before I got heavily into RDF (just before, to be fair). The spark was Robin Cover's 1998 article, "XML and Semantic Transparency". I still use the term "semantic transparency" a lot to describe the gap left by the base layer of XML technology.

In the side bar to my 2000 article "An introduction to RDF" I was already speaking of the curse of the "S" word. In the end, as Kendall pointed out, it all comes down to marketing, and that's fine. Marketing is all about communication, and if "semantic" allows those visiting our site to understand what we want them to understand, we'll use the term.

I don't think there is dichotomy between linked data and "semantics". As I've argued a lot in my "Thinking XML", I think that simple links, e.g. within schema definitions, to some source of semantic agreement, whether expressed in RDF or otherwise, is sufficient for most needs, and sufficient to make a huge change in the value of bodies of data. See, for example:

* http://www.ibm.com/developerworks/xml/library/x-tipdict.html
* http://www.ibm.com/developerworks/xml/library/x-think32.html

I think this is a very close correspondence to linked data. I know that e.g. Tim BL doesn't like my advocacy of Linked Data without insisting on every other spec built on top of RDF, but I don't see my lack of enthusiasm for SPARQL and OWL as schism-making. If the RDF community thinks itself healthy and viable, it's going to have to accommodate deep differences of technical opinion. We can't all just be a happy-happy-joy-joy coxless eight.

As for Zepheira, we cleave to the practical. All our architects can agree on the outlines of Linked Data, and I offer pretty sharp tools for using this to bring semantic transparency to XML, and re-animating XML and other large, dead bodies of data is generally a large concern for our clients, so both shoes fit rather nicely, and we're happy to wear them.

I continue to think that even Linked Data, by which we mean my pieces of data can link to your pieces of data across the web, is secondary. The big change is in database structure: from relational tables to graphs. It's in the way that my pieces of data link to my other pieces of data. This is the core thing RDF does (but not well enough), and what SPARQL exists to take advantage of (but not well enough). And yes, it's also the foundation for linking separate datasets across the web, but as long as we keep talking about it as primarily (exclusively?) an integration strategy, it will seem peripheral to most of the people who currently have the data...