RDF, The Semantic Web, and Linked Data

This essay is an attempt to tie together my articles and blog posts on semantic web related topics.

Bob DuCharme, last updated 24 September 2009

Resource Description Framework (RDF) is a W3C standard data structure for storing arbitrary data on the web and elsewhere.[1] Its key advantages include flexibility in describing different kinds of data, ease of aggregation, and the availability of standards supported by both commercial and open source software.

Several syntaxes are available for representing RDF data. RDF/XML was one of the earliest, and its frequent ugliness and syntactic complexity for representing what became little-used RDF features gave RDF a bad reputation in its early days, but RDF/XML doesn't have to be ugly.[2] Formats more popular today include n3, its sibling Turtle, N-triples, and RDFa (formerly "RDF/a"[24]).

Machine-readable data on the public web

An RDF {subject, predicate, object} statement known as a "triple" names a resource and a bit of data about that resource. RDFa lets you embed these triples in HTML, and is becoming popular as a more flexible, scalable alternative to microformats.[11, 12, 3] (It can be used in other forms of XML such as DITA and DocBook as well.[80]) Examples of RDFa often show hand-coded markup, but it's easy enough to automate the addition of RDFa metadata with HTML generation systems such as Moveable Type[4, 5] and the DITA Open Toolkit.[6] As more database-driven, non-static web pages such as the news site digg[7, 70] incorporate RDFa, they make more of their data and metadata available to applications in addition to making data available to eyeballs, which has been the main use of the web so far. As websites that deliver business-driver data such as schedules and prices make more of this data available using RDFa,[8] it will open up new possibilities for how we can use the web, especially now that Yahoo and Google have announced plans to parse and use metadata stored using RDFa. [71, 74]

While RDFa lets you embed machine-readable data in web pages, the Linked Data movement provides a larger-scale approach to making data that fits the RDF model available on the public internet or across silos of intranets behind firewalls, and the number of sites making Linked Data available is constantly growing.[82] This data may be stored in a relational database[9, 10], but by providing an HTTP interface that references data using URIs[69] and accepts queries using the W3C standard SPARQL language, a "SPARQL endpoint" makes the data accessible to a wide variety of applications. Once you know the URL for a SPARQL endpoint, you can jump in and start exploring it with no need for special background on what kind of data is stored there.[13] Straightforward RDF development tools existed before SPARQL was available[44], but the existence of a standardized, well-supported query language for RDF means that, as with the opportunities that SQL provided for relational databases, people can extract data from databases according to their own conditions and terms with no need to build full applications. Of course, it also makes application development much easier, because SPARQL tools often combine easily with other tools.[66, 67, 68]

A standard way to query that data

As with any computer language, you can build complex, powerful SPARQL queries—it's not unusual to find complaints about how complicated SPARQL queries can get on web pages implemented with an impenetrable mess of spaghetti JavaScript code[14]—but simple SPARQL queries that get real work done are quite common.[15]

Once you know a little SPARQL, you can start querying interesting endpoints out there such as the Linked Movie Database[16] and DBPedia[17, 18, 14, 75, 76, 77], a community effort to create a SPARQL endpoint-accessible database from the structured data stored in Wikipedia Infoboxes (the fielded information in the gray boxes on the right side of many Wikipedia pages). These endpoints and the applications that use them are accumulating into a Linked Data Web that's already had its first conference.[19, 20] (I co-chaired this conference and interviewed several of its key speakers in the weeks leading up to it.[21, 22]) Like SQL, SPARQL is more than a query language, making it possible to create new data from existing data, which can be one of the more exciting aspects of application development with SPARQL.[81]

Adding semantics to the shared data

While the idea of the Semantic Web predates the Linked Data movement, I find it useful to think of the Semantic Web as being the Linked Data web with the addition of standards-based semantics encoded to help you get more out of that data.[23] (As the idea of "semantics" becomes a buzzword for selling web-based technology, the "standards-based" part of this becomes more important.[50]) The infrastructure consisting of SPARQL endpoints, the SPARQL query language, and the best practice use of HTTP and URIs allows us to share data so that apps can find it; the RDF Schema Language (RDFS), the Web Ontology Language (OWL), and the software that supports them let you encode semantic information about your data[25, 45, 47] or the data of others, which is what makes the "web" part of this so exciting. As a W3C standard for encoding semantics and ontologies, OWL builds on a lot of pre-existing work[26] and moves that work beyond the academic world where it had been most comfortable to the world of commercial and open source software where we can all play with it.

The development of ontologies can seem intimidating, but in addition to the open-source [31, 32] and commercial [78, 79] tools available, the library science world has a long tradition of techniques and best practices for identifying key terms in a domain and their relationships. The influence of professional taxonomists has been a great benefit to the Semantic Web.[21, 33] More literature is becoming available to help as well. [72]

I've always been especially interested in OWL's potential role as metadata that lets you get more value out of existing data[34, 62, 63, 64], even when that data is stored in traditional relational databases.[27, 28, 29, 30, 43, 48] When the data is stored in a "triplestore"—a database specialized for the RDF data structure—you have more flexibility, and I've played with several of the excellent free triplestores now available.[35, 36, 37, 38, 39] Along with ease of setup and use, SPARQL support, RDFS and OWL support, and the added features of a commercial version, another interesting comparison point among these triplestores is their support for named graphs, or named sets of triples. It took me some time to appreciate how valuable these can be for updating of data and tracking of data provenance.[40, 41, 42]

Applying the technology

So what do you do with all this technology? What problems does it solve? One example is the integration of data across silos mentioned earlier.[46] Another example is the sharing and aggregation of data from sources that can't all be expected to follow the same structure—for example, across social networking sites.[49] As I've described ideas for other domains where semantic technology could be useful, such as legal publishing[51], XBRL work[65], and in open source ERP[52] and CRM frameworks[53], I've received encouraging comments about progress already underway there.

Legal publishing is full of structured content that doesn't conform to a central model but must be shared among different groups, and this world has benefited from an XML-oriented approach since the days of SGML. They also use a lot of metadata to help people find what they want. If semantic web technology helps you use metadata to get more value out of data, then it can provide an excellent complement to the kind of XML-based content technology used in legal publishing.[60, 61, 73, 75] As the publishing world moves more products to electronic delivery, Semantic Web technology can help other kinds of publishers track the relationships between their own content, licensed content, and user-generated content, and their rarely-static rights to use content from each of these categories. They can then use this relationship metadata to create new products.[54, 55]

Elsewhere in the publishing world, Adobe's XMP standard, a subset of RDF, makes it easier to connect several of the publishing world's most popular tools to this technology[56, 57]. Reuter's Calais can add RDF-based semantics to content at no cost[58, 59], adding even more possibilities to how a given set of content can bring business value or even entertainment to customers.

To build applications around this data, a growing number of open-source tools are available to mix and match with user interface tools[76]; commercial tools such as those of my employer, TopQuadrant,[78, 79] are becoming increasingly popular for the modeling, development, and deployment of Semantic Web applications.

RDF, The Semantic Web, and Linked Data

Machine-readable data on the public web

A standard way to query that data

Adding semantics to the shared data

Applying the technology

footnotes