RDF, The Semantic Web, and Linked Data

This essay is an attempt to tie together my articles and blog posts on semantic web related topics.

Bob DuCharme, last updated 24 September 2009

Resource Description Framework (RDF) is a W3C standard data structure for storing arbitrary data on the web and elsewhere.[1] Its key advantages include flexibility in describing different kinds of data, ease of aggregation, and the availability of standards supported by both commercial and open source software.

Several syntaxes are available for representing RDF data. RDF/XML was one of the earliest, and its frequent ugliness and syntactic complexity for representing what became little-used RDF features gave RDF a bad reputation in its early days, but RDF/XML doesn't have to be ugly.[2] Formats more popular today include n3, its sibling Turtle, N-triples, and RDFa (formerly "RDF/a"[24]).

Machine-readable data on the public web

An RDF {subject, predicate, object} statement known as a "triple" names a resource and a bit of data about that resource. RDFa lets you embed these triples in HTML, and is becoming popular as a more flexible, scalable alternative to microformats.[11, 12, 3] (It can be used in other forms of XML such as DITA and DocBook as well.[80]) Examples of RDFa often show hand-coded markup, but it's easy enough to automate the addition of RDFa metadata with HTML generation systems such as Moveable Type[4, 5] and the DITA Open Toolkit.[6] As more database-driven, non-static web pages such as the news site digg[7, 70] incorporate RDFa, they make more of their data and metadata available to applications in addition to making data available to eyeballs, which has been the main use of the web so far. As websites that deliver business-driver data such as schedules and prices make more of this data available using RDFa,[8] it will open up new possibilities for how we can use the web, especially now that Yahoo and Google have announced plans to parse and use metadata stored using RDFa. [71, 74]

While RDFa lets you embed machine-readable data in web pages, the Linked Data movement provides a larger-scale approach to making data that fits the RDF model available on the public internet or across silos of intranets behind firewalls, and the number of sites making Linked Data available is constantly growing.[82] This data may be stored in a relational database[9, 10], but by providing an HTTP interface that references data using URIs[69] and accepts queries using the W3C standard SPARQL language, a "SPARQL endpoint" makes the data accessible to a wide variety of applications. Once you know the URL for a SPARQL endpoint, you can jump in and start exploring it with no need for special background on what kind of data is stored there.[13] Straightforward RDF development tools existed before SPARQL was available[44], but the existence of a standardized, well-supported query language for RDF means that, as with the opportunities that SQL provided for relational databases, people can extract data from databases according to their own conditions and terms with no need to build full applications. Of course, it also makes application development much easier, because SPARQL tools often combine easily with other tools.[66, 67, 68]

A standard way to query that data

As with any computer language, you can build complex, powerful SPARQL queries—it's not unusual to find complaints about how complicated SPARQL queries can get on web pages implemented with an impenetrable mess of spaghetti JavaScript code[14]—but simple SPARQL queries that get real work done are quite common.[15]

Once you know a little SPARQL, you can start querying interesting endpoints out there such as the Linked Movie Database[16] and DBPedia[17, 18, 14, 75, 76, 77], a community effort to create a SPARQL endpoint-accessible database from the structured data stored in Wikipedia Infoboxes (the fielded information in the gray boxes on the right side of many Wikipedia pages). These endpoints and the applications that use them are accumulating into a Linked Data Web that's already had its first conference.[19, 20] (I co-chaired this conference and interviewed several of its key speakers in the weeks leading up to it.[21, 22]) Like SQL, SPARQL is more than a query language, making it possible to create new data from existing data, which can be one of the more exciting aspects of application development with SPARQL.[81]

Adding semantics to the shared data

While the idea of the Semantic Web predates the Linked Data movement, I find it useful to think of the Semantic Web as being the Linked Data web with the addition of standards-based semantics encoded to help you get more out of that data.[23] (As the idea of "semantics" becomes a buzzword for selling web-based technology, the "standards-based" part of this becomes more important.[50]) The infrastructure consisting of SPARQL endpoints, the SPARQL query language, and the best practice use of HTTP and URIs allows us to share data so that apps can find it; the RDF Schema Language (RDFS), the Web Ontology Language (OWL), and the software that supports them let you encode semantic information about your data[25, 45, 47] or the data of others, which is what makes the "web" part of this so exciting. As a W3C standard for encoding semantics and ontologies, OWL builds on a lot of pre-existing work[26] and moves that work beyond the academic world where it had been most comfortable to the world of commercial and open source software where we can all play with it.

The development of ontologies can seem intimidating, but in addition to the open-source [31, 32] and commercial [78, 79] tools available, the library science world has a long tradition of techniques and best practices for identifying key terms in a domain and their relationships. The influence of professional taxonomists has been a great benefit to the Semantic Web.[21, 33] More literature is becoming available to help as well. [72]

I've always been especially interested in OWL's potential role as metadata that lets you get more value out of existing data[34, 62, 63, 64], even when that data is stored in traditional relational databases.[27, 28, 29, 30, 43, 48] When the data is stored in a "triplestore"—a database specialized for the RDF data structure—you have more flexibility, and I've played with several of the excellent free triplestores now available.[35, 36, 37, 38, 39] Along with ease of setup and use, SPARQL support, RDFS and OWL support, and the added features of a commercial version, another interesting comparison point among these triplestores is their support for named graphs, or named sets of triples. It took me some time to appreciate how valuable these can be for updating of data and tracking of data provenance.[40, 41, 42]

Applying the technology

So what do you do with all this technology? What problems does it solve? One example is the integration of data across silos mentioned earlier.[46] Another example is the sharing and aggregation of data from sources that can't all be expected to follow the same structure—for example, across social networking sites.[49] As I've described ideas for other domains where semantic technology could be useful, such as legal publishing[51], XBRL work[65], and in open source ERP[52] and CRM frameworks[53], I've received encouraging comments about progress already underway there.

Legal publishing is full of structured content that doesn't conform to a central model but must be shared among different groups, and this world has benefited from an XML-oriented approach since the days of SGML. They also use a lot of metadata to help people find what they want. If semantic web technology helps you use metadata to get more value out of data, then it can provide an excellent complement to the kind of XML-based content technology used in legal publishing.[60, 61, 73, 75] As the publishing world moves more products to electronic delivery, Semantic Web technology can help other kinds of publishers track the relationships between their own content, licensed content, and user-generated content, and their rarely-static rights to use content from each of these categories. They can then use this relationship metadata to create new products.[54, 55]

Elsewhere in the publishing world, Adobe's XMP standard, a subset of RDF, makes it easier to connect several of the publishing world's most popular tools to this technology[56, 57]. Reuter's Calais can add RDF-based semantics to content at no cost[58, 59], adding even more possibilities to how a given set of content can bring business value or even entertainment to customers.

To build applications around this data, a growing number of open-source tools are available to mix and match with user interface tools[76]; commercial tools such as those of my employer, TopQuadrant,[78, 79] are becoming increasingly popular for the modeling, development, and deployment of Semantic Web applications.

footnotes

[1] RDF: Store Metadata About Anything, Anywhere (Dr. Dobb's Journal, April 2005)

[2] Making Your XML RDF-Friendly (with John Cowan) (XML.com, October, 2002)

[3] Introducing RDFa Part 2(XML.com, April 4, 2007)

[4] Generating RDFa from Movable Type(blog, January 2007)

[5] Generating RDFa from Movable Type, Part 2 (blog, February 2007)

[6] Automated RDFa Output from DITA Open Toolkit (blog, August 2007)

[7] Digging RDFa (blog, April 2008)

[8] The future of RDFa (blog, February 2008)

[9] SPARQL and relational databases: getting started (blog, October 2008)

[10] Mapping relational data to RDF with D2RQ (blog, November 2006)

[11] RDF metadata in XHTML gets even easier (blog, June 2006)

[11] SPARQL and live relational data (blog, December 2008)

[12] Introducing RDFa Part 1 (XML.com, February 14, 2007)

[13] How you can explore a new set of linked data (blog, August 2008)

[14] Hey CNN, SPARQL isn't so difficult. (blog, January 2009)

[15] Learning more about SPARQL (blog, October 2008)

[16] SPARQL at the movies (blog, November 2008)

[17] Querying DBpedia (blog, November 2007)

[18] Querying wiki/dbpedia for presidents' ages at inauguration (blog, September 2008)

[19] Ask a good linked data development question, go to Linked Data Planet for free (blog, April 2008)

[20] A successful Linked Data Planet conference (blog, June 2008)

[21] An interview with Seth Earley about Linked Data (blog, June 2008)

[22] An interview with Uche Ogbuji about Linked Data (blog, June 2008)

[23] (semantic web) - semantics = linked data? (blog, October 2008)

[24] Putting semantics on the web (blog, January 2006)

[25] Adding semantics to make data more valuable (blog, May 2008)

[26] The "DL" in "OWL DL" (blog, October 2007)

[27] Relational Database Integration with RDF/OWL (DevX, July of 2008).)

[28] DevX article "Relational Database Integration with RDF/OWL" (blog, 2008)

[29] Integrating relational data into the semantic web (blog, May 2008)

[30] Integrating relational databases with RDF/OWL (blog, October 2006)

[31] Using the ontology editing tool SWOOP to edit taxonomies and thesaurii (blog, August 2008)

[32] SKOS and SWOOP: how (blog, August 2008)

[33] What is a taxonomy? (blog, July2008)

[34] Adding metadata value with Pellet (blog, December 2008)

[35] Playing with some RDF stores (blog, January 2009)

[36] Getting started using Virtuoso as a triplestore (blog, February 2009)

[37] Getting started with Sesame (blog, February 2009)

[38] Getting started with Open Anzo (blog, March 2009)

[39] Getting started with AllegroGraph (blog, April 2009)

[40] Querying a set of named RDF graphs without naming the graphs (blog, March 2009)

[41] Some questions about RDF named graphs (blog, March 2009)

[42] Some use cases to implement using SPARQL graphs (blog, March 2009)

[43] Linking information to "missing" information in SPARQL (blog, November 2008)

[44] Building Metadata Applications with RDF February, 2003)

[45] RDFS without RDF/OWL? (blog, September 2006)

[46] RDF/OWL for data silo integration? (blog, August 2006)

[47] Some great W3C explanations of basic ontology concepts (blog, August 2007)

[48] XML 2006 paper done and available (blog, December 2006)

[49] RDF and social networks (blog, April 2008)

[50] What Shelley said (blog, October 2007)

[51] Law metadata on the web (blog, March 2006)

[52] Semantic Web project ideas number 3: Planning those enterprise resources. (blog, April 2007)

[53] Semantic Web project ideas number 2: Managing relationships with customers. (blog, April 2007)

[54] What can publishing and semantic web technology offer to each other? (blog, February 2009)

[55] DAM! Subversion! RDF? (OWL?) (blog, November 2006)

[56] Using (or not using) Adobe's XMP metadata format (blog, December 2005)

[57] New XMP spec (blog, October 2008)

[58] Having fun with Reuters Calais (blog, May 2008)

[59] Navigating Hollywood gossip with semantic technology (blog, June 2008)

[60] RDF versus XQuery (blog, December 2006)

[61] Schema language victory (and OWL) (blog, November 2006)

[62] using owl:imports (blog, August 2007)

[63] Semantic Web project ideas number 5: Use an existing ontology to make a web store easier to use. (blog, July2007)

[64] Semantic Web project ideas number 4: Build an ontology and rules around a working taxonomy (blog, April 2007)

[65] Querying aggregated XBRL reports with SPARQL (blog, September 2008)

[66] Great survey of RDF/web development tools (blog, January 2007)

[67] Download SPARQL results directly into a spreadsheet (blog, October 2008)

[68] Semantic Web project ideas number 6: A form-driven front end to a SPARQL engine (blog, July2007)

[69] Making up URIs (blog, January 2007)

[70] A belated Christmas wish: a SPARQL endpoint for Digg RDF (blog, December 2008)

[71] Google and RDFa: what and why (blog, May 2009)

[72] "Semantic Web for the Working Ontologist" (blog, May 2009)

[73] Big legal publishers and semantic web technology (blog, June 2009)

[74] SearchMonkey and RDFa (blog, June 2009)

[75] Court decision metadata and DBpedia (blog, July 2009)

[76] New developerWorks article: "Build Wikipedia query forms with semantic technology" (blog, July 2009)

[77] Modeling your data with DBpedia vocabularies (blog, July 2009)

[78] Joining TopQuadrant (blog, August 2009)

[79] Getting started with the TopQuadrant product line (blog, August 2009)

[80] DevX article on using RDFa with DocBook and DITA (blog, August 2009)

[81] Appreciating SPARQL CONSTRUCT more (blog, September 2009)

[82] Growth of the linked data cloud (blog, September 2009)