RESTful SPARQL queries of RDFa

No local parsing or querying software needed.

Facebook's OpenGraph, Google's Rich Snippets, BestBuy's use of the GoodRelations vocabulary and other recent events are boosting RDFa's popularity for storing machine-readable data in web pages. There are several tools and programming libraries available (not to mention built-in features of development platforms such as TopQuadrant's TopBraid Suite for application development) that let you extract the RDF triples from this RDFa markup and use it, but I recently discovered how easily I can extract this data and perform SPARQL queries on it by just using publicly available, RESTful web services. The web page where the RDFa is embedded doesn't even have to be well-formed HTML.

Getting the RDF triples out of the RDFa

I can say "extract the RDF triples from the RDFa on that web page and then run this SPARQL query against it" all with a single URL.

The W3C's RDFa Distiller and Parser at http://www.w3.org/2007/08/pyRdfa/ has a form that lets you enter the URL of a web page and set various parameters before clicking the "Go!" button to see the triples stored in that web page. Once you do this, you'll see the RDF on your browser (a View Source may be necessary) and you'll also see, in your browser's navigation toolbar, the REST URL you would use to have the same program extract the triples without you filling out the form first. (As the page tells you, "If you intend to use this service regularly on large scale, consider downloading the package and use it locally.")

For example, if you go to this form and enter the URL of TopQuadrant's products web page (http://www.topquadrant.com/products/TB_Suite.html), leaving all the other parameters at their default settings, clicking the "Go!" button will get you RDF/XML of the triples and, in the navigation toolbar, the URL used to retrieve them. I trimmed a few parameters off the URL and entered this shortened version directly into the browser, and it worked: http://www.w3.org/2007/08/pyRdfa/extract?uri=http%3A%2F%2Fwww.topquadrant.com%2Fproducts%2FTB_Suite.html&format=pretty-xml. I'll come back to this URL below.

Querying the RDF

The sparql.org SPARQLer web form lets you enter a SPARQL query, specify a set of RDF to query and the return format, and then retrieve the result. For example, when I specify my FOAF file at http://www.snee.com/bob/foaf.rdf as the data to query and the following as the query, the SPARQLer lists my name and airport code, because I'm the only person in my FOAF file with both pieces of information:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX air: <http://www.megginson.com/exp/ns/airports#>
SELECT ?personName ?airportCode WHERE {
  ?person foaf:name ?personName ; 
          foaf:nearestAirport ?airport . 
  ?airport air:iata ?airportCode . 
}

If you pick a non-default output format at the bottom of that form, then instead of the results being displayed in your browser, they may get saved to your disk. When doing this as a RESTful call (for example, when using wget or curl) note the &output= parameter in the URL and experiment with other settings besides the default of XML.

My foaf.rdf file is a static text file sitting on disk, but here's the cool part: I can enter any URI as the resource to query, as long as it identifies parsable RDF—for example, the URL above that gets RDF/XML out of the TopQuadrant products page.

Putting it together

The TopQuadrant products page uses mostly the GoodRelations vocabulary and the Yahoo! Searchmonkey Product vocabularies. (RDFa on other pages of the website use other mixes of different vocabularies as appropriate; let's not take for granted how easy RDF makes it to do this.) If I want to use SPARQL to get a list of product names and descriptions from that page, I can take the URL above that extracts triples from the RDFa in the TopBraid products page, enter it as the "Target graph URI" value on the SPARQLer form, and put the following query into that form's "General SPARQL query" box:

PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX sm: <http://search.yahoo.com/searchmonkey/product/>

SELECT ?name ?description WHERE {
  ?product a sm:Product ;
           rdfs:label ?name ; 
           gr:description ?description . 
}

As with the RDFa Distiller and Parser, in addition to seeing the results of my query on the SPARQLer form, I'll see the URL in the navigation bar that I could have used to execute the same query against the same data with a single URL instead of using the form. This is the grander cool part: I can say "extract the RDF triples from the RDFa on that web page and then run this SPARQL query against it" all with a single URL.

Of course it's a long, messy-looking URL because of the URL-escaping of things like the spaces and punctuation in the SPARQL query. Any modern programming or scripting language provides a function that does this for you, and I've already written a perl script that does something pretty valuable with all this. More on that in a week or two.

5 Comments

I was intrigued by the title of this post only to find that the "RESTful"ness here is encoding some query into a URI that clients can GET to. May be it should be titled "How to Encode SPARQL into URIs"


Subbu,

As a matter of fact, I didn't say how to encode SPARQL into URIs, and mentioned that most programming languages have a function that will do that for you. I described some services to call with those URIs once you have them, and how the use of two of these services could be combined in one call.

Maybe I have an oversimplified idea of what qualifies as RESTful, but if a process can instruct processes on other servers to provide specific machine-readable information using HTTP GETs, I thought that qualified. It certainly can play a role in a useful distributed application.

Bob


Hauntingly familiar:

http://www.semanticoverflow.com/questions/587/is-there-a-web-service-that-allow-me-to-run-sparql-against-a-xhtmlrdfa-website/588#588

There is something very pleasing about this sort of composition.


SPARQListas do some things over and over

select some set of resources
as you did in both queries:
(_, personName, _)
(_ , type, someClass)

i'd call your approach of encoding an arbitrary query-language into a querystring argument RPC-ish rather than REST-ful

GET already has one URI per request. in the first example, theres exactly one URI in the triple pattern, so a single querystring key can apply "function that builds a (_ URI _) triplepattern" . the second example can specify the second URI in the querystring

once we have the set of resources, pulling out the names, locations etc can be done with existing tools like CSS selectors or XPATH. or much more concise RDF path-expression microsyntaxes that arent ugly smashed into a URL

obviously there are larger ad-hoc cases where you really want the power of full SPARQL, but the jump to the complexity of requiring a SPARQL engine is not necessary for some large swath of typical web needs. just like the world realized they didnt need SQL when a basic key/val hashtable store (with interesting sharding and distribution possibilities)

i like where youre going i just think it can be taken a lot further, and since i personally was annoyed by the notion of having to flush REST down the drain in favor of SPARQL i decided to scratch the itch

even basic things in HTTP are unspecified, for example how do you in-band into the URI Accept: arguemnts like the content-type. ive seen ?output=, ?format=, appending the extension of the format to the URI before the querystring, and countless other variations.


it would be nice if there were some standards there. maybe full-fledfged URI keys like myapi:format which could rdf:sameAs some standard definition of what to do with that querystring arg..


Carmen,

In general, that all makes sense to me. I certainly didn't mean to flush REST down the drain. I wanted to encode a request for a resource into a URL that turned out to have a lot of extra stuff, so (as I said before) perhaps idea of REST is too broad.

XPath (and for that matter, CSS) won't work, though, unless the data conforms to a very specific structure that the person writing the query can take for granted. Then, of course, the query can be a lot simpler. I've worked with XML long enough to know that getting a wide variety of people to follow a specific DTD/schema in a wide variety of cases is a lot easier said than done, which is why I like the flexibility that RDF offers. This flexibility does shift the processing load elsewhere--in the case of my example, to the query engine--but the query engine software to do the work is out there, and I see it making a contribution to some real data processing problems.