Federated SPARQL queries

Using a Jena extension.

Much of the promise of RDF and Linked Data is the ease of pulling data from multiple sources and combining it. I recently discovered the SERVICE extension that Jena adds to SPARQL, letting you send subqueries off to multiple SPARQL endpoints and then combine the results. Because a given SPARQL endpoint may be an interface to a triplestore or a relational data store or something else, the ability to query several endpoints with one query is very nice.

The ability to query several endpoints with one query is very nice.

The Jena project's ARQ - Basic Federated SPARQL Query describes the use of this keyword. Before I start quoting from that page, I wanted to jump right in with an example that worked for me to pull birthday and spouse information about Arnold Schwarzenegger from DBpedia and a list of his movies and their release dates from Linked Movie Database in one query:

PREFIX imdb: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dbpo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?birthDate ?spouseName ?movieTitle ?movieDate {
  { SERVICE <http://dbpedia.org/sparql>
    { SELECT ?birthDate ?spouseName WHERE {
        ?actor rdfs:label "Arnold Schwarzenegger"@en ;
               dbpo:birthDate ?birthDate ;
               dbpo:spouse ?spouseURI .
        ?spouseURI rdfs:label ?spouseName .
        FILTER ( lang(?spouseName) = "en" )
      }
    }
  }
  { SERVICE <http://data.linkedmdb.org/sparql>
    { SELECT ?actor ?movieTitle ?movieDate WHERE {
      ?actor imdb:actor_name "Arnold Schwarzenegger".
      ?movie imdb:actor ?actor ;
             dcterms:title ?movieTitle ;
             dcterms:date ?movieDate .
      }
    }
  }
}

You can run this query yourself at the sparql.org RDF Query Demo page.

Before you start modeling your own queries on this, it's worth reading the Jena documentation page mentioned above, especially the "Performance Considerations" part:

This feature is a basic building block to allow remote access in the middle of a query, not a general solution to the issues in distributed query evaluation. The algebra operation is executed without regard to how selective the pattern is. So the order of the query will affect the speed of execution. Because it involves HTTP operations, asking the query in the right order matters a lot. Don't ask for the whole of a bookstore just to find book whose title comes from a local RDF file - ask the bookshop a query with the title already bound from earlier in the query.

As an example, both subqueries above specifically ask for information about Schwarzenegger instead of trying to scan the complete databases looking for matches.

Two parts of this trick are non-standard SPARQL, but may become part of SPARQL 1.1: subqueries and the SERVICE keyword. As the latter Lee Feigenbaum slide points out, the SPARQL Working Group is using ARQ's SERVICE keyword as a starting point in thinking about how a query can target multiple endpoints.

My query above of the two different SPARQL endpoints also works from within TopQuadrant's TopBraid Suite of products, so I'm sure I'll be using this on work-related projects more and more.

0 TrackBacks

Listed below are links to blogs that reference this entry: Federated SPARQL queries.

TrackBack URL for this entry: http://www.snee.com/cgi-sys/cgiwrap/bobd/managed-mt/mt-tb.cgi/578

1 Comments

I knew we'd get you using Jena sooner or later. It's got the best sparql IMHO.


Leave a comment


Type the characters you see in the picture above.