Dropping OPTIONAL blocks from SPARQL CONSTRUCT queries

And retrieving those triples much, much faster.
animals taxonomy

While preparing a demo for the upcoming Taxonomy Boot Camp conference, I hit upon a trick for revising SPARQL CONSTRUCT queries so that they don't need OPTIONAL blocks. As I wrote in the new "Query Efficiency and Debugging" chapter in the second edition of Learning SPARQL, "Academic papers on SPARQL query optimization agree: OPTIONAL is the guiltiest party in slowing down queries, adding the most complexity to the job that the SPARQL processor must do to find the relevant data and return it." My new trick not only made the retrieval much faster; it also made it possible to retrieve a lot more data from a remote endpoint.

First, let's look at a simple version of the use case. DBpedia has a lot of SKOS taxonomy data in it, and at Taxonomy Boot Camp I'm going to show how you can pull down and use that data. Now, imagine that a little animal taxonomy like the one shown in the illustration here is stored on an endpoint and I want to write a query to retrieve all the triples showing preferred labels and "has broader" values up to three levels down from the Mammal concept, assuming that the taxonomy's structure uses SKOS to represent its structure. The following query asks for all three levels of the taxonomy below Mammal, but it won't get the whole taxonomy:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level1 skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label . 
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}
WHERE {
  ?level1 skos:broader v:Mammal ;
          skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label .
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}

As with any SPARQL query, it's only going to return triples for which all the triple patterns in the WHERE clause match. While Horse may have a broader value of Mammal and therefore match the triple pattern {?level1 skos:broader v:Mammal}, there are no nodes that have Horse as a broader value, so there will be no match for {?level2 skos:broader v:Horse}. So, the Horse triples won't be in the output. The same thing will happen with the Cat triples; only the Dog ones, which go down three levels below Mammal, will match the graph pattern in the WHERE clause above.

If we want a CONSTRUCT query that retrieves all the triples of the subtree under Mammal, we need a way to retrieve the Horse and Cat concepts and any descendants they have, even if they have no descendants, and OPTIONAL makes this possible. The following will do this:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level1 skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label . 
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}
WHERE {
  ?level1 skos:broader v:Mammal ;
          skos:prefLabel ?level1label . 
  OPTIONAL {
    ?level2 skos:broader ?level1 ;
            skos:prefLabel ?level2label .
  }
  OPTIONAL {
    ?level3 skos:broader ?level2 ;
            skos:prefLabel ?level3label . 
  }
}

The problem: this doesn't scale. When I sent a nearly identical query to DBpedia to ask for the triples representing the hierarchy three levels down from <http://dbpedia.org/resource/Category:Mammals>, it timed out after 20 minutes, because the two OPTIONAL graph patterns gave DBpedia too much work to do.

As a review, let's restate the problem: we want the identified concept and the preferred labels and broader values of concepts up to three levels down from that concept, but without using the OPTIONAL keyword. How can we do this?

By asking for each level in a separate query. When I split the DBpedia version of the query above into the following three queries, each retrieved its data in under a second, retrieving a total of 2,597 triples representing a taxonomy of 1,107 concepts:

# query 1
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  <http://dbpedia.org/resource/Category:Mammals> a skos:Concept . 
  ?level1 a skos:Concept ;
          skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
}

# query 2
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level2 a skos:Concept ;
          skos:broader ?level1 ;  
          skos:prefLabel ?level2label .  
}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> .
  ?level2 skos:broader ?level1 ;  
            skos:prefLabel ?level2label .  
}

# query 3
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level3 a skos:Concept ;
          skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  
}
WHERE {
?level2 skos:broader/skos:broader <http://dbpedia.org/resource/Category:Mammals> .
  ?level3 skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  
}

Going from timing out after 20 minutes to successful execution in under 3 seconds is quite a performance improvement. Below, you can see how the beginning of a small piece of this taxonomy looks in TopQuadrant's TopBraid EVN vocabulary manager. At the first level down, you can only see Afrosoricida, Australosphenida, and Bats in the picture; I then drilled down three more levels from there to show that Fictional bats has the single subcategory Silverwing.

As you can tell from the Mammals URI in the queries above, these taxonomy concepts are categories, and each category has at least one member (for example, Bats as food) in Wikipedia and is therefore represented as triples in DBpedia, ready for you to retrieve with SPARQL CONSTRUCT queries. I didn't retrieve any instance triples here, but it's great to know that they're available, and that this technique for avoiding CONSTRUCT graph patterns will serve me for much more than SKOS taxonomy work.

There has been plenty of talk lately on Twitter and in blogs about how it's not a good idea for important applications to have serious dependencies on public SPARQL endpoints such as DBpedia. (Orri Erling has one of the most level-head discussions of this that I've seen in SEMANTiCS 2014 (part 3 of 3): Conversations; in my posting Semantic Web Journal article on DBpedia on this blog I described a great article that lists other options.) There's all this great data to use in DBpedia, and besides spinning up an Amazon Web Services image with your own copy of DBpedia, as Orri suggests, you can pull down the data you need to store locally when it is up. If you're unsure about the structure and connections of the data you're pulling down, OPTIONAL graph patterns seems like an obvious fix, but this trick for splitting up CONSTRUCT queries to avoid the use of OPTIONAL graph patterns means that you can pull down a lot more data lot more efficiently.

Stickin' to the UNION

October 16th update: Once I split out the pieces of the original query into separate files, it should have occurred to me to at least try joining them back up into a single query with UNION instead of OPTIONAL, but it didn't. Luckily for me, John Walker suggested in the comments for this blog entry that I try this, so I did. It worked great, with the benefit of being simpler to read and maintain than using a collection of queries to retrieve a single set of triples. This version only took three seconds to retrieve the triples:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  <http://dbpedia.org/resource/Category:Mammals> a skos:Concept . 
  ?level1 a skos:Concept ;
          skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
  ?level2 a skos:Concept ;
          skos:broader ?level1 ;  
          skos:prefLabel ?level2label .  
  ?level3 a skos:Concept ;
          skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  

}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
  {
    ?level2 skos:broader ?level1 ;  
    skos:prefLabel ?level2label .  
  }
  UNION
  {
    ?level2 skos:broader ?level1 .
    ?level3 skos:broader ?level2 ;  
            skos:prefLabel ?level3label .  
  }
}

There are two lessons here:

  • If you've figured out a way to do something better, don't be too satisfied too quickly—keep trying to make it even better.

  • UNION is going to be useful in more situations than I originally thought it would.

animals taxonomy

Please add any comments to this Google+ post.