Dividing and conquering SPARQL endpoint retrieval

With the VALUES keyword.

When I first tried SPARQL's VALUES keyword (at which point it was pretty new to SPARQL, having only recently been added to SPARQL 1.1) I demoed it with a fairly artificial example. I later found that it solved one particular problem for me by letting me create a little lookup table. Recently, it gave me huge help in one of the most classic SPARQL development problems of all: how to retrieve so much data from an endpoint that the first attempts at that retrieval resulted in timeouts.

The Wikidata:SPARQL query service/queries page includes an excellent Wikdata query to find latitudes and longitudes for places in Paris. You can easily modify this query to retrieve from places within other cities, and I wanted to build on this query to make it retrieve additional available data about those places as well. While accounting for the indirection in the Wikidata query model made this a little more complicated, it wasn't much trouble to write.

The expanded query worked great for a city like Charlottesville, where I live, but for larger cities, the query was just asking for too much information from the endpoint and timed out. My new idea was to first ask for the roughly the same information that the Paris query above does, and to then request additional data about those entities a batch at a time with a series of queries that use the VALUES keyword to specify each batch. (I've pasted a sample query requesting one batch below.)

It worked just fine. I put all the queries and other relevant files in a zip file for people who want to check it out, but it's probably not worth looking at too closely, because in a month or two I'll be replacing it with a Python version that does everything more efficiently. It's still worth explaining the steps in this version's shell script driver file, because the things I worked out for this prototype effort--despite its Perl scripting and extensive disk I/O--mean that the Python version should come together pretty quickly. That's what prototypes are for!

The driver shell script

Before running the shell script, you specify the Wikidata local name of the city to query near the top of the getCityEntities.rq SPARQL query file. (This is easier than it sounds--for example, to do it for Charlottesville, go to its Wikipedia page and click Wikidata item in the menu on the left to find that Q123766 is the local name.)

Once that's done, running the zip file's getCityData.sh shell script executes these main steps:

It uses a curl command to send the getCityEntities.rq CONSTRUCT query to the https://query.wikidata.org/sparql endpoint.The curl command saves the resulting triples in a file called cityEntities.ttl.
It uses ARQ to run the listSubjects.rq query on the new cityEntities.ttl file, specifying that the result should be a TSV file.
The results of listSubjects.rq get piped to a Perl script called makePart2Queries.pl. This creates a series of CONSTRUCT query files that ask Wikidata for data about entities listed in a VALUES section. It puts 50 entries in each file's VALUES section; this figure of 50 is stored in a $maxLines variable in makePart2Queries.pl where it can be reset if the endpoint is still timing out. This step also adds lines to a shell script called callTempQueries.sh, where each line uses curl to call one of the queries that uses VALUES to request a batch of data.
getCityData.sh next runs the callTempQueries.sh shell script to execute all of these new queries, storing the resulting triples in the file tempCityData.ttl.
The tempCityData.ttl file has plenty of good data, but it can be used to get additional relevant data, so the script's next line runs a query that creates a TSV file with a list of all of the classes found in tempCityData.ttl triples of the form {?instance wdt:P31 ?class}. (wdt:P31 is the Wikidata equivalent of rdf:type, indicating that a resource is an instance of a particular class.) That TSV file then drives the creation of a query that gets sent to the SPARQL endpoint to ask about the classes' parent and grandparent classes, and that data gets added to tempCityData.ttl.
Another ARQ call in the script uses a local query to check for triple objects in the http://www.wikidata.org/entity/ namespace that don't have rdfs:label values and get them--or at least, get the English ones, but it's easy to fix if you want labels in different or additional languages.
The script runs one final ARQ query on tempCityData.ttl: the classic SELECT * WHERE {?s ?p ?o}. This request for all the triples actually tidies up the Turtle data a bit, storing all the triples with common subjects together. It puts the result in cityData.ttl.

One running theme of some of the shell script's steps is the retrieval of labels associated with qnames. Wikidata has a lot of triples like {wd:Q69040 wd:P361 wd:Q16950} that are just three qnames, so retrieved data will have more value to applications if people and processes can find out what each qname refers to.

The main shell script has other housekeeping steps such as recording of the start and end times and deletion of the temporary files. I had more ideas for things to add, but I'll save those for the Python version.

The Python version won't just be a more efficient version of my use of VALUES to do batch retrievals of data that might otherwise time out. It will demonstrate, more nicely, something that only gets hinted at in this mess of shell and Perl scripts: the ability to automate the generation of SPARQL queries that build on the results of previously executed queries so that they can all work together as a pipeline to drive increasingly sophisticated RDF application development.

Here is a sample of one of the queries created to request data about one batch of entities within the specified city:

PREFIX p: <http://www.wikidata.org/prop/> 
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 

CONSTRUCT
{ ?s ?p ?o. 
  ?s ?p1 ?o1 . 
  ?s wgs84:lat ?lat . 
  ?s wgs84:long ?long .
  ?p rdfs:label ?pname .
  ?s wdt:P31 ?class .   
}
WHERE {
  VALUES ?s {
<http://www.wikidata.org/entity/Q42537129>
<http://www.wikidata.org/entity/Q30272197>
# about 48 more of those here...
}
  # wdt:P131 means 'located in the administrative territorial entity' .
  ?s wdt:P131+ ?geoEntityWikidataID .  
      ?s p:P625 ?statement . # coordinate-location statement
  ?statement psv:P625 ?coordinate_node .
  ?coordinate_node wikibase:geoLatitude ?lat .
  ?coordinate_node wikibase:geoLongitude ?long .

  # Reduce the indirection used by Wikidata triples. Based on Tommy Potter query
  # at http://www.snee.com/bobdc.blog/2017/04/the-wikidata-data-model-and-yo.html.
  ?s ?directClaimP ?o .                   # Get the truthy triples. 
  ?p wikibase:directClaim ?directClaimP . # Find the wikibase properties linked
  ?p rdfs:label ?pname .                  # to the truthy triples' predicates.

  # the following VALUES clause is actually faster than just
  # having specific triple patterns for those 3 p1 values.
  ?s ?p1 ?o1 .
  VALUES ?p1 {
    schema:description
    rdfs:label        
    skos:altLabel
  }

  ?s wdt:P31 ?class . # Class membership. Pull this and higher level classes out in later query.
  
  # If only English names desired
  FILTER (isURI(?o1) || lang(?o1) = 'en' )
  # For English + something else, follow this pattern: 
  # FILTER (isURI(?o1) || lang(?o1) = 'en' || lang(?o1) = 'de')

  FILTER(lang(?pname) = 'en')
}

Neon sign picture by Jeremy Brooks on Flickr (CC BY-NC 2.0)

Please add any comments to this Google+ post.

bobdc.blog

Bob DuCharme's weblog, mostly on technology for representing and linking information.

Dividing and conquering SPARQL endpoint retrieval

The driver shell script

Search