22 July 2018

Dividing and conquering SPARQL endpoint retrieval

With the VALUES keyword.

VALUES neon sign

When I first tried SPARQL's VALUES keyword (at which point it was pretty new to SPARQL, having only recently been added to SPARQL 1.1) I demoed it with a fairly artificial example. I later found that it solved one particular problem for me by letting me create a little lookup table. Recently, it gave me huge help in one of the most classic SPARQL development problems of all: how to retrieve so much data from an endpoint that the first attempts at that retrieval resulted in timeouts.

The Wikidata:SPARQL query service/queries page includes an excellent Wikdata query to find latitudes and longitudes for places in Paris. You can easily modify this query to retrieve from places within other cities, and I wanted to build on this query to make it retrieve additional available data about those places as well. While accounting for the indirection in the Wikidata query model made this a little more complicated, it wasn't much trouble to write.

The expanded query worked great for a city like Charlottesville, where I live, but for larger cities, the query was just asking for too much information from the endpoint and timed out. My new idea was to first ask for the roughly the same information that the Paris query above does, and to then request additional data about those entities a batch at a time with a series of queries that use the VALUES keyword to specify each batch. (I've pasted a sample query requesting one batch below.)

It worked just fine. I put all the queries and other relevant files in a zip file for people who want to check it out, but it's probably not worth looking at too closely, because in a month or two I'll be replacing it with a Python version that does everything more efficiently. It's still worth explaining the steps in this version's shell script driver file, because the things I worked out for this prototype effort--despite its Perl scripting and extensive disk I/O--mean that the Python version should come together pretty quickly. That's what prototypes are for!

The driver shell script

Before running the shell script, you specify the Wikidata local name of the city to query near the top of the getCityEntities.rq SPARQL query file. (This is easier than it sounds--for example, to do it for Charlottesville, go to its Wikipedia page and click Wikidata item in the menu on the left to find that Q123766 is the local name.)

Once that's done, running the zip file's getCityData.sh shell script executes these main steps:

  1. It uses a curl command to send the getCityEntities.rq CONSTRUCT query to the https://query.wikidata.org/sparql endpoint.The curl command saves the resulting triples in a file called cityEntities.ttl.

  2. It uses ARQ to run the listSubjects.rq query on the new cityEntities.ttl file, specifying that the result should be a TSV file.

  3. The results of listSubjects.rq get piped to a Perl script called makePart2Queries.pl. This creates a series of CONSTRUCT query files that ask Wikidata for data about entities listed in a VALUES section. It puts 50 entries in each file's VALUES section; this figure of 50 is stored in a $maxLines variable in makePart2Queries.pl where it can be reset if the endpoint is still timing out. This step also adds lines to a shell script called callTempQueries.sh, where each line uses curl to call one of the queries that uses VALUES to request a batch of data.

  4. getCityData.sh next runs the callTempQueries.sh shell script to execute all of these new queries, storing the resulting triples in the file tempCityData.ttl.

  5. The tempCityData.ttl file has plenty of good data, but it can be used to get additional relevant data, so the script's next line runs a query that creates a TSV file with a list of all of the classes found in tempCityData.ttl triples of the form {?instance wdt:P31 ?class}. (wdt:P31 is the Wikidata equivalent of rdf:type, indicating that a resource is an instance of a particular class.) That TSV file then drives the creation of a query that gets sent to the SPARQL endpoint to ask about the classes' parent and grandparent classes, and that data gets added to tempCityData.ttl.

  6. Another ARQ call in the script uses a local query to check for triple objects in the http://www.wikidata.org/entity/ namespace that don't have rdfs:label values and get them--or at least, get the English ones, but it's easy to fix if you want labels in different or additional languages.

  7. The script runs one final ARQ query on tempCityData.ttl: the classic SELECT * WHERE {?s ?p ?o}. This request for all the triples actually tidies up the Turtle data a bit, storing all the triples with common subjects together. It puts the result in cityData.ttl.

One running theme of some of the shell script's steps is the retrieval of labels associated with qnames. Wikidata has a lot of triples like {wd:Q69040 wd:P361 wd:Q16950} that are just three qnames, so retrieved data will have more value to applications if people and processes can find out what each qname refers to.

The main shell script has other housekeeping steps such as recording of the start and end times and deletion of the temporary files. I had more ideas for things to add, but I'll save those for the Python version.

The Python version won't just be a more efficient version of my use of VALUES to do batch retrievals of data that might otherwise time out. It will demonstrate, more nicely, something that only gets hinted at in this mess of shell and Perl scripts: the ability to automate the generation of SPARQL queries that build on the results of previously executed queries so that they can all work together as a pipeline to drive increasingly sophisticated RDF application development.

Here is a sample of one of the queries created to request data about one batch of entities within the specified city:

PREFIX p: <http://www.wikidata.org/prop/> 
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 

CONSTRUCT
{ ?s ?p ?o. 
  ?s ?p1 ?o1 . 
  ?s wgs84:lat ?lat . 
  ?s wgs84:long ?long .
  ?p rdfs:label ?pname .
  ?s wdt:P31 ?class .   
}
WHERE {
  VALUES ?s {
<http://www.wikidata.org/entity/Q42537129>
<http://www.wikidata.org/entity/Q30272197>
# about 48 more of those here...
}
  # wdt:P131 means 'located in the administrative territorial entity' .
  ?s wdt:P131+ ?geoEntityWikidataID .  
      ?s p:P625 ?statement . # coordinate-location statement
  ?statement psv:P625 ?coordinate_node .
  ?coordinate_node wikibase:geoLatitude ?lat .
  ?coordinate_node wikibase:geoLongitude ?long .

  # Reduce the indirection used by Wikidata triples. Based on Tommy Potter query
  # at http://www.snee.com/bobdc.blog/2017/04/the-wikidata-data-model-and-yo.html.
  ?s ?directClaimP ?o .                   # Get the truthy triples. 
  ?p wikibase:directClaim ?directClaimP . # Find the wikibase properties linked
  ?p rdfs:label ?pname .                  # to the truthy triples' predicates.

  # the following VALUES clause is actually faster than just
  # having specific triple patterns for those 3 p1 values.
  ?s ?p1 ?o1 .
  VALUES ?p1 {
    schema:description
    rdfs:label        
    skos:altLabel
  }

  ?s wdt:P31 ?class . # Class membership. Pull this and higher level classes out in later query.
  
  # If only English names desired
  FILTER (isURI(?o1) || lang(?o1) = 'en' )
  # For English + something else, follow this pattern: 
  # FILTER (isURI(?o1) || lang(?o1) = 'en' || lang(?o1) = 'de')

  FILTER(lang(?pname) = 'en')
}  

Neon sign picture by Jeremy Brooks on Flickr (CC BY-NC 2.0)


Please add any comments to this Google+ post.

17 June 2018

Running and querying my own Wikibase instance

Querying it, of course, with SPARQL.

Many of us have waited years for an open-source framework that makes the development of web-based RDF applications as easy as Ruby on Rails does for web-based SQL applications. This dockerized version of Wikibase looks like a big step in this direction.

When Dario Taraborelli's tweeted about how quickly he got a local wikibase instance and SPARQL endpoint up and running with wikibase-docker, he inspired me to give it a shot, and it was surprisingly easy and fun.

I have minimal experience with docker. As instructed by wikibase-docker's README page, I installed docker and docker-compose. (When I got to the Test Docker Installation part of the Get Started, Part 1: Orientation and setup page for setting up docker, the hello-world app gave me a "permission denied" problem, but this solution described at Techoverflow solved it. I did have to reboot, as it suggested.)

Continuing along with the wikibase-docker README, when I clicked "http://localhost:8181" under Accessing your Wikibase instance and the Query Service UI it was pretty cool to see my own local running instance of the wiki:

Moving along in the README, I clicked "Create a new item" before I clicked "Create a new property", but when I saw that the new item's property list offered no choices, I realized that I should define some properties before creating any items. Properties and items can have names, aliases, and descriptions in a wide choice of spoken languages, and Wikibase includes a nice choice of data types.

After defining a property and creating items that had a value for that property, the "Query Service UI @ http://localhost:8282" link on the README led to a web form where I could enter a SPARQL query. I entered SELECT * WHERE { ?s ?p ?o} and saw the default triples that were part of the store as well as triples about the items and property that I had created.

The "Get an RDF dump from wikibase" docker command on the README page did just fine. Reviewing the triples in its output, I saw that the created entities fit the Wikidata data model described at Wikibase/DataModel/Primer, which I wrote about at The Wikidata data model and your SPARQL queries.

It took me some time (and a tweet) to realize that the "Query Service Backend (Behind a proxy)" URL listed on the README file was the URL for the SPARQL endpoint. The first query I tried after that worked with no problem:

curl http://localhost:8989/bigdata/sparql?query=SELECT%20DISTINCT%20%3Fp%20WHERE%20%7B%20%3Fs%20%3Fp%20%3Fo%20%7D
      

It was also easy to access this server from my phone across my home wifi when I substituted the machine's name or IP address for "localhost" in the URLs above. The web interface was the same on a phone as on a big screen; the MediaWiki project's Mobiles, tablets and responsive design manual page describes some options for extending the interface. If someone out there is looking for UI work and has some time on their hands, contributing some phone and tablet responsiveness to this open source project would be a great line on your résumé.

And finally, while the docker version of this is quick to get up and running, if you're going far with your own MediaWiki installation, you'll want to look over the Installation instructions for the regular, non-docker version.

After I did these experiments and wrote my first draft of this, I discovered the medium.com posting Wikibase for Research Infrastructure -- Part 1 by Pratt Institute librarian and researcher Matt Miller. His piece describes a nice use case of following through on creating a Wikibase application and points to some handy Python scripts for automating the creation of classes and other structures from spreadsheets. His use case happens to be one of my favorite RDF-related available data sources: the Linked Jazz Project. I look forward to Part 2.

It's great to have such a comprehensive system running on my local machine, complete with a web interface that lets non-RDF people create and edit any data they want and, for the RDF people, a SPARQL interface to let them pull and manipulate that data. For more serious dataset development, the MediaWiki project includes some helpful documentation about how to define your own classes and associated properties and forms. (July 20th note: that page is actually about Semantic MediaWiki, which I played around with a few years ago--apparently I didn't keep my notes on that and Wikibase as organized as I should have.)

Many of us have waited years for an open-source framework that makes the development of web-based RDF applications as easy as Ruby on Rails does for web-based SQL applications. The dockerized version of Wikibase looks like a big step in this direction.


Please add any comments to this Google+ post.

28 May 2018

RDF* and SPARQL*

Reification can be pretty cool.

triple within a triple

After I posted Reification is a red herring (and you don't need property graphs to assign data to individual relationships) last month, I had an amusingly difficult time explaining to my wife how that would generate so much Twitter activity. This month I wanted to make it clear that I'm not opposed to reification in and of itself, and I wanted to describe the fun I've been having playing with Olaf Hartig and Bryan Thompson's RDF* and and SPARQL* extensions to these standards to make reification more elegant.

In that post, I said that in many years of using RDF I've never needed to use reification because, for most use cases where it was a candidate solution, I was better off using RDFS to declare classes and properties that reflected the use case domain instead of going right to the standard reification syntax (awkward in any standardized serialization) that let me create triples about triples. My soapbox ranting in that post focused on the common argument that the property graph approach of systems like Tinkerpop and Neo4j is better than RDF because achieving similar goals in RDF would require reification; as I showed, it doesn't.

But, reification can still be very useful, especially in the world of metadata. (I am slightly jealous of the metadata librarians of the world for having the word "metadata" in their job title--it sounds even cooler in Canada: Bibliothécaire aux métadonnées.) If metadata is data about data, and more and more of the Information Science world is taking advantage of linked data technologies, then triples about triples are bound to be useful in their use of information for provenance, curation, and all kinds of scholarship about datasets.

The conclusion of my blog post mentioned how, just as I was finishing it up, I discovered Olaf Hartig and Bryan Thompson's 2014 paper Foundations of an Alternative Approach to Reification in RDF and Blazegraph's implementation of it. I decided to play with this a bit in Blazegraph in order to get a hands-on appreciation of what was possible, and I like it. (Olaf recently mentioned on Twitter that these capabilities are being added into Apache Jena as well, so this isn't just a Blazegraph thing.)

As I described in Trying out Blazegraph two years ago, it's pretty simple to download the Blazegraph jar, start it up, load RDF data, and query it. For my RDF* experiments, I started up Blazegraph and created a Blazegraph namespace with a mode of rdr and then did my first few experiments there.

I started with the examples in Olaf's slides RDF* and SPARQL*: An Alternative Approach to Statement-Level Metadata in RDF. To make the slides visually cleaner, he left out full URIs and prefixes, so I added some to properly see the querying in action. I loaded his slide 15 data into my new Blazegraph namespace, specifying a format of Turtle-RDR. The double brackets that you see here are the RDF* extension that lets us create triples that are themselves resources that we can use as subjects and objects of other triples:

@prefix d: <http://www.learningsparql.com/ns/data/> .
<<d:Kubrik d:influencedBy d:Welles>> d:significance 0.8 ;
      d:source <https://nofilmschool.com/2013/08/films-directors-that-influenced-stanley-kubrick> .

This data tells us that the triple about Kubrik being influenced by Welles has a significance of 0.8 and a source at an article on nofilmschool.com.

I then executed the following query, based on Olaf's from slide 16, with no problem:

PREFIX d: <http://www.learningsparql.com/ns/data/> 
SELECT ?x WHERE {
  <<?x d:influencedBy d:Welles>> d:significance ?sig .
  FILTER (?sig > 0.7)
}

In this case, the use of the double angle brackets is the SPARQL* extension that lets us do the same thing that this syntax does in RDF*. This query asks for whoever was named as being influenced by Welles in statements that have a significance greater than 0.7. The query worked just fine in Blazegraph.

SPARQL* also lets you query for the components of triples that are being treated as independent resources. From Olaf's slide 17, this next query asks for whoever was influenced by Welles and the significance and source of any returned statements, and it worked fine with the data above:

PREFIX d: <http://www.learningsparql.com/ns/data/> 
SELECT ?x ?sig ?src WHERE {
  <<?x d:influencedBy d:Welles>> d:significance ?sig ;
  d:source ?src .
}

His slide 18 query returns the same result as that one, but takes the syntax a bit further by binding the triple pattern about someone influencing Welles to a variable and then querying for that:

PREFIX d: <http://www.learningsparql.com/ns/data/> 
SELECT ?x ?sig ?src WHERE {
  BIND(<<?x d:influencedBy d:Welles>> AS ?t)
  ?t  d:significance ?sig ;
      d:source ?src .
}

Moving on to more easy experiments, I found that all the examples on the Blazegraph page Reification Done Right worked exactly as shown there. That page also provides some nice background for ways to use RDF* and SPARQL* in Blazegraph.

Blazegraph lets you do inferencing, so I couldn't resist mixing that with RDF* and SPARQL*. I had to create a new Blazegraph namespace that not only had a Mode of rdr but also had the "Inference" box checked upon creation, and then I loaded this data:

@prefix d:    <http://www.learningsparql.com/ns/data/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<<d:s1 d:p1 d:o1>> a d:Class2 .
<<d:s2 d:p2 d:o2>> a d:Class3 .

d:Class2 rdfs:subClassOf d:Class1 . 
d:Class3 rdfs:subClassOf d:Class1 . 

It creates two triples that are themselves resources, with one being an instance of Class2 and the other being an instanced of Class3. Two final triples tell us that each of those classes are subclasses of Class1. The following query asked for triples that are instances of Class1, despite the data have no explicit triples about Class1 instances, and Blazegraph did the inferencing and found both of them:

PREFIX d: <http://www.learningsparql.com/ns/data/> 
SELECT ?x ?y ?z WHERE {
   <<?x ?y ?z>> a d:Class1 . 
}

After doing this inferencing, I was thinking that OWL metadata and inferencing about such triples should open up a lot of new possibilities, but I realized that none of those possibilities are necessarily new: they'll just be easier to implement than they would have been using the old method of reification that used four triples to represent one. Still, being easier to implement counts for plenty, and I think that metadata librarians and other people doing work to build value around existing triples now have a reasonable syntax some nice tools to explore this.


Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Archives

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0