23 April 2017

The Wikidata data model and your SPARQL queries

Reference works to get you taking advantage of the fancy parts quickly.

RDF standards were used to describe the Wikibase model that was developed independently of W3C standards.

Last month I promised that I would dig further into the Wikidata data model, its mapping to RDF, and how we can take advantage of this with SPARQL queries. I had been trying to understand the structure of the data based on the RDF classes and properties I saw and the documentation that I could find, and some of the vocabulary discussing these issues confused me--for example, RDF is about describing resources, but I was seeing lots of references to entities, which can mean slightly different things in different branches of computer science. But, as Daniel Kinzler explained to me, "The Wikidata (or technically, Wikibase) data model is not defined in terms of RDF"; RDF standards were used to describe the Wikibase model that was developed independently of W3C standards.

Wikibase, as described by its home page, "is a collection of applications and libraries for creating, managing and sharing structured data...Wikibase was developed for and is used by Wikidata, the free knowledge base and Wikipedia, the encyclopedia that anyone can edit." The same page describes Wikidata as one of the "projects powered by Wikidata", along with the europeana eagle project and Droid wiki.

The Wikibase/DataModel document is fairly long and detailed, and I would suggest starting instead with the Wikibase/DataModel/Primer. The Primer describes how "Entities are the basic elements of the knowledge base" and how "there are two predefined kinds if Entities: Items and Properties" (both of which RDF people consider to be resources). The document goes on to describe the information that can be associated with items and properties.

I had originally found their RDF Dump Format document abstruse and confusing, but it was easier to follow after I read the Wikibase data model primer because I had a better idea of the dump format's basis. It's even easier to follow if you just skim the Dump Format document to get a general idea of what it covers and then go to the Wikidata query service/User Manual, where you'll get an even faster start querying Wikidata. (Their sample queries that I described last month also help a lot.) The User Manual describes the declared prefixes, some nice tricks for taking advantage of different kinds of labels, how to work with geo data, available endpoints that you can federate into your queries, and more. It also provides more context for understanding the Dump Format document.

The Data Model document describes the fundamental role of statements in the Wikibase data model. (Longstanding members of the RDF community will enjoy Kingsley Idehen's continuation of my thread with Daniel, in which Kingsley insists that Wikidata is a collection of reified RDF statements, and Daniel says that, well, no, not really. They eventually agree to disagree.) The RDF Dump Format document describes two statement types that are important to how we treat Wikidata as an RDF repository but are also potentially very confusing. The first type is known as a truthy statement, or "direct claim"; these are simple triples that assert facts. The other statement type is the full statement, which is used to "represent all data about the statement in the system".

As one way to quickly recognize the difference, Wikimedia usually uses specific namespaces in specific places in both truthy and full statements. For example, the namespace http://www.wikidata.org/prop/direct/, which is abbreviated using the prefix, wdt:, is usually used for the predicate of a truthy statement. (The Dump format document has a nice list of all of these in the Predicates section. As you work with this data, you'll often go back to the Prefixes used section of the RDF Dump Format and also the Full list of prefixes section that follows it.)

Here's an example of the two kinds of statements that Daniel provided me: the triple {wd:Q64 wdt:P1376 wd:Q183} is a truthy triple saying that Berlin is the capital of Germany. Here is the full version of that statement:

wds:Q64-43CCD3D6-F52E-4742-B0E3-BCA671B69D2C a wikibase:Statement,
                 wikibase:BestRank ;
   wikibase:rank wikibase:PreferredRank ;
   ps:P1376 wd:Q183 ;
   prov:wasDerivedFrom wdref:ba76a7c0f885fa85b10368696ab4ac89680aa073 .

wdref:ba76a7c0f885fa85b10368696ab4ac89680aa073 a wikibase:Reference ;
   pr:P248 wd:Q451546 ;
   pr:P958 "Artikel 2 (1)" .
      

To understand this better, I wanted to see this for a different statement: the fact that bebop musician Tommy Potter played the bass. First, I clicked the "Wikidata item" link on Potter's Wikipedia page and substituted /entity/ for /wiki/, as I described in my February blog entry Getting to know Wikidata, to get the URI that represents him: http://www.wikidata.org/entity/Q1369941.

However, after doing this, it wasn't as simple as you might think to find the triple about the instrument he played. A query for {wd:Q1369941 ?p ?o} (using the prefix substitution for brevity) retrieves all the triples about him, but they're the "truthy" ones, in which the predicates are known as direct claim predicates. Three of these triples described him as a Jazzbassist, a contrebassiste de jazz, and a contrabbassista statunitense, but none listed the "bass" as the instrument that he played in any language. Queries about the predicates themselves--that is, queries for triples where the properties used by these triples were the subjects so that I could learn more about the truthy triples I retrieved about Potter (for example, whether the properties have rdfs:label values in different languages)--showed very little information. It turned out that, to learn more about these properties, I could look for triples that had these properties as objects, with a predicate of wikibase:directClaim linking the actual Wikidata data model property to the predicate used in the direct claim. When I queried for triples that had these Wikidata data model properties as subjects so that I could learn more about them, I found plenty.

To put these relationships to use, I entered the following query to find out more about Tommy Potter:

SELECT ?pname ?o ?olabel WHERE 
{
  wd:Q1369941 ?directClaimP ?o .          # Get the truthy triples.
  ?p wikibase:directClaim ?directClaimP . # Find the Wikibase properties linked
  ?p rdfs:label ?pname .                  # to the truthy triples' predicates
  FILTER ( lang(?pname) = "en" )          # and their labels, in English.
  OPTIONAL {
     ?o rdfs:label ?olabel  
     FILTER ( lang(?olabel) = "en" )
  }
}

The result of this query is a mostly-human readable statement of facts about him. You could substitute the URI for just about any Wikidata entity as the subject in that first triple pattern to see information about that entity. You could also view the property names in other languages besides English, which is a big advantage of the Wikibase data model.

If you send your browser to the http://www.wikidata.org/entity/Q1369941 URI that represents Potter, you will get redirected to a Wikidata page with a nicely formatted human-readable version of data about Potter at https://www.wikidata.org/wiki/Q1369941. On the other hand, if you add .ttl (or .nt or .rdf) to the end of the /entity/ version of the URI, you'll get RDF of all the data about Potter, including the full representations with triples that include predicates such as wikibase:BestRank and prov:wasDerivedFrom, just like the full version of the data above about Berlin being the capital of Germany.

After looking at the full data about Potter, some queries to find out more about it often found less than what I expected. I eventually learned from the WDQS data differences section of the RDF Dump Format document that "Data nodes (wdata:Q2) are not stored... This is done for performance reasons."

After all this exploration, I still haven't gotten to the kinds of structural queries I've been planning on--for example, looking for instances based on their class's relationship(s) to other classes. The Stack Exchange question How to include sub-classes in a Wikidata SPARQL query?, which has a solid answer, looks pretty inspirational. I'm looking forward to playing with it.

Meanwhile, as you use SPARQL to play with Wikidata, you're going to see a lot of cryptic resource names, like wdt:P279 in the Stack Exchange answer, and you'll wonder what their human-readable name is. I created the form below to help me with the prefixes I used the most. You can use this form yourself (for example, enter P279 in the wdt: field and press Enter), but you'd probably be best off copying it from this page's source into your own page that you can customize.

It turns out that wdt:P279 means "subclass of". This is something I'll certainly be getting to know better in the future.

wd:
wdt:
p:

Please add any comments to this Google+ post.

26 March 2017

Wikidata's excellent sample SPARQL queries

Learning about the data, its structure, and more.

part of Khan and descendants graph

Last month I finally got to know Wikidata more and saw that it has a lot of great stuff to explore. I've continued to explore the data and its model using two strategies: exploring the ontology built around the data and playing with the sample queries.

Exploring the ontology takes some work. I'll describe the resources available for this (and the ontology!) in greater detail when I have a better handle on it all. For sample queries, I have my own queries that I use to explore a dataset, as I described in the "Exploring the Data" section of the Learning SPARQL chapter "A SPARQL Cookbook", but the wise people behind Wikidata have done much better than this by giving us a page of sample queries that highlight some of the data and syntax available.

The sample queries range from simple to complex, and each has a "Try it!" link that loads the query into the query form. (Before you get too far into the list of queries, note that the RDF Dump Format documentation page, which I will describe more next time, has a list of the URIs represented by the prefixes in the queries.)

Here are some that I particularly liked after my brief tour:

  • The second example query, for data about Horses, is a good example of the excellent commenting that you will find in many of the sample queries.

  • The Recent Events query nicely demonstrates how Wikidata models time and how a query can use that to identify events with a particular time window--in this case of this sample query, between 0 and 31 days ago.

  • The Popular eye colors one demonstrates the use of Default views--special comments in directives that the Wikidata Query Service understands as an indication of how to present the data. The eye color query's directive of "#defaultView:BubbleChart" means that running the query on https://query.wikidata.org will (quickly!) give you this:

    result of eye color query below

    Popular surnames among humans creates another nice bubble chart.

  • The Even more cats, with pictures query that follows the eye color one uses an ImageGrid defaultView to create the following, finally filling the gap between "SPARQL" and "cat pictures" that has bedeviled web technology for so long:

    result of eye color query below

    The remaining six defaultViews also look like a lot of fun.

  • The Children of Ghengis Khan sample query uses the Graph defaultView to display Khan's children and grandchildren, with images of them when available, in a graph that lets you zoom and drag nodes around. A piece of it is shown above. The Music Genres query after that is similar. The line graph resulting from the Number of bands by year and genre query is also interesting.

After getting this far, I hadn't even seen 10% of the sample queries, but I did find the answer to my original question about how to get to know the range of possibilities with SPARQL queries of Wikidata better. (One more nice sample query that I wanted to mention is not on the samples page but on the User Manual one: an example of Geospatial searches that lists airports within 100km of Berlin.)

To really learn about how Wikidata executes SPARQL queries, the SPARQL query service/query optimization page provides good background on how Blazegraph, the triplestore and query engine that Wikidata's SPARQL endpoint uses, goes about executing the queries. (I found it pretty gutsy of this page's authors to add a "Try it!" link after a sample query that the page itself says will time out.) As I wrote in the "Query Efficiency and Debugging" chapter of "Learning SPARQL", query engines often optimize for you. Their methods for doing so are how these query engines try to distinguish themselves from each other, so learning more about the one that you're using is worth it when you're dealing with large-scale data like Wikidata. The "SPARQL query service/query optimization" page also describes how adding an explain keyword to the query URL will get you a report on how it parses and optimizes your query.

As much as I'd like to keep playing with of the sample queries, I'm going to dig into the Wikidata data model and its mapping to RDF next. Watch this space...


Please add any comments to this Google+ post.

26 February 2017

Getting to know Wikidata

First (SPARQL-oriented) steps.

Wikidata and SPARQL logos

I've written so often about DBpedia here that a few times I considered writing a book about it. As I saw Wikidata get bigger and bigger, I kept postponing the day when I would dig in and learn more about this Wikipedia sibling project. I've finally done this, starting with a few basic steps and one extra fun one:

  • Learn how to hit the SPARQL endpoint from an operating system command line with curl

  • Explore, if available, the web form front end to the endpoint

  • Learn how to find the identifier for whatever I like (a band, a person, a concept) so that I can create queries about it

  • Automate the finding of the identifier when looking at a Wikipedia page

Wikidata SPARQL queries from the command line

For that first task, you can append an escaped version of your query to https://query.wikidata.org/sparql?query= and pass that to curl. For example, doing it with the query "SELECT DISTINCT ?p WHERE { ?s ?p ?o } LIMIT 10" gives you this:

        curl https://query.wikidata.org/sparql?query=SELECT%20DISTINCT%20%3Fp%20WHERE%20%7B%20%3Fs%20%3Fp%20%3Fo%20%7D%20LIMIT%2010
      

That command line retrieves the result in the default XML format. curl's -H option let's you add HTTP header information to your request; for example adding '-H "Accept: text/csv'" after 'curl' on the command line above retrieves a CSV version of the result set instead of XML.

Web form front end for entering Wikidata SPARQL queries

https://query.wikidata.org/ is one of the nicest web forms I've ever seen for entering SPARQL queries. It offers color coding, auto-completion, and drop-down menus of tools, prefixes, and help.

When I enter a query like the one above into this form and click the Run button, the form runs the query and shows a URL in the browser's address bar that incorporates the query. Pasting that full URL into another browser address bar takes me to the query form and enters that query (see this for an example), but doesn't execute it the way DBpedia does in the same situation--with the Wikidata form, you still need to click that Run button. If anyone knows of some parameter that I can add to the Wikidata URL to make this happen, I'd love to hear about it; I could then use it to replace the delivery of the handful of JSON in the scriplet described below. March 4 update: I have learned from Jonas M. Kress that appending the escaped query to "https://query.wikidata.org/embed.html#" gives you a URL that will execute the query directly, like this.

Finding the identifier for a resource starting at its Wikipedia page

Feb 27 update: it looks like I went to a lot of unnecessary trouble when I should have paid closer attention to the Wikipedia pages themselves, which now have a "Wikidata item" link on the left. I learned about this from Raffaele Messuti, who also told me that a Ctrl+option+g keystroke will do the same thing. This keystroke combination didn't work for me using a Das Keyboard under Ubuntu with either Chrome or Firefox, but may for you. The important thing is the nice link from every Wikipedia page to the corresponding Wikimedia page, although you'll want to substitute "/entity/" for "/wiki/" in the Wikimedia URL to get the actual entity URI.

When viewing a Wikipedia page for something, you can usually find that thing's DBpedia URI by rearranging the Wikipedia URL a little. Almost six years ago I automated this in a scriptlet that takes a browser from a Wikipedia page to the DBpedia URI for the page's subject in one click.

The usage of the English terms from the Wikipedia URLs in the corresponding DBpedia URIs worked pretty well for a bottom-up, easily crowd-sourced bootstrapping of the DBpedia URI design, but the English basis and the problems introduced by the occasional use of punctuation are not ideal. The Wikidata team did more initial design of the URI structure and went with the best practice of not incorporating actual names. (My favorite explication of this practice is on slides 41 and 42 of this BBC slide deck.) For example, while the DBpedia URI for "house" is http://dbpedia.org/resource/House, the Wikidata one is http://www.wikidata.org/entity/Q3947.

So if we can't go from a Wikipedia page to a Wikidata URI by manipulating a string version of the Wikipedia URL, how do we do it? The Wikibase/Indexing/RDF Dump Format page explains a lot about the structure of the data, and its Sitelinks section describes how a triple with a predicate of schema:about links a Wikipedia page to the Wikidata URI for the entity being described. If I want to know the URI for the concept of House and I know the concept's Wikipedia URL, I can enter the query "SELECT ?uri WHERE { <https://en.wikipedia.org/wiki/House> schema:about ?uri }". (You can try it in the Wikidata query form by clicking here.)

Automating that

To go from a Wikipedia page to a Wikidata URI in one click, I needed to embed a SPARQL query about the page's schema:about value in a scriptlet that would send the query to the Wikidata SPARQL endpoint. (I would have liked to send it to the query form and execute that, but as I described above, I couldn't work out how to trigger the running of the query from the submitted URL.) I did get this to work, and you can drag this link to your Chrome bookmarks bar: wp -> wikidata.

The scriptlet is a bit limited, though:

  • It returns a small handful of JSON instead of just the URI, which I would have preferred.

  • When used with Chrome, it displays the JSON in the browser. In a brief test with Firefox, the browser offered to download the JSON instead of displaying it.

  • I mentioned above how Wikipedia and DBpedia use English words in their URL identifiers, and this often includes disambiguation language, so the scriptlet doesn't work on those. For example, adding the string "Asteroid" to the base URL "https://en.wikipedia.org/wiki/" will give you the Wikipedia URL for the English-language page describing minor planets, and if you're looking at the Wikipedia page for that my new scriptlet will work just fine. However, if you add the string "Rock" to the same base URL, you get the URL for a Wikipedia disambiguation page. If you are viewing the Wikipedia page for Rock (geology), my scriplet's little bit of string manipulation that constructs a SPARQL query to send to the Wikidata endpoint won't have enough to go on.

The scriplet is about 180 characters of JavaScript that does the following:

  1. For the current location in the browser (that is, the URL of the displayed Wikipedia page) replace any underscores with %2520. This is the escaped version of the escaped version of a space character, which I discovered is necessary through trial and error.

  2. Escape the remainder of that URL as necessary.

  3. Insert the result into a SPARQL query of the form SELECT ?uri WHERE {<escaped-url> schema:about ?uri}

  4. Create a SPARQL endpoint GET request URL by appending all that to "https://query.wikidata.org/sparql?query=" and add "&format=json" at the end. (I tried "&format=csv" but instead of displaying the result Chrome offered to download it.)

  5. Set location.href to the result. This "sends" the browser to the constructed URL, which should then display the result of the query in JSON.

Once I could find the URIs to represent the resources I ws interested in, it was time to start querying for information about them. In my next blog entry, I'll talk about exploring Wikidata and its RDF-related resources with SPARQL. There are definitely some great features there.


Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Archives

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists