The Wikidata data model and your SPARQL queries

Reference works to get you taking advantage of the fancy parts quickly.
RDF standards were used to describe the Wikibase model that was developed independently of W3C standards.

Last month I promised that I would dig further into the Wikidata data model, its mapping to RDF, and how we can take advantage of this with SPARQL queries. I had been trying to understand the structure of the data based on the RDF classes and properties I saw and the documentation that I could find, and some of the vocabulary discussing these issues confused me--for example, RDF is about describing resources, but I was seeing lots of references to entities, which can mean slightly different things in different branches of computer science. But, as Daniel Kinzler explained to me, "The Wikidata (or technically, Wikibase) data model is not defined in terms of RDF"; RDF standards were used to describe the Wikibase model that was developed independently of W3C standards.

Wikibase, as described by its home page, "is a collection of applications and libraries for creating, managing and sharing structured data...Wikibase was developed for and is used by Wikidata, the free knowledge base and Wikipedia, the encyclopedia that anyone can edit." The same page describes Wikidata as one of the "projects powered by Wikibase", along with the europeana eagle project and Droid wiki.

The Wikibase/DataModel document is fairly long and detailed, and I would suggest starting instead with the Wikibase/DataModel/Primer. The Primer describes how "Entities are the basic elements of the knowledge base" and how "there are two predefined kinds of Entities: Items and Properties" (both of which RDF people consider to be resources). The document goes on to describe the information that can be associated with items and properties.

I had originally found their RDF Dump Format document abstruse and confusing, but it was easier to follow after I read the Wikibase data model primer because I had a better idea of the dump format's basis. It's even easier to follow if you just skim the Dump Format document to get a general idea of what it covers and then go to the Wikidata query service/User Manual, where you'll get an even faster start querying Wikidata. (Their sample queries that I described last month also help a lot.) The User Manual describes the declared prefixes, some nice tricks for taking advantage of different kinds of labels, how to work with geo data, available endpoints that you can federate into your queries, and more. It also provides more context for understanding the Dump Format document.

The Data Model document describes the fundamental role of statements in the Wikibase data model. (Longstanding members of the RDF community will enjoy Kingsley Idehen's continuation of my thread with Daniel, in which Kingsley insists that Wikidata is a collection of reified RDF statements, and Daniel says that, well, no, not really. They eventually agree to disagree.) The RDF Dump Format document describes two statement types that are important to how we treat Wikidata as an RDF repository but are also potentially very confusing. The first type is known as a truthy statement, or "direct claim"; these are simple triples that assert facts. The other statement type is the full statement, which is used to "represent all data about the statement in the system".

As one way to quickly recognize the difference, Wikimedia usually uses specific namespaces in specific places in both truthy and full statements. For example, the namespace http://www.wikidata.org/prop/direct/, which is abbreviated using the prefix wdt:, is usually used for the predicate of a truthy statement. (The Dump format document has a nice list of all of these in the Predicates section. As you work with this data, you'll often go back to the Prefixes used section of the RDF Dump Format and also the Full list of prefixes section that follows it.)

Here's an example of the two kinds of statements that Daniel provided me: the triple {wd:Q64 wdt:P1376 wd:Q183} is a truthy triple saying that Berlin is the capital of Germany. Here is the full version of that statement:

wds:Q64-43CCD3D6-F52E-4742-B0E3-BCA671B69D2C a wikibase:Statement,
                 wikibase:BestRank ;
   wikibase:rank wikibase:PreferredRank ;
   ps:P1376 wd:Q183 ;
   prov:wasDerivedFrom wdref:ba76a7c0f885fa85b10368696ab4ac89680aa073 .

wdref:ba76a7c0f885fa85b10368696ab4ac89680aa073 a wikibase:Reference ;
   pr:P248 wd:Q451546 ;
   pr:P958 "Artikel 2 (1)" .
      

To understand this better, I wanted to see this for a different statement: the fact that bebop musician Tommy Potter played the bass. First, I clicked the "Wikidata item" link on Potter's Wikipedia page and substituted /entity/ for /wiki/, as I described in my February blog entry Getting to know Wikidata, to get the URI that represents him: http://www.wikidata.org/entity/Q1369941.

However, after doing this, it wasn't as simple as you might think to find the triple about the instrument he played. A query for {wd:Q1369941 ?p ?o} (using the prefix substitution for brevity) retrieves all the triples about him, but they're the "truthy" ones, in which the predicates are known as direct claim predicates. Three of these triples described him as a Jazzbassist, a contrebassiste de jazz, and a contrabbassista statunitense, but none listed the "bass" as the instrument that he played in any language. Queries about the predicates themselves--that is, queries for triples where the properties used by these triples were the objects so that I could learn more about the truthy triples I retrieved about Potter (for example, whether the properties have rdfs:label values in different languages)--showed very little information. It turned out that, to learn more about these properties, I could look for triples that had these properties as objects, with a predicate of wikibase:directClaim linking the actual Wikidata data model property to the predicate used in the direct claim. When I queried for triples that had these Wikidata data model properties as subjects so that I could learn more about them, I found plenty.

To put these relationships to use, I entered the following query to find out more about Tommy Potter:

SELECT ?pname ?o ?olabel WHERE 
{
  wd:Q1369941 ?directClaimP ?o .          # Get the truthy triples.
  ?p wikibase:directClaim ?directClaimP . # Find the Wikibase properties linked
  ?p rdfs:label ?pname .                  # to the truthy triples' predicates
  FILTER ( lang(?pname) = "en" )          # and their labels, in English.
  OPTIONAL {
     ?o rdfs:label ?olabel  
     FILTER ( lang(?olabel) = "en" )
  }
}

The result of this query is a mostly-human readable statement of facts about him. You could substitute the URI for just about any Wikidata entity as the subject in that first triple pattern to see information about that entity. You could also view the property names in other languages besides English, which is a big advantage of the Wikibase data model.

If you send your browser to the http://www.wikidata.org/entity/Q1369941 URI that represents Potter, you will get redirected to a Wikidata page with a nicely formatted human-readable version of data about Potter at https://www.wikidata.org/wiki/Q1369941. On the other hand, if you add .ttl (or .nt or .rdf) to the end of the /entity/ version of the URI, you'll get RDF of all the data about Potter, including the full representations with triples that include predicates such as wikibase:BestRank and prov:wasDerivedFrom, just like the full version of the data above about Berlin being the capital of Germany.

After looking at the full data about Potter, some queries to find out more about it often found less than what I expected. I eventually learned from the WDQS data differences section of the RDF Dump Format document that "Data nodes (wdata:Q2) are not stored... This is done for performance reasons."

After all this exploration, I still haven't gotten to the kinds of structural queries I've been planning on--for example, looking for instances based on their class's relationship(s) to other classes. The Stack Exchange question How to include sub-classes in a Wikidata SPARQL query?, which has a solid answer, looks pretty inspirational. I'm looking forward to playing with it.

Meanwhile, as you use SPARQL to play with Wikidata, you're going to see a lot of cryptic resource names, like wdt:P279 in the Stack Exchange answer, and you'll wonder what their human-readable name is. I created the form below to help me with the prefixes I used the most. You can use this form yourself (for example, enter P279 in the wdt: field and press Enter), but you'd probably be best off copying it from this page's source into your own page that you can customize.

It turns out that wdt:P279 means "subclass of". This is something I'll certainly be getting to know better in the future.

wd:
wdt:
p:

Please add any comments to this Google+ post.