Getting to know Wikidata

First (SPARQL-oriented) steps.
Wikidata and SPARQL logos

I've written so often about DBpedia here that a few times I considered writing a book about it. As I saw Wikidata get bigger and bigger, I kept postponing the day when I would dig in and learn more about this Wikipedia sibling project. I've finally done this, starting with a few basic steps and one extra fun one:

  • Learn how to hit the SPARQL endpoint from an operating system command line with curl

  • Explore, if available, the web form front end to the endpoint

  • Learn how to find the identifier for whatever I like (a band, a person, a concept) so that I can create queries about it

  • Automate the finding of the identifier when looking at a Wikipedia page

Wikidata SPARQL queries from the command line

For that first task, you can append an escaped version of your query to https://query.wikidata.org/sparql?query= and pass that to curl. For example, doing it with the query "SELECT DISTINCT ?p WHERE { ?s ?p ?o } LIMIT 10" gives you this:

        curl https://query.wikidata.org/sparql?query=SELECT%20DISTINCT%20%3Fp%20WHERE%20%7B%20%3Fs%20%3Fp%20%3Fo%20%7D%20LIMIT%2010
      

That command line retrieves the result in the default XML format. curl's -H option let's you add HTTP header information to your request; for example adding '-H "Accept: text/csv'" after 'curl' on the command line above retrieves a CSV version of the result set instead of XML.

Web form front end for entering Wikidata SPARQL queries

https://query.wikidata.org/ is one of the nicest web forms I've ever seen for entering SPARQL queries. It offers color coding, auto-completion, and drop-down menus of tools, prefixes, and help.

When I enter a query like the one above into this form and click the Run button, the form runs the query and shows a URL in the browser's address bar that incorporates the query. Pasting that full URL into another browser address bar takes me to the query form and enters that query (see this for an example), but doesn't execute it the way DBpedia does in the same situation--with the Wikidata form, you still need to click that Run button. If anyone knows of some parameter that I can add to the Wikidata URL to make this happen, I'd love to hear about it; I could then use it to replace the delivery of the handful of JSON in the scriplet described below. March 4 update: I have learned from Jonas M. Kress that appending the escaped query to "https://query.wikidata.org/embed.html#" gives you a URL that will execute the query directly, like this.

Finding the identifier for a resource starting at its Wikipedia page

Feb 27 update: it looks like I went to a lot of unnecessary trouble when I should have paid closer attention to the Wikipedia pages themselves, which now have a "Wikidata item" link on the left. I learned about this from Raffaele Messuti, who also told me that a Ctrl+option+g keystroke will do the same thing. This keystroke combination didn't work for me using a Das Keyboard under Ubuntu with either Chrome or Firefox, but may for you. The important thing is the nice link from every Wikipedia page to the corresponding WIkimedia page, although you'll want to substitute "/entity/" for "/wiki/" in the Wikimedia URL to get the actual entity URI.

When viewing a Wikipedia page for something, you can usually find that thing's DBpedia URI by rearranging the Wikipedia URL a little. Almost six years ago I automated this in a scriptlet that takes a browser from a Wikipedia page to the DBpedia URI for the page's subject in one click.

The usage of the English terms from the Wikipedia URLs in the corresponding DBpedia URIs worked pretty well for a bottom-up, easily crowd-sourced bootstrapping of the DBpedia URI design, but the English basis and the problems introduced by the occasional use of punctuation are not ideal. The Wikidata team did more initial design of the URI structure and went with the best practice of not incorporating actual names. (My favorite explication of this practice is on slides 41 and 42 of this BBC slide deck.) For example, while the DBpedia URI for "house" is http://dbpedia.org/resource/House, the Wikidata one is http://www.wikidata.org/entity/Q3947.

So if we can't go from a Wikipedia page to a Wikidata URI by manipulating a string version of the Wikipedia URL, how do we do it? The Wikibase/Indexing/RDF Dump Format page explains a lot about the structure of the data, and its Sitelinks section describes how a triple with a predicate of schema:about links a Wikipedia page to the Wikidata URI for the entity being described. If I want to know the URI for the concept of House and I know the concept's Wikipedia URL, I can enter the query "SELECT ?uri WHERE { <https://en.wikipedia.org/wiki/House> schema:about ?uri }". (You can try it in the Wikidata query form by clicking here.)

Automating that

To go from a Wikipedia page to a Wikidata URI in one click, I needed to embed a SPARQL query about the page's schema:about value in a scriptlet that would send the query to the Wikidata SPARQL endpoint. (I would have liked to send it to the query form and execute that, but as I described above, I couldn't work out how to trigger the running of the query from the submitted URL.) I did get this to work, and you can drag this link to your Chrome bookmarks bar: wp -> wikidata.

The scriptlet is a bit limited, though:

  • It returns a small handful of JSON instead of just the URI, which I would have preferred.

  • When used with Chrome, it displays the JSON in the browser. In a brief test with Firefox, the browser offered to download the JSON instead of displaying it.

  • I mentioned above how Wikipedia and DBpedia use English words in their URL identifiers, and this often includes disambiguation language, so the scriptlet doesn't work on those. For example, adding the string "Asteroid" to the base URL "https://en.wikipedia.org/wiki/" will give you the Wikipedia URL for the English-language page describing minor planets, and if you're looking at the Wikipedia page for that my new scriptlet will work just fine. However, if you add the string "Rock" to the same base URL, you get the URL for a Wikipedia disambiguation page. If you are viewing the Wikipedia page for Rock (geology), my scriplet's little bit of string manipulation that constructs a SPARQL query to send to the Wikidata endpoint won't have enough to go on.

The scriplet is about 180 characters of JavaScript that does the following:

  1. For the current location in the browser (that is, the URL of the displayed Wikipedia page) replace any underscores with %2520. This is the escaped version of the escaped version of a space character, which I discovered is necessary through trial and error.

  2. Escape the remainder of that URL as necessary.

  3. Insert the result into a SPARQL query of the form SELECT ?uri WHERE {<escaped-url> schema:about ?uri}

  4. Create a SPARQL endpoint GET request URL by appending all that to "https://query.wikidata.org/sparql?query=" and add "&format=json" at the end. (I tried "&format=csv" but instead of displaying the result Chrome offered to download it.)

  5. Set location.href to the result. This "sends" the browser to the constructed URL, which should then display the result of the query in JSON.

Once I could find the URIs to represent the resources I ws interested in, it was time to start querying for information about them. In my next blog entry, I'll talk about exploring Wikidata and its RDF-related resources with SPARQL. There are definitely some great features there.


Please add any comments to this Google+ post.