25 June 2017

Creating Wide CSV files with SPARQL

Lots of columns and commas, but all in the right place.

I was a bit proud that I came up with this simple way to make sure all the values came out in the right places in this fairly complicated target output.

I recently decided to copy my address book, which I have in an RDF file, to Google Contacts. The basic steps are pretty straightforward:

  1. In Google Contacts, create an entry with test data in every field: TestGiveName, TestFamilyName, testemail@whatever.com, and so forth.

  2. Export the contacts as a CSV file. The currently default "preview" version of Google Contacts doesn't allow this yet, but you can "go to old version" and then find Export on the More drop-down menu.

  3. In the exported CSV, move the test entry created in step 1 to the second line, just under the field names.

  4. Using the field names and test entry as a guide, write a SPARQL query that returns the relevant information from the RDF address book file in the order shown in the exported file.

  5. Execute the query, requesting CSV output.

  6. Replace the query output's header row with the header row from the original exported file and then import the result into Google contacts.

Step 4 seemed a bit intimidating. With something like 88 columns in step 2's exported CSV, I knew that messing up one comma (for example, putting the 47th piece of information after the 47th comma instead of before it) would mess up all the information after it. I have made plenty of mistakes like this when creating wide-body CSV before.

I had a great idea, though, that made it much simpler: I created the SELECT statement from the first line of the exported CSV. I copied that line to a text editor, replaced the spaces in the field names with underscores, removed the hyphens (not allowed in SPARQL variable names), and then replaced each comma with a space and a question mark to turn the name after it into a variable name. Finally, I manually added a question mark to the very first name (the global replace in the previous step didn't do that because there was no comma there) and added the word SELECT before it, and I had the SELECT statement that my query needed.

This way, before I'd even begun implementing the logic to pull each piece of data out of the address book RDF, I knew that when I did they would come out in the right places.

Adding two bits of that logic to a WHERE clause gave me this:

PREFIX  v: <http://www.w3.org/2006/vcard/ns#>

SELECT ?Name ?Given_Name ?Additional_Name ?Family_Name ?Yomi_Name
       ?Given_Name_Yomi ?Additional_Name_Yomi ?Family_Name_Yomi ?Name_Prefix
       ?Name_Suffix ?Initials ?Nickname ?Short_Name ?Maiden_Name ?Birthday
        # 21 more lines of variable names
       ?Custom_Field_2__Value ?Custom_Field_3__Type ?Custom_Field_3__Value
WHERE {
          ?entry v:family-name ?Family_Name . 
          ?entry v:given-name  ?Given_Name .
}

When I ran arq with this command,

arq --query addrbook2csv.rq --data addrbook.rdf --results=CSV

It gave me CSV output with the ?Family_Name and ?Given_Name values right where they needed to be for Google Contacts to import them properly.

I wish I could say that the rest of the query development was just a matter of adding triple patterns like the ?Family_Name and ?Given_Name ones shown above, but it got more complicated because of the ad hoc structure of my address book data. I needed a UNION, lots of OPTIONAL blocks, and even some nested OPTIONAL blocks that I'm not proud of. Still, I was a bit proud that I came up with this simple way to make sure that all the values came out in the right places in this fairly complicated target output.


Please add any comments to this Google+ post.

29 May 2017

Instead of writing SPARQL queries for Wikipedia--query for them!

Queries as data to help you get at more data.

Let's say, hypothetically, that you want to execute a SPARQL query that lists all of Wikimedia's portraits with fruit. Wikimedia does have a category for this, so what would be the quickest way to come up with the query?

If you click the Wikidata item link on this category's page, you'll see all the data about it that you can retrieve with a SPARQL query to the Wikimedia endpoint, as I've described in my last few blog entries. The cool thing for this particular resource is that one property is called Wikidata SPARQL query equivalent, and its value is the query that will retrieve a list of the portraits with fruit. In other words, Wikidata has a triple that looks like this:

subjectwd:Q29789760 (the Wikidata category "portraits with fruit")
predicatep:P3921 ("Wikidata SPARQL query equivalent")
object"SELECT DISTINCT ?item WHERE { ?item wdt:P31/wdt:P279* wd:Q838948 . ?item wdt:P136/wdt:P31?/wdt:P279* wd:Q134307 . ?item wdt:P180/wdt:P31?/wdt:P279* wd:Q3314483 . }"

Paste that object value into the Wikidata query service, and you can run it to get a list of the portraits.

That may seem like a lot of trouble to get this list, but that's not really the point. This query gives you a head start in developing more sophisticated queries on the topic.

When I wondered how many Wikimedia resources used this predicate, I found that the ones using it were easier to understand if they also had an rdfs:label value. So, I entered this query to count the subjects that had both:

SELECT (count(*) as ?count) WHERE { 
  ?s wdt:P3921 ?o ;
     rdfs:label ?label .
  }

Two weeks ago there were 316, but as I write this there are almost a hundred more, so the number is growing at a good pace.

The idea of a SPARQL query as an object in an RDF triple is not new. It's part of the Shapes Constraint Language (SHACL), as demonstrated by one of its test cases. SHACL is a W3C specification that lets you specify constraints on data--for example, to validate that certain properties are required for instances of a particular class and that others are optional. (This is a lot more difficult using OWL.) I'll be looking at SHACL more closely in the coming months; meanwhile, I'll be keeping an eye on the SPARQL queries being added to Wikidata where we can retrieve them with our own SPARQL queries.


Please add any comments to this Google+ post.

23 April 2017

The Wikidata data model and your SPARQL queries

Reference works to get you taking advantage of the fancy parts quickly.

RDF standards were used to describe the Wikibase model that was developed independently of W3C standards.

Last month I promised that I would dig further into the Wikidata data model, its mapping to RDF, and how we can take advantage of this with SPARQL queries. I had been trying to understand the structure of the data based on the RDF classes and properties I saw and the documentation that I could find, and some of the vocabulary discussing these issues confused me--for example, RDF is about describing resources, but I was seeing lots of references to entities, which can mean slightly different things in different branches of computer science. But, as Daniel Kinzler explained to me, "The Wikidata (or technically, Wikibase) data model is not defined in terms of RDF"; RDF standards were used to describe the Wikibase model that was developed independently of W3C standards.

Wikibase, as described by its home page, "is a collection of applications and libraries for creating, managing and sharing structured data...Wikibase was developed for and is used by Wikidata, the free knowledge base and Wikipedia, the encyclopedia that anyone can edit." The same page describes Wikidata as one of the "projects powered by Wikidata", along with the europeana eagle project and Droid wiki.

The Wikibase/DataModel document is fairly long and detailed, and I would suggest starting instead with the Wikibase/DataModel/Primer. The Primer describes how "Entities are the basic elements of the knowledge base" and how "there are two predefined kinds if Entities: Items and Properties" (both of which RDF people consider to be resources). The document goes on to describe the information that can be associated with items and properties.

I had originally found their RDF Dump Format document abstruse and confusing, but it was easier to follow after I read the Wikibase data model primer because I had a better idea of the dump format's basis. It's even easier to follow if you just skim the Dump Format document to get a general idea of what it covers and then go to the Wikidata query service/User Manual, where you'll get an even faster start querying Wikidata. (Their sample queries that I described last month also help a lot.) The User Manual describes the declared prefixes, some nice tricks for taking advantage of different kinds of labels, how to work with geo data, available endpoints that you can federate into your queries, and more. It also provides more context for understanding the Dump Format document.

The Data Model document describes the fundamental role of statements in the Wikibase data model. (Longstanding members of the RDF community will enjoy Kingsley Idehen's continuation of my thread with Daniel, in which Kingsley insists that Wikidata is a collection of reified RDF statements, and Daniel says that, well, no, not really. They eventually agree to disagree.) The RDF Dump Format document describes two statement types that are important to how we treat Wikidata as an RDF repository but are also potentially very confusing. The first type is known as a truthy statement, or "direct claim"; these are simple triples that assert facts. The other statement type is the full statement, which is used to "represent all data about the statement in the system".

As one way to quickly recognize the difference, Wikimedia usually uses specific namespaces in specific places in both truthy and full statements. For example, the namespace http://www.wikidata.org/prop/direct/, which is abbreviated using the prefix, wdt:, is usually used for the predicate of a truthy statement. (The Dump format document has a nice list of all of these in the Predicates section. As you work with this data, you'll often go back to the Prefixes used section of the RDF Dump Format and also the Full list of prefixes section that follows it.)

Here's an example of the two kinds of statements that Daniel provided me: the triple {wd:Q64 wdt:P1376 wd:Q183} is a truthy triple saying that Berlin is the capital of Germany. Here is the full version of that statement:

wds:Q64-43CCD3D6-F52E-4742-B0E3-BCA671B69D2C a wikibase:Statement,
                 wikibase:BestRank ;
   wikibase:rank wikibase:PreferredRank ;
   ps:P1376 wd:Q183 ;
   prov:wasDerivedFrom wdref:ba76a7c0f885fa85b10368696ab4ac89680aa073 .

wdref:ba76a7c0f885fa85b10368696ab4ac89680aa073 a wikibase:Reference ;
   pr:P248 wd:Q451546 ;
   pr:P958 "Artikel 2 (1)" .
      

To understand this better, I wanted to see this for a different statement: the fact that bebop musician Tommy Potter played the bass. First, I clicked the "Wikidata item" link on Potter's Wikipedia page and substituted /entity/ for /wiki/, as I described in my February blog entry Getting to know Wikidata, to get the URI that represents him: http://www.wikidata.org/entity/Q1369941.

However, after doing this, it wasn't as simple as you might think to find the triple about the instrument he played. A query for {wd:Q1369941 ?p ?o} (using the prefix substitution for brevity) retrieves all the triples about him, but they're the "truthy" ones, in which the predicates are known as direct claim predicates. Three of these triples described him as a Jazzbassist, a contrebassiste de jazz, and a contrabbassista statunitense, but none listed the "bass" as the instrument that he played in any language. Queries about the predicates themselves--that is, queries for triples where the properties used by these triples were the subjects so that I could learn more about the truthy triples I retrieved about Potter (for example, whether the properties have rdfs:label values in different languages)--showed very little information. It turned out that, to learn more about these properties, I could look for triples that had these properties as objects, with a predicate of wikibase:directClaim linking the actual Wikidata data model property to the predicate used in the direct claim. When I queried for triples that had these Wikidata data model properties as subjects so that I could learn more about them, I found plenty.

To put these relationships to use, I entered the following query to find out more about Tommy Potter:

SELECT ?pname ?o ?olabel WHERE 
{
  wd:Q1369941 ?directClaimP ?o .          # Get the truthy triples.
  ?p wikibase:directClaim ?directClaimP . # Find the Wikibase properties linked
  ?p rdfs:label ?pname .                  # to the truthy triples' predicates
  FILTER ( lang(?pname) = "en" )          # and their labels, in English.
  OPTIONAL {
     ?o rdfs:label ?olabel  
     FILTER ( lang(?olabel) = "en" )
  }
}

The result of this query is a mostly-human readable statement of facts about him. You could substitute the URI for just about any Wikidata entity as the subject in that first triple pattern to see information about that entity. You could also view the property names in other languages besides English, which is a big advantage of the Wikibase data model.

If you send your browser to the http://www.wikidata.org/entity/Q1369941 URI that represents Potter, you will get redirected to a Wikidata page with a nicely formatted human-readable version of data about Potter at https://www.wikidata.org/wiki/Q1369941. On the other hand, if you add .ttl (or .nt or .rdf) to the end of the /entity/ version of the URI, you'll get RDF of all the data about Potter, including the full representations with triples that include predicates such as wikibase:BestRank and prov:wasDerivedFrom, just like the full version of the data above about Berlin being the capital of Germany.

After looking at the full data about Potter, some queries to find out more about it often found less than what I expected. I eventually learned from the WDQS data differences section of the RDF Dump Format document that "Data nodes (wdata:Q2) are not stored... This is done for performance reasons."

After all this exploration, I still haven't gotten to the kinds of structural queries I've been planning on--for example, looking for instances based on their class's relationship(s) to other classes. The Stack Exchange question How to include sub-classes in a Wikidata SPARQL query?, which has a solid answer, looks pretty inspirational. I'm looking forward to playing with it.

Meanwhile, as you use SPARQL to play with Wikidata, you're going to see a lot of cryptic resource names, like wdt:P279 in the Stack Exchange answer, and you'll wonder what their human-readable name is. I created the form below to help me with the prefixes I used the most. You can use this form yourself (for example, enter P279 in the wdt: field and press Enter), but you'd probably be best off copying it from this page's source into your own page that you can customize.

It turns out that wdt:P279 means "subclass of". This is something I'll certainly be getting to know better in the future.

wd:
wdt:
p:

Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Archives

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists