Creating Wide CSV files with SPARQL

Lots of columns and commas, but all in the right place.
I was a bit proud that I came up with this simple way to make sure all the values came out in the right places in this fairly complicated target output.

I recently decided to copy my address book, which I have in an RDF file, to Google Contacts. The basic steps are pretty straightforward:

  1. In Google Contacts, create an entry with test data in every field: TestGiveName, TestFamilyName, testemail@whatever.com, and so forth.

  2. Export the contacts as a CSV file. The currently default "preview" version of Google Contacts doesn't allow this yet, but you can "go to old version" and then find Export on the More drop-down menu.

  3. In the exported CSV, move the test entry created in step 1 to the second line, just under the field names.

  4. Using the field names and test entry as a guide, write a SPARQL query that returns the relevant information from the RDF address book file in the order shown in the exported file.

  5. Execute the query, requesting CSV output.

  6. Replace the query output's header row with the header row from the original exported file and then import the result into Google contacts.

Step 4 seemed a bit intimidating. With something like 88 columns in step 2's exported CSV, I knew that messing up one comma (for example, putting the 47th piece of information after the 47th comma instead of before it) would mess up all the information after it. I have made plenty of mistakes like this when creating wide-body CSV before.

I had a great idea, though, that made it much simpler: I created the SELECT statement from the first line of the exported CSV. I copied that line to a text editor, replaced the spaces in the field names with underscores, removed the hyphens (not allowed in SPARQL variable names), and then replaced each comma with a space and a question mark to turn the name after it into a variable name. Finally, I manually added a question mark to the very first name (the global replace in the previous step didn't do that because there was no comma there) and added the word SELECT before it, and I had the SELECT statement that my query needed.

This way, before I'd even begun implementing the logic to pull each piece of data out of the address book RDF, I knew that when I did they would come out in the right places.

Adding two bits of that logic to a WHERE clause gave me this:

PREFIX  v: <http://www.w3.org/2006/vcard/ns#>

SELECT ?Name ?Given_Name ?Additional_Name ?Family_Name ?Yomi_Name
       ?Given_Name_Yomi ?Additional_Name_Yomi ?Family_Name_Yomi ?Name_Prefix
       ?Name_Suffix ?Initials ?Nickname ?Short_Name ?Maiden_Name ?Birthday
        # 21 more lines of variable names
       ?Custom_Field_2__Value ?Custom_Field_3__Type ?Custom_Field_3__Value
WHERE {
          ?entry v:family-name ?Family_Name . 
          ?entry v:given-name  ?Given_Name .
}

When I ran arq with this command,

arq --query addrbook2csv.rq --data addrbook.rdf --results=CSV

It gave me CSV output with the ?Family_Name and ?Given_Name values right where they needed to be for Google Contacts to import them properly.

I wish I could say that the rest of the query development was just a matter of adding triple patterns like the ?Family_Name and ?Given_Name ones shown above, but it got more complicated because of the ad hoc structure of my address book data. I needed a UNION, lots of OPTIONAL blocks, and even some nested OPTIONAL blocks that I'm not proud of. Still, I was a bit proud that I came up with this simple way to make sure that all the values came out in the right places in this fairly complicated target output.


Please add any comments to this Google+ post.