Linking information to "missing" information in SPARQL

Or, as the SQL people call it, doing outer joins.
flickr picture titled 'outer join'

Last month, I described the SPARQL approach to some basic relational database queries in SPARQL and relational databases: getting started. Today I want to talk about SPARQL's equivalent of the SQL outer join, a bit of syntax that lets you add the phrase "and these corresponding fields if they're there" to a query. I'll use the same "world" database that I used in that posting's examples.

First, to review a simple join, the following SQL query asks for the names of the countries as well as the names of the cities that are the capitals of those countries:

      SELECT country.name, city.name 
      FROM country, city
      WHERE country.capital = city.id;
    

This query returns 232 rows, beginning with these five:

 Afghanistan                            Kabul
 Netherlands                            Amsterdam
 Netherlands Antilles                   Willemstad
 Albania                                Tirana
 Algeria                                Alger

If we ask the world database to just list country names, like this,

SELECT name FROM country;

we get 239 rows, because the result includes "countries" that have no capital such as Antarctica and Bouvet Island. So how would you tell an SQL system "Show me all the countries, and if they have them, their capitals"? Using a technique called an outer join:

      SELECT country.name, city.name
      FROM country LEFT OUTER JOIN city ON
      country.capital = city.id;
    

A left outer join asks for all relevant columns in the table on the "left" (that is, the first one mentioned in the query) and any information from the other table that matches. (A right outer join asks for all columns from the second table and any relevant ones from the first one, and a full outer join asks for all relevant columns from either table.) The result of our sample left outer join query includes all the rows from our first query above with seven extra rows for the "countries" that have no capital. These have NULL listed where the capital name would go, as this excerpt shows:

 Virgin Islands, U.S.                          Charlotte Amalie
 Zimbabwe                                      Harare
 Palestine                                     Gaza
 Antarctica                                    NULL
 Bouvet Island                                 NULL

How can we do this in SPARQL? I found it to be easier than the SQL version, which took me several tries to run with no syntax mistakes.

First, let's review the SPARQL version of the first SQL query above, which asks for the countries in the database that have capitals, listed with those capitals:

      SELECT ?countryName ?cityName
      WHERE {
        ?s1 vocab:country_Name ?countryName;
            vocab:country_Capital ?capital.
        ?s2 vocab:city_Name ?cityName;
            vocab:city_ID ?capital.
      }
    

Like the SQL version, it pulls out 232 rows. The SPARQL version of "list all the countries and, if they're there, the capitals" looks like this:

      SELECT ?countryName ?cityName
      WHERE {
        ?s1 vocab:country_Name ?countryName.

      OPTIONAL {
        ?s1 vocab:country_Capital ?capital.
        ?s2 vocab:city_Name ?cityName;
            vocab:city_ID ?capital.
        }
      }
    

The corresponding excerpt looks like this, with hyphens showing where it found no values:

"Virgin Islands, U.S."  "Charlotte Amalie"
"Zimbabwe"              "Harare"
"Palestine"             "Gaza"
"Antarctica"            -
"Bouvet Island"         -

I think that the word OPTIONAL followed by a pair of curly braces is a much more intuitive way to say "and the following, if they're there" than "LEFT OUTER JOIN ON", but I'm increasingly biased.

5 Comments

I agree that "OPTIONAL" is better than "LEFT OUTER JOIN ON", but this is a pretty low standard, and I don't think you can reasonably claim that your SPARQL query as a whole is any more "intuitive" than the SQL version. It's strictly more complicated both syntactically and semantically, and I bet a reasonably adept SPARQL programmer (imagining that more than 10 of these exist) would take more trial and error to get it right than an equivalent SQL programmer would need to get the SQL version working.

Also, one of the victories SPARQL frequently claims over SQL is the elimination of joins. This is a misleading claim to begin with, since the reason you don't need most ordinary joins in SPARQL is that they've have been moved into the data model. But even so, the fact that you had to change your SPARQL query to get the effect of a different SQL join shows that SPARQL hasn't actually eliminated joins.

Whereas (I say in the spirit of language-comparison), in Thread this query would be:

Country|Capital

That's all. "Country" means "get all nodes of type Country", "|Country" means "and for each of those nodes, calculate and return the results of following its Capital arc". It makes no difference whether a country has 1 capital, no captial, 25 capitals, etc. (Your SPARQL version will produce N *different* pairs for countries with N capitals, right? Ugly.)

And if you want to see it get even uglier, do the SPARQL query for finding those 7 countries without capitals.

Thread's version:

Country:!(.Capital)


With the word "Thread" referring to a connected series of postings on virtually any topic so often, it's very difficult to find out more about this query language with a web search, so it's an unfortunate choice of name. Can you point me to implementations and data to query with those implementations so that I can try it?


The bigger problem with searching for info on Thread, in this case, is that it's part of a project still under development! So you can't try it out yet, and the spec hasn't been published. There's a blog post about it at

http://www.furia.com/page.cgi?type=log#id311

based on a talk I did at the Web 3.0 conference last month, and I gave some other examples in comments on your movie-query SPQARL post. It should be available for first public experimentation in Q1 some time. Possibly this makes it too annoying for me to be talking about it in advance at all, but I'm impatient, and I think most query languages get discussed before they're built, so hopefully that moves it to within the bounds of acceptability.

But if you'd rather not have me distracting your SQL/SPARQL contrast with a phantom alternative, just say so and I'll leave you alone!


To be honest, while language elegance is certainly a great thing, I'm not going to defend SPARQL on elegance points. Its main appeal to me is the increasing amount of data and implementations available, so while I've obviously been discussing its syntax, that's really just a means to an end: helping people take advantage of all that data.

I will keep an eye on http://www.furia.com. (In fact, to gather data to query against with your Thread implementation, you may very well end up using SPARQL!)


This issue gets complicated if someone fills the capital field with something when the country region doesn't really have a capital. There are no good universal standards for how to indicate a null and/or missing and/or unknown value.

Also some countries have multiple capitals. You see a couple examples just in the sample data above. Palestine's capital is Gaza if you use the location of the Palestine government's heaqdquarters as the capital. But the Palestinians think that Jerusalem is their capital, and the offices in Gaza are just a temporary expedient. A less exotic example is the Netherlands: Amsterdam is its capital for most purposes, but some parts of the government are headquartered in The Hague.