SPARQL at the movies

Using SPARQL to answer a few questions that IMDB won't help much with.

Last week I mentioned that the Linked Movie Database SPARQL endpoint would be fun to play with, and it has been. I used my spreadsheetSPARQL interface to send the following queries to their SPARQL endpoint.

One degree of Kevin Bacon

The following lists all the actors who have appeared in a movie with Kevin Bacon. Or, in more SPARQLy terms, it says "show me ?actorName (with no repeats) where ?kb is the ID for Kevin Bacon, and a given movie has ?kb and ?actor in it, and ?actor has the name ?actorName, but don't show me ?actorName if the actor is ?kb" (that is, don't list Kevin himself):


  ?kb    <> "Kevin Bacon".

  ?movie <> ?kb;
         <> ?actor.

  ?actor <> ?actorName.

  FILTER (?kb != ?actor).

I won't show you all of the 240 names that get returned, but here are the first few:

Vincent D'Onofrio
Daniel Stern
John C. Reilly
J. T. Walsh
Michael Gross
William Windom
Michael Tucker
Stephen Lang

Versatile actor(s)

Which actors have appeared in both a John Waters movie and a Steven Spielberg movie? (Assign the URI for each director to a variable, find the URI for any actors who worked with both directors, and get the actors' names.)


  ?dir1      <> "John Waters".

  ?dir2      <> "Steven Spielberg".

  ?dir1movie <> ?dir1;
             <> ?actor.

  ?dir2movie <> ?actor;
             <> ?dir2.

  ?actor     <> ?actorName.

This query returns only one result: Darren E. Burrows, who was in both Amistad and Cry-Baby. (Fans of the old television show "Northern Exposure" might remember him as Ed Chigliak.)

Woody Allen's favorite actors

To list everyone who had ever been in a Woody Allen movie, I might use the DISTINCT keyword so that each was only listed once, but I wanted the repetition so that I could see who had been in how many of these movies:

SELECT ?actorName WHERE {

  ?woody  <> "Woody Allen".

  ?movie  <> ?woody;
          <> ?actor.

  ?actor <> ?actorName.

Adding count(?actorName) after ?actorName caused an error, so either this SPARQL extension isn't supported by the Linked Movie Database's SPARQL implementation or I was doing something wrong. Either way, I got what I wanted by copying the output of the query above to a file called waactors.txt and piping that through the following at a Windows command prompt:

type waactors.txt | sort | uniq -c | sort /r

This sorts the names, reduces the list to show each name once with a count of how many times it occurred, and then does a reverse sort to put the names that occurred the most at the top of the list. (The Linux equivalent would be "cat waactors.txt | sort | uniq -c | sort -r".) The resulting list began with these names:

     26 Woody Allen
     12 Mia Farrow
      7 Diane Keaton
      6 Julie Kavner
      5 Dianne Wiest
      4 Louise Lasser
      3 Tony Roberts
      3 Scarlett Johansson
      3 Judy Davis
      3 Alan Alda

Now we have a clear picture of who his favorite actor is. When I saw that the voice of Marge Simpson beat everyone but Allen himself and his two most famous leading ladies, I wondered what six movies she appeared in. The answer was easy to find out:

SELECT  ?movieName WHERE {

  ?woody  <> "Woody Allen".

  ?actor <> "Julie Kavner".

  ?movie  <> ?woody;
          <> ?actor;
          <> ?movieName.

The answer:

Don't Drink the Water
Deconstructing Harry
Hannah and Her Sisters
New York Stories
Radio Days
Shadows and Fog

Moving further down the list of Allen's favorite actors, I'm going to go out on a limb and predict that Scarlett Johansson moves ahead of Tony Roberts before Alan Alda does.


Actually, you have to use "sort -n -r" on Posix systems to force the first field to be interpreted numerically. (I do a lot of this sort of thing.)

Hi John,

I didn't need the -n, I assume because the uniq -c output right-justifies the numbers with leading spaces that make the numeric sort come out the way I want it.

Very cool that you actually managed to get computed answers to these questions.

Not so cool, it seems to me, that the questions are so cumbersome to frame.

I will now tantalize (appall? annoy?) you with what these queries would look like in Thread, the path-based query-language I'm working on at ITA:

"actors who have appeared in a movie with Kevin Bacon":

Actor:=Kevin Bacon.Movie.Actor

"actors who have appeared in both a John Waters movie and a Steven Spielberg movie":

Actor:(.Movie.Director:=John Waters:=Steven Spielberg)

"actors who have appeared in Woody Allen movies":

Director:=Woody Allen.Movie.Actor

"actors who have appeared in Woody Allen movies, with counts"

Actor|Appearances=(.Movie:(.Director:=Woody Allen)._Count):Appearances>0#Appearances

or, more cleverly:

Director:=Woody Allen.Movie/Actor|Movies=(.Nodes._Count)#Movies

You can probably get the gist of most of these without a very detailed explanation of syntax. Except the last one, which is the most fun:

Director - get all nodes of type Director
:=Woody Allen - narrow this list down to the one named Woody Allen
.Movie - get all this director's movies
/Actor - group these movies by actor (each movie will appear in multiple groups...)
|Movies=(.Nodes._Count) - calculate the number of grouped movies in each actor's group, and call this "Movies"
#Movies - and sort the set of actor/movie-groups by these counts

No real points until it's publicly available, obviously. But intriguing, maybe?

Oh, sorry, the Waters/Spielberg example would be:

Actor:(.Movie.Director:=John Waters,=Steven Spielberg._Count:=2)


Actor:(.Movie.Director:=John Waters):(.Movie.Director:=Steven Spielberg)

or, for that matter

Director:=John Waters,=Steven Spielberg/(.Movie.Actor):(.nodes._Count:=2)

Thanks for the interesting post. The SPARQL endpoint has been shifted back to the D2RQ server:

well… i visit your website first time and found this site very usefull and intresting !

well… you guys doing nice work and i just want to say that keep rocking and keep it up !!!!