Finding Europeana audio with SPARQL

And video!
As a SPARQL geek's alternative to YouTube, the 166,872 video resources with a edm:type value of "VIDEO" look like a tempting way to kill some time.

When I first heard about the SPARQL endpoint for the Europeana aggregation of data about European cultural artifacts, the first example I heard about was an MP3 audio file of a Slovenian version of O sole mio. I happened to be in the middle of packing for a family visit over Christmas and immediately tweeted "Lots of holiday stuff to do, but the new Ontotext Europeana SPARQL endpoint points to MP3s! So tempting..." This past Sunday morning I finally made some time to explore it more, and I found 6,219 audio files.

The following query pulls down data about 100 of them (which 100 you pull depends on the OFFSET value), and this XSLT stylesheet converts a SPARQL XML query result version of the results to a simple HTML file that shows the title, creator, and source of each one, with the title being a hypertext link to the audio file itself. Following some of these links, I found folk music, classical music, interviews, and plenty of Finnish spoken word material where I had no idea what they were saying.

Here is the query itself:

PREFIX edm: <http://www.europeana.eu/schemas/edm/>
PREFIX ore: <http://www.openarchives.org/ore/terms/>
PREFIX dc: <http://purl.org/dc/elements/1.1/> 

SELECT ?title ?mediaURL ?creator ?source WHERE {
  ?resource edm:type "SOUND" ;
            ore:proxyIn ?proxy ;
            dc:title ?title ;
            dc:creator ?creator ;
            dc:source ?source . 
  ?proxy edm:isShownBy ?mediaURL . 
 }
OFFSET 600
LIMIT 100

This link runs the query with an offset of 3000, and this web page shows the result of running the stylesheet on the query results when run with an offset of 600 as above. As you'll see and hear by following that page's links, that batch seems to be mostly Norwegian folk music.

A few notes:

  • As I mentioned in the tweet, it's running Ontotext's OWLIM triplestore. This made it the first large public endpoint that I've seen with SPARQL 1.1 support, which was great to see. I didn't need any 1.1 features for the query above, but did for others on my way there—for example, to find out that there were 6,219 audio files.

  • About half of the audio URLs had "mp3" at the end. When I tried some of the audio URLs that didn't, they seemed to play audio just fine, but there may be some that don't link to playable audio.

  • The proxy parts of the query deal with a level of indirection that was necessary because the site federates data from other sites. Documentation of the data model is available (well, it isn't the morning of January 13th, but Google has a cached copy), but I got to the query above by various hit-and-miss experiments starting with one that looked for resources whose names ended with ".mp3".

  • The web-based front end to the Europeana SPARQL endpoint did some nice parentheses matching and color-coding of syntax as I entered queries. It doesn't compare with TopBraid Composer's SPARQL view, which has command completion and other IDE-oriented features, but it was impressive for a field on a web form.

There is plenty more metadata available in addition to the title, creator, and source that my query requests for each resource; I encourage you to try variations on the query to explore it. Other possible edm:type values are TEXT, IMAGE, VIDEO and 3D. (The two 3D resources were a 70-meg two-page PDF and a 59-meg eight-page one, each showing a church in Cyprus. Viewed with Adobe Reader, some of the images could be rotated, I think.)

As a SPARQL geek's alternative to YouTube, the 166,872 resources with an edm:type value of "VIDEO" are a tempting way to kill some time. Just substitute "VIDEO" for "SOUND" in the query above and you'll be off and running. (Don't forget that LIMIT keyword, though—be polite and don't ask for too much at once.)


Please add any comments to this Google+ post.