21 January 2010

Using the ARQ SPARQL processor from the command line

With the Jena extensions.

I recently described how to execute Federated SPARQL queries that use Jena extensions that we'll hopefully see added to the SPARQL 1.1 standard. I showed a sample query and suggested that you try it at the sparql.org RDF Query Demo page.

For local, command-line use of SPARQL, I've used the Jena ARQ query engine for years, but my sample federated query didn't work with it, and now I know why: the sparql.bat file that comes with the distribution invokes the processor in a strictly standards-compliant mode without the extensions enabled. I thought I'd have to write and compile some Java code to use the extensions, but my co-worker Jeremy Carroll pointed out that the sparql.bat file in ARQ's bat subdirectory calls the arq.sparql library, like this,

java -cp %CP% arq.sparql %*

and that calling the arq.arq library instead enables the extensions. Then, I noticed the arq.bat file in the same directory as sparql.bat, and this is exactly what it does. There are more batch files in there, and a web search on their names led me to an ARQ - Command Line Applications documentation page, which will be handy.

Using arq.bat instead of sparql.bat, the sample federated query works as written (tested with ARQ 2.8.2), and so does LET assignment and extension functions, making it possible to use ARQ in real semantic web application development with no need to do Java coding around the Jena API.

(Thanks again, Jeremy!)

12 January 2010

Live stock ticker data in RDF

Well, on a 20-minute delay.

I've played with finance.yahoo.com's feed of CSV stock ticker data before and recently had an idea that was so simple that I'm surprised that no one's done it before: why not write a script that passes along a request for this data but converts the result to RDF before returning it? So I did.

I supposed it might count as a semantic web service.

A URL like http://www.rdfdata.org/cgi/stockquotes.cgi?symbols=BUD,IBM,SNE asks for recent ticker information about the stock symbols listed in the comma-separated value list. The stockquotes.cgi script adds the parameters to the appropriate stub to create a URL like http://download.finance.yahoo.com/d/quotes.csv?f=sl1d1t1ohgv&e=.csv&s=BUD,IBM,SNE, uses this URL to retrieve the CSV results, converts them to RDF/XML, and sends that back to the original requester with a MIME type of application/rdf+xml. The whole script, with white space and comments, wasn't even 100 lines. You can click the first link in this paragraph to see an example of it in action.

I haven't done anything with the rdfdata.org domain name in a while, so I thought that would be a nice place for this. I've already used this little web service in a work-related demo that combines and cross-references RDF data from multiple sources, because after all, that's one of the things that RDF is so good at.

Is this a "semantic web service"? All it does is convert the data returned by a Yahoo feed into a different syntax and pass it along. I did throw together a little ontology to name the properties, but it doesn't add a lot of semantics. On the other hand, my script's output syntax is based on a semantic web standard, and it makes the data easier to use in semantic web applications, so I suppose it might count as a semantic web service.

I hope this is useful to others, and I hope that more people look for opportunities to convert live feeds of useful data in simple formats into live feeds of RDF.

4 January 2010

Federated SPARQL queries

Using a Jena extension.

Much of the promise of RDF and Linked Data is the ease of pulling data from multiple sources and combining it. I recently discovered the SERVICE extension that Jena adds to SPARQL, letting you send subqueries off to multiple SPARQL endpoints and then combine the results. Because a given SPARQL endpoint may be an interface to a triplestore or a relational data store or something else, the ability to query several endpoints with one query is very nice.

The ability to query several endpoints with one query is very nice.

The Jena project's ARQ - Basic Federated SPARQL Query describes the use of this keyword. Before I start quoting from that page, I wanted to jump right in with an example that worked for me to pull birthday and spouse information about Arnold Schwarzenegger from DBpedia and a list of his movies and their release dates from Linked Movie Database in one query:

PREFIX imdb: <http://data.linkedmdb.org/resource/movie/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dbpo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?birthDate ?spouseName ?movieTitle ?movieDate {
  { SERVICE <http://dbpedia.org/sparql>
    { SELECT ?birthDate ?spouseName WHERE {
        ?actor rdfs:label "Arnold Schwarzenegger"@en ;
               dbpo:birthDate ?birthDate ;
               dbpo:spouse ?spouseURI .
        ?spouseURI rdfs:label ?spouseName .
        FILTER ( lang(?spouseName) = "en" )
      }
    }
  }
  { SERVICE <http://data.linkedmdb.org/sparql>
    { SELECT ?actor ?movieTitle ?movieDate WHERE {
      ?actor imdb:actor_name "Arnold Schwarzenegger".
      ?movie imdb:actor ?actor ;
             dcterms:title ?movieTitle ;
             dcterms:date ?movieDate .
      }
    }
  }
}

You can run this query yourself at the sparql.org RDF Query Demo page.

Before you start modeling your own queries on this, it's worth reading the Jena documentation page mentioned above, especially the "Performance Considerations" part:

This feature is a basic building block to allow remote access in the middle of a query, not a general solution to the issues in distributed query evaluation. The algebra operation is executed without regard to how selective the pattern is. So the order of the query will affect the speed of execution. Because it involves HTTP operations, asking the query in the right order matters a lot. Don't ask for the whole of a bookstore just to find book whose title comes from a local RDF file - ask the bookshop a query with the title already bound from earlier in the query.

As an example, both subqueries above specifically ask for information about Schwarzenegger instead of trying to scan the complete databases looking for matches.

Two parts of this trick are non-standard SPARQL, but may become part of SPARQL 1.1: subqueries and the SERVICE keyword. As the latter Lee Feigenbaum slide points out, the SPARQL Working Group is using ARQ's SERVICE keyword as a starting point in thinking about how a query can target multiple endpoints.

My query above of the two different SPARQL endpoints also works from within TopQuadrant's TopBraid Suite of products, so I'm sure I'll be using this on work-related projects more and more.

Recent Tweets

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists