8 March 2014

Easier querying of strings with RDF 1.1

In which a spoonful of syntactic sugar makes the string querying go down a bit easier.

If it looks and walks and talks like a string...

The recent publication of RDF 1.1 specifications fifteen years and three days after RDF 1.0 became a Recommendation has not added many new features to RDF, although it has made a few new syntaxes official, and there were no new documents about the SPARQL query language. The new Recommendations did clean up a few odds and ends, and one bit of cleanup officially removes an annoying impediment to straightforward querying of strings.

Near the beginning of chapter 5 of my book Learning SPARQL, I wrote

Discussions are currently underway at the W3C about potentially doing away with the concept of the plain literal and just making xsd:string the default datatype, so that "this" and "this"^^xsd:string would mean the same thing.

When dealing with the difference between simple literals and those that were explicitly cast as xsd:string values, casting in one direction or the other with the str() and xsd:string() functions gave us a workaround, but once all the query engines catch up with RDF 1.1 we won't have to work around this anymore.

The 2011 document StringLiterals/LanguageTaggedStringDatatypeProposal describes the problem in more detail, but here's a short example. Imagine that you want to query for the author of one of the works listed in these triples:

@prefix dc:  <http://purl.org/dc/elements/1.1/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ls:  <http://learningsparql.com/id#> . 

ls:i1001 dc:creator "Jane Austen" ;
         dc:title "Persuasion" .
ls:i1002 dc:creator "Nathaniel Hawthorne" ;
         dc:title "The Scarlet Letter"^^xsd:string .

For example, let's say you want to know who wrote "The Scarlet Letter" and you enter this query:

PREFIX dc:  <http://purl.org/dc/elements/1.1/> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 

SELECT ?author WHERE { 
  ?work  dc:title "The Scarlet Letter" ; 
         dc:creator ?author . 
}

Using a SPARQL engine that was strictly compliant with RDF 1.0, this query wouldn't find anything, because the dc:title value of ls:i1002 is the typed literal "The Scarlet Letter"^^xsd:string and not the untyped string that the query was looking for. If a similar query asked for the author of "Persuasion"^^xsd:string, it wouldn't find anything, because the query is looking for a string that has been explicitly typed as an xsd:string, and in the data the value is an untyped literal.

This, in fact, is what happens with release 2.6.4 of Sesame, the version currently on my hard disk. Sesame is now up to 2.7.10, and, seeing the change coming, may have accounted for it by now. ARQ and the TopBraid platform stopped distinguishing between simple literals and typed string literals several years ago.

Treating the simple literal and typed string versions of a string as the same thing is now officially what's supposed to happen. According to section 3.3 of the new RDF 1.1 Concepts and Abstract Syntax Recommendation, "Simple literals are syntactic sugar for abstract syntax literals with the datatype IRI http://www.w3.org/2001/XMLSchema#string". In other words, if it looks and walks and talks like a string, treat it like a string.

With this update, there's nothing to hold back other SPARQL engines from treating simple literals and typed string literals the same way. This is going to make the development of a lot of SPARQL queries a little bit simpler.


Please add any comments to this Google+ post.

9 February 2014

Querying my own MP3, image, and other file metadata with SPARQL

And a standard part of Ubuntu.

Ubuntu has a utility called Tracker that makes it easy to search your hard disk, a bit like the old Google Desktop with a few extra features. One extra feature ranks among the coolest SPARQL applications I've ever seen: the ability to execute SPARQL queries against data extracted from files on your hard disk.

Anarchy paper lantern

To install it, I did a sudo apt-get install of tracker-gui to get the base parts of tracker and then did a similar installation of tracker-utils to get the SPARQL query utility. Next, I added the Ubuntu applications "Desktop search" and "search and indexing" as applications and used the latter to search and index 94 GB of MP3s and some image files. The indexing took a few hours. (tracker-control -S was a handy command for checking on the indexing progress.) The worldofgnome.org page Indexing preferences in GNOME 3.8 was helpful for understanding the indexing options.

Once the file metadata is indexed, the tracker-sparql command-line utility lets you query it. For example, the following runs the query stored in bea.spq against the metadata:

tracker-sparql -f bea.spq

(The tracker-sparql help said that I was also supposed to include -q to show that it was a SPARQL query, but it seemed to work fine without this command line switch.) The following shows bea.spq, a query for artist names that begin with "Bea", allowing for an optional "The " before that:

PREFIX nmm: <http://www.tracker-project.org/temp/nmm#>
SELECT DISTINCT ?artistName WHERE {
        ?artist a nmm:Artist . 
       ?artist nmm:artistName ?artistName .
       FILTER(regex(?artistName,"^(The )?Bea"))
}

Here is the output:

Results:
  Beachwood Sparks
  Beastie Boys/Beck/Dust Brothers
  Beastie Boys/Dust Brothers
  Beatles
  The Beach Boys
  The Beastie Boys
  The Beatles
  The Beatniks

One frustrating thing about tracker-sparql is that it rejects certain queries because, as it tells us, "Unrestricted predicate variables not supported." In my experience, this meant that you couldn't have a variable in a triple pattern's predicate position if there was another one in the subject position. So, for example, while I know that the Dust Brothers have worked with the Beastie Boys and Beck separately, I've never heard of all of them working together, but I couldn't enter a query to see which work was created by an artist with a nmm:artistName value of "Beastie Boys/Beck/Dust Brothers". I did try dc:contributor, nmm:performer, and some other properties that were used to connect an artist to a work, but with no luck. (My guess: it was some sort of remix that combined a few Dust Brothers works.)

This was a fun query, asking what values of "genre" were stored in my MP3s:

SELECT DISTINCT ?genre WHERE
{
  ?work nfo:genre ?genre
}

The results:

Results:
  Jazz
  Rock
  Classical
  New Wave
  Avantgarde
  Pop
  Salsa
  Blues
  Soundtrack
  RETRO SWING
  Swing
  Country
  Other
  Sound Clip
  jazz
  Latin
  Lo-Fi
  Rock & Roll
  Hip-Hop
  Techno-Industrial
  Euro-Techno
  Booty Bass
  Alternative
  Reggae
  Indian
  Podcast
  Electronic

This can lead to a real rabbit hole of additional queries as I wonder "what do I have in that category?" but I'll spare you that part.

tracker-sparql has a few command line options that are shortcuts to common queries for exploring a dataset. For example, -c lists classes, and gave me a list of 230. A query for distinct rdf:type values showed only 67 being used in my file metadata, so I assume that -c refers to classes that are declared in an internal schema. The tracker-stats utility shows how many instances each class has. (The "SEE ALSO" section of the help page for tracker-store had the best list I could find of the various tracker utilities.)

The tracker indexer also pulls fairly typical metadata out of image files. Unfortunately, it doesn't pull latitude and longitude data out when present, but it does let you add and query tag values in images. I played with this using the image file above, which shows a paper lantern with the anarchy symbol that I saw in San Francisco's Chinatown during the 2010 Semantic Technologies conference. Using the tracker-tag utility, I added a tag to the image like this:

tracker-tag --add=anarchy /my/path/semtech/2010/pics/IMG_5257.jpg

This added the following triples to the dataset:

@prefix nao:  <http://www.semanticdesktop.org/ontologies/2007/08/15/nao#> . 
@prefix tr:   <http://www.tracker-project.org/ontologies/tracker#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
@prefix nao:  <http://www.semanticdesktop.org/ontologies/2007/08/15/nao#> . 

<urn:uuid:5aa32bbc-7f08-da08-3bbd-8ae6650411fb> nao:hasTag  
  <urn:uuid:a49c693c-d439-529b-8e27-296d589e905c> . 

<urn:uuid:a49c693c-d439-529b-8e27-296d589e905c>
  tr:added "2014-01-18T22:31:44Z" ;
  tr:modified 7170 ;
  rdf:type rdfs:Resource ;
  rdf:type  nao:Tag ;
  nao:prefLabel "anarchy" . 

The first triple says that the image resource has a particular tag, and the remaining triples tell us about that tag. It was nice to see that the tag is a resource and not just a string, so it can be renamed without losing its relationships with tagged resources. It also means that the tag itself can have additional metadata assigned to it such as skos:broader values to create a taxonomy hierarchy. And of course, there are all kinds of possibilities for SPARQL queries about what is tagged with what. (It would be fun to pull a set of nao:Tag resource triples into TopBraid EVN and really turn them into a proper SKOS taxonomy.)

A few random closing notes:

  • I tried a few SPARQL 1.1 features like BIND and contains() with no luck, but the tracker-sparql help page does show that the count() function and SPARQL UPDATE are supported. I tried adding a triple with an UPDATE request, but I didn't get it to work. If it was possible to add arbitrary triples about existing resources, we could store additional data about them such as the skos:broader values mentioned above and triples about the latitude and longitude where the picture was taken, which ExifTool can extract from image files. Apache Tika, which I've written about here before, would also be great to throw into the mix.

  • It's interesting that the resources were identified with URNs instead of URLs.

  • The Adrian Perez blog post Some Tracker + SPARQL bits has some good tips, and it points to two blog entries by Adrien Bustany that describe some nice predicate functions built into Tracker's SPARQL engine.

  • It was nice to see the Nepomuk ontology used here. Talk about a semantic desktop! (Since writing the first draft of this, I have learned that the next generation of Nepomuk is not using RDF, which I was sorry to hear.) It would be nice to see a schema for the Tracker-specific classes and properties; the http://www.tracker-project.org/ontologies base URI used for some of the namespaces currently doesn't go anywhere. (If someone can point me to such a schema, I'd be happy to update this.)

  • The metadata that the indexer pulled from a PDF on my hard disk included the complete text of the PDF stored using the nie:plainTextContent property. That could be very useful for searches and text extraction.

Playing with this dataset, if I limited myself to SPARQL queries about my own MP3s, I could stay busy for hours. Assigning, querying, and curating tags (while I assigned one to a JPEG file above, they could be assigned to any resources), as I mentioned above, is something else that would be a lot of fun to play with. For example, imagine running some text analytics on nie:plainTextContent values to come up with tag values to assign to that PDF. And, if music files have an artist property and PDFs have a plainTextContent property, there are probably plenty of other properties that are specific to certain file types and reveal interesting things about them—especially when queried with SPARQL to find patterns among the values of the files in your own collection.


Please add any comments to this Google+ post.

7 January 2014

Storing and querying RDF in Neo4j

Hands-on experience with another NoSQL database manager.

In the typical classification of NoSQL databases, the "graph" category is one that was not covered in the "NoSQL Databases for RDF: An Empirical Evaluation" paper that I described in my last blog entry. (Several were "column-oriented" databases, which I always thought sounded like triple stores—the "table" part of they way people describe these always sounded to me like a stretched metaphor designed to appeal to relational database developers.) A triplestore is a graph database, and Brazilian software developer Paulo Roberto Costa Leite has developed a SPARQL plugin for Neo4j, the most popular of the NoSQL graph databases. This gave me enough incentive to install Neo4j and play with it and the SPARQL plugin.

While this plugin has a ways to go before people can get serious work done with it, it's still a great start and fun to play with.

To quote Neo4j's home page, it's "a robust (fully ACID) transactional property graph database. Due to its graph data model, Neo4j is highly agile and blazing fast. For connected data operations, Neo4j runs a thousand times faster than relational databases." According to the popular NoSQL introduction Seven Databases in Seven Weeks, Neo4j "can store tens of billions of nodes and as many edges." The ability to distribute a database across a cluster is another thing that makes Neo4j popular.

From what I can tell, at least on Windows, you don't want the installer version of Neo4j on its download page, because that doesn't create a plugins directory where you can add the SPARQL one, so get the zip version. I got release 1.9.5 of that one.

I don't know much about Neo4j except some basics that I read in the "Seven Databses" book, so please forgive any basic misunderstandings or big deviations from standard Neo4j practices. Once I installed it and started it up with bin\neo4j.bat, I sent a browser to the main screen at http://localhost:7474 to make sure that I had installed it properly. This all worked fine; installation was really just a matter of unzipping, once I determined the right distribution to unzip.

To install the SPARQL plugin, I downloaded the distribution zip file from its github page (not to be confused with the project's github page, which has the source), unzipped that inside of the neo4j-community-1.9.5\plugins folder, and restarted neo4j (that is, I shut it down with a ^C in the terminal window that it created when I started it up, then started it again the same way I did originally).

Inserting data

I like to use curl to test RESTful (or REST-ish) interfaces, and found that I had better luck interacting with Neo4j by using curl from the cygwin sh shell under Windows than using it with the native Windows command line prompt. Following some examples in the SPARQL plugin's documentation, I tried the following, which successfully inserted some data. (Assume that all curl command lines shown here were actually executed as a single line.)

curl -X POST -H Content-Type:application/json -H Accept:application/json 
  --data-binary @sampledata.txt 
  http://localhost:7474/db/data/ext/SPARQLPlugin/graphdb/insert_quad 

The sampledata.txt file named in that command line had this in it:

{ 
  "s" : "http://neo4j.org#jim",  
  "p" : "http://neo4j.org#knows",  
  "o" : "http://neo4j.org#mitch",  
  "c" : "http://neo4j.org" 
}

Note that it's inserting a quad, not a triple, with "c" being a named graph. I'm guessing that the "c" stands for "context" because the plugin uses a lot of Sesame jar files.

The following successfully inserted a similar query with the quad specified on the command line:

curl -X POST -H Content-Type:application/json -H 
   Accept:application/json 
   http://localhost:7474/db/data/ext/SPARQLPlugin/graphdb/insert_quad  
   -d '{  "s" : "http://neo4j.org#joe",  "p" : "http://neo4j.org#knows",  
   "o" : "http://neo4j.org#sara",  "c" : "http://neo4j.org"}'

This worked to insert a literal string,

curl -X POST -H Content-Type:application/json -H Accept:application/json 
  http://localhost:7474/db/data/ext/SPARQLPlugin/graphdb/insert_quad -d 
  '{  "s" : "http://neo4j.org#joe",  "p" : "http://learningsparql.com/ns/data#lastName", 
  "o" : "\"Schmoe\"",  "c" : "http://learningsparql.com/ns/data#test1/"}'

and this inserted a value with an explicit type:

curl -X POST -H Content-Type:application/json -H Accept:application/json 
  http://localhost:7474/db/data/ext/SPARQLPlugin/graphdb/insert_quad  -d 
  '{  "s" : "http://neo4j.org#joe",  "p" : "http://learningsparql.com/ns/data#hireDate", 
  "o" : "\"2012-11-09\"^^<http://www.w3.org/2001/XMLSchema#date>",  "c" : 
  "http://learningsparql.com/ns/data#test1/"}'

Querying

With this SPARQL query stored in neo4jquery1.json,

{
  "query" : "SELECT * WHERE { ?s <http://neo4j.org#knows> ?o .}"
}

I entered this at the cygwin sh prompt,

curl -X POST -H Content-Type:application/json -H Accept:application/json  
   --data-binary @neo4jquery1.json 
   http://localhost:7474/db/data/ext/SPARQLPlugin/graphdb/execute_sparql

and got this result:

[ {
  "s" : "http://neo4j.org#jane",
  "o" : "http://neo4j.org#jim"
}, {
  "s" : "http://neo4j.org#joe",
  "o" : "http://neo4j.org#sara"
} ]

I found it best to execute queries from a stored file like that, because although JSON won't let me spread a string (in the case, the query itself) across multiple lines, it was still a little easier than packing it into a curl command line with the other parameters.

A similar command line executed this query, which specifies the named graph whose triples should be returned:

{
  "query" : "SELECT * WHERE { GRAPH <http://neo4j.org>  {?s ?p ?o }  }"
}

I tried a few random SPARQL 1.1 features such as BIND and COUNT, and they worked fine. Because most of the Sesame JAR files say "2.6.10," which is only a little more than a year old, I'm guessing that the support of the SPARQL 1.1 query language is pretty complete.

The plugin currently does not support the SPARQL UPDATE language. Deleting the data inserted above would require the use of native Neo4j commands, which would require you to know the internal Neo4j identifiers used for the nodes and edges that represent RDF resources and predicates. Perhaps a bit ironically to RDF people, these identifiers are URIs, but they will rarely be universally unique; for example, my URI http://neo4j.org#mitch was actually stored with the URI http://localhost:7474/db/data/node/7, a URI that very likely refers to other resources on other Neo4j installations that use the default system name and port number of localhost:7474. (I assume that much of Paulo's work in building the query plugin was mapping from the SPARQL URI references to the internal Neo4j references.)

The plugin, JSON, and the future

You've probably noticed that all the input and output to this SPARQL plugin is always JSON: you send data and queries to Neo4j embedded in JSON, and your results are JSON, but not the W3C SPARQL Query Results JSON Format. This use of JSON isn't specific to Paulo's plugin, but a default for the Neo4j REST API, which currently provides the context for all SPARQL-oriented communication with a Neo4j server. While the plugin's documentation refers to an endpoint, it's not a SPARQL endpoint in the sense that it supports the SPARQL Protocol (the "P" in "SPARQL"), but an endpoint that, at this point, has its own interface for accepting SPARQL queries and delivering results.

The insert_quad and execute_sparql methods shown above are currently the only two that the plugin offers, and as you might guess from the singular form of "insert_quad," it can only insert one at a time. For now, inserting multiple quads will mean either multiple calls to this method or digging down into the lower levels of the plugin.

So, while this plugin has a ways to go before people can get serious work done with it, it's still a great start and fun to play with. I don't want to finish this with a discussion of the RDF features that it's missing, but instead with some mentions of the cool Neo4j things that would be great to try with RDF. I've already mentioned the ease with which data can apparently be distributed across clusters; another is Neo4j's built-in shortest path algorithm(s), something I've always wanted for an RDF store.

I look forward to Paulo's future work, and I'd like to thank him for helping this Neo4j neophyte get this far with Neo4j and with his plugin.


Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists