Storing and querying RDF in Neo4j

Hands-on experience with another NoSQL database manager.

In the typical classification of NoSQL databases, the "graph" category is one that was not covered in the "NoSQL Databases for RDF: An Empirical Evaluation" paper that I described in my last blog entry. (Several were "column-oriented" databases, which I always thought sounded like triple stores—the "table" part of they way people describe these always sounded to me like a stretched metaphor designed to appeal to relational database developers.) A triplestore is a graph database, and Brazilian software developer Paulo Roberto Costa Leite has developed a SPARQL plugin for Neo4j, the most popular of the NoSQL graph databases. This gave me enough incentive to install Neo4j and play with it and the SPARQL plugin.

While this plugin has a ways to go before people can get serious work done with it, it's still a great start and fun to play with.

To quote Neo4j's home page, it's "a robust (fully ACID) transactional property graph database. Due to its graph data model, Neo4j is highly agile and blazing fast. For connected data operations, Neo4j runs a thousand times faster than relational databases." According to the popular NoSQL introduction Seven Databases in Seven Weeks, Neo4j "can store tens of billions of nodes and as many edges." The ability to distribute a database across a cluster is another thing that makes Neo4j popular.

From what I can tell, at least on Windows, you don't want the installer version of Neo4j on its download page, because that doesn't create a plugins directory where you can add the SPARQL one, so get the zip version. I got release 1.9.5 of that one.

I don't know much about Neo4j except some basics that I read in the "Seven Databses" book, so please forgive any basic misunderstandings or big deviations from standard Neo4j practices. Once I installed it and started it up with bin\neo4j.bat, I sent a browser to the main screen at http://localhost:7474 to make sure that I had installed it properly. This all worked fine; installation was really just a matter of unzipping, once I determined the right distribution to unzip.

To install the SPARQL plugin, I downloaded the distribution zip file from its github page (not to be confused with the project's github page, which has the source), unzipped that inside of the neo4j-community-1.9.5\plugins folder, and restarted neo4j (that is, I shut it down with a ^C in the terminal window that it created when I started it up, then started it again the same way I did originally).

Inserting data

I like to use curl to test RESTful (or REST-ish) interfaces, and found that I had better luck interacting with Neo4j by using curl from the cygwin sh shell under Windows than using it with the native Windows command line prompt. Following some examples in the SPARQL plugin's documentation, I tried the following, which successfully inserted some data. (Assume that all curl command lines shown here were actually executed as a single line.)

curl -X POST -H Content-Type:application/json -H Accept:application/json 
  --data-binary @sampledata.txt 
  http://localhost:7474/db/data/ext/SPARQLPlugin/graphdb/insert_quad 

The sampledata.txt file named in that command line had this in it:

{ 
  "s" : "http://neo4j.org#jim",  
  "p" : "http://neo4j.org#knows",  
  "o" : "http://neo4j.org#mitch",  
  "c" : "http://neo4j.org" 
}

Note that it's inserting a quad, not a triple, with "c" being a named graph. I'm guessing that the "c" stands for "context" because the plugin uses a lot of Sesame jar files.

The following successfully inserted a similar query with the quad specified on the command line:

curl -X POST -H Content-Type:application/json -H 
   Accept:application/json 
   http://localhost:7474/db/data/ext/SPARQLPlugin/graphdb/insert_quad  
   -d '{  "s" : "http://neo4j.org#joe",  "p" : "http://neo4j.org#knows",  
   "o" : "http://neo4j.org#sara",  "c" : "http://neo4j.org"}'

This worked to insert a literal string,

curl -X POST -H Content-Type:application/json -H Accept:application/json 
  http://localhost:7474/db/data/ext/SPARQLPlugin/graphdb/insert_quad -d 
  '{  "s" : "http://neo4j.org#joe",  "p" : "http://learningsparql.com/ns/data#lastName", 
  "o" : "\"Schmoe\"",  "c" : "http://learningsparql.com/ns/data#test1/"}'

and this inserted a value with an explicit type:

curl -X POST -H Content-Type:application/json -H Accept:application/json 
  http://localhost:7474/db/data/ext/SPARQLPlugin/graphdb/insert_quad  -d 
  '{  "s" : "http://neo4j.org#joe",  "p" : "http://learningsparql.com/ns/data#hireDate", 
  "o" : "\"2012-11-09\"^^<http://www.w3.org/2001/XMLSchema#date>",  "c" : 
  "http://learningsparql.com/ns/data#test1/"}'

Querying

With this SPARQL query stored in neo4jquery1.json,

{
  "query" : "SELECT * WHERE { ?s <http://neo4j.org#knows> ?o .}"
}

I entered this at the cygwin sh prompt,

curl -X POST -H Content-Type:application/json -H Accept:application/json  
   --data-binary @neo4jquery1.json 
   http://localhost:7474/db/data/ext/SPARQLPlugin/graphdb/execute_sparql

and got this result:

[ {
  "s" : "http://neo4j.org#jane",
  "o" : "http://neo4j.org#jim"
}, {
  "s" : "http://neo4j.org#joe",
  "o" : "http://neo4j.org#sara"
} ]

I found it best to execute queries from a stored file like that, because although JSON won't let me spread a string (in the case, the query itself) across multiple lines, it was still a little easier than packing it into a curl command line with the other parameters.

A similar command line executed this query, which specifies the named graph whose triples should be returned:

{
  "query" : "SELECT * WHERE { GRAPH <http://neo4j.org>  {?s ?p ?o }  }"
}

I tried a few random SPARQL 1.1 features such as BIND and COUNT, and they worked fine. Because most of the Sesame JAR files say "2.6.10," which is only a little more than a year old, I'm guessing that the support of the SPARQL 1.1 query language is pretty complete.

The plugin currently does not support the SPARQL UPDATE language. Deleting the data inserted above would require the use of native Neo4j commands, which would require you to know the internal Neo4j identifiers used for the nodes and edges that represent RDF resources and predicates. Perhaps a bit ironically to RDF people, these identifiers are URIs, but they will rarely be universally unique; for example, my URI http://neo4j.org#mitch was actually stored with the URI http://localhost:7474/db/data/node/7, a URI that very likely refers to other resources on other Neo4j installations that use the default system name and port number of localhost:7474. (I assume that much of Paulo's work in building the query plugin was mapping from the SPARQL URI references to the internal Neo4j references.)

The plugin, JSON, and the future

You've probably noticed that all the input and output to this SPARQL plugin is always JSON: you send data and queries to Neo4j embedded in JSON, and your results are JSON, but not the W3C SPARQL Query Results JSON Format. This use of JSON isn't specific to Paulo's plugin, but a default for the Neo4j REST API, which currently provides the context for all SPARQL-oriented communication with a Neo4j server. While the plugin's documentation refers to an endpoint, it's not a SPARQL endpoint in the sense that it supports the SPARQL Protocol (the "P" in "SPARQL"), but an endpoint that, at this point, has its own interface for accepting SPARQL queries and delivering results.

The insert_quad and execute_sparql methods shown above are currently the only two that the plugin offers, and as you might guess from the singular form of "insert_quad," it can only insert one at a time. For now, inserting multiple quads will mean either multiple calls to this method or digging down into the lower levels of the plugin.

So, while this plugin has a ways to go before people can get serious work done with it, it's still a great start and fun to play with. I don't want to finish this with a discussion of the RDF features that it's missing, but instead with some mentions of the cool Neo4j things that would be great to try with RDF. I've already mentioned the ease with which data can apparently be distributed across clusters; another is Neo4j's built-in shortest path algorithm(s), something I've always wanted for an RDF store.

I look forward to Paulo's future work, and I'd like to thank him for helping this Neo4j neophyte get this far with Neo4j and with his plugin.

6/23/14 update: I have just discovered Michael B's Importing ttl (Turtle) ontologies in Neo4j from over a year ago. It describes things mostly in terms of Java source code, so I'm not about to jump on it and try it out right away, it but will make a good resource for people interested in using RDF in Neo4j. And, the fact that he's an IBM employee makes it more interesting.


Please add any comments to this Google+ post.