Storing (and querying) RDF in NoSQL database managers

Interesting progress, carefully measured.
"...we are confident that NoSQL databases will present an ever growing opportunity to store and manage RDF data in the cloud."

A little over a year ago, in a blog entry titled SPARQL and Big Data (and NoSQL), I wrote this:

What I'd love to see, and have heard about tentative steps toward, would be SPARQL endpoints for some of these NoSQL database systems. The D2RQ and R2RML work have accomplished things that should be easier for graph-oriented NoSQL databases like Neo4J and, if I understand the quote above [from Edd Dumbill's Planning for Big Data] correctly, for column-oriented NoSQL databases as well. Google searches on SPARQL and either Hadoop, Neo4J, HBase, or Cassandra show that some people have been discussing and even doing a bit of coding on several of these.

Discussions and bits of coding are nice, but I recently found something much better in a paper titled "NoSQL Databases for RDF: An Empirical Evaluation" (pdf)—a methodical comparison of the storage and querying of RDF in different NoSQL systems. This ISWC 2013 paper, written by ten authors from four universities in four countries, included this in its abstract:

This work is, to the best of our knowledge, the first systematic attempt at characterizing and comparing NoSQL stores for RDF processing. In the following, we describe four different NoSQL stores and compare their key characteristics when running standard RDF benchmarks on a popular cloud infrastructure using both single-machine and distributed deployments.

The paper then describes the storage and querying of RDF using HBase with Jena for querying, HBase with Hive as the query engine (with Jena's ARQ to parse the queries before converting them to HiveQL), CumulusRDF (Cassandra with Sesame), and Couchbase. The study also includes the 4store triplestore so that the authors could compare their NoSQL storage benchmarks with those of a native RDF triplestore. (As you might guess from its name, 4store is actually a quad store—and speaking of quads, while adding links to this paragraph, I found that fully four technologies listed here are their own separate Apache projects.)

The benchmarks and testing environments are all rigorously documented in the paper. You can read these details yourself, so I'll skip ahead to the end of their conclusion: "we are confident that NoSQL databases will present an ever growing opportunity to store and manage RDF data in the cloud."

I didn't recognize many of the authors' names, but I certainly recognized the name of Juan Sequeda of the University of Texas and Capsenta. His PhD work at UT that led to Capsenta's Ultrawrap product makes Juan about the most qualified person I can think of to perform this kind of methodical review of the potential value of NoSQL database managers for storing and querying RDF, so I'm glad that he and his co-authors on the paper are doing this. Additional good news is that they've made "all results, as well as [their] source code, how-to guides, and EC2 images to rerun [their] experiments" available on their project's web site for others to build on, and it looks like they have continued that work since publishing the paper. I look forward to further reports from them as efforts to store RDF in NoSQL database managers move forward.


Please add any comments to this Google+ post.