SPARQL and Big Data (and NoSQL)

How to pursue the common ground?

I think it's obvious that SPARQL and other RDF-related technologies have plenty to offer to the overlapping worlds of Big Data and NoSQL, but this doesn't seem as obvious to people who focus on those areas. For example, the program for this week's Strata conference makes no mention of RDF or SPARQL. The more I look into it, the more I see that this flexible, standardized data model and query language align very well with what many of those people are trying to do.

If there's just enough structure to get a toehold and build from there, your data is minimally structured.

But, we semantic web types can't blame them for not noticing. If you build a better mouse trap, the world won't necessarily beat a path to your door, because they have to find out about your mouse trap and what it does better. This requires marketing, which requires talking to those people in language that they understand, so I've been reading up on Big Data and NoSQL in order to better appreciate what they're trying to do and how.

A great place to start is the excellent (free!) booklet Planning for Big Data by Edd Dumbill. (Others contributed a few chapters.) For a start, he describes data that "doesn't fit the strictures of your database architectures" as a good candidate for Big Data approaches. That's a good start for us. Here are a few longer quotes that I found interesting, starting with these two paragraphs from the section titled "Ingesting and Cleaning" after a discussion about collecting data from multiple different sources (something else that RDF and SPARQL are good at):

Once the data is collected, it must be ingested. In traditional business intelligence (BI) parlance, this is known as Extract, Transform, and Load (ETL): the act of putting the right information into the correct tables of a database schema and manipulating certain fields to make them easier to work with.

One of the distinguishing characteristics of big data, however, is that the data is often unstructured. That means we don’t know the inherent schema of the information before we start to analyze it. We may still transform the information — replacing an IP address with the name of a city, for example, or anonymizing certain fields with a one-way hash function — but we may hold onto the original data and only define its structure as we analyze it.

With my long history as an XML guy (which is how I know Edd, the former editor of XML.com), I know that ideas about "structured" vs. "unstructured" data are very relativistic—one person's structured data is another person's unstructured data, especially if the first person is an XML guy and the latter is an RDBMS person—and that the term "semi-structured" becomes the compromise adjective. I'll coin a new term that seems to get no relevant Google hits: "minimally structured"—if there's just enough structure to get a toehold and build from there, your data is minimally structured. And, RDFS is excellent if we want to "define [data's] structure as we analyze it". This can be done very incrementally, and OWL can take you many increments further.

Some of that minimal structure can be inferred and made explicit; for example, if you have data about people's genders and and about who is the parent of who, you can infer father and mother relationships (and grandfather, and aunt, and...) and even classes by defining a Grandfather class as the set of instances that have a gender of male and have children who have children. I might say that this is creating new information, but a relational database person would say that it's not—it's just making implicit information explicit. Relational database people put a lot of effort into avoiding the explicit storage of information that can be otherwise inferred, but a relational database is a very closed world, so new possibilities of things to infer within a given set of data don't come up often. Accumulation of RDF from multiple sources can be very dynamic, making it easier to create new wholes that are greater than the sum of their parts (made greater by this kind of inferencing) which opens up new possibilities for patterns to find in different combinations of data.

Another quote from Edd's book:

Even where there’s not a radical data type mismatch, a disadvantage of the relational database is the static nature of its schemas. In an agile, exploratory environment, the results of computations will evolve with the detection and extraction of more signals. Semi-structured NoSQL databases meet this need for flexibility: they provide enough structure to organize data, but do not require the exact schema of the data before storing it.

So do triplestores, which give you the best of both worlds: with no need for a schema, you can accumulate data and query it using a standardized query language, and then if you want you can incrementally add schema metadata (often based on query results) to aid further queries.

Another quote on this topic:

NoSQL databases are frequently called “schemaless,” because they don’t have the formal schema associated with relational databases. The lack of a formal schema, which typically has to be designed before any code is written, means that schemaless databases are a better fit for current software development practices, such as agile development. Starting from the simplest thing that could possibly work and iterating quickly in response to customer input doesn’t fit well with designing an all-encompassing data schema at the start of the project. It’s impossible to predict how data will be used, or what additional data you’ll need as the project unfolds.

Again, all very easy with RDF-based technology, where in addition to the choices of "assemble a big schema before you start developing" and "just blow off schemas, because they impair flexibility" you can work with a middle ground of little bits of schema metadata added when you need them as you go along.

From what I've heard of the various classes of NoSQL databases, graph-oriented ones like Neo4J sound the closest to triplestores, which are also storing graphs. This description of another class of NoSQL databases really caught my attention, though:

Cassandra and HBase are usually called column-oriented databases, though a better term is a “sparse row store.” In these databases, the equivalent to a relational “table” is a set of rows, identified by a key. Each row consists of an unlimited number of columns; columns are essentially keys that let you look up values in the row. Columns can be added at any time, and columns that are unused in a given row don’t occupy any storage. NULLs don’t exist.

This is the "equivalent to a relational 'table'"? It sounds more like the equivalent to a set of triples grouped by subject. Properties (predicates) are essentially keys that let you look up values associated with subjects; you can add property name/value pairs to a subject at any time, because they don't depend on some schema, and properties that aren't used for a given resource don't occupy any storage. (And NULLs don't exist.)

What I'd love to see, and have heard about tentative steps toward, would be SPARQL endpoints for some of these NoSQL database systems. The D2RQ and R2RML work have accomplished things that should be easier for graph-oriented NoSQL databases like Neo4J and, if I understand the quote above correctly, for column-oriented NoSQL databases as well. Google searches on SPARQL and either Hadoop, Neo4J, HBase, or Cassandra show that some people have been discussing and even doing a bit of coding on several of these. (In addition to the column- and graph-oriented NoSQL databases, another category is the "document-oriented" ones, so AllegroGraph's interface to MongoDB is an excellent sign of progress in this direction.) What can we do to encourage more of this kind of interaction?

I have a lot more research to do, so I just started reading Eric Redmond and Jim Wilson's Seven Databases in Seven Weeks. I will report back on further ideas I have. Meanwhile I'd appreciate hearing anyone else's opinions on how Big Data and NoSQL technology and standards-based semantic technology can better take advantage of what each other has to offer.

Please add any comments to this Google+ post.

bobdc.blog

Bob DuCharme's weblog, mostly on technology for representing and linking information.

SPARQL and Big Data (and NoSQL)

Search