IBM's DB2 as a triplestore

Surprisingly easy to set up and use, but requiring lots of Java coding for any real application development.

I thought it was pretty big news for the semantic web world when IBM announced that release 10.1 of their venerable DB2 database manager could function as an RDF triplestore, but it seems that few others—not even, apparently, IBM staff responsible for marketing semantic technology—agreed with me. More on this below.

RDF on DB2

IBM invented relational databases, and DB2 has been their main relational database product for almost twenty years. It runs on mainframes, PCs, Linux, the iSeries (descendants of the AS/400) and other platforms. (Although DB2 has also worked as an XML repository since 2006, with support for XQuery and XPath, I have not been aware of any shops using it for that instead of, say MarkLogic or eXist. I assume it's used for more transaction-oriented XML as opposed to content for publishing.) In addition to functioning as a triplestore, DB2 10.1 supports SPARQL 1.0 and a few of the more SQL-friendly features of SPARQL 1.1.

I found the free version of DB2 for Windows to be fairly easy to download and install. I didn't have to do anything special to get my downloaded copy to support RDF; after I finished the default installation, my hard disk had a \Program Files\IBM\SQLLIB\rdf directory with a lib subdirectory full of jar files and a set of batch files that call the jar files in a bin subdirectory.

RDF application development for IBM data servers appears to be the main documentation page for DB2's RDF support, but I used the developerWorks tutorial Resource description framework application development in DB2 10 for Linux, UNIX, and Windows as my guide to getting started—in particular, to find out about the Jena and ARQ jar files to add to the rdf/lib directory to make everything work properly.

The tutorial has you using "IBM Data Studio", their Eclipse-based DB2 administration interface, after you finish your initial setup, and I couldn't get certain menu choices described by the article to show up in the copy of Data Studio that I downloaded, but with some generous email help from the article's lead author, Mario Briggs, I managed to ultimately do everything I wanted to without Data Studio.

(The developerWorks article is actually just Part 1, and I look forward to Part 2. Remember, though, that the article is more oriented toward explaining RDF to DB2 users than vice versa, and it also assumes that your main use of DB2's RDF storage will be from Java code that you write yourself. I limited myself to the batch files in the bin directory and two that Mario sent, and did manage to load and query some data.)

The "Prerequisites for creating RDF stores" section of the tutorial article lists some very technical setup details to perform, but step 2 after that describes a script that takes care of these steps for you—for example, by creating the DB2 database RDFSAMPL that each of my command line examples below refer to. (Note that the script is called dbsetup.sql, not setup.sql, as the article currently says. Also, in Windows 7, you can't do this in just any command line window, but must do it from one opened by right-clicking a command line window icon and picking "Run as administrator".) That was not the first time that I did something specified by the article, saw that it didn't work, and then read in the paragraphs after that about changes to make to the displayed command to make it work with my configuration. So, if you get stuck in the tutorial, read ahead a little before you get too frustrated.

If you run a batch file from \Program Files\IBM\SQLLIB\rdf\bin with no parameters, it displays help about the available parameters, so that will tell you more details about the steps that I executed below.

Once I had the RDFSAMPL database defined using the dbsetup.sql script, running the following command from the bin directory mentioned above created an RDF store in RDFSAMPL named myrdfstore (I had set the password values when I first installed DB2):

createrdfstore myrdfstore -db RDFSAMPL -user db2admin -password mydb2password

The bin directory includes a createrdfstoreandloader.bat batch file to create and load data at once, but I usually used the loadrdfstore.bat batch file (available here with ".txt" added to the filename for easier downloading) that Mario sent me. For example, this next command loaded some data into that RDF store and gave a report about how many triples were loaded:

loadrdfstore myrdfstore -db RDFSAMPL -user db2admin -password mydb2password \temp\ex029.rdf

Right now, DB2 can load RDF/XML and ntriples files, but not Turtle. As far as I can tell, without custom Java coding there is currently no way to add triples to an RDF store that already has triples in it or to add triples to named graphs. See the documentation for more on the relevant Java libraries and calls.

Another short yet crucial batch file that Mario sent me was queryrdfstore (available here). This next command uses it to run the query shown and displays the results along with a count of the milliseconds it took:

queryrdfstore myrdfstore -db RDFSAMPL -user db2admin -password mydb2password "SELECT DISTINCT ?p WHERE { ?s ?p ?o }"

(Keep in mind that the files that Mario sent me may not work with future versions of DB2's RDF support; that's why they were left out of the basic distribution. I'm sure they'll have some sort of equivalent.) Instead of a quoted query, you can supply the name of a file with the SPARQL query stored in it:

queryrdfstore myrdfstore -db RDFSAMPL -user db2admin -password mydb2password myquery1.rq

For now it looks like IBM isn't that interested in selling DB2 and its RDF triplestore features to the semantic web crowd. For example, shortly before the big Semantic Technologies Conference last June in San Francisco, semanticweb.com's Eric Franzon interviewed IBM Director of Strategy and Marketing for Database Software and Systems Bernie Spang in an article titled RDF Support in IBM’s DB2. Spang talked more in big picture terms, which is his job, and the article concludes by pointing out that IBM is a Gold Sponsor of the San Francisco conference. However, when I went to the IBM booth at the conference to ask about the RDF triplestore support in DB2, the two guys in the booth were genuinely surprised to hear that this had been added to DB2. (They were there to sell IBM's Enterprise Content Management product.) They did give me some excellent wind-up IBM robots, though.

When I see a title of "DB2-RDF (NoSQL Graph) Support in DB2 LUW 10.1" on another page on the developerWorks site, I can better see the logic of IBM's approach: they're saying "Hey, we can do NoSQL", a message that can appeal to a bigger audience than a marketing effort focused on us semantic web geeks, especially when you consider the huge base of existing DB2 users who are wondering about the new database technologies getting the most buzz lately.

I'm still very happy that IBM chose to go with a W3C standards-based approach to supporting NoSQL graph databases. I especially appreciate this direction because a lot of the NoSQL crowd seems unaware of what RDF and SPARQL technology can offer them. (Why, and what can we do about it? That's another blog entry, but feel free to add comments here with your own theories.) I just think it's great that I can store and query RDF on my laptop using one of the most respected database management packages without spending a dime, and that if I really want to scale up, I can do it with the same software on an IBM mainframe.

windup IBM schwag robots

Please add any comments to this Google+ post.