Relational database integration with RDF/OWL

Bob DuCharme
Paper accompanying XML 2006 presentation

RDF/OWL ontologies let you describe data and relationships between data items—for example, that two classes of data items are equivalent, or that one is a subset of another. Common examples given of this typically deal with complex knowledge domains such as pharmacology, but we can use these ontologies with simple, straightforward data that typical companies have in their systems right now, often using popular relational database packages. When you can add metadata to existing data and then use that metadata to query the data collection, you can get more value out of that data. With data being the most important asset of many organizations, the use of standards-based technology to do this is becoming more and more attractive.

In some research on who had actually used RDF/OWL to implement such an integration, I learned of several examples, but none that could be examined closely, so I decided to do one myself. My primary goal in this project was to use RDF/OWL to integrate two relational databases and then perform queries against the aggregate collection to answer realistic questions that could not be answered without the addition of an RDF/OWL ontology. Secondary goals included:

Doing it all with free, portable, open-source software
Making all the relevant files available so that others could reproduce my results (contained in the owlrdbms.zip file—see its readme.txt file for instructions on using its files to execute the steps described in this paper)
Doing it with address book data, which has relevance to everyone
Doing it all without any specialized programming (I did write a short XSLT 1.0 stylesheet utility, which is included in the zip file, but managed to avoid any coding that required compiling and deployment)

Use Cases

For sample data to load into the MySQL relational database package, I created two collections of address book data, one based on the structure of the data stored by the Eudora e-mail program and another based on the structure of the address book data stored by Microsoft Outlook 2003. Despite the different field names used by the two packages (for example, businessState for the work address part of one database's record and workState for the other), I wanted to issue a single query against the aggregate collection of data to answer each of the following questions:

Who works in NY state?
List any phone numbers (home, mobile, business, etc.) that I have for Alfred Adams.
Find all information about Bobby Fischer at 2304 Eighth Lane, even if the other database lists him as Robert L. Fischer of 2304 8th Ln.

Once the data was loaded in MySQL, the following steps made these queries possible:

Use the D2RQ interface (I've done this with release 0.3 and release 0.3.1) to extract an RDF representation of the relational data. All that's really necessary is one record from each of the two one-table databases, which lets the next step see the data structures in use.
Use the open-source ontology editor SWOOP (I used release 2.2.1) to automatically generate the mundane, repetitive parts of the ontology.
Continuing with SWOOP, identify relationships between data properties in the pair of databases that enable the execution of the use case queries.
Separate out the RDF/OWL syntax generated by SWOOP into a file separate from the instance data to allow its use with future versions of the instance data as the address book databases get updated.
Use the RDF/OWL reasoner Pellet to issue queries in the W3C standard SPARQL query language against the data, using the generated ontology to take the ontology-defined relationships into account when generating answers.

Sample data

I performed the following steps to generate each of the two databases:

In each of the two email programs, create a new address book entry and fill out every field with a clue about which field was being filled out.
Save the address book as a comma separated value (CSV) file and examine the results.
Using a python script that includes a list of most popular given names, surnames, city names, etc. in the US, generate sample data that corresponds to the exported CSV file. (More details available here.)
When possible (with Outlook, but not Eudora) import the generated file into the address book to check its structure.
Because the SPARQL queries for the second and third use cases will be looking for specific data among the randomly generated data, add that data to the CSV files. (The readme.txt included with the zip file has these lines and describes where to add them.)
An SQL script included with owlrdbms.zip creates two MySQL databases (I used version 14.12 under Windows) and loads each CSV file into one.

Using the D2RQ interface

D2RQ 0.3 included two programs. (Later versions included more, but these two are still the core ones.) You send your SPARQL requests to the main one, an HTTP server called d2r-server, and it passes along the corresponding SQL statements to your MySQL, Oracle, PostgreSQL, or other relational database package. D2RQ needs a mapping file to to convert your SPARQL request into the appropriate SQL query, and the second D2RQ program, generate-mapping, is a utility that generates this mapping file for you in n3 format. You can use the generated program as-is, but as we'll see, various tweaks to it can make it do more for you.

The following two commands tell generate-mapping to log in to MySQL, look at the data dictionary of the indicated databases, and generate the mapping files that let you issue SPARQL queries against the Eudora and Outlook databases (substitute your own MySQL username and password):

generate-mapping -o eudoraMapping.ttl -u myID -p myPassword jdbc:mysql://localhost/eudora 
generate-mapping -o outlookMapping.ttl -u myID -p myPassword jdbc:mysql://localhost/outlook

Because the point of this project is to query the two databases together, the next step is to combine eudoraMapping.ttl and outlookMapping.ttl into one mapping file. Each assigns the namespace prefix "map" to a URL for the mapping file, so in eudoraMapping.ttl I changed this to "emap" throughout and in the other file, to "omap". Both files also assign the prefix "vocab" to a namespace used for vocabulary terms, so I changed this prefix to "eud" in one and "out" in the other and I changed the namespace URL (which wasn't a proper one to begin with) from "vocab" to "eudora" in one and to "outlook" in the other.

At this point, I combined the two files into one comboMapping.ttl file. I moved all the namespace declarations to the top and removed the redundant ones.

Let's take a look at one of the mapping entries—one of the only two that we'll be editing:

emap:entries_email1 a d2rq:PropertyBridge;
   d2rq:belongsToClassMap emap:entries;
   d2rq:property eud:entries_email1;
   d2rq:column "entries.email1";
     .

It assigns various predicate/object pairs to the subject emap:entries_email1, which is a d2rq:PropertyBridge. As the D2RQ User Manual and Language Specification describes, property bridges "relate database table columns to RDF properties." As the entry shows, the default generated property bridge is d2rq:column, which is for properties with literal values. A selection of alternatives is available, including d2rq:uriPattern for properties with URI values, which replaces the d2rq.column property in the revised version of the emap:entries_email1 entry below.

Calling an email address such as joe@whatever.com a URI isn't enough, because software that treats it as a URI won't like it, so the value of the d2rq:uriPattern that replaces the d2rq:column part below inserts a "mailto:" prefix so that the value really will be a URI.

The final line of the mapping entry tells D2RQ a condition for generating one of these mappings: that the data is not an empty string. If any address book entry is missing an email1 value, we don't want the string "mailto:" by itself being generated as its entries_email1 value.

emap:entries_email1 a d2rq:PropertyBridge;
   d2rq:belongsToClassMap emap:entries;
   d2rq:property eud:entries_email1;
#  d2rq:column "entries.email1";
   d2rq:uriPattern "mailto:@@entries.email1@@";
   d2rq:condition "entries.email1 <> ''";
   .

After making these changes to the emap:entries_email1 property bridge, make similar ones to the omap:email2Address one. (For a more thorough database integration, you'd want to do this to all email addresses throughout combomapping.ttl.)

Now we're ready to start up the D2RQ server, telling it to translate SPARQL queries to SQL using the combomapping.ttl file:

d2r-server comboMapping.ttl

(I generated the mapping files and started up the server in the directory where the D2RQ programs are stored.) At this point, if you installed D2RQ using the default configuration, entering the URL http://localhost:2020/ into a browser will get you started playing interactively with the server. What we really want from the server at this point is an RDF representation of enough of the address book data to let SWOOP do the boring parts of ontology generation for us. The following query (shown in both regular and the URL form that you would pass to the D2RQ server) would pull all the data:

CONSTRUCT { ?s ?p ?o } 
WHERE { ?s ?p ?o }

http://localhost:2020/sparql?query=CONSTRUCT+%7B+*+%7D+WHERE+%7B+%3Fs+%3Fp+%3Fo+%7D

(We're not ready to run the query yet—there's still a bit of background to cover.) You really only need one record from each database to show SWOOP the structures you'll be working with, so if you have a lot of data you may want to look at your CSV files for specific values to query for and then modify the query above to only pull those values.

Once all the setup is done and the ontology is created, each time you query the database, you'll want to query the database as combined with the ontology so that Pellet can do reasoning with the data based on facts in the ontology. To more easily pull the data from the server and combine it with an ontology, I created a short XSLT 1.0 stylesheet (included in the zip file) named rdfcat.xsl. When run against a file like the following, this stylesheet combines the output into a single RDF file:

<rdfcat xmlns:xi="http://www.w3.org/2001/XInclude">
  <xi:include href="myfile1.rdf"/>
  <xi:include href="myfile2.rdf"/>
  <xi:include href="myfile3.rdf"/>
</rdfcat>

The stylesheet assumes that you're using an XSLT processor such as libxslt that also implements the W3C XInclude specification. The stylesheet also adds an OWL declaration for each resource it finds, calling it a member of the owl:Thing class so that the OWL reasoner doesn't complain about resources that haven't been declared as a member of a class.

To pull RDF from D2RQ and save it in a file, we run rdfcat.xsl with the following input:

<!-- rdfcat1.xml -->
<rdfcat xmlns:xi="http://www.w3.org/2001/XInclude">
  <xi:include href="http://localhost:2020/sparql?query=CONSTRUCT+%7B+%3Fs+%3Fp+%3Fo+%7D+WHERE+%7B+%3Fs+%3Fp+%3Fo+%7D"/>
  <!--xi:include href="properties.owl"/-->
</rdfcat>

The first xi:include line includes the URL we saw before, which will retrieve the RDF from the D2RQ server. The second points at an ontology to combine with it, but that line is commented out because we don't have this ontology yet. The following invocation of xsltlib's xsltproc program runs the stylesheet and saves the result in the a file named pulled.rdf:

xsltproc -o pulled.rdf -xinclude rdfcat.xsl rdfcat1.xml

Creating an ontology from our data

Start up SWOOP, and then pick Load/Ontology from its File menu. You're not really loading an ontology, but instead the RDF file that you just pulled from D2RQ: pulled.rdf.

This may take SWOOP a few minutes, unless you modified the query above to pull down a subset of each database being integrated. Cutting down on this import time is the main incentive to make such a modification to the query.

Once it's done, immediately pick Save As from the File menu, and save the file as postSwoop.rdf. (If you don't explicitly name the directory in which to save the file, SWOOP may not put it where you want it.)

Take a look at both pulled.rdf and postSwoop.rdf with a text editor. Although you didn't do anything to the pulled.rdf "ontology" that you loaded, SWOOP did, adding OWL declarations to postSwoop.rdf for all the properties it found.

Now it's time to use SWOOP to enhance the ontology so that SPARQL queries against the combined databases can do things that SQL couldn't do to the MySQL versions of the databases. We'll start by adding something to the ontology that says that the workState column of the eudora database's entries table is equivalent to the businessState column of the outlook database's entries table.

Select the Property Tree tab in the lower left of SWOOP and click on eudora:entries_workState. Some information about this property is displayed in the main SWOOP pane, but there isn't much to display yet. Click on "Add" next to "Equivalent to:", select outlook:entries_businessState, and click the "Add and Close" button. You'll see it appear on the main pane, but it's not official until you click the Apply Changes button at the bottom. Save the edited ontology by picking Save from the File menu or by pressing Ctrl+S, and you can then search for "equivalent" in postSwoop.rdf to see how SWOOP saved this.

For a more thorough job of integrating the two databases, you would also define eudora:entries_firstName as equal to outlook:entries_firstName, eudora:entries_lastName as equal to outlook:entries_lastName, and many other equivalencies. The more relationships you can identify between the two databases, the more tightly they'll be integrated. The Bobby Fisher use case will work more nicely if you define at least these two.

To make it possible to find all of Alfred Adams' phone numbers, regardless of which ones we have stored for him, we want to indicate that all the phone properties have some semantics in common. To do this, we'll going to create a new "phone" property and then make all other phone properties subproperties of that. Click the yellow "P" at the left of the SWOOP workspace to display the New Entity dialog box. Set the "Property Type" to OWL Datatype Property, because phone numbers are simple strings. Set "subProperty-of" to "None", which is the first choice, because our new property isn't a subproperty of any other. Based on the URIs that D2RQ generated from my address book, I replaced the default Logical URI on the dialog box with http://localhost:2020/resource/entries/phone. I left the other fields blank. Click the dialog box's Apply and then Close buttons, and you should see "entries:phone" appear at the bottom of the Property Tree. (If it doesn't end up like this, you can always pick Remove and start again.)

After you add this new property, make sure it's selected, and then click on "Add" next to "Superproperty of". (OWL has no "superproperty" property—when you tell SWOOP that entries:phone is a superproperty of outlook:entries_homePhone, it will actually store the fact that outlook:entries_homePhone is a subproperty of entries:phone. It's just faster in the SWOOP interface to indicate that entries:phone is a superproperty of sixteen other properties than it is to go to each of those sixteen and mark it as a subproperty of entries:phone.) In the Specify Property dialog box, Ctrl-click will let you select multiple properties as subproperties of entries:phone. Select the following, and then click "Add and Close":

outlook:entries_primaryPhone
outlook:entries_businessPhone
eudora:entries_otherPhone
outlook:entries_mobilePhone
outlook:entries_carPhone
outlook:entries_homePhone
eudora:entries_workMobile
outlook:entries_businessFax
outlook:entries_homeFax
outlook:entries_otherPhone
eudora:entries_phone
outlook:entries_businessPhone2
outlook:entries_businessFax
eudora:entries_mobile
outlook:entries_homePhone2
eudora:entries_workPhone

Click "Apply Changes" and Save your work.

To let Pellet discover that Bobby Fischer and Robert L. Fischer are the same person, you've already done half the work by adding new information about the emap:entries_email1 and omap:entries_email2Address property bridges in the mapping file. We need to have the ontology specify that the properties used for personal email addresses are inverse functional properties, which means that only one instance of a class can have a particular email1 value. This way, an email1 value of "bobby416@gmail.com" for both Bobby and Robert will mean that they're the same person.

When using OWL DL, a property that is an inverse functional property must be an object property, not a datatype property (that is, a URL and not a string, which is why we tweaked the mapping file for this entry the way we did. Of the three levels of OWL—OWL Lite, OWL DL, and OWL Full—OWL DL is more powerful than Lite and Full is apparently difficult to develop software for, so the OWL software developers' Goldilocks approach of going with the one in the middle has most OWL developers working with OWL DL.) Select eudora:entries_email1 in the property tree tab, and then click "Add" next to "Attributes" at the bottom of the main pane. Select "Inverse Functional" and the Yes button on the "Specify Property Attribute" dialog box, and then click "Apply Changes" at the bottom of the main pane and save your work. Follow these same steps to designate outlook:entries_email2Address as an inverse functional property, and then define outlook:entries_email2Address as and eudora:entries_email1 to be equivalent the same way you defined the equivalent property pairs earlier.

Querying the rules and data

Now that your postSwoop.pdf file has a combination of data and OWL rules, you could issue Pellet SPARQL queries against it, but we want to use this ontology with future versions of the database as its contents evolves, so comment out all the parts of the file that SWOOP didn't generate (the parts with address book data about specific people) and save the file as properties.owl. Once it all works, you can delete the commented parts; if you get XML parsing errors, it's probably because you commented out a bit more than you should. You also may have to replace entity references such as &owl; with the string owl: that they actually represent if you commented out all of the entity declarations that SWOOP generates.

You may remember that the rdfcat.xsl stylesheet was run on a file called rdfcat1.xml, which had an entry for properties.owl commented out because it didn't exist yet. Now that the properties.owl file does exist, we run rdfcat.xsl on rdfcat2.xsl, which is identical to rdfcat1.xsl except that properties.owl is no longer commented out:

xsltproc -o newpulled.rdf -xinclude rdfcat.xsl rdfcat2.xml

Now, assuming we still have the D2RQ server up and running, we're ready to run the SPARQL queries that demonstrate the use cases. The nyworkers.spq file holds the following query:

Pellet wants the source file specified with a URL, which explains the long path name below. To reproduce this, modify the path name to point to your own copy of newpulled.rdf:

Because of the equivalence relationship that you defined, Pellet should list the subjects for both the e:entries_workState triples and the out:entries_businessState triples that have a value of "NY". You won't know if it's correct unless you check how many of those are in each of your two databases, but I found that the random data generation script put four or five in each database, so if this query retrieves more than six or seven, it's good news.

Pellet will probably also output a few suggestions for OWL statements to add to make newpulled.rdf a little more OWL DL compliant. The rdfcat.xsl stylesheet adds rules to account for a few of these, and it could probably use a few more. (If you accidentally used rdfcat1.xsl instead of rdfcat2.xsl, you'll see a lot more of them, because properties.owl includes most of the statements that Pellet wants to see.)

The following shows AAphone.spq, which asks for all phone numbers for Alfred Adams, whether the database has his home phone, work phone, mobile phone, or any other phone numbers. The URL assigned to the "e:" prefix is the one we used when defining the new phone property, the superproperty of the various phone number properties.

The last use case, stored in FisherData.spq in the zip file, asks for all data about the Bobby Fisher entry, which is in the eudora database. Because we defined eudora:entries_email1 as an inverse functional property, Pellet knows that only one entry can have an email1 value of mailto:bobby416@gmail.com. Because we defined eudora:entries_email1 as equivalent to out:entries_email2Address, Pellet will pull Robert L. Fisher data from the outlook database for any other properties that we defined as equivalent to eudora properties.

Further work

The most important short-term step is finding more worthwhile use cases to implement. Metadata for metadata's sake does not justify the trouble of adding it. For this project, I tried to have a use case demonstrating how each bit of ontology information added would let a user get more out of the database than they would have otherwise. For example, I could have the company home pages and other web addresses listed in the data represented as object properties instead of as datatype properties, but what useful new query would this enable? Perhaps it could be used as a key value to check which employees work at the same company, because abbreviation and punctuation issues can make company name spellings less consistent than company home page URLs.

One useful new thing to demonstrate would be how to indicate that a field in one database is not equivalent to a certain field in another database, but a subset of it. For example, if an international address book was incorporated into this data, we'd want to show that American zip codes are postal codes, but that all postal codes are not necessarily zip codes. (This could be easily done by making each zip code property a subset of a postal code property, much as we did with the phone properties.)

Demonstrating additional property attributes besides the inverse functional property would be also valuable. In a different version of this demo, I added data indicating which entry represented the spouse of which other entry. With Jane Smith's entry pointing to Joe Smith as her spouse, "spouse" being defined as a symmetric property, and no home phone number or spouse listed for Joe, I could still query for his home phone number and get Jane's because I had defined a rule saying that if someone didn't have a home phone number but their spouse did, then a query for that person's home phone number should return their spouse's number. It would be nice to have similar demonstrations of the functional and transitive properties.

Another important step would be to do all this with multi-table databases. Few databases are stored in a single table; people use spreadsheets for that.

To make these techniques useful in a production environment, research is necessary to let it scale up further. Instead of dumping all of the relational data to an RDF representation each time the database is changed, to allow Pellet to query an up-to-date version of the data, it would be nice to translate SPARQL queries on the fly to SQL queries, letting us issue SPARQL queries directly against the relational data. This, in fact, is what D2RQ was designed to do. However, I don't believe that any existing query converters implement the OWL-based reasoning that is part of Pellet, and this reasoning is necessary to use the OWL assertions in combination with the data to answer the use case queries.