13 February 2015

Driving Hadoop data integration with standards-based models instead of code

RDFS models!

Note: I wrote this blog entry to accompany the IBM Data Magazine piece mentioned in the first paragraph, so for people following the link from there this goes into a little more detail on what RDF, triples, and SPARQL are than I normally would on this blog. I hope that readers already familiar with these standards will find the parts about doing the inferencing on a Hadoop cluster interesting.

RDF and Hadoop logos

In a short piece in IBM Data Magazine titled Scale up Your Data Integration with Data Models and Inferencing, I give a high-level overview of why the use of W3C standards-based models can provide a more scalable alternative to using code-driven transformations when integrating data from multiple sources:

  • When driving this process with code generated from models (instead of from the models themselves), evolution of the code makes the code more brittle and turns the original models into out-of-date system documentation.

  • Mature commercial and open-source tools are available to infer, for example, that a LastName value from one database and a last_name value from another can both be treated as values of FamilyName from a central canonical data model.

  • After running such a conversion with these models, modifying the conversion to accommodate additional input data often means simply expanding the unifying model, with no need for new code.

  • It can work on a Hadoop cluster with little more than a brief Python script to drive it all.

Here, we'll look at an example of how this can work. I'm going to show how I used these techniques to integrate data from the SQL Server sample Northwind database's "Employees" table with data from the Oracle sample HR database's "EMPLOYEES" table. These use different names for similar properties, and we'll identify the relationships between those properties in a model that uses a W3C standard modeling language. Next, a Python script will use this model to combine data from the two different employee tables into one dataset that conforms to a common model. Finally, we'll see that a small addition to the model, with no new code added to the Python script, lets the script integrate additional data from the different databases. And, we'll do this all on a Hadoop cluster.

The data and the model

RDF represents facts in three-part {entity, property name, property value} statements known as triples. We could, for example, say that employee 4 has a FirstName value of "Margaret", but RDF requires that the entity and property name identifiers be URIs to ensure that they're completely unambiguous. URIs usually look like URLs, but instead of being Universal Resource Locators, they're Universal Resource Identifiers, merely identifying resources instead of naming a location for them. This means that while some of them might look like web addresses, pasting them into a web browser's address bar won't necessarily get you a web page. (RDF also encourages you to represent property values as URIs as well, making it easier to connect triples into graphs that can be traversed and queried. Doing this to connect triples from different sources is another area where RDF shines in data integration work.)

The use of domain names in URIs, as with Java package names, lets an organization control the naming conventions around their resources. When I used D2R—an open source middleware tool that can extract data from popular relational database packages—to pull the employees tables from the Northwind and HR databases, I had it build identifiers around my own snee.com domain name. Doing this, it created entity-name-value triples such as {<http://snee.com/vocab/SQLServerNorthwind#employees_4> <http://snee.com/vocab/schema/SQLServerNorthwind#employees_FirstName> "Margaret"}. A typical fact pulled out of the HR database was {<http://snee.com/vocab/OracleHR#employees_191> <http://snee.com/vocab/schema/OracleHR#employees_first_name> "Randall"}, which tells us that employee 191 in that database has a first_name value of "Randall". If the HR database also had an employee number 4 or used a column name of first_name, the use of the URIs would leave no question as to which employee or property was being referenced by each triple.

It was simplest to have D2R pull the entire tables, so in addition to the first and last names of each employee, I had it pull all the other data in the Northwind and HR employee tables. To integrate this data, we'll start with just the first and last names, and then we'll see how easy it is to broaden the scope of our data integration.

RDF offers several syntaxes for recording triples. RDF/XML was the first to become standardized, but has fallen from popularity as simpler alternatives became available. The simplest syntax, called N-Triples, spells out one triple per line with full URIs and a period at the end, just like a sentence stating a fact would end with a period. Below you can see some of the data about employee 122 from the HREmployees.nt file that I pulled from the HR database's employees table. (For this and the later N-Triples examples, I've added carriage returns to each line to more easily fit them here.)

<http://snee.com/vocab/OracleHR#employees_122> 
<http://snee.com/vocab/schema/OracleHR#employees_department_id> 
<http://snee.com/vocab/OracleHR#departments_50> .

<http://snee.com/vocab/OracleHR#employees_122> 
<http://snee.com/vocab/schema/OracleHR#employees_first_name> "Payam" .

<http://snee.com/vocab/OracleHR#employees_122> 
<http://snee.com/vocab/schema/OracleHR#employees_hire_date> 
"1995-05-01"^^<http://www.w3.org/2001/XMLSchema#date> .

<http://snee.com/vocab/OracleHR#employees_122>
<http://snee.com/vocab/schema/OracleHR#employees_last_name> "Kaufling" .

<http://snee.com/vocab/OracleHR#employees_122> 
<http://snee.com/vocab/schema/OracleHR#employees_phone_number> "650.123.3234" .

The NorthwindEmployees.nt file pulled by D2R represents the Northwind employees with the same syntax as the HREmployees.nt file but uses URIs appropriate for that data, with "SQLServerNorthwind" in their base URI instead of "OracleHR".

For a target canonical integration model, I chose the schema.org model designed by a consortium of major search engines for the embedding of machine-readable data into web pages.The following shows the schemaOrgPersonSchema.ttl file, where I've stored an excerpt of the schema.org model describing the Person class using the W3C standard RDF Schema (RDFS) language. I've added carriage returns to some of the rdfs:comment values to fit them here:

@prefix schema: <http://schema.org/> .
@prefix rdfs:   <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dc:     <http://purl.org/dc/terms/> .
@prefix owl:    <http://www.w3.org/2002/07/owl#> .
@prefix rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

schema:Person a             rdfs:Class;
        rdfs:label          "Person";
        dc:source           <http://www.w3.org/wiki/WebSchemas/SchemaDotOrgSources#source_rNews>;
        rdfs:comment        "A person (alive, dead, undead, or fictional).";
        rdfs:subClassOf     schema:Thing;
        owl:equivalentClass <http://xmlns.com/foaf/0.1/Person> .

schema:familyName a           rdf:Property ;
        rdfs:comment          "Family name. In the U.S., the last name of an Person. 
          This can be used along with givenName instead of the Name property." ;
        rdfs:label            "familyName" ;
        schema:domainIncludes schema:Person ;
        schema:rangeIncludes  schema:Text .

schema:givenName a           rdf:Property ;
       rdfs:comment          "Given name. In the U.S., the first name of a Person. 
         This can be used along with familyName instead of the Name property." ;
       rdfs:label            "givenName" ;
       schema:domainIncludes schema:Person ;
       schema:rangeIncludes  schema:Text .

schema:telephone a           rdf:Property ;
       rdfs:comment          "The telephone number." ;
       rdfs:label            "telephone" ;
       schema:domainIncludes schema:ContactPoint , schema:Organization , 
                             schema:Person , schema:Place ;
       schema:rangeIncludes  schema:Text .

Note that the RDFS "language" is really just a set of properties and classes to use in describing data models, not a syntax. I could have done this with the the N-Triples syntax mentioned earlier, but this excerpt from schema.org uses RDF's Turtle syntax to describe the class and properties. Turtle is similar to N-Triples but offers a few shortcuts to reduce verbosity:

  • You can declare prefixes to stand in for common parts of URIs, so that rdfs:label means the same thing as <http://www.w3.org/2000/01/rdf-schema#label>.

  • A semicolon means "here comes another triple with the same subject as the last one", letting you list multiple facts about a particular resource without repeating the resource's URI or prefixed name.

  • The keyword "a" stands in for the prefixed name rdf:type, so that the first line after the prefix declarations above says that the resource schema:Person has a type of rdfs:Class (that is, that it's an instance of the rdfs:Class class and is therefore a class itself). The first line about schema:familyName says that it has an rdf:type of rdf:Property, and so forth.

Although Turtle is now the most popular syntax for representing RDF, I used N-Triples for the employee instance data because the use of one line per triple, with no dependencies on prefix declarations or anything else on previous lines, means that a Hadoop system can split up an N-Triples file at any line breaks that it wants to without hurting the integrity of the data.

What if schema.org couldn't accommodate my complete canonical model? For example, it has no Employee class; what if I wanted to add one that has a hireDate property as well as the other properties shown above? I could simply add triples saying that Employee was a subclass of schema:Person and that hireDate was a property associated with my new class.

I wouldn't add these modifications directly to the file storing the schema.org model, but instead put them in a separate file so that I could manage local customizations separately from the published standard. (The ability to combine different RDF datasets that use the same syntax—regardless of their respective data models—by just concatenating the files is another reason that RDF is popular for data integration.) This is the same strategy I used to describe my canonical model integration information, storing the following four triples in the integrationModel.ttl file to describe the relationship of the relevant HR and Northwind properties to the schema.org model:

@prefix rdfs:     <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix schema:   <http://schema.org/> . 
@prefix oraclehr: <http://snee.com/vocab/schema/OracleHR#> .
@prefix nw:       <http://snee.com/vocab/schema/SQLServerNorthwind#> .

oraclehr:employees_first_name rdfs:subPropertyOf schema:givenName  . 
oraclehr:employees_last_name  rdfs:subPropertyOf schema:familyName . 
nw:employees_FirstName        rdfs:subPropertyOf schema:givenName  . 
nw:employees_LastName         rdfs:subPropertyOf schema:familyName . 

(Note that in RDF, any resource that can be represented by a URI can have properties assigned to it, including properties themselves. This file uses this ability to say that the two oraclehr properties and the two nw properties shown each have an rdfs:subPropertyOf value.) At this point, with my schemaOrgPersonSchema.ttl file storing the excerpt of schema.org that models a Person and my integrationModel.ttl file modeling the relationships between schema:Person and the Northwind and HR input data, I have all the data modeling I need to drive a simple data integration.

The Python script and the Hadoop cluster

Hadoop's streaming interface lets you configure MapReduce logic using any programming language that can read from standard input and write to standard output, so because I knew of a Python library that could do RDFS inferencing, I wrote the following mapper routine in Python:

#!/usr/bin/python

# employeeInferencing.py: read employee data and models relating it to 
# schema.org, then infer and output schema.org version of relevant facts.

# sample execution:
# cat NorthwindEmployees.nt HREmployees.nt | employeeInferencing.py > temp.ttl

# Reads ntriples from stdin and writes ntriples results to 
# stdout so that it can be used as a streaming Hadoop task. 

import sys
import rdflib
import RDFClosure

diskFileGraph = rdflib.Graph()        # Graph to store data and models

# Read the data from standard input
streamedInput = ""
for line in sys.stdin:
    streamedInput += line
diskFileGraph.parse(data=streamedInput,format="nt")

# Read the modeling information
diskFileGraph.parse(
  "http://snee.com/rdf/inferencingDataIntegration/schemaOrgPersonSchema.ttl",
  format="turtle")
diskFileGraph.parse(
  "http://snee.com/rdf/inferencingDataIntegration/integrationModel.ttl",
  format="turtle")

# Do the inferencing
RDFClosure.DeductiveClosure(RDFClosure.RDFS_Semantics).expand(diskFileGraph)

# Use a SPARQL query to extract the data that we want to return: any
# statements whose properties are associated with the schema:Person
# class. (Note that standard RDFS would use rdfs:domain for this, but
# schema.org uses schema:domainIncludes.)

queryForPersonData = """
PREFIX schema: <http://schema.org/> 
CONSTRUCT { ?subject ?personProperty ?object }
WHERE { 
  ?personProperty schema:domainIncludes schema:Person .
  ?subject ?personProperty ?object .
}"""

personData = diskFileGraph.query(queryForPersonData)

# Add the query results to a graph that we can output.
personDataGraph  = rdflib.Graph()
for row in personData:
    personDataGraph.add(row)

# Send the result to standard out.
personDataGraph.serialize(sys.stdout, format="nt")

After importing the sys library to allow reading from standard input and writing to standard output, the script imports two more libraries: RDFLib, the most popular Python library for working with RDF, and RDFClosure from the related OWL-RL project, which can do inferencing from RDFS modeling statements as well as inferencing that uses the Web Ontology Language (OWL), a more expressive superset of RDFS. (Other available tools for doing RDFS and OWL inferencing include TopQuadrant's TopSPIN engine, Ontotext's OWLIM, and Clark & Parsia's Pellet.) After initializing diskFileGraph as a graph to store the triples that the script will work with, the script reads any N-Triples data fed to it via standard input into this graph and then reads in the schemaOrgPersonSchema.ttl and integrationModel.ttl files of modeling data described above. The identification of these files as http://snee.com/rdf/inferencingDataIntegration/schemaOrgPersonSchema.ttl and http://snee.com/rdf/inferencingDataIntegration/integrationModel.ttl are not URIs in the RDF sense, but actual URLs: send your browser to either and you'll find copies of those files stored at those locations. That's where the script is reading them from.

Next, the script computes the deductive closure of the triples aggregated from standard input and the modeling information. For example, when it sees the triple {<http://snee.com/vocab/OracleHR#employees_122> <http://snee.com/vocab/schema/OracleHR#employees_last_name> "Kaufling"} and the triple {oraclehr:employees_last_name rdfs:subPropertyOf schema:familyName}, it infers the new triple {<http://snee.com/vocab/OracleHR#employees_122> schema:familyName "Kaufling"}. Because the inference engine's job is to infer new triples based on all the relevant ones it can find, newly inferred triples may make new inferences possible, so it continues inferencing until there is nothing new that it can infer from the existing set—it has achieved closure.

At this point, the script will have all of the original triples that it read in plus the new ones that it inferred, but I'm going to assume that applications using data conforming to the canonical model are only interested in that data and not in all the other input. To extract the relevant subset, the script runs a query in SPARQL, the query language from the RDF family of W3C standards. As with SQL, it's common to see SPARQL queries that begin with SELECT statements listing columns of data to return, but this Python script uses a CONSTRUCT query instead, which returns triples instead of columns of data. The query's WHERE clause identifies the triples that the query wants by using "triple patterns", or triples that include variables as wildcards to describe the kinds of triples to look for, and the CONSTRUCT part describes what should be in the triples that get returned.

In this case, the triples to return are any whose predicate value has a schema:domainIncludes value of schema:Person—in other words, any property associated with the schema:Person class. As the comment in the code says, it's more common for RDFS and OWL models to use the standard rdfs:domain property to associate properties with classes, but this can get messy when associating a particular property with multiple classes, so the schema.org project defined their own schema:domainIncludes property for this.

This SPARQL query could be extended to implement additional logic if necessary. For example, if one database had separate lastName and firstName fields and another had a single name field with values of the form "Smith, John", then string manipulation functions in the SPARQL query could concatenate the lastName and firstName values with a comma or split the name value at the comma to create new values. This brings the script past strict model-based mapping to include transformation, but most independently-developed data models don't line up neatly enough to describe their relationships with nothing but simple mappings.

The data returned by the query and stored in the personData variable is not one of RDFLib's Graph() structures like the diskFileGraph instance that it has been working with throughout the script, so the script creates a new instance called personDataGraph and adds the data from personData to it. Once this is done, all that's left is to output this graph's contents to standard out in the N-Triples format, identified as "nt" in the call to the serialize method.

In a typical Hadoop job, the data returned by the mapper routine is further processed by a reducer routine, but to keep this example simple I created a dummyReducer.py script that merely copied the returned data through unchanged:

#!/usr/bin/python
# dummyReducer.py: just copy stdin to stdout

import sys

for line in sys.stdin:
    sys.stdout.write(line)

Running it, expanding the model, and running it again

With my two Python scripts, my two modeling files, and one file of data from each of the two database's employee tables, I had everything I needed to have Hadoop integrate the data to the canonical model using RDFS inferencing. I set up a four-node Hadoop cluster using the steps describe in part 1 and part 2 of Hardik Pandya's "Setting up Hadoop multi-node cluster on Amazon EC2", formatted the distributed file system, and copied the NorthwindEmployees.nt and HREmployees.nt files to the /data/employees directory on that file system. Because the employeeInferencing.py script would be passed to the slave nodes to run on the subsets of input data sent to those nodes, I also installed the RDFLib and OWL-RL Python modules that this script needed on the slave nodes. Then, with the Python scripts stored in /home/ubuntu/dataInt/ on the cluster's master node, I was ready to run the job with the following command (split over six lines here to fit on this page) on the master node:

hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar 
  -file /home/ubuntu/dataInt/employeeInferencing.py 
  -mapper /home/ubuntu/dataInt/employeeInferencing.py 
  -file /home/ubuntu/dataInt/dummyReducer.py 
  -reducer /home/ubuntu/dataInt/dummyReducer.py 
  -input /data/employees/* -output /data/myOutputDir

After running that, the following copied the result from the distributed file system to a run1.nt file in my local filesystem:

hadoop dfs -cat /data/myOutputDir/part-00000 > outputCopies/run1.nt

Here are a few typical lines from run1.nt:

<http://snee.com/vocab/OracleHR#employees_100> 
<http://schema.org/familyName> "King" .	

<http://snee.com/vocab/OracleHR#employees_100> 
<http://schema.org/givenName> "Steven" .	

<http://snee.com/vocab/SQLServerNorthwind#employees_2> 
<http://schema.org/familyName> "Fuller" .	

<http://snee.com/vocab/SQLServerNorthwind#employees_2> 
<http://schema.org/givenName> "Andrew" .	

The entire file is all schema:givenName and schema:familyName triples about the resources from the Oracle HR and SQL Server Northwind databases.

This isn't much so far, with the output only having the first and last name values from the two source databases, but here's where it gets more interesting. We add the following two lines to the copy of integrationModel.ttl stored on the snee.com server:

oraclehr:employees_phone_number rdfs:subPropertyOf schema:telephone .  
nw:employees_HomePhone          rdfs:subPropertyOf schema:telephone . 

Then, with no changes to the Python scripts or anything else, re-running the same command on the Hadoop master node (with a new output directory parameter) produces a result with lines like this:

<http://snee.com/vocab/OracleHR#employees_100> 
<http://schema.org/familyName> "King" .

<http://snee.com/vocab/OracleHR#employees_100> 
<http://schema.org/givenName> "Steven" .

<http://snee.com/vocab/OracleHR#employees_100> 
<http://schema.org/telephone> "515.123.4567" .

<http://snee.com/vocab/SQLServerNorthwind#employees_2> 
<http://schema.org/givenName> "Andrew" .

<http://snee.com/vocab/SQLServerNorthwind#employees_2> 
<http://schema.org/familyName> "Fuller" .

<http://snee.com/vocab/SQLServerNorthwind#employees_2> 
<http://schema.org/telephone> "(206) 555-9482" .

Expanding the scope of the data integration required no new coding in the Python script—just an expansion of the integration model. The integration is truly being driven by the model, and not by procedural transformation code. And, adding a completely new data source wouldn't be any more trouble than adding the phone data was above; you only need to identify which properties of the new data source correspond to which properties of the canonical data model.

Modeling more complex relationships for more complex mapping

All the inferencing so far has been done with just one property from the RDFS standard: rdfs:subPropertyOf. RDFS offers additional modeling constructs that let you do more. As I mentioned earlier, schema.org does not define an Employee class, but if my application needs one, I can use RDFS to define it in my own namespace as a subclass of schema:Person. Also, the Northwind employee data has an nw:employees_HireDate property that I'd like to associate with my new class. I can do both of these by adding these two triples to integrationModel.ttl, shown here with a prefix declaration to make the triples shorter:

@prefix emp: <http://snee.com/vocab/employees#> .
emp:Employee rdfs:subClassOf schema:Person . 
nw:employees_HireDate rdfs:domain emp:Employee .

The SPARQL query in employeeInferencing.py only looked for properties associated with instances of schema:Person, so after expanding that a bit to request the Employee and class membership triples as well, running the inferencing script shows us that the RDFSClosure engine has inferred these new triples about Andrew Fuller:

<http://snee.com/vocab/SQLServerNorthwind#employees_2> 
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
<http://snee.com/vocab/employees#Employee> .

<http://snee.com/vocab/SQLServerNorthwind#employees_2> 
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
<http://schema.org/Person> .

<http://snee.com/vocab/SQLServerNorthwind#employees_2> 
<http://snee.com/vocab/schema/SQLServerNorthwind#employees_HireDate> 
"1992-08-14T00:00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> .

In other words, because he has an nw:employees_HireDate value, it inferred that he is an instance of the class emp:Employee, and because that's a subclass of schema:Person, we see that he is also a member of that class.

The W3C's OWL standard adds additional properties beyond those defined by RDFS to further describe your data, as well as special classes and the ability to define your own classes to use in describing your data. For example, if the HR database's departments table had a related property so that you could specify that the shipping department is related to the receiving department, then specifying in our integration model that {nw:related rdf:type owl:SymmetricProperty} would tell the RDFSClosure engine that this property is symmetric and that it should infer that the receiving department was related to shipping department. (When telling RDFSClosure's DeductiveClosure method to do OWL inferencing in addition to RDFS inferencing, pass it an RDFS_OWLRL_Semantics parameter instead of RDFS_Semantics.)

OWL also includes an owl:inverseOf property that can help with data integration. For example, imagine that the Northwind database had an nw:manages property that let you say things like {emp:jack nw:manages emp:shippingDepartment}, but the HR database identified the relationship in the opposite direction with an oraclehr:managedBy relationship used in triples of the form {emp:receivingDepartment oraclehr:managedBy emp:jill}. When you tell an OWL engine that these two properties are the inverse of each other with the triple {oraclehr:managedBy owl:inverseOf nw:manages}, it will infer from the triples above that {emp:shippingDepartment oraclehr:managedBy emp:jack} and that {emp:jill nw:manages emp:receivingDepartment}.

When processing of the input is distributed over multiple nodes, as with a Hadoop cluster, this inferencing has some limitations. For example, the owl:TransitiveProperty class lets me say that an ex:locatedIn property is transitive by using a triple such as {ex:locatedIn rdf:type owl:TransitiveProperty}. Then, when an OWL engine sees that {ex:chair38 ex:locatedIn ex:room47} and that {ex:room47 ex:locatedIn ex:building6}, it can infer that {ex:chair38 ex:locatedIn ex:building6}. When distributing the processing across a Hadoop cluster, however, the {ex:chair38 ex:locatedIn ex:room47} triple may get sent to one node and the {ex:room47 ex:locatedIn ex:building6} triple to another, so neither will have enough information to infer which building the chair is in. So, when you review the RDFS and OWL standards for properties and classes that you can use to describe the data that you want to integrate on a distributed Hadoop system, keep in mind which of these can do their inferencing based on a single triple of instance data input and which require multiple triples.(The Reduce step of a MapReduce, where above I just put a dummy script to copy the data through, would be a potential place to do additional inferencing based on the output of the mapping steps done on the distributed Hadoop nodes.)

Other tools for working with RDF on Hadoop

There have been other projects for taking advantage of the RDF data model on Hadoop before I tried this, and there are more coming along. At ApacheCon Europe in 2012, Cloudera's Paolo Castagna (formerly of Kasabi, Talis, and HP Labs in Bristol, which is quite an RDF pedigree) gave a talk titled "Handling RDF data with tools from the Hadoop ecosystem" (slides PDF) where he mostly covered the application of popular Hadoop tools to N-Triples files, but he also described his jena-grande project to mix the Apache Jena RDF library with these tools. At the 2014 ApacheCon, YarcData's Rob Vesse gave a talk titled "Quadrupling Your Elephants: RDF and The Hadoop Ecosystem" (slides PDF), which reviewed tools for using RDF on Hadoop and described the Jena Hadoop RDF tools project, which has since been renamed as Jena Elephas. (Rob described Paolo's jena-grande as a "useful reference & inspiration in developing the new stuff".)

The kind of scripting that I did with Hadoop's streaming interface is a great way to get Hadoop tasks up and running quickly, but more serious Hadoop applications are typically written in Java, as I've described in a recent blog entry, and by bringing the full power of Jena to this kind of development, Elephas will open up some great new possibilities for taking advantage of the RDF data model (and SPARQL, and RDFS, and OWL) on Hadoop. I'm definitely looking forward to seeing where that leads.


Please add any comments to this Google+ post.

20 January 2015

R (and SPARQL), part 2

Retrieve data from a SPARQL endpoint, graph it and more, then automate it.

In the future whenever I use SPARQL to retrieve numeric data I'll have some much more interesting ideas about what I can do with that data.

In part 1 of this series, I discussed the history of R, the programming language and environment for statistical computing and graph generation, and why it's become so popular lately. The many libraries that people have contributed to it are a key reason for its popularity, and the SPARQL one inspired me to learn some R to try it out. Part 1 showed how to load this library, retrieve a SPARQL result set, and perform some basic statistical analysis of the numbers in the result set. After I published it, it was nice to see how its comments section filled up with a nice list of projects out there that combine R and SPARQL.

If you executed the sample commands from Part 1 and saved your session when quitting out of R (or in the case of what I was doing last week, RGui), all of the variables set in that session will be available for the commands described here. Today we'll look at a few more commands for analyzing the data, how to plot points and regression lines, and how to automate it all so that you can quickly perform the same analysis on different SPARQL result sets. Again, corrections welcome.

My original goal was to find out how closely the number of employees in the companies making up the Dow Jones Industrial Average correlated with the net income, which we can find out with R's cor() function:

> cor(queryResult$netIncome,queryResult$numEmployees)
[1] 0.1722887

A correlation figure close to 1 or -1 indicates a strong correlation (a negative correlation indicates that one variable's values tend to go in the opposite direction of the other's—for example, if incidence of a certain disease goes down as the use of a particular vaccine goes up) and 0 indicates no correlation. The correlation of 0.1722887 is much closer to 0 than it is to 1 or -1, so we see very little correlation here. (Once we automate this series of steps, we'll finder strong correlations when we focus on specific industries.)

More graphing

We're going to graph the relationship between the employee and net income figures, and then we'll tell R to draw a straight line that fits as closely as possible to the pattern created by the plotted values. This is called a linear regression model, and before we do that we tell R to calculate some data necessary for this task with the lm() ("linear model") function:

> myLinearModelData <- lm(queryResult$numEmployees~queryResult$netIncome) 

Next, we draw the graph:

> plot(queryResult$netIncome,queryResult$numEmployees,xlab="net income",
   ylab="# of employees", main="Dow Jones Industrial Average companies")

As with the histogram that we saw in Part 1, R offers many ways to control the graph's appearance, and add-in libraries let you do even more. (Try a Google image search on "fancy R plots" to get a feel for the possibilities.) In the call to plot() I included three parameters to set a main title and labels for the X and Y axes, and we see these in the result:

DJIA plot

We can see more intuitively what the cor() function already told us: that there is minimal correlation between the rise of employee counts and net income in the companies comprising the Dow Jones Industrial average.

Let's put the data that we stored in myLinearModelData to use. The abline() function can use it to add a regression line to our plot:

> abline(myLinearModelData)  
DJIA plot with regression line

When you type in function calls such as sd(queryResult$numEmployees) and cor(queryResult$netIncome,queryResult$numEmployees), R prints the return values as output, but you can use the return values in other operations. In the following, I've replotted the graph with the cor() function call's result used in a subtitle for the graph, concatenated onto the string "correlation: " with R's paste() function:

 plot(queryResult$netIncome,queryResult$numEmployees,xlab="net income",
   ylab="# of employees", main="Dow Jones Industrial Average companies",
   sub=paste("correlation: ",cor(queryResult$numEmployees,
             queryResult$netIncome),sep=""))

(The paste() function's sep argument here shows that we don't want any separator between our concatenated pieces. I'm guessing that paste() is more typically used to create delimited data files.) R puts the subtitle at the image's bottom:

DJIA plot with subtitle

Instead of plotting the graph on the screen, we can tell R to send it to a JPEG, BMP, PNG, or TIFF file. Calling a graphics devices function such as jpeg() before doing the plot tells R to send the results to a file, and dev.off() turns off the "device" that writes to the image file.

Automating it

Now we know nearly enough commands to create a useful script. The remainder are just string manipulation functions that I found easy enough to look up when I needed them, although having a string concatenation command called paste() is another example of the odd R terminology that I warned about last week. Here is my script:

library(SPARQL) 

category <- "Companies_in_the_Dow_Jones_Industrial_Average"
#category <- "Electronics_companies_of_the_United_States"
#category <- "Financial_services_companies_of_the_United_States"

query <- "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?label ?numEmployees ?netIncome  
WHERE {
  ?s dcterms:subject <http://dbpedia.org/resource/Category:DUMMY-CATEGORY-NAME> ;
     rdfs:label ?label ;
     dbo:netIncome ?netIncomeDollars ;
     dbpprop:numEmployees ?numEmployees . 
     BIND(replace(?numEmployees,',','') AS ?employees)  # lose commas
     FILTER ( lang(?label) = 'en' )
     FILTER(contains(?netIncomeDollars,'E'))
     # Following because DBpedia types them as dbpedia:datatype/usDollar
     BIND(xsd:float(?netIncomeDollars) AS ?netIncome)
     # Original query on following line had two 
     # slashes, but R needed both escaped.
     FILTER(!(regex(?numEmployees,'\\\\d+')))
}
ORDER BY ?numEmployees"

query <- sub(pattern="DUMMY-CATEGORY-NAME",replacement=category,x=query)

endpoint <- "http://dbpedia.org/sparql"
resultList <- SPARQL(endpoint,query)
queryResult <- resultList$results 
correlationLegend=paste("correlation: ",cor(queryResult$numEmployees,
                         queryResult$netIncome),sep="")
myLinearModelData <- lm(queryResult$numEmployees~queryResult$netIncome) 
plotTitle <- chartr(old="_",new=" ",x=category)
outputFilename <- paste("c:/temp/",category,".jpg",sep="")
jpeg(outputFilename)
plot(queryResult$netIncome,queryResult$numEmployees,xlab="net income",
     ylab="number of employees", main=plotTitle,cex.main=.9,
     sub=correlationLegend)
abline(myLinearModelData) 
dev.off()

Instead of hardcoding the URI of the industry category whose data I wanted, my script has DUMMY-CATEGORY-NAME, a string that it substitutes with the category value assigned at the script's beginning. The category value here is "Companies_in_the_Dow_Jones_Industrial_Average", with the setting of two other potential category values commented out so that we can easily try them later. (R, like SPARQL, uses the # character for commenting.) I also used the category value to create the output filename.

An additional embellishment to the sequence of commands that we entered manually is that the script stores the plot title in a plotTitle variable, replacing the underscores in the category name with spaces. Because this sometimes resulted in titles that were too wide for the plot image, I added cex.main=9 as a plot() argument to reduce the title's size.

With the script stored in /temp/myscript.R, entering the following at the R prompt runs it:

source("/temp/myscript.R")

If I don't have an R interpreter up and running, I can run the script from the operating system command line by calling rscript, which is included with R:

rscript /temp/myscript.R

After it runs, my /temp directory has this Companies_in_the_Dow_Jones_Industrial_Average.jpg file in it:

DJIA plot from script

When I uncomment the script's second category assignment line instead of the first and run the script again, it creates the file Electronics_companies_of_the_United_States.jpg:

data on U.S. electronics companies

There's better correlation this time, of almost .5. Fitting two particular outliers onto the plot means that R put enough points in the lower-left to make a bit of a blotch; I did find with experimentation that the plot() command offers parameters to only display the points within a particular range of values on the horizontal or vertical axis, making it easier to show a zoomed view.

Here's what we get when querying about Financial_services_companies_of_the_United_States:

data on U.S. electronics companies

We see the strongest correlation yet: over .84. I suppose that at financial services companies, hiring more people is more likely to increase revenue than in other typical sectors because you can provide (and charge for) a higher volume of services. This is only a theory, but that's why people use statistical analysis packages: to look for patterns that can suggest theories, and it's great to know that such a powerful open-source package can do this with data retrieved from SPARQL endpoints.

If I was going to run this script from the operating system command line regularly, then instead of setting the category value at the beginning of the script, I would pass it to rscript as an argument with the script name.

Learning more about R

Because of R's age and academic roots, there is a lot of stray documentation around, often in LaTeXish-looking PDFs from several years ago. Many introductions to R are aimed at people in a specific field, and I suppose my blog entries here fall in this category.

The best short, modern tour of R that I've found recently is Sharon Machlis's six-part series beginning at Beginner's Guide to R: Introduction. Part six points to many other places to learn about R ranging from blog entries to complete books to videos, and reviewing the list now I see more entries that I hadn't noticed before that look worth investigating.

Her list is where I learned about Jeffrey M. Stanton's Introduction to Data Science, an excellent introduction to both Data Science and to the use of R to execute common data science analysis tasks. The link here goes to an iTunes version of the book, but there's also a PDF version, which I read beginning to end.

The R Programming Wikibook makes a good quick reference work, especially when you need a particular function for something; see the table of contents down its right side. I found myself going back to the Text Processing page there several times. The four-page "R Reference Card" (pdf) by Tom Short is also worth printing out.

Last week I mentioned John D. Cook's R language for programmers, a blog entry that will help anyone familiar with typical modern programming languages get over a few initial small humps more quickly when learning R.

I described Machlis's six-part series as "short" because there there are so many full-length books on R out there, such as free ones like Stanton's and several offerings from O'Reilly and Manning. I've read the first few chapters of Manning's R in Action by Robert Kabakoff and find it very helpful so far. Apparently a new edition is coming out in March, so if you're thinking of buying it you may want to wait or else get the early access edition. Manning's Practical Data Science with R also looks good, but assumes a bit of R background (in fact, it recommends "R in Action" as a starting point), and a real beginner to this area would be better off starting with Stanton's free book mentioned above.

O'Reilly has several books on R, including an R Cookbook whose very task-oriented table of contents is worth skimming, as well as an accompanying R Graphics Cookbook.

I know that I'll be going back to several of these books and web pages, because in the future whenever I use SPARQL to retrieve numeric data I'll have some much more interesting ideas about what I can do with that data.

fancy R plots on Google

Please add any comments to this Google+ post.

13 January 2015

R (and SPARQL), part 1

Or, R for RDF people.

R is a programming language and environment for statistical computing and graph generation that, despite being over 30 years old, has gotten hot lately because it's an open-source, cross-platform tool that brings a lot to the world of Data Science, a recently popular field often associated with the analytics aspect of the drive towards Big Data. The large, active community around R has developed many add-on libraries, including one for working with data retrieved from SPARQL endpoints, so I thought I'd get to know R well enough to try that library. I first learned about this library from SPARQL with R in Less than 5 Minutes, which describes Semantic Web and Linked Data concepts to people familiar with R in order to demonstrate what they can do together; my goal here is to explain R to people familiar with RDF for the same reason. (Corrections to any misuse of statistical terminology are welcome.)

an open-source, cross-platform tool that brings a lot to the world of Data Science

R has also been called "GNU S," and first appeared in 1993 as an implementation of a statistical programming language developed at Bell Labs in 1976 known as S. (This is cuter if you know that the C programming language was also developed at Bell Labs as a successor to a language called B.) Its commercial competition includes Stata, SAS, and SPSS, all of whom have plenty to fear from R as its its power and reputation grow while its cost stays at zero. According to a recent article in Nature on R's growing popularity among scientists, "In the past decade, R has caught up with and overtaken the market leaders."

Downloading and installing R on a Windows machine gave me an icon that opened up the RGui windowed environment, which contains a console window where you enter commands that add other windows within RGui as needed for graphics. (The distribution also includes an executable that you can run from your operating system command line; as we'll see next week, you can use this to run scripts as well.) Most discussions of R recommend the open source RStudio as a more serious IDE for R development, but RGui was enough for me to play around.

Some of R's syntax is a bit awkward in places, possibly because of its age—some of its source code is written in Fortran, and it actually lets you call Fortran subroutines. I found some of its terminology to be awkward as well, but probably because it was designed for statisticians and not for programmers accustomed to typical modern programming languages. I highly recommend the quick tour of syntax quirks in R language for programmers by John D. Cook for such people when they're getting started with R.

For example, where I think of a table or a spreadsheet as consisting of rows and columns, R describes a data frame of observations and variables, meaning essentially the same thing. Of the simpler structures that come up in R, a vector is a one-dimensional set (I almost said "array" or "list" instead of "set" but these have different, specific meanings in R) of values of the same type, a matrix is a two-dimensional version, and an array is a three-dimensional version. A data frame looks like a matrix but "columns can be different modes" (that is, different properties and types), as described on the Data types page of the Quick-R website. The same page says that "data frames are the main structures you'll use to store datasets," which makes sense when you consider their similarity to spreadsheets, relational database tables, and, in the RDF world, SPARQL result sets.

I don't want to make too much of what may look like quirky terminology and syntax to people accustomed to other modern programming languages. I have come to appreciate the way R makes the most popular statistical operations so easy to carry out—even easier than Excel or LibreOffice Calc, which have a surprising amount of basic statistical operations built in.

Retrieving data from a SPARQL endpoint

Below I've walked through a session of commands entered at an R command line that you can paste into an R session yourself, not counting the > prompt shown before each command. Let's say that, using data retrieved from DBpedia, I'm wondering if there's a correlation between the number of employees and the amount of net income in a given set of companies. (I only used U.S. companies to make it easier to compare income figures.) Typically, companies with more employees have more net income, but do they correlate more closely in some industries than others? R lets you quantify and graph this correlation very easily, and along the way we'll see a few other things that it can do.

To start, I install the SPARQL package with this command, which starts up a wizard that loads it from a remote mirror:

> install.packages("SPARQL")  

After R installed the package, I loaded it for use in this session. The help() function can tell us more about an installed package:

> library(SPARQL)
> help(package="SPARQL") 

The help() function pops up a browser window with documentation of the topic passed as an argument. You can pass any function name to help() as well, so you can enter something like help(library()) or even help(help).

Analyzing the result

The next command uses R's <- assignment operator to assign a big multi-line string to the variable query. The string holds a SPARQL query that will be sent to DBpedia; you can run the same query on DBpedia's SNORQL interface to get a preview of the data (the query sent by that link is slightly different—see the last SPARQL comment in the query below):

> query <- "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?label ?numEmployees ?netIncome  
WHERE {
  ?s dcterms:subject <http://dbpedia.org/resource/Category:Companies_in_the_Dow_Jones_Industrial_Average> ;
     rdfs:label ?label ;
     dbo:netIncome ?netIncomeDollars ;
     dbpprop:numEmployees ?numEmployees . 
     BIND(replace(?numEmployees,',','') AS ?employees)  # lose commas
     FILTER ( lang(?label) = 'en' )
     FILTER(contains(?netIncomeDollars,'E'))
     # Following because DBpedia types them as dbpedia:datatype/usDollar
     BIND(xsd:float(?netIncomeDollars) AS ?netIncome)
     # original query on following line had two slashes, but 
     # R needed both escaped
     FILTER(!(regex(?numEmployees,'\\\\d+')))
}
ORDER BY ?numEmployees"

The query asks for the net income and employee count figures for companies that comprise the Dow Jones Industrial Average. The SPARQL comments within the query describe the query's steps in more detail.

Next, we assign the endpoint's URL to the endpoint variable and call the SPARQL package's SPARQL() function to send the query to that endpoint, storing the result in a resultList variable:

> endpoint <- "http://dbpedia.org/sparql"
> resultList <- SPARQL(endpoint,query)
> typeof(resultList)
[1] "list"

The third command there, and R's output, show that resultList has a type of list, which is described on the Data types page mentioned earlier as an "ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name." (Compare this with a vector, where everything must have the same type, or in R-speak, the same mode.)

The next command uses the very handy summary() function to learn more about what the SPARQL() function put into the resultList variable:

> summary(resultList)
           Length Class      Mode
results    3      data.frame list
namespaces 0      -none-     NULL

It shows a list of two things: our query results and an empty list of namespaces. Because we don't care about the empty list of namespaces, we'll make it easier to work with the results part by pulling it out and storing it in its own queryResult variable using the $ operator to identify the part of resultList that we want. Then, we use the str() function to learn more about what's in there:

> queryResult <- resultList$results 
> str(queryResult)
'data.frame':   27 obs. of  3 variables:
 $ label       : chr  "\"Visa Inc.\"@en" "\"The Travelers Companies\"@en" ...
 $ numEmployees: int  8500 30500 32900 44000 62800 64600 70000 ...
 $ netIncome   : num  2.14e+09 2.47e+09 8.04e+09 2.22e+09 5.36e+09 ...

The output tells us that it's a data frame, mentioned earlier as "the main structures you'll use to store datasets," with 27 obs[ervations] and 3 variables (that is, rows and columns).

The summary() function tells us some great stuff about a data frame—a set of information that would be much more work to retrieve if the same data was loaded into a spreadsheet program:

> summary(queryResult)
    label            numEmployees       netIncome        
 Length:27          Min.   :   8500   Min.   :2.144e+09  
 Class :character   1st Qu.:  72500   1st Qu.:4.863e+09  
 Mode  :character   Median : 107600   Median :8.040e+09  
                    Mean   : 205227   Mean   :1.050e+10  
                    3rd Qu.: 171711   3rd Qu.:1.530e+10  
                    Max.   :2200000   Max.   :3.258e+10  

The SPARQL query's SELECT statement asked for the label, numEmployees, and netIncome values, and we see some interesting information about the values returned for these, especially the numeric ones: the minimum, maximum, and mean (average) values of each, as well the boundary values if you split the returned values as closely as possible into four even groups known in statistics as quartiles. The first quartile value marks the boundary between the bottom quarter and the next quarter, the median splits the values in half, and the third quartile splits the top quarter from the third one.

We can very easily ask for the variance—a measure of how far apart all the values are spread from the mean—as well as the standard deviation, a useful measurement for describing how far any specific value is from the mean:

> var(queryResult$numEmployees)
[1] 167791342395
> sd(queryResult$numEmployees)
[1] 409623.4

Our first plot: a histogram

For our first step into graphics, we'll create a histogram, which illustrates the distribution of values. As with all R graphics, there are plenty of parameters available to control the image's appearance, but we can get a pretty useful histogram by sticking with the defaults:

hist(queryResult$netIncome) 

When running this interactively, RGui opens up a new window and displays the image there:

histogram generated with R

Next week we'll learn how to plot the specific points in the data, how to make the graph titles look nicer, and how to quantify the correlation between the two sets of values. (If you've been entering the commands shown here, then when you quit R with the quit() command or by picking Exit from RGui's File menu, it offers to save your workspace image for re-use the next time you start it up, so all of the variables that were set in a session like this will still be available in the next session.) We'll also see how to automate this series of steps to make it easier to generate a graph, with the correlation figure included, as a JPEG file. This automation will make it easier to graph the results and find the correlation figures for different industries. Finally, I'll list the best resources I found for learning R—there are a lot of them out there, of wildly varying quality.

Meanwhile, you can gaze at this R plot of a Mandelbrot set from R's Wikipedia page, which includes all the commands necessary to generate it:

Mandelbrot image generated with R

Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists