R (and SPARQL), part 1

Or, R for RDF people.

R is a programming language and environment for statistical computing and graph generation that, despite being over 30 years old, has gotten hot lately because it's an open-source, cross-platform tool that brings a lot to the world of Data Science, a recently popular field often associated with the analytics aspect of the drive towards Big Data. The large, active community around R has developed many add-on libraries, including one for working with data retrieved from SPARQL endpoints, so I thought I'd get to know R well enough to try that library. I first learned about this library from SPARQL with R in Less than 5 Minutes, which describes Semantic Web and Linked Data concepts to people familiar with R in order to demonstrate what they can do together; my goal here is to explain R to people familiar with RDF for the same reason. (Corrections to any misuse of statistical terminology are welcome.)

an open-source, cross-platform tool that brings a lot to the world of Data Science

R has also been called "GNU S," and first appeared in 1993 as an implementation of a statistical programming language developed at Bell Labs in 1976 known as S. (This is cuter if you know that the C programming language was also developed at Bell Labs as a successor to a language called B.) Its commercial competition includes Stata, SAS, and SPSS, all of whom have plenty to fear from R as its its power and reputation grow while its cost stays at zero. According to a recent article in Nature on R's growing popularity among scientists, "In the past decade, R has caught up with and overtaken the market leaders."

Downloading and installing R on a Windows machine gave me an icon that opened up the RGui windowed environment, which contains a console window where you enter commands that add other windows within RGui as needed for graphics. (The distribution also includes an executable that you can run from your operating system command line; as we'll see next week, you can use this to run scripts as well.) Most discussions of R recommend the open source RStudio as a more serious IDE for R development, but RGui was enough for me to play around.

Some of R's syntax is a bit awkward in places, possibly because of its age—some of its source code is written in Fortran, and it actually lets you call Fortran subroutines. I found some of its terminology to be awkward as well, but probably because it was designed for statisticians and not for programmers accustomed to typical modern programming languages. I highly recommend the quick tour of syntax quirks in R language for programmers by John D. Cook for such people when they're getting started with R.

For example, where I think of a table or a spreadsheet as consisting of rows and columns, R describes a data frame of observations and variables, meaning essentially the same thing. Of the simpler structures that come up in R, a vector is a one-dimensional set (I almost said "array" or "list" instead of "set" but these have different, specific meanings in R) of values of the same type, a matrix is a two-dimensional version, and an array is a three-dimensional version. A data frame looks like a matrix but "columns can be different modes" (that is, different properties and types), as described on the Data types page of the Quick-R website. The same page says that "data frames are the main structures you'll use to store datasets," which makes sense when you consider their similarity to spreadsheets, relational database tables, and, in the RDF world, SPARQL result sets.

I don't want to make too much of what may look like quirky terminology and syntax to people accustomed to other modern programming languages. I have come to appreciate the way R makes the most popular statistical operations so easy to carry out—even easier than Excel or LibreOffice Calc, which have a surprising amount of basic statistical operations built in.

Retrieving data from a SPARQL endpoint

Below I've walked through a session of commands entered at an R command line that you can paste into an R session yourself, not counting the > prompt shown before each command. Let's say that, using data retrieved from DBpedia, I'm wondering if there's a correlation between the number of employees and the amount of net income in a given set of companies. (I only used U.S. companies to make it easier to compare income figures.) Typically, companies with more employees have more net income, but do they correlate more closely in some industries than others? R lets you quantify and graph this correlation very easily, and along the way we'll see a few other things that it can do.

To start, I install the SPARQL package with this command, which starts up a wizard that loads it from a remote mirror:

> install.packages("SPARQL")  

After R installed the package, I loaded it for use in this session. The help() function can tell us more about an installed package:

> library(SPARQL)
> help(package="SPARQL") 

The help() function pops up a browser window with documentation of the topic passed as an argument. You can pass any function name to help() as well, so you can enter something like help(library()) or even help(help).

Analyzing the result

The next command uses R's <- assignment operator to assign a big multi-line string to the variable query. The string holds a SPARQL query that will be sent to DBpedia; you can run the same query on DBpedia's SNORQL interface to get a preview of the data (the query sent by that link is slightly different—see the last SPARQL comment in the query below):

> query <- "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?label ?numEmployees ?netIncome  
  ?s dcterms:subject <http://dbpedia.org/resource/Category:Companies_in_the_Dow_Jones_Industrial_Average> ;
     rdfs:label ?label ;
     dbo:netIncome ?netIncomeDollars ;
     dbpprop:numEmployees ?numEmployees . 
     BIND(replace(?numEmployees,',','') AS ?employees)  # lose commas
     FILTER ( lang(?label) = 'en' )
     # Following because DBpedia types them as dbpedia:datatype/usDollar
     BIND(xsd:float(?netIncomeDollars) AS ?netIncome)
     # original query on following line had two slashes, but 
     # R needed both escaped
ORDER BY ?numEmployees"

The query asks for the net income and employee count figures for companies that comprise the Dow Jones Industrial Average. The SPARQL comments within the query describe the query's steps in more detail.

Next, we assign the endpoint's URL to the endpoint variable and call the SPARQL package's SPARQL() function to send the query to that endpoint, storing the result in a resultList variable:

> endpoint <- "http://dbpedia.org/sparql"
> resultList <- SPARQL(endpoint,query)
> typeof(resultList)
[1] "list"

The third command there, and R's output, show that resultList has a type of list, which is described on the Data types page mentioned earlier as an "ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name." (Compare this with a vector, where everything must have the same type, or in R-speak, the same mode.)

The next command uses the very handy summary() function to learn more about what the SPARQL() function put into the resultList variable:

> summary(resultList)
           Length Class      Mode
results    3      data.frame list
namespaces 0      -none-     NULL

It shows a list of two things: our query results and an empty list of namespaces. Because we don't care about the empty list of namespaces, we'll make it easier to work with the results part by pulling it out and storing it in its own queryResult variable using the $ operator to identify the part of resultList that we want. Then, we use the str() function to learn more about what's in there:

> queryResult <- resultList$results 
> str(queryResult)
'data.frame':   27 obs. of  3 variables:
 $ label       : chr  "\"Visa Inc.\"@en" "\"The Travelers Companies\"@en" ...
 $ numEmployees: int  8500 30500 32900 44000 62800 64600 70000 ...
 $ netIncome   : num  2.14e+09 2.47e+09 8.04e+09 2.22e+09 5.36e+09 ...

The output tells us that it's a data frame, mentioned earlier as "the main structures you'll use to store datasets," with 27 obs[ervations] and 3 variables (that is, rows and columns).

The summary() function tells us some great stuff about a data frame—a set of information that would be much more work to retrieve if the same data was loaded into a spreadsheet program:

> summary(queryResult)
    label            numEmployees       netIncome        
 Length:27          Min.   :   8500   Min.   :2.144e+09  
 Class :character   1st Qu.:  72500   1st Qu.:4.863e+09  
 Mode  :character   Median : 107600   Median :8.040e+09  
                    Mean   : 205227   Mean   :1.050e+10  
                    3rd Qu.: 171711   3rd Qu.:1.530e+10  
                    Max.   :2200000   Max.   :3.258e+10  

The SPARQL query's SELECT statement asked for the label, numEmployees, and netIncome values, and we see some interesting information about the values returned for these, especially the numeric ones: the minimum, maximum, and mean (average) values of each, as well the boundary values if you split the returned values as closely as possible into four even groups known in statistics as quartiles. The first quartile value marks the boundary between the bottom quarter and the next quarter, the median splits the values in half, and the third quartile splits the top quarter from the third one.

We can very easily ask for the variance—a measure of how far apart all the values are spread from the mean—as well as the standard deviation, a useful measurement for describing how far any specific value is from the mean:

> var(queryResult$numEmployees)
[1] 167791342395
> sd(queryResult$numEmployees)
[1] 409623.4

Our first plot: a histogram

For our first step into graphics, we'll create a histogram, which illustrates the distribution of values. As with all R graphics, there are plenty of parameters available to control the image's appearance, but we can get a pretty useful histogram by sticking with the defaults:


When running this interactively, RGui opens up a new window and displays the image there:

histogram generated with R

Next week we'll learn how to plot the specific points in the data, how to make the graph titles look nicer, and how to quantify the correlation between the two sets of values. (If you've been entering the commands shown here, then when you quit R with the quit() command or by picking Exit from RGui's File menu, it offers to save your workspace image for re-use the next time you start it up, so all of the variables that were set in a session like this will still be available in the next session.) We'll also see how to automate this series of steps to make it easier to generate a graph, with the correlation figure included, as a JPEG file. This automation will make it easier to graph the results and find the correlation figures for different industries. Finally, I'll list the best resources I found for learning R—there are a lot of them out there, of wildly varying quality.

Meanwhile, you can gaze at this R plot of a Mandelbrot set from R's Wikipedia page, which includes all the commands necessary to generate it:

Mandelbrot image generated with R

Please add any comments to this Google+ post.