20 January 2015

R (and SPARQL), part 2

Retrieve data from a SPARQL endpoint, graph it and more, then automate it.

In the future whenever I use SPARQL to retrieve numeric data I'll have some much more interesting ideas about what I can do with that data.

In part 1 of this series, I discussed the history of R, the programming language and environment for statistical computing and graph generation, and why it's become so popular lately. The many libraries that people have contributed to it are a key reason for its popularity, and the SPARQL one inspired me to learn some R to try it out. Part 1 showed how to load this library, retrieve a SPARQL result set, and perform some basic statistical analysis of the numbers in the result set. After I published it, it was nice to see how its comments section filled up with a nice list of projects out there that combine R and SPARQL.

If you executed the sample commands from Part 1 and saved your session when quitting out of R (or in the case of what I was doing last week, RGui), all of the variables set in that session will be available for the commands described here. Today we'll look at a few more commands for analyzing the data, how to plot points and regression lines, and how to automate it all so that you can quickly perform the same analysis on different SPARQL result sets. Again, corrections welcome.

My original goal was to find out how closely the number of employees in the companies making up the Dow Jones Industrial Average correlated with the net income, which we can find out with R's cor() function:

> cor(queryResult$netIncome,queryResult$numEmployees)
[1] 0.1722887

A correlation figure close to 1 or -1 indicates a strong correlation (a negative correlation indicates that one variable's values tend to go in the opposite direction of the other's—for example, if incidence of a certain disease goes down as the use of a particular vaccine goes up) and 0 indicates no correlation. The correlation of 0.1722887 is much closer to 0 than it is to 1 or -1, so we see very little correlation here. (Once we automate this series of steps, we'll finder strong correlations when we focus on specific industries.)

More graphing

We're going to graph the relationship between the employee and net income figures, and then we'll tell R to draw a straight line that fits as closely as possible to the pattern created by the plotted values. This is called a linear regression model, and before we do that we tell R to calculate some data necessary for this task with the lm() ("linear model") function:

> myLinearModelData <- lm(queryResult$numEmployees~queryResult$netIncome) 

Next, we draw the graph:

> plot(queryResult$netIncome,queryResult$numEmployees,xlab="net income",
   ylab="# of employees", main="Dow Jones Industrial Average companies")

As with the histogram that we saw in Part 1, R offers many ways to control the graph's appearance, and add-in libraries let you do even more. (Try a Google image search on "fancy R plots" to get a feel for the possibilities.) In the call to plot() I included three parameters to set a main title and labels for the X and Y axes, and we see these in the result:

DJIA plot

We can see more intuitively what the cor() function already told us: that there is minimal correlation between the rise of employee counts and net income in the companies comprising the Dow Jones Industrial average.

Let's put the data that we stored in myLinearModelData to use. The abline() function can use it to add a regression line to our plot:

> abline(myLinearModelData)  
DJIA plot with regression line

When you type in function calls such as sd(queryResult$numEmployees) and cor(queryResult$netIncome,queryResult$numEmployees), R prints the return values as output, but you can use the return values in other operations. In the following, I've replotted the graph with the cor() function call's result used in a subtitle for the graph, concatenated onto the string "correlation: " with R's paste() function:

 plot(queryResult$netIncome,queryResult$numEmployees,xlab="net income",
   ylab="# of employees", main="Dow Jones Industrial Average companies",
   sub=paste("correlation: ",cor(queryResult$numEmployees,
             queryResult$netIncome),sep=""))

(The paste() function's sep argument here shows that we don't want any separator between our concatenated pieces. I'm guessing that paste() is more typically used to create delimited data files.) R puts the subtitle at the image's bottom:

DJIA plot with subtitle

Instead of plotting the graph on the screen, we can tell R to send it to a JPEG, BMP, PNG, or TIFF file. Calling a graphics devices function such as jpeg() before doing the plot tells R to send the results to a file, and dev.off() turns off the "device" that writes to the image file.

Automating it

Now we know nearly enough commands to create a useful script. The remainder are just string manipulation functions that I found easy enough to look up when I needed them, although having a string concatenation command called paste() is another example of the odd R terminology that I warned about last week. Here is my script:

library(SPARQL) 

category <- "Companies_in_the_Dow_Jones_Industrial_Average"
#category <- "Electronics_companies_of_the_United_States"
#category <- "Financial_services_companies_of_the_United_States"

query <- "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?label ?numEmployees ?netIncome  
WHERE {
  ?s dcterms:subject <http://dbpedia.org/resource/Category:DUMMY-CATEGORY-NAME> ;
     rdfs:label ?label ;
     dbo:netIncome ?netIncomeDollars ;
     dbpprop:numEmployees ?numEmployees . 
     BIND(replace(?numEmployees,',','') AS ?employees)  # lose commas
     FILTER ( lang(?label) = 'en' )
     FILTER(contains(?netIncomeDollars,'E'))
     # Following because DBpedia types them as dbpedia:datatype/usDollar
     BIND(xsd:float(?netIncomeDollars) AS ?netIncome)
     # Original query on following line had two 
     # slashes, but R needed both escaped.
     FILTER(!(regex(?numEmployees,'\\\\d+')))
}
ORDER BY ?numEmployees"

query <- sub(pattern="DUMMY-CATEGORY-NAME",replacement=category,x=query)

endpoint <- "http://dbpedia.org/sparql"
resultList <- SPARQL(endpoint,query)
queryResult <- resultList$results 
correlationLegend=paste("correlation: ",cor(queryResult$numEmployees,
                         queryResult$netIncome),sep="")
myLinearModelData <- lm(queryResult$numEmployees~queryResult$netIncome) 
plotTitle <- chartr(old="_",new=" ",x=category)
outputFilename <- paste("c:/temp/",category,".jpg",sep="")
jpeg(outputFilename)
plot(queryResult$netIncome,queryResult$numEmployees,xlab="net income",
     ylab="number of employees", main=plotTitle,cex.main=.9,
     sub=correlationLegend)
abline(myLinearModelData) 
dev.off()

Instead of hardcoding the URI of the industry category whose data I wanted, my script has DUMMY-CATEGORY-NAME, a string that it substitutes with the category value assigned at the script's beginning. The category value here is "Companies_in_the_Dow_Jones_Industrial_Average", with the setting of two other potential category values commented out so that we can easily try them later. (R, like SPARQL, uses the # character for commenting.) I also used the category value to create the output filename.

An additional embellishment to the sequence of commands that we entered manually is that the script stores the plot title in a plotTitle variable, replacing the underscores in the category name with spaces. Because this sometimes resulted in titles that were too wide for the plot image, I added cex.main=9 as a plot() argument to reduce the title's size.

With the script stored in /temp/myscript.R, entering the following at the R prompt runs it:

source("/temp/myscript.R")

If I don't have an R interpreter up and running, I can run the script from the operating system command line by calling rscript, which is included with R:

rscript /temp/myscript.R

After it runs, my /temp directory has this Companies_in_the_Dow_Jones_Industrial_Average.jpg file in it:

DJIA plot from script

When I uncomment the script's second category assignment line instead of the first and run the script again, it creates the file Electronics_companies_of_the_United_States.jpg:

data on U.S. electronics companies

There's better correlation this time, of almost .5. Fitting two particular outliers onto the plot means that R put enough points in the lower-left to make a bit of a blotch; I did find with experimentation that the plot() command offers parameters to only display the points within a particular range of values on the horizontal or vertical axis, making it easier to show a zoomed view.

Here's what we get when querying about Financial_services_companies_of_the_United_States:

data on U.S. electronics companies

We see the strongest correlation yet: over .84. I suppose that at financial services companies, hiring more people is more likely to increase revenue than in other typical sectors because you can provide (and charge for) a higher volume of services. This is only a theory, but that's why people use statistical analysis packages: to look for patterns that can suggest theories, and it's great to know that such a powerful open-source package can do this with data retrieved from SPARQL endpoints.

If I was going to run this script from the operating system command line regularly, then instead of setting the category value at the beginning of the script, I would pass it to rscript as an argument with the script name.

Learning more about R

Because of R's age and academic roots, there is a lot of stray documentation around, often in LaTeXish-looking PDFs from several years ago. Many introductions to R are aimed at people in a specific field, and I suppose my blog entries here fall in this category.

The best short, modern tour of R that I've found recently is Sharon Machlis's six-part series beginning at Beginner's Guide to R: Introduction. Part six points to many other places to learn about R ranging from blog entries to complete books to videos, and reviewing the list now I see more entries that I hadn't noticed before that look worth investigating.

Her list is where I learned about Jeffrey M. Stanton's Introduction to Data Science, an excellent introduction to both Data Science and to the use of R to execute common data science analysis tasks. The link here goes to an iTunes version of the book, but there's also a PDF version, which I read beginning to end.

The R Programming Wikibook makes a good quick reference work, especially when you need a particular function for something; see the table of contents down its right side. I found myself going back to the Text Processing page there several times. The four-page "R Reference Card" (pdf) by Tom Short is also worth printing out.

Last week I mentioned John D. Cook's R language for programmers, a blog entry that will help anyone familiar with typical modern programming languages get over a few initial small humps more quickly when learning R.

I described Machlis's six-part series as "short" because there there are so many full-length books on R out there, such as free ones like Stanton's and several offerings from O'Reilly and Manning. I've read the first few chapters of Manning's R in Action by Robert Kabakoff and find it very helpful so far. Apparently a new edition is coming out in March, so if you're thinking of buying it you may want to wait or else get the early access edition. Manning's Practical Data Science with R also looks good, but assumes a bit of R background (in fact, it recommends "R in Action" as a starting point), and a real beginner to this area would be better off starting with Stanton's free book mentioned above.

O'Reilly has several books on R, including an R Cookbook whose very task-oriented table of contents is worth skimming, as well as an accompanying R Graphics Cookbook.

I know that I'll be going back to several of these books and web pages, because in the future whenever I use SPARQL to retrieve numeric data I'll have some much more interesting ideas about what I can do with that data.

fancy R plots on Google

Please add any comments to this Google+ post.

13 January 2015

R (and SPARQL), part 1

Or, R for RDF people.

R is a programming language and environment for statistical computing and graph generation that, despite being over 30 years old, has gotten hot lately because it's an open-source, cross-platform tool that brings a lot to the world of Data Science, a recently popular field often associated with the analytics aspect of the drive towards Big Data. The large, active community around R has developed many add-on libraries, including one for working with data retrieved from SPARQL endpoints, so I thought I'd get to know R well enough to try that library. I first learned about this library from SPARQL with R in Less than 5 Minutes, which describes Semantic Web and Linked Data concepts to people familiar with R in order to demonstrate what they can do together; my goal here is to explain R to people familiar with RDF for the same reason. (Corrections to any misuse of statistical terminology are welcome.)

an open-source, cross-platform tool that brings a lot to the world of Data Science

R has also been called "GNU S," and first appeared in 1993 as an implementation of a statistical programming language developed at Bell Labs in 1976 known as S. (This is cuter if you know that the C programming language was also developed at Bell Labs as a successor to a language called B.) Its commercial competition includes Stata, SAS, and SPSS, all of whom have plenty to fear from R as its its power and reputation grow while its cost stays at zero. According to a recent article in Nature on R's growing popularity among scientists, "In the past decade, R has caught up with and overtaken the market leaders."

Downloading and installing R on a Windows machine gave me an icon that opened up the RGui windowed environment, which contains a console window where you enter commands that add other windows within RGui as needed for graphics. (The distribution also includes an executable that you can run from your operating system command line; as we'll see next week, you can use this to run scripts as well.) Most discussions of R recommend the open source RStudio as a more serious IDE for R development, but RGui was enough for me to play around.

Some of R's syntax is a bit awkward in places, possibly because of its age—some of its source code is written in Fortran, and it actually lets you call Fortran subroutines. I found some of its terminology to be awkward as well, but probably because it was designed for statisticians and not for programmers accustomed to typical modern programming languages. I highly recommend the quick tour of syntax quirks in R language for programmers by John D. Cook for such people when they're getting started with R.

For example, where I think of a table or a spreadsheet as consisting of rows and columns, R describes a data frame of observations and variables, meaning essentially the same thing. Of the simpler structures that come up in R, a vector is a one-dimensional set (I almost said "array" or "list" instead of "set" but these have different, specific meanings in R) of values of the same type, a matrix is a two-dimensional version, and an array is a three-dimensional version. A data frame looks like a matrix but "columns can be different modes" (that is, different properties and types), as described on the Data types page of the Quick-R website. The same page says that "data frames are the main structures you'll use to store datasets," which makes sense when you consider their similarity to spreadsheets, relational database tables, and, in the RDF world, SPARQL result sets.

I don't want to make too much of what may look like quirky terminology and syntax to people accustomed to other modern programming languages. I have come to appreciate the way R makes the most popular statistical operations so easy to carry out—even easier than Excel or LibreOffice Calc, which have a surprising amount of basic statistical operations built in.

Retrieving data from a SPARQL endpoint

Below I've walked through a session of commands entered at an R command line that you can paste into an R session yourself, not counting the > prompt shown before each command. Let's say that, using data retrieved from DBpedia, I'm wondering if there's a correlation between the number of employees and the amount of net income in a given set of companies. (I only used U.S. companies to make it easier to compare income figures.) Typically, companies with more employees have more net income, but do they correlate more closely in some industries than others? R lets you quantify and graph this correlation very easily, and along the way we'll see a few other things that it can do.

To start, I install the SPARQL package with this command, which starts up a wizard that loads it from a remote mirror:

> install.packages("SPARQL")  

After R installed the package, I loaded it for use in this session. The help() function can tell us more about an installed package:

> library(SPARQL)
> help(package="SPARQL") 

The help() function pops up a browser window with documentation of the topic passed as an argument. You can pass any function name to help() as well, so you can enter something like help(library()) or even help(help).

Analyzing the result

The next command uses R's <- assignment operator to assign a big multi-line string to the variable query. The string holds a SPARQL query that will be sent to DBpedia; you can run the same query on DBpedia's SNORQL interface to get a preview of the data (the query sent by that link is slightly different—see the last SPARQL comment in the query below):

> query <- "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?label ?numEmployees ?netIncome  
WHERE {
  ?s dcterms:subject <http://dbpedia.org/resource/Category:Companies_in_the_Dow_Jones_Industrial_Average> ;
     rdfs:label ?label ;
     dbo:netIncome ?netIncomeDollars ;
     dbpprop:numEmployees ?numEmployees . 
     BIND(replace(?numEmployees,',','') AS ?employees)  # lose commas
     FILTER ( lang(?label) = 'en' )
     FILTER(contains(?netIncomeDollars,'E'))
     # Following because DBpedia types them as dbpedia:datatype/usDollar
     BIND(xsd:float(?netIncomeDollars) AS ?netIncome)
     # original query on following line had two slashes, but 
     # R needed both escaped
     FILTER(!(regex(?numEmployees,'\\\\d+')))
}
ORDER BY ?numEmployees"

The query asks for the net income and employee count figures for companies that comprise the Dow Jones Industrial Average. The SPARQL comments within the query describe the query's steps in more detail.

Next, we assign the endpoint's URL to the endpoint variable and call the SPARQL package's SPARQL() function to send the query to that endpoint, storing the result in a resultList variable:

> endpoint <- "http://dbpedia.org/sparql"
> resultList <- SPARQL(endpoint,query)
> typeof(resultList)
[1] "list"

The third command there, and R's output, show that resultList has a type of list, which is described on the Data types page mentioned earlier as an "ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name." (Compare this with a vector, where everything must have the same type, or in R-speak, the same mode.)

The next command uses the very handy summary() function to learn more about what the SPARQL() function put into the resultList variable:

> summary(resultList)
           Length Class      Mode
results    3      data.frame list
namespaces 0      -none-     NULL

It shows a list of two things: our query results and an empty list of namespaces. Because we don't care about the empty list of namespaces, we'll make it easier to work with the results part by pulling it out and storing it in its own queryResult variable using the $ operator to identify the part of resultList that we want. Then, we use the str() function to learn more about what's in there:

> queryResult <- resultList$results 
> str(queryResult)
'data.frame':   27 obs. of  3 variables:
 $ label       : chr  "\"Visa Inc.\"@en" "\"The Travelers Companies\"@en" ...
 $ numEmployees: int  8500 30500 32900 44000 62800 64600 70000 ...
 $ netIncome   : num  2.14e+09 2.47e+09 8.04e+09 2.22e+09 5.36e+09 ...

The output tells us that it's a data frame, mentioned earlier as "the main structures you'll use to store datasets," with 27 obs[ervations] and 3 variables (that is, rows and columns).

The summary() function tells us some great stuff about a data frame—a set of information that would be much more work to retrieve if the same data was loaded into a spreadsheet program:

> summary(queryResult)
    label            numEmployees       netIncome        
 Length:27          Min.   :   8500   Min.   :2.144e+09  
 Class :character   1st Qu.:  72500   1st Qu.:4.863e+09  
 Mode  :character   Median : 107600   Median :8.040e+09  
                    Mean   : 205227   Mean   :1.050e+10  
                    3rd Qu.: 171711   3rd Qu.:1.530e+10  
                    Max.   :2200000   Max.   :3.258e+10  

The SPARQL query's SELECT statement asked for the label, numEmployees, and netIncome values, and we see some interesting information about the values returned for these, especially the numeric ones: the minimum, maximum, and mean (average) values of each, as well the boundary values if you split the returned values as closely as possible into four even groups known in statistics as quartiles. The first quartile value marks the boundary between the bottom quarter and the next quarter, the median splits the values in half, and the third quartile splits the top quarter from the third one.

We can very easily ask for the variance—a measure of how far apart all the values are spread from the mean—as well as the standard deviation, a useful measurement for describing how far any specific value is from the mean:

> var(queryResult$numEmployees)
[1] 167791342395
> sd(queryResult$numEmployees)
[1] 409623.4

Our first plot: a histogram

For our first step into graphics, we'll create a histogram, which illustrates the distribution of values. As with all R graphics, there are plenty of parameters available to control the image's appearance, but we can get a pretty useful histogram by sticking with the defaults:

hist(queryResult$netIncome) 

When running this interactively, RGui opens up a new window and displays the image there:

histogram generated with R

Next week we'll learn how to plot the specific points in the data, how to make the graph titles look nicer, and how to quantify the correlation between the two sets of values. (If you've been entering the commands shown here, then when you quit R with the quit() command or by picking Exit from RGui's File menu, it offers to save your workspace image for re-use the next time you start it up, so all of the variables that were set in a session like this will still be available in the next session.) We'll also see how to automate this series of steps to make it easier to generate a graph, with the correlation figure included, as a JPEG file. This automation will make it easier to graph the results and find the correlation figures for different industries. Finally, I'll list the best resources I found for learning R—there are a lot of them out there, of wildly varying quality.

Meanwhile, you can gaze at this R plot of a Mandelbrot set from R's Wikipedia page, which includes all the commands necessary to generate it:

Mandelbrot image generated with R

Please add any comments to this Google+ post.

13 December 2014

Hadoop

What it is and how people use it: my own summary.

Hadoop logo

The web offers plenty of introductions to what Hadoop is about. After reading up on it and trying it out a bit, I wanted to see if I could sum up what I see as the main points as concisely as possible. Corrections welcome.

Hadoop is an open source Apache project consisting of several modules. The key ones are the Hadoop Distributed File System (whose acronym is trademarked, apparently) and MapReduce. The HDFS lets you distribute storage across multiple systems and MapReduce lets you distribute processing across multiple systems by performing your "Map" logic on the distributed nodes and then the "Reduce" logic to gather up the results of the map processes on the master node that's driving it all.

This ability to spread out storage and processing makes it easier to do large-scale processing without requiring large-scale hardware. You can spread the processing across whatever boxes you have lying around or across virtual machines on a cloud platform that you spin up for only as long as you need them. This ability to inexpensively scale up has made Hadoop one of the most popular technologies associated with the buzzphrase "Big Data."

Writing Hadoop applications

Hardcore Hadoop usage often means writing the map and reduce tasks in Java programs that must import special Hadoop libraries and play by Hadoop rules; see the source of the Apache Hadoop Wiki's Word Count program for an example. (Word count programs are ubiquitous in Hadoop primers.) Then, once you've started up the Hadoop background processes, you can use Hadoop command line utilities to indicate the JAR file with your map and reduce logic and where on the HDFS to look for input and to put output. While your program runs, you can check on its progress with web interfaces to the various background processes.

Instead of coding and compiling your own JAR file, one nice option is to use the hadoop-streaming-*.jar one that comes with the Hadoop distribution to hand off the processing to scripts you've written in just about any language that can read from standard input and write to standard output. There's no need for these scripts to import any special Hadoop libraries. I found it very easy to go through Michael G. Noll's Writing an Hadoop MapReduce Program in Python tutorial (creating yet another word count program) after first doing his Running Hadoop on Ubuntu Linux (Single-Node Cluster) tutorial to set up a small Hadoop environment. (If you try one of the many Hadoop tutorials you can find on the web, make sure to run the same version of Hadoop that the tutorial's author did. The 2.* Hadoop releases are different enough from the 1.* ones that if you try to set up a distributed file system and share processing across it using a recent release while following instructions written using a 1.* release, there are more opportunities for problems. I had good luck with Hardik Pandya's "How to Set Up a Multi-Node Hadoop Cluster on Amazon EC2," split into Part 1 and Part 2, when I used the same release that he did.)

Hadoop's native scripting environments

Instead of writing your own applications, you can take advantage of the increasing number of native Hadoop scripting languages that shield you from the lower-level parts. Several popular ones build on HCatalog, a layer built on top of the HDFS. As the Hortonworks Hadoop tutorial Hello World! – An introduction to Hadoop with Hive and Pig puts it, "The function of HCatalog is to hold location and metadata about the data in a Hadoop cluster. This allows scripts and MapReduce jobs to be decoupled from data location and metadata like the schema. Additionally since HCatalog supports many tools, like Hive and Pig, the location and metadata can be shared between tools." You can work with HCatalog directly, but it's more common to use these other tools that are built on top of it, and you'll often see HCatalog mentioned in discussions of those tools. (For example, the same tutorial refers to the need to register a file with HCatalog before Hive or Pig can use it.)

Apache Hive, according to its home page, "facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL." You can start up Hive and enter HiveQL commands at its prompt or you can pass it scripts instead of using it interactively. If you know the basics of SQL, you'll be off and running pretty quickly. The 4:33 video Demonstration of Apache Hive by Rob Kerr gives a nice short introduction to writing and running Hive scripts.

Apache Pig is another Hadoop utility that takes advantage of HCatalog. The "Pig Latin" scripting language is less SQL-like (but straightforward enough) and lets you create data structures on the fly so that you can pipeline data through a series of steps. You can run its commands interactively at its grunt shell or in batch mode from the operating system command line.

When should you use Hive and when should you use Pig? It's a common topic of discussion; a Google search for "pig vs. hive" gets over 2,000 hits. Sometimes it's just a matter of convention at a particular shop. The stackoverflow thread Difference between Pig and Hive? Why have both? has some good points as well as pointers to more detailed discussions, including a Yahoo developer network discussion that doesn't mention Hive by name but has a good description of the basics of Pig and how it compares to an SQL approach.

You know what would be cool? A Hive adapter for D2R.

Hive and Pig are both very big in the Hadoop world, but plenty of other such tools are coming along. The home page of Apache Storm tells us that it "makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing." Apache Spark provides Java, Scala, and Python APIs and promises greater speed and an ability to layer on top of many different classes of data sources as its main advantages. There are other tools, but I mention these two because according to the recent O'Reilly 2014 Data Science Salary Survey, "Storm and Spark users earn the highest median salary" of all the data science tools they surveyed. Neither is restricted to use with Hadoop, but the big players described below advertise support for one or both as advantages of their Hadoop distributions.

Another popular tool in the Hadoop ecosystem is Apache HBase, the most well-known of the column-oriented NoSQL databases. It can sit on top of HDFS, and its tables can host both input and output for MapReduce jobs.

The big players

The companies Cloudera, HortonWorks, and MapR have gotten famous and made plenty of money selling and supporting packaged Hadoop distributions that include additional tools to make them easier to set up and use than the Apache downloads. After hearing that HortonWorks stayed closer to the open source philosophy than the others, I tried their distribution and found that it includes many additional web-based tools to shield you from the command line. For example, it lets you enter Hive and Pig Latin commands into IDE-ish windows designed around these tools, and it includes a graphical drag-and-drop file browser interface to the HDFS. I found the tutorials in the "Hello World" section of their Tutorials page to be very helpful. I have no experience with the other two companies, but a Google search on cloudera hortonworks mapr finds a lot of discussions out there comparing the three.

Pre-existing big IT names such as IBM and Microsoft have also jumped into the Hadoop market; when you do a Google search for just hadoop, it's interesting to see which companies have paid relatively how much for Google AdWord placement.

Hadoop's future

One of Hadoop's main uses so far has been to batch process large amounts of data (usually data that fits into one giant table, such as server or transaction logs) to harvest summary data that can be handed off to analytics packages. This is why SAS and Pentaho, who do not have their own Hadoop distributions, have paid for good Google AdWord placement when you search for "hadoop"—they want you to use their products for the analytics part.

A hot area of growth seems to be the promise of using Hadoop for more real-time processing, which is driving the escalation in Storm and Spark's popularity. Even in batch processing, there are still plenty of new opportunities in the Hadoop world as people adapt more kinds of data for use with the growing tool set. The "one giant table" representation is usually necessary to ease the splitting up of your data for distribution across multiple nodes; with my RDF hat on, I think there are some interesting possibilities for representing complex data structures in Hadoop using the N-Triples RDF syntax, which will still look like one giant three- (or four-) column table to Hadoop.

Cloudera's Paolo Castagna has done some work in this direction, as described in his presentation "Handling RDF data with tools from the Hadoop ecosystem" (pdf). A more recent presentation Quadrupling your Elephants: RDF and the Hadoop Ecosystem by YarcData's Rob Vesse shows some interesting work as well, including the beginnings of some Jena-based tools for processing RDF with Hadoop. There has been some work at the University of Freiberg on SPARQL query processing using Hadoop (pdf), and SPARQL City also offers a SPARQL front end to Hadoop-based storage. (If anyone's looking for a semantic web project idea, you know what would be cool? A Hive adapter for D2R.) I think there's a very bright future for the cross-pollination for all of these tools.


Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists