13 December 2014

Hadoop

What it is and how people use it: my own summary.

Hadoop logo

The web offers plenty of introductions to what Hadoop is about. After reading up on it and trying it out a bit, I wanted to see if I could sum up what I see as the main points as concisely as possible. Corrections welcome.

Hadoop is an open source Apache project consisting of several modules. The key ones are the Hadoop Distributed File System (whose acronym is trademarked, apparently) and MapReduce. The HDFS lets you distribute storage across multiple systems and MapReduce lets you distribute processing across multiple systems by performing your "Map" logic on the distributed nodes and then the "Reduce" logic to gather up the results of the map processes on the master node that's driving it all.

This ability to spread out storage and processing makes it easier to do large-scale processing without requiring large-scale hardware. You can spread the processing across whatever boxes you have lying around or across virtual machines on a cloud platform that you spin up for only as long as you need them. This ability to inexpensively scale up has made Hadoop one of the most popular technologies associated with the buzzphrase "Big Data."

Writing Hadoop applications

Hardcore Hadoop usage often means writing the map and reduce tasks in Java programs that must import special Hadoop libraries and play by Hadoop rules; see the source of the Apache Hadoop Wiki's Word Count program for an example. (Word count programs are ubiquitous in Hadoop primers.) Then, once you've started up the Hadoop background processes, you can use Hadoop command line utilities to indicate the JAR file with your map and reduce logic and where on the HDFS to look for input and to put output. While your program runs, you can check on its progress with web interfaces to the various background processes.

Instead of coding and compiling your own JAR file, one nice option is to use the hadoop-streaming-*.jar one that comes with the Hadoop distribution to hand off the processing to scripts you've written in just about any language that can read from standard input and write to standard output. There's no need for these scripts to import any special Hadoop libraries. I found it very easy to go through Michael G. Noll's Writing an Hadoop MapReduce Program in Python tutorial (creating yet another word count program) after first doing his Running Hadoop on Ubuntu Linux (Single-Node Cluster) tutorial to set up a small Hadoop environment. (If you try one of the many Hadoop tutorials you can find on the web, make sure to run the same version of Hadoop that the tutorial's author did. The 2.* Hadoop releases are different enough from the 1.* ones that if you try to set up a distributed file system and share processing across it using a recent release while following instructions written using a 1.* release, there are more opportunities for problems. I had good luck with Hardik Pandya's "How to Set Up a Multi-Node Hadoop Cluster on Amazon EC2," split into Part 1 and Part 2, when I used the same release that he did.)

Hadoop's native scripting environments

Instead of writing your own applications, you can take advantage of the increasing number of native Hadoop scripting languages that shield you from the lower-level parts. Several popular ones build on HCatalog, a layer built on top of the HDFS. As the Hortonworks Hadoop tutorial Hello World! – An introduction to Hadoop with Hive and Pig puts it, "The function of HCatalog is to hold location and metadata about the data in a Hadoop cluster. This allows scripts and MapReduce jobs to be decoupled from data location and metadata like the schema. Additionally since HCatalog supports many tools, like Hive and Pig, the location and metadata can be shared between tools." You can work with HCatalog directly, but it's more common to use these other tools that are built on top of it, and you'll often see HCatalog mentioned in discussions of those tools. (For example, the same tutorial refers to the need to register a file with HCatalog before Hive or Pig can use it.)

Apache Hive, according to its home page, "facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL." You can start up Hive and enter HiveQL commands at its prompt or you can pass it scripts instead of using it interactively. If you know the basics of SQL, you'll be off and running pretty quickly. The 4:33 video Demonstration of Apache Hive by Rob Kerr gives a nice short introduction to writing and running Hive scripts.

Apache Pig is another Hadoop utility that takes advantage of HCatalog. The "Pig Latin" scripting language is less SQL-like (but straightforward enough) and lets you create data structures on the fly so that you can pipeline data through a series of steps. You can run its commands interactively at its grunt shell or in batch mode from the operating system command line.

When should you use Hive and when should you use Pig? It's a common topic of discussion; a Google search for "pig vs. hive" gets over 2,000 hits. Sometimes it's just a matter of convention at a particular shop. The stackoverflow thread Difference between Pig and Hive? Why have both? has some good points as well as pointers to more detailed discussions, including a Yahoo developer network discussion that doesn't mention Hive by name but has a good description of the basics of Pig and how it compares to an SQL approach.

You know what would be cool? A Hive adapter for D2R.

Hive and Pig are both very big in the Hadoop world, but plenty of other such tools are coming along. The home page of Apache Storm tells us that it "makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing." Apache Spark provides Java, Scala, and Python APIs and promises greater speed and an ability to layer on top of many different classes of data sources as its main advantages. There are other tools, but I mention these two because according to the recent O'Reilly 2014 Data Science Salary Survey, "Storm and Spark users earn the highest median salary" of all the data science tools they surveyed. Neither is restricted to use with Hadoop, but the big players described below advertise support for one or both as advantages of their Hadoop distributions.

Another popular tool in the Hadoop ecosystem is Apache HBase, the most well-known of the column-oriented NoSQL databases. It can sit on top of HDFS, and its tables can host both input and output for MapReduce jobs.

The big players

The companies Cloudera, HortonWorks, and MapR have gotten famous and made plenty of money selling and supporting packaged Hadoop distributions that include additional tools to make them easier to set up and use than the Apache downloads. After hearing that HortonWorks stayed closer to the open source philosophy than the others, I tried their distribution and found that it includes many additional web-based tools to shield you from the command line. For example, it lets you enter Hive and Pig Latin commands into IDE-ish windows designed around these tools, and it includes a graphical drag-and-drop file browser interface to the HDFS. I found the tutorials in the "Hello World" section of their Tutorials page to be very helpful. I have no experience with the other two companies, but a Google search on cloudera hortonworks mapr finds a lot of discussions out there comparing the three.

Pre-existing big IT names such as IBM and Microsoft have also jumped into the Hadoop market; when you do a Google search for just hadoop, it's interesting to see which companies have paid relatively how much for Google AdWord placement.

Hadoop's future

One of Hadoop's main uses so far has been to batch process large amounts of data (usually data that fits into one giant table, such as server or transaction logs) to harvest summary data that can be handed off to analytics packages. This is why SAS and Pentaho, who do not have their own Hadoop distributions, have paid for good Google AdWord placement when you search for "hadoop"—they want you to use their products for the analytics part.

A hot area of growth seems to be the promise of using Hadoop for more real-time processing, which is driving the escalation in Storm and Spark's popularity. Even in batch processing, there are still plenty of new opportunities in the Hadoop world as people adapt more kinds of data for use with the growing tool set. The "one giant table" representation is usually necessary to ease the splitting up of your data for distribution across multiple nodes; with my RDF hat on, I think there are some interesting possibilities for representing complex data structures in Hadoop using the N-Triples RDF syntax, which will still look like one giant three- (or four-) column table to Hadoop.

Cloudera's Paolo Castagna has done some work in this direction, as described in his presentation "Handling RDF data with tools from the Hadoop ecosystem" (pdf). A more recent presentation Quadrupling your Elephants: RDF and the Hadoop Ecosystem by YarcData's Rob Vesse shows some interesting work as well, including the beginnings of some Jena-based tools for processing RDF with Hadoop. There has been some work at the University of Freiberg on SPARQL query processing using Hadoop (pdf), and SPARQL City also offers a SPARQL front end to Hadoop-based storage. (If anyone's looking for a semantic web project idea, you know what would be cool? A Hive adapter for D2R.) I think there's a very bright future for the cross-pollination for all of these tools.


Please add any comments to this Google+ post.

9 November 2014

Querying aggregated Walmart and BestBuy data with SPARQL

From structured data in their web pages!

some description

The combination of microdata and schema.org seems to have hit a sweet spot that has helped both to get a lot of traction. I've been learning more about microdata recently, but even before I did, I found that the W3C's Microdata to RDF Distiller written by Ivan Herman would convert microdata stored in web pages into RDF triples, making it possible to query this data with SPARQL. With major retailers such as Walmart and BestBuy making such data available on—as far as I can tell—every single product's web page, this makes some interesting queries possible to compare prices and other information from the two vendors.

I extracted the data describing six external USB drives from both walmart.com and bestbuy.com, limiting myself to models that were available on both websites. (Instead of pulling it separately from the twelve individual web pages, it would have been nice to automate this a bit more. I did sign up for Walmart's API program, which was easy to try out, but the part of the API that lets you query products by category is "restricted, and is available on a request basis" according to their Data Feed API home page, so I didn't bother. If I was going to pursue this further I would enroll in BestBuy's Developer Program as well.) After using the Distiller form to do this several times, I downloaded its Python script from the pymicrodata github page and found it easy to run locally.

You can see a Turtle file of aggregated Walmart plus Bestbuy data here. Because of some slight differences in how they treated certain bits of data, I was tempted to clean up the aggregated data before querying it, but I really wanted to write queries that would work on the data in its native form, so I put the cleanup steps right in the queries.

The various queries that I wrote led up to this one, which lists all the products by model number and price for easy comparison:

PREFIX schema: <http://schema.org/> 
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#> 

SELECT ?productName ?modelNumber ?price ?sellerName 
WHERE {
   ?product a schema:Product . 
   ?product schema:name ?productNameVal . 
   # str() to strip any language tags
   BIND(str(?productNameVal) AS ?productName)
   ?product schema:model ?modelNumberVal . 
   BIND(str(?modelNumberVal) AS ?modelNumber)
   ?product schema:offers ?offer . 
   ?offer a schema:Offer . 
   ?offer schema:price ?priceVal . 
   # Remove $ and cast to decimal
   BIND(xsd:decimal(replace(?priceVal,"\\$","")) AS ?price)
   ?offer schema:seller ?seller. 
   # In case there's a level of indirection for seller name
   OPTIONAL {
    ?seller schema:name ?sellerSchemaName . 
   }
   BIND(str(coalesce(?sellerSchemaName,?seller)) AS ?sellerName )
}
ORDER BY ?modelNumber ?price

Each comment in the query describes how it accounts for some difference between the Walmart microdata and the BestBuy microdata—for example, the BestBuy data included a dollar sign with prices, but the Walmart data did not.

After running the query, requesting XML output, and then running a little XSLT on that output, I ended up with the table shown below.

Product Name Model Number Price Seller Name
Buffalo - DriveStation Axis Velocity 2TB External USB 3.0/2.0 Hard Drive HD-LX2.0TU3 106.99 BestBuy
Buffalo Technology DriveStation Axis Velocity 2TB USB 3.0 External Hard Drive with Hardware Encryption, Black HD-LX2.0TU3 108.25 Walmart.com
Buffalo Technology DriveStation Axis Velocity 2TB USB 3.0 External Hard Drive with Hardware Encryption, Black HD-LX2.0TU3 129.45 pcRUSH
Buffalo Technology DriveStation Axis Velocity 2TB USB 3.0 External Hard Drive with Hardware Encryption, Black HD-LX2.0TU3 143.69 Tonzof
Toshiba - Canvio Basics 1 TB External Hard Drive HDTB210XK3BA 68.60 Buy.com
Toshiba 1TB Canvio Basics USB 3.0 External Hard Drive HDTB210XK3BA 73.84 pcRUSH
Toshiba 1TB Canvio Basics USB 3.0 External Hard Drive HDTB210XK3BA 99.0 Walmart.com
Toshiba Canvio Basics 2TB USB 3.0 External Hard Drive HDTB220XK3CA 103.14 Walmart.com
Toshiba - Canvio Basics Hard Drive HDTB220XK3CA 108.57 Buy.com
Seagate - Backup Plus Slim 1TB External USB 3.0/2.0 Portable Hard Drive - Black STDR1000100 69.99 BestBuy
Seagate Backup Plus 1TB Slim Portable External Hard Drive, Black STDR1000100 89.99 Walmart.com
WD - My Book 3TB External USB 3.0 Hard Drive - Black WDBFJK0030HBK-NESN 128.99 BestBuy
WD My Book 3TB USB 3.0 External Hard Drive WDBFJK0030HBK-NESN 129.99 Walmart.com
WD - My Book 4TB External USB 3.0 Hard Drive - Black WDBFJK0040HBK-NESN 149.99 BestBuy
WD My Book 4TB USB 3.0 External Hard Drive WDBFJK0040HBK-NESN 169.99 Walmart.com

Vendors other than Walmart and BestBuy on the list were included in the Walmart data.

Unfortunately, since I pulled the data that I was working with on October 15th, Walmart seems to have changed their web pages so that the W3C Microdata to RDF Distiller doesn't find the data in them anymore. I still see schema.org microdata in the source of a page like this Walmart page for an external hard drive, but I guess it's arranged differently. Perhaps they didn't want people using standards-based technology to automate the process of finding out that BestBuy's external hard drives usually cost less, or at least did in mid-October. A random check of products on other websites showed that the Distiller could pull useful data out of pages on target.com, llbean.com, and markesandspencer.com, so plenty of other major retailers are providing schema.org microdata in their product web pages.

The important thing is that, even before I knew anything about the structure and syntax of microdata, a publicly available open source program let me pull and aggregate data from different big box stores' web sites so that I could query the combination with SPARQL. With more and more brand name retailers making data available for this, this will definitely make some interesting applications possible in the future.


Please add any comments to this Google+ post.

6 October 2014

Dropping OPTIONAL blocks from SPARQL CONSTRUCT queries

And retrieving those triples much, much faster.

animals taxonomy

While preparing a demo for the upcoming Taxonomy Boot Camp conference, I hit upon a trick for revising SPARQL CONSTRUCT queries so that they don't need OPTIONAL blocks. As I wrote in the new "Query Efficiency and Debugging" chapter in the second edition of Learning SPARQL, "Academic papers on SPARQL query optimization agree: OPTIONAL is the guiltiest party in slowing down queries, adding the most complexity to the job that the SPARQL processor must do to find the relevant data and return it." My new trick not only made the retrieval much faster; it also made it possible to retrieve a lot more data from a remote endpoint.

First, let's look at a simple version of the use case. DBpedia has a lot of SKOS taxonomy data in it, and at Taxonomy Boot Camp I'm going to show how you can pull down and use that data. Now, imagine that a little animal taxonomy like the one shown in the illustration here is stored on an endpoint and I want to write a query to retrieve all the triples showing preferred labels and "has broader" values up to three levels down from the Mammal concept, assuming that the taxonomy's structure uses SKOS to represent its structure. The following query asks for all three levels of the taxonomy below Mammal, but it won't get the whole taxonomy:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level1 skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label . 
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}
WHERE {
  ?level1 skos:broader v:Mammal ;
          skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label .
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}

As with any SPARQL query, it's only going to return triples for which all the triple patterns in the WHERE clause match. While Horse may have a broader value of Mammal and therefore match the triple pattern {?level1 skos:broader v:Mammal}, there are no nodes that have Horse as a broader value, so there will be no match for {?level2 skos:broader v:Horse}. So, the Horse triples won't be in the output. The same thing will happen with the Cat triples; only the Dog ones, which go down three levels below Mammal, will match the graph pattern in the WHERE clause above.

If we want a CONSTRUCT query that retrieves all the triples of the subtree under Mammal, we need a way to retrieve the Horse and Cat concepts and any descendants they have, even if they have no descendants, and OPTIONAL makes this possible. The following will do this:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level1 skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label . 
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}
WHERE {
  ?level1 skos:broader v:Mammal ;
          skos:prefLabel ?level1label . 
  OPTIONAL {
    ?level2 skos:broader ?level1 ;
            skos:prefLabel ?level2label .
  }
  OPTIONAL {
    ?level3 skos:broader ?level2 ;
            skos:prefLabel ?level3label . 
  }
}

The problem: this doesn't scale. When I sent a nearly identical query to DBpedia to ask for the triples representing the hierarchy three levels down from <http://dbpedia.org/resource/Category:Mammals>, it timed out after 20 minutes, because the two OPTIONAL graph patterns gave DBpedia too much work to do.

As a review, let's restate the problem: we want the identified concept and the preferred labels and broader values of concepts up to three levels down from that concept, but without using the OPTIONAL keyword. How can we do this?

By asking for each level in a separate query. When I split the DBpedia version of the query above into the following three queries, each retrieved its data in under a second, retrieving a total of 2,597 triples representing a taxonomy of 1,107 concepts:

# query 1
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  <http://dbpedia.org/resource/Category:Mammals> a skos:Concept . 
  ?level1 a skos:Concept ;
          skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
}

# query 2
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level2 a skos:Concept ;
          skos:broader ?level1 ;  
          skos:prefLabel ?level2label .  
}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> .
  ?level2 skos:broader ?level1 ;  
            skos:prefLabel ?level2label .  
}

# query 3
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level3 a skos:Concept ;
          skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  
}
WHERE {
?level2 skos:broader/skos:broader <http://dbpedia.org/resource/Category:Mammals> .
  ?level3 skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  
}

Going from timing out after 20 minutes to successful execution in under 3 seconds is quite a performance improvement. Below, you can see how the beginning of a small piece of this taxonomy looks in TopQuadrant's TopBraid EVN vocabulary manager. At the first level down, you can only see Afrosoricida, Australosphenida, and Bats in the picture; I then drilled down three more levels from there to show that Fictional bats has the single subcategory Silverwing.

As you can tell from the Mammals URI in the queries above, these taxonomy concepts are categories, and each category has at least one member (for example, Bats as food) in Wikipedia and is therefore represented as triples in DBpedia, ready for you to retrieve with SPARQL CONSTRUCT queries. I didn't retrieve any instance triples here, but it's great to know that they're available, and that this technique for avoiding CONSTRUCT graph patterns will serve me for much more than SKOS taxonomy work.

There has been plenty of talk lately on Twitter and in blogs about how it's not a good idea for important applications to have serious dependencies on public SPARQL endpoints such as DBpedia. (Orri Erling has one of the most level-head discussions of this that I've seen in SEMANTiCS 2014 (part 3 of 3): Conversations; in my posting Semantic Web Journal article on DBpedia on this blog I described a great article that lists other options.) There's all this great data to use in DBpedia, and besides spinning up an Amazon Web Services image with your own copy of DBpedia, as Orri suggests, you can pull down the data you need to store locally when it is up. If you're unsure about the structure and connections of the data you're pulling down, OPTIONAL graph patterns seems like an obvious fix, but this trick for splitting up CONSTRUCT queries to avoid the use of OPTIONAL graph patterns means that you can pull down a lot more data lot more efficiently.

Stickin' to the UNION

October 16th update: Once I split out the pieces of the original query into separate files, it should have occurred to me to at least try joining them back up into a single query with UNION instead of OPTIONAL, but it didn't. Luckily for me, John Walker suggested in the comments for this blog entry that I try this, so I did. It worked great, with the benefit of being simpler to read and maintain than using a collection of queries to retrieve a single set of triples. This version only took three seconds to retrieve the triples:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  <http://dbpedia.org/resource/Category:Mammals> a skos:Concept . 
  ?level1 a skos:Concept ;
          skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
  ?level2 a skos:Concept ;
          skos:broader ?level1 ;  
          skos:prefLabel ?level2label .  
  ?level3 a skos:Concept ;
          skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  

}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
  {
    ?level2 skos:broader ?level1 ;  
    skos:prefLabel ?level2label .  
  }
  UNION
  {
    ?level2 skos:broader ?level1 .
    ?level3 skos:broader ?level2 ;  
            skos:prefLabel ?level3label .  
  }
}

There are two lessons here:

  • If you've figured out a way to do something better, don't be too satisfied too quickly—keep trying to make it even better.

  • UNION is going to be useful in more situations than I originally thought it would.

animals taxonomy

Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists