Querying aggregated Walmart and BestBuy data with SPARQL

From structured data in their web pages!
some description

The combination of microdata and schema.org seems to have hit a sweet spot that has helped both to get a lot of traction. I've been learning more about microdata recently, but even before I did, I found that the W3C's Microdata to RDF Distiller written by Ivan Herman would convert microdata stored in web pages into RDF triples, making it possible to query this data with SPARQL. With major retailers such as Walmart and BestBuy making such data available on—as far as I can tell—every single product's web page, this makes some interesting queries possible to compare prices and other information from the two vendors.

I extracted the data describing six external USB drives from both walmart.com and bestbuy.com, limiting myself to models that were available on both websites. (Instead of pulling it separately from the twelve individual web pages, it would have been nice to automate this a bit more. I did sign up for Walmart's API program, which was easy to try out, but the part of the API that lets you query products by category is "restricted, and is available on a request basis" according to their Data Feed API home page, so I didn't bother. If I was going to pursue this further I would enroll in BestBuy's Developer Program as well.) After using the Distiller form to do this several times, I downloaded its Python script from the pymicrodata github page and found it easy to run locally.

You can see a Turtle file of aggregated Walmart plus Bestbuy data here. Because of some slight differences in how they treated certain bits of data, I was tempted to clean up the aggregated data before querying it, but I really wanted to write queries that would work on the data in its native form, so I put the cleanup steps right in the queries.

The various queries that I wrote led up to this one, which lists all the products by model number and price for easy comparison:

PREFIX schema: <http://schema.org/> 
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#> 

SELECT ?productName ?modelNumber ?price ?sellerName 
   ?product a schema:Product . 
   ?product schema:name ?productNameVal . 
   # str() to strip any language tags
   BIND(str(?productNameVal) AS ?productName)
   ?product schema:model ?modelNumberVal . 
   BIND(str(?modelNumberVal) AS ?modelNumber)
   ?product schema:offers ?offer . 
   ?offer a schema:Offer . 
   ?offer schema:price ?priceVal . 
   # Remove $ and cast to decimal
   BIND(xsd:decimal(replace(?priceVal,"\\$","")) AS ?price)
   ?offer schema:seller ?seller. 
   # In case there's a level of indirection for seller name
    ?seller schema:name ?sellerSchemaName . 
   BIND(str(coalesce(?sellerSchemaName,?seller)) AS ?sellerName )
ORDER BY ?modelNumber ?price

Each comment in the query describes how it accounts for some difference between the Walmart microdata and the BestBuy microdata—for example, the BestBuy data included a dollar sign with prices, but the Walmart data did not.

After running the query, requesting XML output, and then running a little XSLT on that output, I ended up with the table shown below.

Product Name Model Number Price Seller Name
Buffalo - DriveStation Axis Velocity 2TB External USB 3.0/2.0 Hard Drive HD-LX2.0TU3 106.99 BestBuy
Buffalo Technology DriveStation Axis Velocity 2TB USB 3.0 External Hard Drive with Hardware Encryption, Black HD-LX2.0TU3 108.25 Walmart.com
Buffalo Technology DriveStation Axis Velocity 2TB USB 3.0 External Hard Drive with Hardware Encryption, Black HD-LX2.0TU3 129.45 pcRUSH
Buffalo Technology DriveStation Axis Velocity 2TB USB 3.0 External Hard Drive with Hardware Encryption, Black HD-LX2.0TU3 143.69 Tonzof
Toshiba - Canvio Basics 1 TB External Hard Drive HDTB210XK3BA 68.60 Buy.com
Toshiba 1TB Canvio Basics USB 3.0 External Hard Drive HDTB210XK3BA 73.84 pcRUSH
Toshiba 1TB Canvio Basics USB 3.0 External Hard Drive HDTB210XK3BA 99.0 Walmart.com
Toshiba Canvio Basics 2TB USB 3.0 External Hard Drive HDTB220XK3CA 103.14 Walmart.com
Toshiba - Canvio Basics Hard Drive HDTB220XK3CA 108.57 Buy.com
Seagate - Backup Plus Slim 1TB External USB 3.0/2.0 Portable Hard Drive - Black STDR1000100 69.99 BestBuy
Seagate Backup Plus 1TB Slim Portable External Hard Drive, Black STDR1000100 89.99 Walmart.com
WD - My Book 3TB External USB 3.0 Hard Drive - Black WDBFJK0030HBK-NESN 128.99 BestBuy
WD My Book 3TB USB 3.0 External Hard Drive WDBFJK0030HBK-NESN 129.99 Walmart.com
WD - My Book 4TB External USB 3.0 Hard Drive - Black WDBFJK0040HBK-NESN 149.99 BestBuy
WD My Book 4TB USB 3.0 External Hard Drive WDBFJK0040HBK-NESN 169.99 Walmart.com

Vendors other than Walmart and BestBuy on the list were included in the Walmart data.

Unfortunately, since I pulled the data that I was working with on October 15th, Walmart seems to have changed their web pages so that the W3C Microdata to RDF Distiller doesn't find the data in them anymore. I still see schema.org microdata in the source of a page like this Walmart page for an external hard drive, but I guess it's arranged differently. Perhaps they didn't want people using standards-based technology to automate the process of finding out that BestBuy's external hard drives usually cost less, or at least did in mid-October. A random check of products on other websites showed that the Distiller could pull useful data out of pages on target.com, llbean.com, and markesandspencer.com, so plenty of other major retailers are providing schema.org microdata in their product web pages.

The important thing is that, even before I knew anything about the structure and syntax of microdata, a publicly available open source program let me pull and aggregate data from different big box stores' web sites so that I could query the combination with SPARQL. With more and more brand name retailers making data available for this, this will definitely make some interesting applications possible in the future.

Please add any comments to this Google+ post.