9 November 2014

Querying aggregated Walmart and BestBuy data with SPARQL

From structured data in their web pages!

some description

The combination of microdata and schema.org seems to have hit a sweet spot that has helped both to get a lot of traction. I've been learning more about microdata recently, but even before I did, I found that the W3C's Microdata to RDF Distiller written by Ivan Herman would convert microdata stored in web pages into RDF triples, making it possible to query this data with SPARQL. With major retailers such as Walmart and BestBuy making such data available on—as far as I can tell—every single product's web page, this makes some interesting queries possible to compare prices and other information from the two vendors.

I extracted the data describing six external USB drives from both walmart.com and bestbuy.com, limiting myself to models that were available on both websites. (Instead of pulling it separately from the twelve individual web pages, it would have been nice to automate this a bit more. I did sign up for Walmart's API program, which was easy to try out, but the part of the API that lets you query products by category is "restricted, and is available on a request basis" according to their Data Feed API home page, so I didn't bother. If I was going to pursue this further I would enroll in BestBuy's Developer Program as well.) After using the Distiller form to do this several times, I downloaded its Python script from the pymicrodata github page and found it easy to run locally.

You can see a Turtle file of aggregated Walmart plus Bestbuy data here. Because of some slight differences in how they treated certain bits of data, I was tempted to clean up the aggregated data before querying it, but I really wanted to write queries that would work on the data in its native form, so I put the cleanup steps right in the queries.

The various queries that I wrote led up to this one, which lists all the products by model number and price for easy comparison:

PREFIX schema: <http://schema.org/> 
PREFIX xsd:    <http://www.w3.org/2001/XMLSchema#> 

SELECT ?productName ?modelNumber ?price ?sellerName 
WHERE {
   ?product a schema:Product . 
   ?product schema:name ?productNameVal . 
   # str() to strip any language tags
   BIND(str(?productNameVal) AS ?productName)
   ?product schema:model ?modelNumberVal . 
   BIND(str(?modelNumberVal) AS ?modelNumber)
   ?product schema:offers ?offer . 
   ?offer a schema:Offer . 
   ?offer schema:price ?priceVal . 
   # Remove $ and cast to decimal
   BIND(xsd:decimal(replace(?priceVal,"\\$","")) AS ?price)
   ?offer schema:seller ?seller. 
   # In case there's a level of indirection for seller name
   OPTIONAL {
    ?seller schema:name ?sellerSchemaName . 
   }
   BIND(str(coalesce(?sellerSchemaName,?seller)) AS ?sellerName )
}
ORDER BY ?modelNumber ?price

Each comment in the query describes how it accounts for some difference between the Walmart microdata and the BestBuy microdata—for example, the BestBuy data included a dollar sign with prices, but the Walmart data did not.

After running the query, requesting XML output, and then running a little XSLT on that output, I ended up with the table shown below.

Product Name Model Number Price Seller Name
Buffalo - DriveStation Axis Velocity 2TB External USB 3.0/2.0 Hard Drive HD-LX2.0TU3 106.99 BestBuy
Buffalo Technology DriveStation Axis Velocity 2TB USB 3.0 External Hard Drive with Hardware Encryption, Black HD-LX2.0TU3 108.25 Walmart.com
Buffalo Technology DriveStation Axis Velocity 2TB USB 3.0 External Hard Drive with Hardware Encryption, Black HD-LX2.0TU3 129.45 pcRUSH
Buffalo Technology DriveStation Axis Velocity 2TB USB 3.0 External Hard Drive with Hardware Encryption, Black HD-LX2.0TU3 143.69 Tonzof
Toshiba - Canvio Basics 1 TB External Hard Drive HDTB210XK3BA 68.60 Buy.com
Toshiba 1TB Canvio Basics USB 3.0 External Hard Drive HDTB210XK3BA 73.84 pcRUSH
Toshiba 1TB Canvio Basics USB 3.0 External Hard Drive HDTB210XK3BA 99.0 Walmart.com
Toshiba Canvio Basics 2TB USB 3.0 External Hard Drive HDTB220XK3CA 103.14 Walmart.com
Toshiba - Canvio Basics Hard Drive HDTB220XK3CA 108.57 Buy.com
Seagate - Backup Plus Slim 1TB External USB 3.0/2.0 Portable Hard Drive - Black STDR1000100 69.99 BestBuy
Seagate Backup Plus 1TB Slim Portable External Hard Drive, Black STDR1000100 89.99 Walmart.com
WD - My Book 3TB External USB 3.0 Hard Drive - Black WDBFJK0030HBK-NESN 128.99 BestBuy
WD My Book 3TB USB 3.0 External Hard Drive WDBFJK0030HBK-NESN 129.99 Walmart.com
WD - My Book 4TB External USB 3.0 Hard Drive - Black WDBFJK0040HBK-NESN 149.99 BestBuy
WD My Book 4TB USB 3.0 External Hard Drive WDBFJK0040HBK-NESN 169.99 Walmart.com

Vendors other than Walmart and BestBuy on the list were included in the Walmart data.

Unfortunately, since I pulled the data that I was working with on October 15th, Walmart seems to have changed their web pages so that the W3C Microdata to RDF Distiller doesn't find the data in them anymore. I still see schema.org microdata in the source of a page like this Walmart page for an external hard drive, but I guess it's arranged differently. Perhaps they didn't want people using standards-based technology to automate the process of finding out that BestBuy's external hard drives usually cost less, or at least did in mid-October. A random check of products on other websites showed that the Distiller could pull useful data out of pages on target.com, llbean.com, and markesandspencer.com, so plenty of other major retailers are providing schema.org microdata in their product web pages.

The important thing is that, even before I knew anything about the structure and syntax of microdata, a publicly available open source program let me pull and aggregate data from different big box stores' web sites so that I could query the combination with SPARQL. With more and more brand name retailers making data available for this, this will definitely make some interesting applications possible in the future.


Please add any comments to this Google+ post.

6 October 2014

Dropping OPTIONAL blocks from SPARQL CONSTRUCT queries

And retrieving those triples much, much faster.

animals taxonomy

While preparing a demo for the upcoming Taxonomy Boot Camp conference, I hit upon a trick for revising SPARQL CONSTRUCT queries so that they don't need OPTIONAL blocks. As I wrote in the new "Query Efficiency and Debugging" chapter in the second edition of Learning SPARQL, "Academic papers on SPARQL query optimization agree: OPTIONAL is the guiltiest party in slowing down queries, adding the most complexity to the job that the SPARQL processor must do to find the relevant data and return it." My new trick not only made the retrieval much faster; it also made it possible to retrieve a lot more data from a remote endpoint.

First, let's look at a simple version of the use case. DBpedia has a lot of SKOS taxonomy data in it, and at Taxonomy Boot Camp I'm going to show how you can pull down and use that data. Now, imagine that a little animal taxonomy like the one shown in the illustration here is stored on an endpoint and I want to write a query to retrieve all the triples showing preferred labels and "has broader" values up to three levels down from the Mammal concept, assuming that the taxonomy's structure uses SKOS to represent its structure. The following query asks for all three levels of the taxonomy below Mammal, but it won't get the whole taxonomy:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level1 skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label . 
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}
WHERE {
  ?level1 skos:broader v:Mammal ;
          skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label .
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}

As with any SPARQL query, it's only going to return triples for which all the triple patterns in the WHERE clause match. While Horse may have a broader value of Mammal and therefore match the triple pattern {?level1 skos:broader v:Mammal}, there are no nodes that have Horse as a broader value, so there will be no match for {?level2 skos:broader v:Horse}. So, the Horse triples won't be in the output. The same thing will happen with the Cat triples; only the Dog ones, which go down three levels below Mammal, will match the graph pattern in the WHERE clause above.

If we want a CONSTRUCT query that retrieves all the triples of the subtree under Mammal, we need a way to retrieve the Horse and Cat concepts and any descendants they have, even if they have no descendants, and OPTIONAL makes this possible. The following will do this:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level1 skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label . 
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}
WHERE {
  ?level1 skos:broader v:Mammal ;
          skos:prefLabel ?level1label . 
  OPTIONAL {
    ?level2 skos:broader ?level1 ;
            skos:prefLabel ?level2label .
  }
  OPTIONAL {
    ?level3 skos:broader ?level2 ;
            skos:prefLabel ?level3label . 
  }
}

The problem: this doesn't scale. When I sent a nearly identical query to DBpedia to ask for the triples representing the hierarchy three levels down from <http://dbpedia.org/resource/Category:Mammals>, it timed out after 20 minutes, because the two OPTIONAL graph patterns gave DBpedia too much work to do.

As a review, let's restate the problem: we want the identified concept and the preferred labels and broader values of concepts up to three levels down from that concept, but without using the OPTIONAL keyword. How can we do this?

By asking for each level in a separate query. When I split the DBpedia version of the query above into the following three queries, each retrieved its data in under a second, retrieving a total of 2,597 triples representing a taxonomy of 1,107 concepts:

# query 1
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  <http://dbpedia.org/resource/Category:Mammals> a skos:Concept . 
  ?level1 a skos:Concept ;
          skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
}

# query 2
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level2 a skos:Concept ;
          skos:broader ?level1 ;  
          skos:prefLabel ?level2label .  
}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> .
  ?level2 skos:broader ?level1 ;  
            skos:prefLabel ?level2label .  
}

# query 3
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level3 a skos:Concept ;
          skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  
}
WHERE {
?level2 skos:broader/skos:broader <http://dbpedia.org/resource/Category:Mammals> .
  ?level3 skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  
}

Going from timing out after 20 minutes to successful execution in under 3 seconds is quite a performance improvement. Below, you can see how the beginning of a small piece of this taxonomy looks in TopQuadrant's TopBraid EVN vocabulary manager. At the first level down, you can only see Afrosoricida, Australosphenida, and Bats in the picture; I then drilled down three more levels from there to show that Fictional bats has the single subcategory Silverwing.

As you can tell from the Mammals URI in the queries above, these taxonomy concepts are categories, and each category has at least one member (for example, Bats as food) in Wikipedia and is therefore represented as triples in DBpedia, ready for you to retrieve with SPARQL CONSTRUCT queries. I didn't retrieve any instance triples here, but it's great to know that they're available, and that this technique for avoiding CONSTRUCT graph patterns will serve me for much more than SKOS taxonomy work.

There has been plenty of talk lately on Twitter and in blogs about how it's not a good idea for important applications to have serious dependencies on public SPARQL endpoints such as DBpedia. (Orri Erling has one of the most level-head discussions of this that I've seen in SEMANTiCS 2014 (part 3 of 3): Conversations; in my posting Semantic Web Journal article on DBpedia on this blog I described a great article that lists other options.) There's all this great data to use in DBpedia, and besides spinning up an Amazon Web Services image with your own copy of DBpedia, as Orri suggests, you can pull down the data you need to store locally when it is up. If you're unsure about the structure and connections of the data you're pulling down, OPTIONAL graph patterns seems like an obvious fix, but this trick for splitting up CONSTRUCT queries to avoid the use of OPTIONAL graph patterns means that you can pull down a lot more data lot more efficiently.

Stickin' to the UNION

October 16th update: Once I split out the pieces of the original query into separate files, it should have occurred to me to at least try joining them back up into a single query with UNION instead of OPTIONAL, but it didn't. Luckily for me, John Walker suggested in the comments for this blog entry that I try this, so I did. It worked great, with the benefit of being simpler to read and maintain than using a collection of queries to retrieve a single set of triples. This version only took three seconds to retrieve the triples:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  <http://dbpedia.org/resource/Category:Mammals> a skos:Concept . 
  ?level1 a skos:Concept ;
          skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
  ?level2 a skos:Concept ;
          skos:broader ?level1 ;  
          skos:prefLabel ?level2label .  
  ?level3 a skos:Concept ;
          skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  

}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
  {
    ?level2 skos:broader ?level1 ;  
    skos:prefLabel ?level2label .  
  }
  UNION
  {
    ?level2 skos:broader ?level1 .
    ?level3 skos:broader ?level2 ;  
            skos:prefLabel ?level3label .  
  }
}

There are two lessons here:

  • If you've figured out a way to do something better, don't be too satisfied too quickly—keep trying to make it even better.

  • UNION is going to be useful in more situations than I originally thought it would.

animals taxonomy

Please add any comments to this Google+ post.

13 September 2014

A schemaless computer database in 1965

To enable flexible metadata aggregation, among other things.

figure 3

I've been reading up on America's post-war attempt to keep up the accelerated pace of R&D that began during World War II. This effort led to an infrastructure that made accomplishments such as the moon landing and the Internet possible; it also led to some very dry literature, and I'm mostly interested in what new metadata-related techniques were developed to track and share the products of the research as they led to development.

One dry bit of literature is the proceedings of the 1965 Toward a National Information System: Second Annual National Colloquium On Information Retrieval. The conference was sponsored by the American Documentation Institute, who had a big role in the post-war information sharing work, as well as the University of Pennsylvania's Moore School of Electrical Engineering (where Eckert and Mauchly built ENIAC and its successor EDVAC) and some ACM chapters.

In a chapter on how the North American Aviation company (now part of Boeing) revamped their practices for sharing information among divisions, I came across this description of some very flexible metadata storage:

All bibliographic information contained in both the corporate and divisional Electronic Data Processing (EDP) subsystems is retained permanently on magnetic tape in the form of variable length records containing variable length fields. Each field, with the exception of sort keys, consists of three adjacent field parts: field character count, field identification, and field text (see Figure 3). There are several advantages to this format: it is extremely compact, thereby reducing computer read-write time; it provides for definition and consequent addition of new types of fields of bibliographic information without reformatting extant files; and its flexibility allows conversion of files from other indexing abstracting services.

I especially like that "it provides for definition and consequent addition of new types of fields of bibliographic information without reformatting extant files." This reminds me of one slide in my presentation last month at the Semantic Technology and Business / NoSQL Now! conferences last month, where my talk was on a track shared by both conferences, about how a key advantage of schemaless NoSQL databases is the ability to add a new value for a new property to a data set with no need for the schema evolution steps that can be so painful in a relational database.

Moore's law has led to less of a reliance on arranging data in tables to allow the efficient retrieval of that data. The various NoSQL options have explored new ways to do this, and it was great to see that one aerospace company was doing it 49 years ago. Of course, retrieving data from magnetic tape is less efficient than modern alternatives, but it was a big step past the use of piles of punched cards, and pretty modern for its time, as you can see from the tape spools on the picture of EDVAC's gleaming successor below. I thought it was cool to see that, although tabular representation of data long predates relational databases (hierarchical and network databases also stored sets of entities as tables, but with much less flexibility) that someone had implemented such a flexible model so long ago, especially to represent metadata, with a use case that we often see now with RDF: to allow "conversion of files from other indexing abstracting services"—in other words, to accomodate the aggregation of metadata from other sources that may not have structured their data the same way that yours is structured.

Univac 9400

Univac photo by H. Müller CC-BY-SA-2.5, via Wikimedia Commons


Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists