6 October 2014

Dropping OPTIONAL blocks from SPARQL CONSTRUCT queries

And retrieving those triples much, much faster.

animals taxonomy

While preparing a demo for the upcoming Taxonomy Boot Camp conference, I hit upon a trick for revising SPARQL CONSTRUCT queries so that they don't need OPTIONAL blocks. As I wrote in the new "Query Efficiency and Debugging" chapter in the second edition of Learning SPARQL, "Academic papers on SPARQL query optimization agree: OPTIONAL is the guiltiest party in slowing down queries, adding the most complexity to the job that the SPARQL processor must do to find the relevant data and return it." My new trick not only made the retrieval much faster; it also made it possible to retrieve a lot more data from a remote endpoint.

First, let's look at a simple version of the use case. DBpedia has a lot of SKOS taxonomy data in it, and at Taxonomy Boot Camp I'm going to show how you can pull down and use that data. Now, imagine that a little animal taxonomy like the one shown in the illustration here is stored on an endpoint and I want to write a query to retrieve all the triples showing preferred labels and "has broader" values up to three levels down from the Mammal concept, assuming that the taxonomy's structure uses SKOS to represent its structure. The following query asks for all three levels of the taxonomy below Mammal, but it won't get the whole taxonomy:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level1 skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label . 
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}
WHERE {
  ?level1 skos:broader v:Mammal ;
          skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label .
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}

As with any SPARQL query, it's only going to return triples for which all the triple patterns in the WHERE clause match. While Horse may have a broader value of Mammal and therefore match the triple pattern {?level1 skos:broader v:Mammal}, there are no nodes that have Horse as a broader value, so there will be no match for {?level2 skos:broader v:Horse}. So, the Horse triples won't be in the output. The same thing will happen with the Cat triples; only the Dog ones, which go down three levels below Mammal, will match the graph pattern in the WHERE clause above.

If we want a CONSTRUCT query that retrieves all the triples of the subtree under Mammal, we need a way to retrieve the Horse and Cat concepts and any descendants they have, even if they have no descendants, and OPTIONAL makes this possible. The following will do this:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level1 skos:prefLabel ?level1label . 
  ?level2 skos:broader ?level1 ;
          skos:prefLabel ?level2label . 
  ?level3 skos:broader ?level2 ;
          skos:prefLabel ?level3label . 
}
WHERE {
  ?level1 skos:broader v:Mammal ;
          skos:prefLabel ?level1label . 
  OPTIONAL {
    ?level2 skos:broader ?level1 ;
            skos:prefLabel ?level2label .
  }
  OPTIONAL {
    ?level3 skos:broader ?level2 ;
            skos:prefLabel ?level3label . 
  }
}

The problem: this doesn't scale. When I sent a nearly identical query to DBpedia to ask for the triples representing the hierarchy three levels down from <http://dbpedia.org/resource/Category:Mammals>, it timed out after 20 minutes, because the two OPTIONAL graph patterns gave DBpedia too much work to do.

As a review, let's restate the problem: we want the identified concept and the preferred labels and broader values of concepts up to three levels down from that concept, but without using the OPTIONAL keyword. How can we do this?

By asking for each level in a separate query. When I split the DBpedia version of the query above into the following three queries, each retrieved its data in under a second, retrieving a total of 2,597 triples representing a taxonomy of 1,107 concepts:

# query 1
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  <http://dbpedia.org/resource/Category:Mammals> a skos:Concept . 
  ?level1 a skos:Concept ;
          skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
}

# query 2
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level2 a skos:Concept ;
          skos:broader ?level1 ;  
          skos:prefLabel ?level2label .  
}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> .
  ?level2 skos:broader ?level1 ;  
            skos:prefLabel ?level2label .  
}

# query 3
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  ?level3 a skos:Concept ;
          skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  
}
WHERE {
?level2 skos:broader/skos:broader <http://dbpedia.org/resource/Category:Mammals> .
  ?level3 skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  
}

Going from timing out after 20 minutes to successful execution in under 3 seconds is quite a performance improvement. Below, you can see how the beginning of a small piece of this taxonomy looks in TopQuadrant's TopBraid EVN vocabulary manager. At the first level down, you can only see Afrosoricida, Australosphenida, and Bats in the picture; I then drilled down three more levels from there to show that Fictional bats has the single subcategory Silverwing.

As you can tell from the Mammals URI in the queries above, these taxonomy concepts are categories, and each category has at least one member (for example, Bats as food) in Wikipedia and is therefore represented as triples in DBpedia, ready for you to retrieve with SPARQL CONSTRUCT queries. I didn't retrieve any instance triples here, but it's great to know that they're available, and that this technique for avoiding CONSTRUCT graph patterns will serve me for much more than SKOS taxonomy work.

There has been plenty of talk lately on Twitter and in blogs about how it's not a good idea for important applications to have serious dependencies on public SPARQL endpoints such as DBpedia. (Orri Erling has one of the most level-head discussions of this that I've seen in SEMANTiCS 2014 (part 3 of 3): Conversations; in my posting Semantic Web Journal article on DBpedia on this blog I described a great article that lists other options.) There's all this great data to use in DBpedia, and besides spinning up an Amazon Web Services image with your own copy of DBpedia, as Orri suggests, you can pull down the data you need to store locally when it is up. If you're unsure about the structure and connections of the data you're pulling down, OPTIONAL graph patterns seems like an obvious fix, but this trick for splitting up CONSTRUCT queries to avoid the use of OPTIONAL graph patterns means that you can pull down a lot more data lot more efficiently.

Stickin' to the UNION

October 16th update: Once I split out the pieces of the original query into separate files, it should have occurred to me to at least try joining them back up into a single query with UNION instead of OPTIONAL, but it didn't. Luckily for me, John Walker suggested in the comments for this blog entry that I try this, so I did. It worked great, with the benefit of being simpler to read and maintain than using a collection of queries to retrieve a single set of triples. This version only took three seconds to retrieve the triples:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
CONSTRUCT {
  <http://dbpedia.org/resource/Category:Mammals> a skos:Concept . 
  ?level1 a skos:Concept ;
          skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
  ?level2 a skos:Concept ;
          skos:broader ?level1 ;  
          skos:prefLabel ?level2label .  
  ?level3 a skos:Concept ;
          skos:broader ?level2 ;  
          skos:prefLabel ?level3label .  

}
WHERE {
  ?level1 skos:broader <http://dbpedia.org/resource/Category:Mammals> ;
          skos:prefLabel ?level1label .  
  {
    ?level2 skos:broader ?level1 ;  
    skos:prefLabel ?level2label .  
  }
  UNION
  {
    ?level2 skos:broader ?level1 .
    ?level3 skos:broader ?level2 ;  
            skos:prefLabel ?level3label .  
  }
}

There are two lessons here:

  • If you've figured out a way to do something better, don't be too satisfied too quickly—keep trying to make it even better.

  • UNION is going to be useful in more situations than I originally thought it would.

  • animals taxonomy

    Please add any comments to this Google+ post.

13 September 2014

A schemaless computer database in 1965

To enable flexible metadata aggregation, among other things.

figure 3

I've been reading up on America's post-war attempt to keep up the accelerated pace of R&D that began during World War II. This effort led to an infrastructure that made accomplishments such as the moon landing and the Internet possible; it also led to some very dry literature, and I'm mostly interested in what new metadata-related techniques were developed to track and share the products of the research as they led to development.

One dry bit of literature is the proceedings of the 1965 Toward a National Information System: Second Annual National Colloquium On Information Retrieval. The conference was sponsored by the American Documentation Institute, who had a big role in the post-war information sharing work, as well as the University of Pennsylvania's Moore School of Electrical Engineering (where Eckert and Mauchly built ENIAC and its successor EDVAC) and some ACM chapters.

In a chapter on how the North American Aviation company (now part of Boeing) revamped their practices for sharing information among divisions, I came across this description of some very flexible metadata storage:

All bibliographic information contained in both the corporate and divisional Electronic Data Processing (EDP) subsystems is retained permanently on magnetic tape in the form of variable length records containing variable length fields. Each field, with the exception of sort keys, consists of three adjacent field parts: field character count, field identification, and field text (see Figure 3). There are several advantages to this format: it is extremely compact, thereby reducing computer read-write time; it provides for definition and consequent addition of new types of fields of bibliographic information without reformatting extant files; and its flexibility allows conversion of files from other indexing abstracting services.

I especially like that "it provides for definition and consequent addition of new types of fields of bibliographic information without reformatting extant files." This reminds me of one slide in my presentation last month at the Semantic Technology and Business / NoSQL Now! conferences last month, where my talk was on a track shared by both conferences, about how a key advantage of schemaless NoSQL databases is the ability to add a new value for a new property to a data set with no need for the schema evolution steps that can be so painful in a relational database.

Moore's law has led to less of a reliance on arranging data in tables to allow the efficient retrieval of that data. The various NoSQL options have explored new ways to do this, and it was great to see that one aerospace company was doing it 49 years ago. Of course, retrieving data from magnetic tape is less efficient than modern alternatives, but it was a big step past the use of piles of punched cards, and pretty modern for its time, as you can see from the tape spools on the picture of EDVAC's gleaming successor below. I thought it was cool to see that, although tabular representation of data long predates relational databases (hierarchical and network databases also stored sets of entities as tables, but with much less flexibility) that someone had implemented such a flexible model so long ago, especially to represent metadata, with a use case that we often see now with RDF: to allow "conversion of files from other indexing abstracting services"—in other words, to accomodate the aggregation of metadata from other sources that may not have structured their data the same way that yours is structured.

Univac 9400

Univac photo by H. Müller CC-BY-SA-2.5, via Wikimedia Commons


Please add any comments to this Google+ post.

24 August 2014

Exploring a SPARQL endpoint

In this case, semanticweb.org.

graph of ISWC SPARQL papers

In the second edition of my book Learning SPARQL, a new chapter titled "A SPARQL Cookbook" includes a section called "Exploring the Data," which features useful queries for looking around a dataset that you know little or nothing about. I was recently wondering about the data available at the SPARQL endpoint http://data.semanticweb.org/sparql, so to explore it I put several of the queries from this section of the book to work.

An important lesson here is how easy SPARQL and RDF make it to explore a dataset that you know nothing about. If you don't know about the properties used, or whether any schema or schemas were used and how much they was used, you can just query for this information. Most hypertext links below will execute the queries they describe using semanticweb.org's SNORQL interface.

I started with what is generally my favorite query, listing which predicates are used in the data, because that's the quickest way to get a flavor for what kind of data is available. Several of the predicates that got listed immediately told me some interesting things:

  • rdfs:subClassOf shows me that there's probably some structure worth exploring.

  • dcterms:subject (and dc:subject) shows that things have probably been tagged with keywords.

  • ical properties such as dtstart shows that events are recorded.

  • FOAF properties show that there is probably information about people.

  • dcterms:title, swrc:booktitle, dc:title, src:title, and swrc:subtitle show me that works are covered.

An RDF dataset may or may not have explicit structure, and the use of rdfs:subClassOf in this data showed me that there was, so my next query asked what classes were subclasses of what classes so that I could get an overview of how much structure the dataset included. The result showed me that the ontology seemed to be mostly in the swc namespace, which turns out to be the semanticweb.com conference ontology. The site does include nice documentation for this ontology.

The use of the FOAF vocabulary showed me that there are probably people described, but if the properties foaf:name, foaf:lastName, foaf:familyName, foaf:family_name, and foaf:surname are all in there, which should I try first? A quick ego search showed foaf:family_name being used. It also showed that the URI used to represent me is http://data.semanticweb.org/person/bob-ducharme, and because they've published this data as linked data, sending a browser to that URL showed that it described me as a member of the 2010 ISWC program committee.

It also showed me to be a proud instance of the foaf:Person class, so I did a query to find out how many persons there were in all: 10,982.

Given the domain of the ontology and the reason that I was listed, I guessed that it was all about ISWC conferences, so I listed the dc:title values to see what would show up. The query took long enough that I added a LIMIT keyword to create a politer version of that query. Looking at the complete data for one work showed all kinds of interesting information, including an swrc:year value to indicate the year of this paper's conference. A list of all year values showed a range from 2001 right up to 2014, so it's nice to see that they're keeping the data up to date.

Next, I listed all papers that mention "SPARQL" in their title, with their years. After listing the number of papers with SPARQL in their title each year, I used sgvizler (which I described here last September) to create the chart of these figures shown above.

The use of dcterms:subject and dc:subject was interesting because these add some pretty classic metadata for navigating content. Listing triples that used either, I included LIMIT 100 to be polite to the server in case these properties were used a lot. They are. Doing this with dc:subject shows subjects such as "ontology alignment" and "controlled natural language" assigned to articles. Doing it with dcterms:subject showed it used more the way I might use rdf:type, indicating that something is an instance of a particular class: for example, swc:Chair and swc:Delegate each have dcterms:subject values of http://dbpedia.org/resource/Role.

My interest in taxonomies (spurred by my work with TopQuadrant's TopBraid EVN) led me to look harder at the dc:subject values. They're string values, and not instances of something like skos:Concept, so they have no hierarchical relationship or other metadata themselves. I'm guessing that this is because key phrases assigned to conference papers are more of a folksonomy, in which people can make up their own key phrases as they wish. Either some people must have been aware of other key phrases in use or some were added automatically, because, while counting how many different ones there were came up with 3,594, a query to see which were the most popular showed that "Corpus (creation, annotation, etc.)" was far and away the most used, with 506 papers having that subject.

I could go on. Call me a SPARQL geek, but I really enjoy looking around a data set like this, especially when (as the presence of the papers for ISWC 2014 shows) the data is kept up to date. For people interested in any aspect of semantic web technology, the ability to look around this particular dataset and count up which data falls into which patterns is a great resource.


Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists