Modeling your data with DBpedia vocabularies

Broad, useful, vocabularies with plenty of sample data.

I've known for a while about ways to dig into the vocabularies used in DBpedia's massive collection of triples, and I've used terms from these vocabularies to query for information such as Bart Simpson blackboard messages and US presidents' ages at inauguration. I saw these terms as "field" names to use when querying this body of data.

Reading the W3C RDFa spec recently, though, I was struck by one example:

<div about="http://dbpedia.org/resource/Albert_Einstein">
  <span property="foaf:name">Albert Einstein</span>
  <span property="dbp:dateOfBirth" datatype="xsd:date">1879-03-14</span>
  <div rel="dbp:birthPlace" resource="http://dbpedia.org/resource/Germany">
    <span property="dbp:conventionalLongName">Federal Republic of Germany
   </span>
  </div>
</div>

This particular example demonstrates how to chain statements together with shared resource references, but what caught my eye was the use of the http://dbpedia.org/resource/ namespace to reference Albert Einstein and Germany and the http://dbpedia.org/property/ namespace (here represented as "dbp:") for the factual property "birthPlace". In other words, here were two DBpedia vocabularies being used not to query DBpedia, but to model data completely outside of the context of DBpedia, because they offered straighforward, dereferencable URIs for these things.

Comic Book Guy with LC URI

I'm not saying that these are the first vocabularies to check when you need URIs for people, places, concepts, or properties, but they could be the best second or third places to go to if your domain offers no clear choice for a vocabulary that meets your needs. For example, I'd prefer the Linked Movie Database URI of http://data.linkedmdb.org/page/film/2674 for Truffaut's film "Shoot the Piano Player" over DBpedia's http://dbpedia.org/resource/Shoot_the_Piano_Player, despite the latter's greater readability, because for one thing, the linkedmdb.org page for Shoot the Piano Player includes data about this resource being owl:sameAs the resource http://dbpedia.org/resource/Shoot_the_Piano_Player, making it easy for queries about this movie to tie the Linked Movie Database and DBpedia metadata together. The more important reason, though, is that as far as I can tell, the Linked Movie Database project team has worked out a specific property vocabulary as part of their project, while the DBpedia one has grown more organically, leading to many more strange edge cases among the well-chosen terms.

While the Library of Congress Subject Headings provide a solid, professional taxonomy and a set of URIs for a wide variety of subjects and concepts, they don't have them for places or people. (They might have one for London (England)--History, but they don't have one for "London (England)".) So, while they have a URI for the concept of sightings of Elvis Presley since his death, they have no URI for Elvis himself. Nor do they have one for Einstein, and I don't know what well-known vocabulary does, so the RDFa spec's authors went with the DBpedia URI for the famous physicist. (Interestingly, the Library of Congress Subject Headers do cover fictitious characters such as Holden Caulfield and even the Simpsons' Comic Book Guy.)

To describe facts about Einstein, the FOAF vocabulary includes many good properties for describing a person, but none to identify the day a person was born, so the RDFa spec's authors used the DBpedia http://dbpedia.org/property/dateOfBirth property. It's easy enough to check whether DBpedia has a URI for a person, place, or thing by going to the appropriate Wikipedia page (watch out for redirects) and replacing the http://en.wikipedia.org/wiki/ part of its URI with http://dbpedia.org/page/. I have a bookmarklet called wp -> dbpedia that makes this replacement and takes me from a Wikipedia page to the corresponding DBpedia page with one click. If you drag that link to your bookmarks toolbar, it should work for you.

To look for a property name you might need, you can check a DBpedia page for a resource that may have had that property assigned to it. You can also download an ntriples or csv file in your choice of 14 languages from DBpedia's Download Page. The compressed version of infoboxproperties_en.nt, the ntriples version of the English language properties, was 606K, which decompression expanded to over 13 megs. With two ntriples per property, as shown in their brief sample of the file, it's pretty verbose, so I wrote a perl script to trim it down to just one property name per line, without the full URLs, bringing the size of the list down to 49,122 lines and about 879K.

The list fun to skim through. There are a lot of goofy properties in there; worldSnookerChampionshipRoundsProperty99 has 98 more to go with it. So how do you know which ones are worth using? I like metadata that's really about existing data, and it's easy to use DBpedia's SPARQL query form to ask about resources that have a particular property assigned. Entering the following query there showed me that over 50 people have had worldSnookerChampionshipRoundsProperty99 values assigned to them:

SELECT DISTINCT ?s ?o WHERE {
  ?s 
  <http://dbpedia.org/property/worldSnookerChampionshipRoundsProperty99> 
  ?o
}

Seeing examples of how a property was used also gives you great background in whether it's appropriate to your needs.

The first place I'd check, though, for appropriate DBpedia property names would be the DBpedia Ontology available from the same download page. It's not huge, defining metadata for about 1200 properties at this point, but it really brings the property vocabulary into ontology territory by defining domains, ranges, subclasses, and other relationships between terms that help you to get more out of them. Outside of that ontology, plenty of other hard work continues to make the DBpedia predicate vocabulary more valuable to all of us, so it's worth keeping an eye on the work going on around this vocabulary.

2 Comments

While the Library of Congress Subject Headings provide a solid, professional taxonomy and a set of URIs for a wide variety of subjects and concepts, they don't have them for places or people.

While this is true, the Library of Congress does have authority files for those things, and I understand they plan on adding them to id.loc.gov as Linked Data soon.

Einstein: http://errol.oclc.org/laf/n79-22889.html

Great post!


Bob,

Great article (as usual ;) and might be worth it for us to cover this aspect in [1].
Thanks!


[1] http://ld2sd.deri.org/lod-ng-tutorial/#checklist