Normalizing company names with SPARQL and DBpedia

Wikipedia page redirection data, waiting for you to query it.
[ODU mascot]

If you send your browser to http://en.wikipedia.org/wiki/Big_Blue, you'll end up at IBM's page, because Wikipedia knows that this nickname usually refers to this company. (Apparently, it's also a nickname for several high schools and universities.) This data pointing from nicknames to official names is also stored in DBpedia, which means that we we can use SPARQL queries to normalize company names. You can use the same technique to normalize other kinds of names—for example, trying to send your browser to http://en.wikipedia.org/wiki/Bobby_Kennedy will actually send it to http://en.wikipedia.org/wiki/Robert_F._Kennedy—but a query that sticks to one domain will have a simpler job. Description Logics and all that.

The query below can be run with any SPARQL client that supports 1.1. I wanted it to cover these three cases:

  • Run it with an unofficial company name such as Big Blue, Apple Computer, or Kodak, and it should return the official company name.

  • Run it with an official company name such as IBM, Apple, Inc., or Eastman Kodak, and it should return that name.

  • Run it with something that isn't a company, such as Snee, and it shouldn't return anything.

The query's first BIND statement sets the name to check (including a language tag, because DBpedia is pretty consistent about using those) in the ?inputName variable, and the SERVICE keyword sends the bolded part of the query off to DBpedia's SPARQL endpoint.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpo: <http://dbpedia.org/ontology/>
SELECT ?name 
WHERE {
  BIND("Big Blue"@en AS ?inputName) 
  SERVICE <http://dbpedia.org/sparql> 
  {
    ?s rdfs:label ?inputName .
    {
      ?s dbpo:wikiPageRedirects ?actualResource .
      ?actualResource a dbpo:Company . 
      ?actualResource rdfs:label ?redirectsTo . 
      FILTER ( lang(?redirectsTo) = "en" )
    }
    UNION
    { ?s a dbpo:Company . }
  }
  BIND(STR(COALESCE(?redirectsTo,?inputName)) AS ?name)
}

After finding a resource (?s) that has the bound value as an rdfs:label value, DBpedia returns the UNION of two graph patterns. The first checks whether this resource is supposed to redirect to another dbpo:Company resource, and if so, stores the English rdfs:label of that resource in the variable ?redirectsTo.

If that graph pattern doesn't return anything because ?s doesn't have a dbpo:wikiPageRedirects property, but DBpedia does know that it's a dbpo:Company, the graph pattern after the UNION keyword will match.

After DBpedia returns any bound variables, the local client uses the COALESCE function to bind ?redirectsTo to the ?name variable if ?redirectsTo got bound, and otherwise binds ?inputName to it. (Because COALESCE is a new SPARQL 1.1 feature and DBPedia doesn't support any of 1.1 that I know of yet, this part has to be done locally.) If nothing got bound, then there was no such company listed in DBpedia.

I tested this with both ARQ and TopBraid Composer. With TBC (including the free version), it was fun to put the whole query into a SPIN function that I called normalizeCompanyName, so that I could make calls such as normalizeCompanyName("Kodak") or normalizeCompanyName("Apple, Inc.") in the middle of other SPARQL queries.

It took me a lot of tweaking to get the query above to work the way I wanted to, and I wouldn't be surprised if it can be improved at all. I'd love to hear any suggestions.


Please add any comments to this Google+ post.