20 January 2019

Querying machine learning distributional semantics with SPARQL

Bringing together my two favorite kinds of semantics.

I recommend the paper to anyone interested in SPARQL or the embedding vectors side of machine learning. They seem to have a productive future together.

When I wrote Semantic web semantics vs. vector embedding machine learning semantics, I described how distributional semantics--whose machine learning implementations are very popular in modern natural language processing--are quite different from the kind of semantics that RDF people usually talk about. I recently learned of a fascinating project that brings RDF technology and distributional semantics together, letting our SPARQL query logic take advantage of entity similarity as rated by machine learning models.

To review a little from that blog entry: machine learning implementations of distributional semantics can identify some of the meanings of words by analyzing their relationships with other words in a set of training data. For example, after analyzing the distribution of terms in a large enough text corpus, such a system can answer the question "woman is to man as queen is to what?" Along with the answer of "king", discussions of this technology typically bring up other examples such as the questions "walking is to walked as swimming is to what?" (an especially nice one because "swim" is an irregular verb) and "London is to England as Berlin is to what?"

These examples are a bit oversimplified. Instead of such a straightforward answer, an implementation such as word2vec typically responds with a list of scored words. If the analyzed corpus was large enough, asking word2vec to complete the second pair in "woman man queen" will get you a list of words with "king" having the highest score. In my experiments, this was nice for the "london england berlin" case, because while germany had the highest score, prussia had the second highest, and Berlin was the capital of Prussia for a few centuries.

word2vec doesn't actually compare the strings "london" and "england" and "berlin". It uses cosine similarity to compare vectors that were assigned to each word as a result of the training step done with the input corpus--the machine "learning" part. Then, it looks for vectors whose similarity to the berlin vector is comparable to the similarity between the london and england vectors.

Some of the most interesting work in machine learning of the past few years has built on the use of vectors to represent entities other than words. The popular doc2vec (originally implemented by my CCRi co-worker Tim Emerick) does it with documents, and others have done it with audio clips and images.

It's one thing to pick out an entity and then ask for a list of entities whose vectors are similar to that of the selected entity. Researchers at King Abdullah University of Science and Technology, the University of Birmingham, and Maastricht University have collaborated to take this further by mixing in some SPARQL. Their paper Vec2SPARQL: integrating SPARQL queries and knowledge graph embeddings describes "a general framework for integrating structured data and their vector space representations [that] allows jointly querying vector functions such as computing similarities (cosine, correlations) or classifications with machine learning models within a single SPARQL query". They have made their implementation available as a Docker image and also put up a SPARQL endpoint with their sample data and SPARQL extensions.

Vec2SPARQL lets you use SPARQL to move beyond simple comparison of vector similarity scores to combine SPARQL's abilities with this. As they write,

For example, once feature vectors are extracted from images, meta-data that is associated with the images (such as geo-locations, image types, author, or similar) could be queried using SPARQL and combined with the semantic queries over the feature vectors extracted from the images themselves. Such a combination would, for example, allow to identify the images authored by person a that are most similar to an image of author b; it can enable similarity- or analogy-based search and retrieval in precisely delineated subsets; or, when feature learning is applied to structured datasets, can combine similarity search and link prediction based on knowledge graph embeddings with structured queries based on SPARQL.

The paper's authors extended Apache Jena ARQ (the open source cross-platform command line SPARQL processor that I recommend in my book Learning SPARQL) with two new functions that make it easier to work with these vectors. The similarity(?x,?y) function lets you compute the similarity of two vectors so that you can use the result in a FILTER, BIND, or SELECT statement. For example, you might use it in a FILTER statement to only retrieve resources whose similarity to a particular resource was above a specified threshold. Their mostSimilar(?x,n) function asks for the n most similar entities to the one passed as the first argument.

Their paper discusses two applications of Vec2SPARQL, in which they "demonstrate using biomedical, clinical, and bioinformatics use cases how [their] approach can enable new kinds of queries and applications that combine symbolic processing and retrieval of information through sub-symbolic semantic queries within vector spaces". As they described the first of their two examples,

...we can use Vec2SPARQL to perform queries of a knowledge graph of mouse genes, diseases and phenotypes and incorporate Vec2SPARQL similarity functions... Our aim in this use case is to find mouse gene associations with human diseases by prioritizing them using their phenotypic similarity, and simultaneously restrict the similarity comparisons to genes and diseases with specific properties (such as being associated with a particular phenotype).

The paper describes where they got their data and how they prepared it, and it shows a brief but expressive query that let them achieve their goal.

In their second example, after assigning vectors to over 112,000 human chest x-ray images that also included gender, age, and diagnosis metadata, they could query for image similarity and also add filters to these queries such as combinations of age range and gender to find other patterns of similarity.

The paper goes into greater detail on the data used for their samples and the similarity measures that they used. It also points to their source code on github and a "SPARQL endpoint" at http://sparql.bio2vec.net/ that is really more of a SPARQL endpoint query form. (The actual endpoint is at http://sparql.bio2vec.net/patient_embeddings/query, and I successfully sent a query there with curl.)

For an academic paper, "Vec2SPARQL: integrating SPARQL queries and knowledge graph embeddings" is quite readable. (Although I didn't have the right biology background to closely follow all the discussions of their sample query data, I could just about handle the math as shown.) I recommend the paper to anyone interested in SPARQL or the embedding vectors side of machine learning. They seem to have a productive future together.


Please add any comments to this Google+ post.

23 December 2018

Playing with wdtaxonomy

Those queries from my last blog entry? Never mind!

After I wrote about Extracting RDF data models from Wikidata in my blog last month, Ettore Rizza suggested that I check out wdtaxonomy, which extracts taxonomies from Wikidata by retrieving the kinds of data that my blog entry's sample queries retrieved, and it then displays the results as a tree. After playing with it, I'm tempted to tell everyone who read that blog entry to ignore the example queries I included, because you can learn a lot more from wdtaxonomy.

The queries in that blog entry might still give you some useful perspective on how SPARQL can retrieve triples from Wikidata that express tree-ish relationships between the concepts of a given domain that have Wikipedia pages--whether you want to call that a taxonomy or an ontology--but I was just dabbling, while wdtaxonomy is a full-featured serious application for this.

Jakob Voss designed wdtaxonomy as both a command line utility and as an NPM module that you can reference from applications. I tried the command line version and had a lot of fun. To try it with my periodic table element example that I wrote about last month, I started by entering "wdtaxonomy Q11344" (using the same local name for the Wikidata identifier that I used before) and the results were impressive.

wdtaxonomy typically outputs a text-based tree with various information about the nodes of the tree. Instead of pasting a sample here, I'm showing a screen shot of the beginning of the output so that you can see the nice color coding:

The wdtaxonomy readthedocs.io documentation lists over two dozen command line options that you can use to customize the output. (Entering "wdtaxonomy" alone at the command line gives a good summary.) My favorite is -s, which tells you you the SPARQL query that wdtaxonomy would use to retrieve the requested information from wikidata. Here is what that gives you when added it to the Q11344 command line I entered above:

$ wdtaxonomy -s Q11344
  SELECT ?item ?broader ?itemLabel ?instances ?sites WITH {
    SELECT DISTINCT ?item { ?item wdt:P279* wd:Q11344 }
  } AS %items WHERE { 
    INCLUDE %items .
    OPTIONAL { ?item wdt:P279 ?broader } .
    {
      SELECT ?item (count(distinct ?element) as ?instances) {
        INCLUDE %items.
        OPTIONAL { ?element wdt:P31 ?item }
      } GROUP BY ?item
    }
    {
      SELECT ?item (count(distinct ?site) as ?sites) {
        INCLUDE %items.
        OPTIONAL { ?site schema:about ?item }
      } GROUP BY ?item
    }
    SERVICE wikibase:label {
      bd:serviceParam wikibase:language "en"
    }
  }

(The INCLUDE keyword used in this query is a Blazegraph and Anzo extension to the SPARQL standard.) Combining this -s option with other options, such as -i to include instances or -d to include item descriptions, shows what SPARQL query the tool would generate to retrieve this additional information. It's a great opportunity to learn more about SPARQL, about the Wikidata data model, and about their relationship. (I have worried that this data model would scare off people who are new to SPARQL--that if their first data set to query was Wikidata, they migh think that the complexity of the necessary queries was because of SPARQL and not because of Wikidata--but when I see all the great activity on Twitter around the use of SPARQL with Wikidata lately, I don't worry so much anymore.)

The ability to get at the generated SPARQL queries is also a huge help to my original goal of retrieving triples that let me store an RDFS/OWL ontology or a SKOS taxonomy about Wikipedia entities. I can change the SELECT part to a CONSTRUCT clause to create triples that use the variables bound in wdtaxonomy's WHERE clauses. wdtaxonomy (or rather, Jakob) has done the difficult work of assembling the necessary query logic and we can just take it and use it.

Some of the other command line options I liked include -U to get full URIs and -r to get superclasses of the named entity instead of its subclasses. I encourage everyone interested in SPARQL and Wikidata to install wdtaxonomy and start playing with it. Especially with that -s option!


Please add any comments to this Google+ post.

18 November 2018

Extracting RDF data models from Wikidata

That's "models", plural.

Their avoidance of the standard model vocabularies is not a big deal, and we should be glad that they make this available in RDF at all.

Some people complain when an RDF dataset lacks a documented data model. A great thing about RDF and SPARQL is that if you want to know what kind of modeling might have been done for a dataset, you just look, even if they're using non-(W3C-)standard modeling structures. They're still using triples, so you look at the triples.

If I know that there is an entity x:thing23 in a dataset, I'm going to query for {x:thing23 ?p ?o} and see what information there is about that entity. Hopefully I will find an rdf:type triple saying that it's a member of a class. If not, maybe it uses some other home-grown way to indicate class membership; either way, you can then start querying to find out about the class's relationships to properties and other classes, and you've got a data model. What if it doesn't use RDFS to describe these modeling structures and their relationships? A CONSTRUCT query will convert it to a data model that does.

And, if {x:thing23 ?p ?o} triples don't indicate any class membership, just seeing what the ?p values are tells you something about the data model. If certain entities use certain properties for their predicates, and other entities use a list that overlaps with that, you've learned more about relationships between sets of entities in the dataset. All of these things can be investigated with simple queries.

Wikidata offers tons of great data and modeling for us RDF people, but it wasn't designed for us. They created their own model and then expressed the model and instance data in RDF, and I'm not going to complain; can you imagine how cool it would be if Google did the same with their knowledge graph? (When I tweeted "Handy Wikidata hints for people who have been using RDF and SPARQL since before Wikidata was around: use wdt:P31 instead of rdf:type and wdt:P279 instead of rdfs:subClassOf", Mark Watson replied that he liked my sense of humor. While I hadn't meant to be funny I do appreciate his sense of humor.) As I've worked at understanding Wikidata's documentation about their mapping to RDF I've had fun just querying around to understand the structures. Again: this is one of the key reasons that RDF and SPARQL are great! Because we can do that!

Last month I described how you can find the subclass tree under a given class in Wikidata and since then I've done further exploration of how to pull data models out of Wikidata. Note that I say "models" and not "model". Olivier Rossel recently referred to extracting the data model of Wikidata (my translation from his French), but I worry that looking for "the" grand RDF data model of Wikidata might set someone up for disappointment. I think that looking for data models to suit various projects will be more productive. (Olivier and I discussed this further in the "Handy Wikidata hints" thread mentioned above.)

The following query builds on the one I did last month to either get a class tree below a given one or to get its superclasses instead. It creates triples that express the classes and their relationships using W3C standard properties.

CONSTRUCT {
  ?class a owl:Class . 
  ?class rdfs:subClassOf ?superclass . 
  ?class rdfs:label ?classLabel . 
  ?property rdfs:domain ?class . 
  ?property rdfs:label ?classLabel .
}
WHERE {
  BIND(wd:Q11344 AS ?mainClass) .    # Q11344 chemical element; Q1420 automobile
  
  # Pick one or the other of the following two triple patterns. 
  ?class wdt:P279* ?mainClass.     # Find subclasses of the main class. 
  #?mainClass wdt:P279* ?class.     # Find superclasses of the main class. 
  
  ?class wdt:P279 ?superclass .     # So we can create rdfs:subClassOf triples
  ?class rdfs:label ?classLabel.
  OPTIONAL {
    ?class wdt:P1963 ?property.
    ?property rdfs:label ?propertyLabel.
    FILTER((LANG(?propertyLabel)) = "en")
    }
  FILTER((LANG(?classLabel)) = "en")
}
      

(Because the query uses prefixes that Wikidata already understands, I didn't need to declare any.) When run in the Wikidata query service form, there are too many triples to see at once, so I put the query into a subtreeClasses.rq file and ran it with curl from the command line like this:

curl --data-urlencode "query@subtreeClasses.rq" https://query.wikidata.org/sparql -H "Accept: text/turtle"  > chemicalElementSubClasses.ttl
      

Loading the result into TopBraid Composer Free edition (available here; the Free edition is a choice on the Product dropdown list) showed a class tree of the result like this:

(It's tempting to add an entry for Frinkonium as a subclass of "hypothetical chemical element".) I understand that the Wikimedia Foundation had their reasons for not describing their models with the standard vocabularies, but this shows the value of using the standards: interoperability with other tools. It also shows that the Foundation's avoidance of the standard model vocabularies is not a big deal, and that we should be glad that they make this available in RDF at all, because the sheer fact that it's in RDF makes it easy to convert to whatever RDF we want with a CONSTRUCT query. (Again, imagine if Google did this with any portion of their knowledge graph...)

The query above also looks for properties for those classes so that it can express those in the output with the RDFS vocabulary. It didn't find many, but this bears further investigation. This query shows that in addition to the chemical element class having properties, there are constraints on those properties described with triples, so there's a lot more that can be done here to pull richer models out of Wikidata and then express them in more standard vocabularies.

And of course there's the possibility of pulling out instance data to go with these models. Queries for that would be easy enough to assemble but you might end up with so much data that Wikidata times out before giving it to you; you could use the techniques I described in Pipelining SPARQL queries in memory with the rdflib Python library to retrieve instance URIs and then retrieve the additional triples about those instances in batches of queries that use the VALUES keywords.

Lots of data instances of rich models, all transformed to conform to the W3C standards so that they work with lots of open source and commercial tools--the possibilities are pretty impressive. If anyone pulls datasets like this out of Wikidata for their field, let me know about it!


Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Archives

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0