Querying machine learning distributional semantics with SPARQL

Bringing together my two favorite kinds of semantics.
I recommend the paper to anyone interested in SPARQL or the embedding vectors side of machine learning. They seem to have a productive future together.

When I wrote Semantic web semantics vs. vector embedding machine learning semantics, I described how distributional semantics--whose machine learning implementations are very popular in modern natural language processing--are quite different from the kind of semantics that RDF people usually talk about. I recently learned of a fascinating project that brings RDF technology and distributional semantics together, letting our SPARQL query logic take advantage of entity similarity as rated by machine learning models.

To review a little from that blog entry: machine learning implementations of distributional semantics can identify some of the meanings of words by analyzing their relationships with other words in a set of training data. For example, after analyzing the distribution of terms in a large enough text corpus, such a system can answer the question "woman is to man as queen is to what?" Along with the answer of "king", discussions of this technology typically bring up other examples such as the questions "walking is to walked as swimming is to what?" (an especially nice one because "swim" is an irregular verb) and "London is to England as Berlin is to what?"

These examples are a bit oversimplified. Instead of such a straightforward answer, an implementation such as word2vec typically responds with a list of scored words. If the analyzed corpus was large enough, asking word2vec to complete the second pair in "woman man queen" will get you a list of words with "king" having the highest score. In my experiments, this was nice for the "london england berlin" case, because while germany had the highest score, prussia had the second highest, and Berlin was the capital of Prussia for a few centuries.

word2vec doesn't actually compare the strings "london" and "england" and "berlin". It uses cosine similarity to compare vectors that were assigned to each word as a result of the training step done with the input corpus--the machine "learning" part. Then, it looks for vectors whose similarity to the berlin vector is comparable to the similarity between the london and england vectors.

Some of the most interesting work in machine learning of the past few years has built on the use of vectors to represent entities other than words. The popular doc2vec (originally implemented by my CCRi co-worker Tim Emerick) does it with documents, and others have done it with audio clips and images.

It's one thing to pick out an entity and then ask for a list of entities whose vectors are similar to that of the selected entity. Researchers at King Abdullah University of Science and Technology, the University of Birmingham, and Maastricht University have collaborated to take this further by mixing in some SPARQL. Their paper Vec2SPARQL: integrating SPARQL queries and knowledge graph embeddings describes "a general framework for integrating structured data and their vector space representations [that] allows jointly querying vector functions such as computing similarities (cosine, correlations) or classifications with machine learning models within a single SPARQL query". They have made their implementation available as a Docker image and also put up a SPARQL endpoint with their sample data and SPARQL extensions.

Vec2SPARQL lets you use SPARQL to move beyond simple comparison of vector similarity scores to combine SPARQL's abilities with this. As they write,

For example, once feature vectors are extracted from images, meta-data that is associated with the images (such as geo-locations, image types, author, or similar) could be queried using SPARQL and combined with the semantic queries over the feature vectors extracted from the images themselves. Such a combination would, for example, allow to identify the images authored by person a that are most similar to an image of author b; it can enable similarity- or analogy-based search and retrieval in precisely delineated subsets; or, when feature learning is applied to structured datasets, can combine similarity search and link prediction based on knowledge graph embeddings with structured queries based on SPARQL.

The paper's authors extended Apache Jena ARQ (the open source cross-platform command line SPARQL processor that I recommend in my book Learning SPARQL) with two new functions that make it easier to work with these vectors. The similarity(?x,?y) function lets you compute the similarity of two vectors so that you can use the result in a FILTER, BIND, or SELECT statement. For example, you might use it in a FILTER statement to only retrieve resources whose similarity to a particular resource was above a specified threshold. Their mostSimilar(?x,n) function asks for the n most similar entities to the one passed as the first argument.

Their paper discusses two applications of Vec2SPARQL, in which they "demonstrate using biomedical, clinical, and bioinformatics use cases how [their] approach can enable new kinds of queries and applications that combine symbolic processing and retrieval of information through sub-symbolic semantic queries within vector spaces". As they described the first of their two examples,

...we can use Vec2SPARQL to perform queries of a knowledge graph of mouse genes, diseases and phenotypes and incorporate Vec2SPARQL similarity functions... Our aim in this use case is to find mouse gene associations with human diseases by prioritizing them using their phenotypic similarity, and simultaneously restrict the similarity comparisons to genes and diseases with specific properties (such as being associated with a particular phenotype).

The paper describes where they got their data and how they prepared it, and it shows a brief but expressive query that let them achieve their goal.

In their second example, after assigning vectors to over 112,000 human chest x-ray images that also included gender, age, and diagnosis metadata, they could query for image similarity and also add filters to these queries such as combinations of age range and gender to find other patterns of similarity.

The paper goes into greater detail on the data used for their samples and the similarity measures that they used. It also points to their source code on github and a "SPARQL endpoint" at http://sparql.bio2vec.net/ that is really more of a SPARQL endpoint query form. (The actual endpoint is at http://sparql.bio2vec.net/patient_embeddings/query, and I successfully sent a query there with curl.)

For an academic paper, "Vec2SPARQL: integrating SPARQL queries and knowledge graph embeddings" is quite readable. (Although I didn't have the right biology background to closely follow all the discussions of their sample query data, I could just about handle the math as shown.) I recommend the paper to anyone interested in SPARQL or the embedding vectors side of machine learning. They seem to have a productive future together.


Please add any comments to this Google+ post.