24 March 2019

Changing my blog's domain name and platform

New look, new domain name.

Hugo logo

For too long I've postponed the migration of my blog to something more phone-friendly. I accumulated many notes about doing this, and I also wanted to move more of my online life from the snee.com domain to bobdc.com. When someone recently asked me about changing the stylesheet (I have dug and dug in the aforementioned notes but can't remember who and will add their name here if I ever find it) I thought I'd take a deep breath and follow through with this. This is the last new blog entry you'll see on the snee.com domain; you'll also find it at bobdc.com/blog along with converted versions of all my other blog entries since I started on snee.com/bobdc.blog in 2005. I will continue my blog on bobdc.com/blog after this entry.

The conversion of the old entries was most of the work, but with some Perl and XSLT and pandoc and spit and duct tape I got the legacy content into pretty good shape for the new platform.

Of course, the platform choice was a geeky thing to agonize over. I finally went with Hugo, a Go-based static site generator. (I never had to learn the Go programming language, but it looks cool enough.)

It's a bit scary to think of the high percentage of the world's blog entries that are created by data entry into web forms that then use a bunch of PHP to manage that content's storage in relational databases. Having spent much of my career helping people store non-tabular content in standards-based non-tabular storage tools, I definitely wanted to get away from using PHP and relational database managers for narrative content, so I researched various static site generators before settling on Hugo.

Simple web sites like my learningsparql.com and datascienceglossary.org sites are just plain static sites: HTML files that I edit as necessary. A static site generator lets you store content separate from the styling and then generates HTML for your site based on the combination. If you want to change your website's layout or styling, you edit the CSS or whatever and then regenerate the HTML. (The version of MovableType that I used on snee.com actually did static site generation, but all the styling was managed with a mess of old PHP. I haven't upgraded it in ten years because the last time I did it broke so much.) A selling point of Hugo is that it does this very quickly--or, to use the now-clich├ęd phrase that they prefer, "blazingly fast".

I knew about Jekyll and Sphinx from work because both are used for geomesa.org. After researching alternatives I decided that I liked the available Hugo themes the most. The Hugo documentation isn't very good, but the people on the discussion forum are very helpful, sometimes answering within minutes. If there is any interest I may write a blog entry about the important Hugo techniques I had to track down to customize my blog because they were not written up in an easily findable place.

You store your Hugo content separately from the styling using Hugo's own variation of markdown. As a longstanding XML guy ever since it was a four-letter word, I have ranted about what's wrong with markdown--or, as I should say, "the markdowns"-- but it works for what I want to do in my blog and you can embed just about any sensible HTML you want in places where markdown falls short. I would have preferred a static site generator where the content I wrote for each new blog entry conformed to some simple XHTML profile but I just couldn't find anything with good themes and the right level of automation.

In the lower-right of my snee.com blog you'll see four variations on Atom and RSS feeds. More than one Atom or RSS feed seems to be difficult in Hugo, so my new blog's Atom feed has summaries and links to the original postings and the new blog's RSS feed has the full entries. I will be setting the snee.com ones to redirect to the bobdc.com ones shortly, but you can just subscribe to the new ones now if you like.

So, I apologize for the lack of phone-friendliness of my blog for the last few years and hope you enjoy the new more responsive version of my blog.

24 February 2019

curling SPARQL

A quick reference.

I've been using the curl utility to retrieve data from SPARQL endpoints for years, but I still have trouble remembering some of the important syntax, so I jotted down a quick reference for myself and I thought I'd share it. I also added some background.

Quick reference

Submit a URL-encoded SPARQL query on the operating system command line to the endpoint http://edan.si.edu/saam/sparql:

curl "http://edan.si.edu/saam/sparql?query=SELECT%20*%20WHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D%20LIMIT%208"

(Quoting the URL isn't always necessary, but won't hurt. Omitting it may hurt if some of the characters mean something special to your operating system's command line interpreter.)

Submit the same query stored in the file query1.rq:

curl --data-urlencode "query@query1.rq" http://edan.si.edu/saam/sparql

There is no need to escape the query in the file, because the --data-urlencode parameter tells curl to do so.

The above queries return the data in whatever format the endpoint's system administrators chose as the default. You can pass a request header to specify that you want a particular format. The following requests comma-separated values:

curl -H "Accept: text/csv" --data-urlencode "query@query1.rq"  http://edan.si.edu/saam/sparql

Other possible content types are application/sparql-results+json, application/sparql-results+xml, and text/tab-separated-values.

The above examples all use a SELECT query. A CONSTRUCT query requests triples, so instead of CSV or one of the other tabular formats you want an RDF serialization such as Turtle:

curl -H "Accept: text/turtle" --data-urlencode "query@query2.rq"  http://edan.si.edu/saam/sparql

Other possible content types for CONSTRUCT queries are application/rdf+xml, application/rdf+json, and, for ntriples, text/plain. The bio2rdf github page has good long lists for both SELECT and CONSTRUCT content types, although not all endpoints will support all of the listed types. (It lists text/plain for N-triples, but you're better off using application/n-triples.)

Background

curl lets you submit many kinds of HTTP requests to HTTP servers. It's part of the Linux and MacOS operating systems, and if you don't have it on your Windows machine, you can download it.

If you enter curl with no parameters other than a URL, like this,

curl http://www.learningsparql.com

it does the same HTTP GET that a browser would do. This has the same effect as doing a browser View Source on that web page.

It gets more interesting when you're not pointing curl at a static web page like http://www.learningsparql.com but at a dynamic resource such as a SPARQL endpoint. A SPARQL endpoint is usually identified with a URL ending with /sparql. I tested everything shown above with these endpoint URLs:

  • https://query.wikidata.org/bigdata/namespace/wdq/sparql, the SPARQL endpoint for Wikidata.

  • http://localhost:3030/myDataset/sparql, the SPARQL endpoint for a local instance of Apache Jena Fuseki. This is the triplestore that I described in the "Updating Data with SPARQL" chapter of my book Learning SPARQL because, for a server that accepts SPARQL UPDATE commands, it's so easy to get up and running. Before running the queries against this endpoint I created a dataset on this running instance with the clever name of myDataset and loaded some triples into it. As you can see, a Fuseki endpoint URL includes the dataset name.

  • http://edan.si.edu/saam/sparql, the SPARQL endpoint for the Smithsonian Institution. I used this one in the examples here because it's the shortest of the three endpoint URLs that I used for testing.

The simplest way to send a query to a SPARQL endpoint is to add query=[your URL-encoded query] to the end of the endpoint's URL as with the very first example above. You can paste the resulting URL into the address bar of a web browser so that the browser will retrieve the query results from the endpoint, but curl lets you retrieve the results from a command line so that you can save the returned data and use it as part of an application.

URL encoding is the process of taking characters that might screw up the parsing of the URL and converting each to a percent sign followed by a number representing its Unicode code point--most often, converting each space to %20. For example, the escaped version of the query SELECT * WHERE {?s ?p ?o} LIMIT 8 that I used in the examples above is SELECT%20*%20WHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D%20LIMIT%208. Most programming languages offer built-in functions to do this; I usually paste one of these queries into a form on a website like this one and then copy the result after having the form do the conversion.

When you add the escaped query to a SPARQL endpoint URL such as the Smithsonian one and enter the result as a parameter to curl at your command line, like this,

curl http://edan.si.edu/saam/sparql?query=SELECT%20*%20WHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D%20LIMIT%208

it should retrieve a SPARQL Query Results JSON Format version of the data requested by that query, because that's the default format for that endpoint.

I actually don't escape queries and add them to a curl command line often. When I'm refining a query by iteratively editing and running it, re-encoding the URL each time can be a pain, so I usually store the query in a text file (query1.rq for the sample SELECT query above and query2.rq for the CONSTRUCT query) and tell curl to URL-encode the file's contents and send the result off to the SPARQL endpoint.

If I keep the file with the query in a text editor, I can refine it, save it, and run the same command over and over without worrying about escaping each revision of the query. (Because my editor is Emacs, I could actually send the query to the endpoint using Emacs SPARQLMode, but today's topic is curl.)

The curl website has plenty of documentation, but you can learn a lot with just this:

  curl --help

Among the many, many options, some useful ones are -o to redirect output to a file and -L for "follow location hints" (that is, if the server has instructions to redirect a request for a given URL to something else, take the hint). Another is-I for "Show document info only": just get information about the requested "document" without actually retrieving a named resource, which is useful for debugging. The classic -v for "verbose" is also handy for debugging.

Take a look at the available options, experiment with some SPARQL endpoints, and soon you'll be using "curl" as a verb (for example, "I tried to curl it but I didn't have the right certs"--see the -E command line option for more on that) and you won't be talking about hairstyling, arm exercises, or sliding round stones across the ice.

(I just learned about Curling SPARQL HTTP Graph Store protocol by @jindrichmynarz, so if you've gotten this far, you'll like that too.)

curling lamp

Curling image by Greg Scheckter via flicker CC some rights reserved


Comments? Just tweet to @bobdc for now, because Google+ is shutting down. I will be moving my blog to a new more phone-responsive platform shortly and I'm researching options for hosted comments.

20 January 2019

Querying machine learning distributional semantics with SPARQL

Bringing together my two favorite kinds of semantics.

I recommend the paper to anyone interested in SPARQL or the embedding vectors side of machine learning. They seem to have a productive future together.

When I wrote Semantic web semantics vs. vector embedding machine learning semantics, I described how distributional semantics--whose machine learning implementations are very popular in modern natural language processing--are quite different from the kind of semantics that RDF people usually talk about. I recently learned of a fascinating project that brings RDF technology and distributional semantics together, letting our SPARQL query logic take advantage of entity similarity as rated by machine learning models.

To review a little from that blog entry: machine learning implementations of distributional semantics can identify some of the meanings of words by analyzing their relationships with other words in a set of training data. For example, after analyzing the distribution of terms in a large enough text corpus, such a system can answer the question "woman is to man as queen is to what?" Along with the answer of "king", discussions of this technology typically bring up other examples such as the questions "walking is to walked as swimming is to what?" (an especially nice one because "swim" is an irregular verb) and "London is to England as Berlin is to what?"

These examples are a bit oversimplified. Instead of such a straightforward answer, an implementation such as word2vec typically responds with a list of scored words. If the analyzed corpus was large enough, asking word2vec to complete the second pair in "woman man queen" will get you a list of words with "king" having the highest score. In my experiments, this was nice for the "london england berlin" case, because while germany had the highest score, prussia had the second highest, and Berlin was the capital of Prussia for a few centuries.

word2vec doesn't actually compare the strings "london" and "england" and "berlin". It uses cosine similarity to compare vectors that were assigned to each word as a result of the training step done with the input corpus--the machine "learning" part. Then, it looks for vectors whose similarity to the berlin vector is comparable to the similarity between the london and england vectors.

Some of the most interesting work in machine learning of the past few years has built on the use of vectors to represent entities other than words. The popular doc2vec (originally implemented by my CCRi co-worker Tim Emerick) does it with documents, and others have done it with audio clips and images.

It's one thing to pick out an entity and then ask for a list of entities whose vectors are similar to that of the selected entity. Researchers at King Abdullah University of Science and Technology, the University of Birmingham, and Maastricht University have collaborated to take this further by mixing in some SPARQL. Their paper Vec2SPARQL: integrating SPARQL queries and knowledge graph embeddings describes "a general framework for integrating structured data and their vector space representations [that] allows jointly querying vector functions such as computing similarities (cosine, correlations) or classifications with machine learning models within a single SPARQL query". They have made their implementation available as a Docker image and also put up a SPARQL endpoint with their sample data and SPARQL extensions.

Vec2SPARQL lets you use SPARQL to move beyond simple comparison of vector similarity scores to combine SPARQL's abilities with this. As they write,

For example, once feature vectors are extracted from images, meta-data that is associated with the images (such as geo-locations, image types, author, or similar) could be queried using SPARQL and combined with the semantic queries over the feature vectors extracted from the images themselves. Such a combination would, for example, allow to identify the images authored by person a that are most similar to an image of author b; it can enable similarity- or analogy-based search and retrieval in precisely delineated subsets; or, when feature learning is applied to structured datasets, can combine similarity search and link prediction based on knowledge graph embeddings with structured queries based on SPARQL.

The paper's authors extended Apache Jena ARQ (the open source cross-platform command line SPARQL processor that I recommend in my book Learning SPARQL) with two new functions that make it easier to work with these vectors. The similarity(?x,?y) function lets you compute the similarity of two vectors so that you can use the result in a FILTER, BIND, or SELECT statement. For example, you might use it in a FILTER statement to only retrieve resources whose similarity to a particular resource was above a specified threshold. Their mostSimilar(?x,n) function asks for the n most similar entities to the one passed as the first argument.

Their paper discusses two applications of Vec2SPARQL, in which they "demonstrate using biomedical, clinical, and bioinformatics use cases how [their] approach can enable new kinds of queries and applications that combine symbolic processing and retrieval of information through sub-symbolic semantic queries within vector spaces". As they described the first of their two examples,

...we can use Vec2SPARQL to perform queries of a knowledge graph of mouse genes, diseases and phenotypes and incorporate Vec2SPARQL similarity functions... Our aim in this use case is to find mouse gene associations with human diseases by prioritizing them using their phenotypic similarity, and simultaneously restrict the similarity comparisons to genes and diseases with specific properties (such as being associated with a particular phenotype).

The paper describes where they got their data and how they prepared it, and it shows a brief but expressive query that let them achieve their goal.

In their second example, after assigning vectors to over 112,000 human chest x-ray images that also included gender, age, and diagnosis metadata, they could query for image similarity and also add filters to these queries such as combinations of age range and gender to find other patterns of similarity.

The paper goes into greater detail on the data used for their samples and the similarity measures that they used. It also points to their source code on github and a "SPARQL endpoint" at http://sparql.bio2vec.net/ that is really more of a SPARQL endpoint query form. (The actual endpoint is at http://sparql.bio2vec.net/patient_embeddings/query, and I successfully sent a query there with curl.)

For an academic paper, "Vec2SPARQL: integrating SPARQL queries and knowledge graph embeddings" is quite readable. (Although I didn't have the right biology background to closely follow all the discussions of their sample query data, I could just about handle the math as shown.) I recommend the paper to anyone interested in SPARQL or the embedding vectors side of machine learning. They seem to have a productive future together.


Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Archives

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0