bobdc.blog

Changing my blog's domain name and platform

2019-03-24T13:40:48Z

New look, new domain name.

For too long I've postponed the migration of my blog to something more phone-friendly. I accumulated many notes about doing this, and I also wanted to move more of my online life from the snee.com domain to bobdc.com. When someone recently asked me about changing the stylesheet (I have dug and dug in the aforementioned notes but can't remember who and will add their name here if I ever find it) I thought I'd take a deep breath and follow through with this. This is the last new blog entry you'll see on the snee.com domain; you'll also find it at bobdc.com/blog along with converted versions of all my other blog entries since I started on snee.com/bobdc.blog in 2005. I will continue my blog on bobdc.com/blog after this entry.

The conversion of the old entries was most of the work, but with some Perl and XSLT and pandoc and spit and duct tape I got the legacy content into pretty good shape for the new platform.

Of course, the platform choice was a geeky thing to agonize over. I finally went with Hugo, a Go-based static site generator. (I never had to learn the Go programming language, but it looks cool enough.)

It's a bit scary to think of the high percentage of the world's blog entries that are created by data entry into web forms that then use a bunch of PHP to manage that content's storage in relational databases. Having spent much of my career helping people store non-tabular content in standards-based non-tabular storage tools, I definitely wanted to get away from using PHP and relational database managers for narrative content, so I researched various static site generators before settling on Hugo.

Simple web sites like my learningsparql.com and datascienceglossary.org sites are just plain static sites: HTML files that I edit as necessary. A static site generator lets you store content separate from the styling and then generates HTML for your site based on the combination. If you want to change your website's layout or styling, you edit the CSS or whatever and then regenerate the HTML. (The version of MovableType that I used on snee.com actually did static site generation, but all the styling was managed with a mess of old PHP. I haven't upgraded it in ten years because the last time I did it broke so much.) A selling point of Hugo is that it does this very quickly--or, to use the now-clichéd phrase that they prefer, "blazingly fast".

I knew about Jekyll and Sphinx from work because both are used for geomesa.org. After researching alternatives I decided that I liked the available Hugo themes the most. The Hugo documentation isn't very good, but the people on the discussion forum are very helpful, sometimes answering within minutes. If there is any interest I may write a blog entry about the important Hugo techniques I had to track down to customize my blog because they were not written up in an easily findable place.

You store your Hugo content separately from the styling using Hugo's own variation of markdown. As a longstanding XML guy ever since it was a four-letter word, I have ranted about what's wrong with markdown--or, as I should say, "the markdowns"-- but it works for what I want to do in my blog and you can embed just about any sensible HTML you want in places where markdown falls short. I would have preferred a static site generator where the content I wrote for each new blog entry conformed to some simple XHTML profile but I just couldn't find anything with good themes and the right level of automation.

In the lower-right of my snee.com blog you'll see four variations on Atom and RSS feeds. More than one Atom or RSS feed seems to be difficult in Hugo, so my new blog's Atom feed has summaries and links to the original postings and the new blog's RSS feed has the full entries. I will be setting the snee.com ones to redirect to the bobdc.com ones shortly, but you can just subscribe to the new ones now if you like.

So, I apologize for the lack of phone-friendliness of my blog for the last few years and hope you enjoy the new more responsive version of my blog.

]]>

curling SPARQL

2019-02-24T15:45:30Z

A quick reference.

I've been using the curl utility to retrieve data from SPARQL endpoints for years, but I still have trouble remembering some of the important syntax, so I jotted down a quick reference for myself and I thought I'd share it. I also added some background.

Quick reference

Submit a URL-encoded SPARQL query on the operating system command line to the endpoint http://edan.si.edu/saam/sparql:

curl "http://edan.si.edu/saam/sparql?query=SELECT%20*%20WHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D%20LIMIT%208"

(Quoting the URL isn't always necessary, but won't hurt. Omitting it may hurt if some of the characters mean something special to your operating system's command line interpreter.)

Submit the same query stored in the file query1.rq:

curl --data-urlencode "query@query1.rq" http://edan.si.edu/saam/sparql

There is no need to escape the query in the file, because the --data-urlencode parameter tells curl to do so.

The above queries return the data in whatever format the endpoint's system administrators chose as the default. You can pass a request header to specify that you want a particular format. The following requests comma-separated values:

curl -H "Accept: text/csv" --data-urlencode "query@query1.rq"  http://edan.si.edu/saam/sparql

Other possible content types are application/sparql-results+json, application/sparql-results+xml, and text/tab-separated-values.

The above examples all use a SELECT query. A CONSTRUCT query requests triples, so instead of CSV or one of the other tabular formats you want an RDF serialization such as Turtle:

curl -H "Accept: text/turtle" --data-urlencode "query@query2.rq"  http://edan.si.edu/saam/sparql

Other possible content types for CONSTRUCT queries are application/rdf+xml, application/rdf+json, and, for ntriples, text/plain. The bio2rdf github page has good long lists for both SELECT and CONSTRUCT content types, although not all endpoints will support all of the listed types. (It lists text/plain for N-triples, but you're better off using application/n-triples.)

Background

curl lets you submit many kinds of HTTP requests to HTTP servers. It's part of the Linux and MacOS operating systems, and if you don't have it on your Windows machine, you can download it.

If you enter curl with no parameters other than a URL, like this,

curl http://www.learningsparql.com

it does the same HTTP GET that a browser would do. This has the same effect as doing a browser View Source on that web page.

It gets more interesting when you're not pointing curl at a static web page like http://www.learningsparql.com but at a dynamic resource such as a SPARQL endpoint. A SPARQL endpoint is usually identified with a URL ending with /sparql. I tested everything shown above with these endpoint URLs:

https://query.wikidata.org/bigdata/namespace/wdq/sparql, the SPARQL endpoint for Wikidata.
http://localhost:3030/myDataset/sparql, the SPARQL endpoint for a local instance of Apache Jena Fuseki. This is the triplestore that I described in the "Updating Data with SPARQL" chapter of my book Learning SPARQL because, for a server that accepts SPARQL UPDATE commands, it's so easy to get up and running. Before running the queries against this endpoint I created a dataset on this running instance with the clever name of myDataset and loaded some triples into it. As you can see, a Fuseki endpoint URL includes the dataset name.
http://edan.si.edu/saam/sparql, the SPARQL endpoint for the Smithsonian Institution. I used this one in the examples here because it's the shortest of the three endpoint URLs that I used for testing.

The simplest way to send a query to a SPARQL endpoint is to add query=[your URL-encoded query] to the end of the endpoint's URL as with the very first example above. You can paste the resulting URL into the address bar of a web browser so that the browser will retrieve the query results from the endpoint, but curl lets you retrieve the results from a command line so that you can save the returned data and use it as part of an application.

URL encoding is the process of taking characters that might screw up the parsing of the URL and converting each to a percent sign followed by a number representing its Unicode code point--most often, converting each space to %20. For example, the escaped version of the query SELECT * WHERE {?s ?p ?o} LIMIT 8 that I used in the examples above is SELECT%20*%20WHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D%20LIMIT%208. Most programming languages offer built-in functions to do this; I usually paste one of these queries into a form on a website like this one and then copy the result after having the form do the conversion.

When you add the escaped query to a SPARQL endpoint URL such as the Smithsonian one and enter the result as a parameter to curl at your command line, like this,

curl http://edan.si.edu/saam/sparql?query=SELECT%20*%20WHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D%20LIMIT%208

it should retrieve a SPARQL Query Results JSON Format version of the data requested by that query, because that's the default format for that endpoint.

I actually don't escape queries and add them to a curl command line often. When I'm refining a query by iteratively editing and running it, re-encoding the URL each time can be a pain, so I usually store the query in a text file (query1.rq for the sample SELECT query above and query2.rq for the CONSTRUCT query) and tell curl to URL-encode the file's contents and send the result off to the SPARQL endpoint.

If I keep the file with the query in a text editor, I can refine it, save it, and run the same command over and over without worrying about escaping each revision of the query. (Because my editor is Emacs, I could actually send the query to the endpoint using Emacs SPARQLMode, but today's topic is curl.)

The curl website has plenty of documentation, but you can learn a lot with just this:

  curl --help

Among the many, many options, some useful ones are -o to redirect output to a file and -L for "follow location hints" (that is, if the server has instructions to redirect a request for a given URL to something else, take the hint). Another is-I for "Show document info only": just get information about the requested "document" without actually retrieving a named resource, which is useful for debugging. The classic -v for "verbose" is also handy for debugging.

Take a look at the available options, experiment with some SPARQL endpoints, and soon you'll be using "curl" as a verb (for example, "I tried to curl it but I didn't have the right certs"--see the -E command line option for more on that) and you won't be talking about hairstyling, arm exercises, or sliding round stones across the ice.

(I just learned about Curling SPARQL HTTP Graph Store protocol by @jindrichmynarz, so if you've gotten this far, you'll like that too.)

Curling image by Greg Scheckter via flicker CC some rights reserved

Comments? Just tweet to @bobdc for now, because Google+ is shutting down. I will be moving my blog to a new more phone-responsive platform shortly and I'm researching options for hosted comments.

]]>

Querying machine learning distributional semantics with SPARQL

2019-01-20T14:57:40Z

Bringing together my two favorite kinds of semantics.

I recommend the paper to anyone interested in SPARQL or the embedding vectors side of machine learning. They seem to have a productive future together.

When I wrote Semantic web semantics vs. vector embedding machine learning semantics, I described how distributional semantics--whose machine learning implementations are very popular in modern natural language processing--are quite different from the kind of semantics that RDF people usually talk about. I recently learned of a fascinating project that brings RDF technology and distributional semantics together, letting our SPARQL query logic take advantage of entity similarity as rated by machine learning models.

To review a little from that blog entry: machine learning implementations of distributional semantics can identify some of the meanings of words by analyzing their relationships with other words in a set of training data. For example, after analyzing the distribution of terms in a large enough text corpus, such a system can answer the question "woman is to man as queen is to what?" Along with the answer of "king", discussions of this technology typically bring up other examples such as the questions "walking is to walked as swimming is to what?" (an especially nice one because "swim" is an irregular verb) and "London is to England as Berlin is to what?"

These examples are a bit oversimplified. Instead of such a straightforward answer, an implementation such as word2vec typically responds with a list of scored words. If the analyzed corpus was large enough, asking word2vec to complete the second pair in "woman man queen" will get you a list of words with "king" having the highest score. In my experiments, this was nice for the "london england berlin" case, because while germany had the highest score, prussia had the second highest, and Berlin was the capital of Prussia for a few centuries.

word2vec doesn't actually compare the strings "london" and "england" and "berlin". It uses cosine similarity to compare vectors that were assigned to each word as a result of the training step done with the input corpus--the machine "learning" part. Then, it looks for vectors whose similarity to the berlin vector is comparable to the similarity between the london and england vectors.

Some of the most interesting work in machine learning of the past few years has built on the use of vectors to represent entities other than words. The popular doc2vec (originally implemented by my CCRi co-worker Tim Emerick) does it with documents, and others have done it with audio clips and images.

It's one thing to pick out an entity and then ask for a list of entities whose vectors are similar to that of the selected entity. Researchers at King Abdullah University of Science and Technology, the University of Birmingham, and Maastricht University have collaborated to take this further by mixing in some SPARQL. Their paper Vec2SPARQL: integrating SPARQL queries and knowledge graph embeddings describes "a general framework for integrating structured data and their vector space representations [that] allows jointly querying vector functions such as computing similarities (cosine, correlations) or classifications with machine learning models within a single SPARQL query". They have made their implementation available as a Docker image and also put up a SPARQL endpoint with their sample data and SPARQL extensions.

Vec2SPARQL lets you use SPARQL to move beyond simple comparison of vector similarity scores to combine SPARQL's abilities with this. As they write,

For example, once feature vectors are extracted from images, meta-data that is associated with the images (such as geo-locations, image types, author, or similar) could be queried using SPARQL and combined with the semantic queries over the feature vectors extracted from the images themselves. Such a combination would, for example, allow to identify the images authored by person a that are most similar to an image of author b; it can enable similarity- or analogy-based search and retrieval in precisely delineated subsets; or, when feature learning is applied to structured datasets, can combine similarity search and link prediction based on knowledge graph embeddings with structured queries based on SPARQL.

The paper's authors extended Apache Jena ARQ (the open source cross-platform command line SPARQL processor that I recommend in my book Learning SPARQL) with two new functions that make it easier to work with these vectors. The similarity(?x,?y) function lets you compute the similarity of two vectors so that you can use the result in a FILTER, BIND, or SELECT statement. For example, you might use it in a FILTER statement to only retrieve resources whose similarity to a particular resource was above a specified threshold. Their mostSimilar(?x,n) function asks for the n most similar entities to the one passed as the first argument.

Their paper discusses two applications of Vec2SPARQL, in which they "demonstrate using biomedical, clinical, and bioinformatics use cases how [their] approach can enable new kinds of queries and applications that combine symbolic processing and retrieval of information through sub-symbolic semantic queries within vector spaces". As they described the first of their two examples,

...we can use Vec2SPARQL to perform queries of a knowledge graph of mouse genes, diseases and phenotypes and incorporate Vec2SPARQL similarity functions... Our aim in this use case is to find mouse gene associations with human diseases by prioritizing them using their phenotypic similarity, and simultaneously restrict the similarity comparisons to genes and diseases with specific properties (such as being associated with a particular phenotype).

The paper describes where they got their data and how they prepared it, and it shows a brief but expressive query that let them achieve their goal.

In their second example, after assigning vectors to over 112,000 human chest x-ray images that also included gender, age, and diagnosis metadata, they could query for image similarity and also add filters to these queries such as combinations of age range and gender to find other patterns of similarity.

The paper goes into greater detail on the data used for their samples and the similarity measures that they used. It also points to their source code on github and a "SPARQL endpoint" at http://sparql.bio2vec.net/ that is really more of a SPARQL endpoint query form. (The actual endpoint is at http://sparql.bio2vec.net/patient_embeddings/query, and I successfully sent a query there with curl.)

For an academic paper, "Vec2SPARQL: integrating SPARQL queries and knowledge graph embeddings" is quite readable. (Although I didn't have the right biology background to closely follow all the discussions of their sample query data, I could just about handle the math as shown.) I recommend the paper to anyone interested in SPARQL or the embedding vectors side of machine learning. They seem to have a productive future together.

Please add any comments to this Google+ post.

]]>

Playing with wdtaxonomy

2018-12-23T14:51:49Z

Those queries from my last blog entry? Never mind!

After I wrote about Extracting RDF data models from Wikidata in my blog last month, Ettore Rizza suggested that I check out wdtaxonomy, which extracts taxonomies from Wikidata by retrieving the kinds of data that my blog entry's sample queries retrieved, and it then displays the results as a tree. After playing with it, I'm tempted to tell everyone who read that blog entry to ignore the example queries I included, because you can learn a lot more from wdtaxonomy.

The queries in that blog entry might still give you some useful perspective on how SPARQL can retrieve triples from Wikidata that express tree-ish relationships between the concepts of a given domain that have Wikipedia pages--whether you want to call that a taxonomy or an ontology--but I was just dabbling, while wdtaxonomy is a full-featured serious application for this.

Jakob Voss designed wdtaxonomy as both a command line utility and as an NPM module that you can reference from applications. I tried the command line version and had a lot of fun. To try it with my periodic table element example that I wrote about last month, I started by entering "wdtaxonomy Q11344" (using the same local name for the Wikidata identifier that I used before) and the results were impressive.

wdtaxonomy typically outputs a text-based tree with various information about the nodes of the tree. Instead of pasting a sample here, I'm showing a screen shot of the beginning of the output so that you can see the nice color coding:

The wdtaxonomy readthedocs.io documentation lists over two dozen command line options that you can use to customize the output. (Entering "wdtaxonomy" alone at the command line gives a good summary.) My favorite is -s, which tells you you the SPARQL query that wdtaxonomy would use to retrieve the requested information from wikidata. Here is what that gives you when added it to the Q11344 command line I entered above:

$ wdtaxonomy -s Q11344
  SELECT ?item ?broader ?itemLabel ?instances ?sites WITH {
    SELECT DISTINCT ?item { ?item wdt:P279* wd:Q11344 }
  } AS %items WHERE { 
    INCLUDE %items .
    OPTIONAL { ?item wdt:P279 ?broader } .
    {
      SELECT ?item (count(distinct ?element) as ?instances) {
        INCLUDE %items.
        OPTIONAL { ?element wdt:P31 ?item }
      } GROUP BY ?item
    }
    {
      SELECT ?item (count(distinct ?site) as ?sites) {
        INCLUDE %items.
        OPTIONAL { ?site schema:about ?item }
      } GROUP BY ?item
    }
    SERVICE wikibase:label {
      bd:serviceParam wikibase:language "en"
    }
  }

(The INCLUDE keyword used in this query is a Blazegraph and Anzo extension to the SPARQL standard.) Combining this -s option with other options, such as -i to include instances or -d to include item descriptions, shows what SPARQL query the tool would generate to retrieve this additional information. It's a great opportunity to learn more about SPARQL, about the Wikidata data model, and about their relationship. (I have worried that this data model would scare off people who are new to SPARQL--that if their first data set to query was Wikidata, they migh think that the complexity of the necessary queries was because of SPARQL and not because of Wikidata--but when I see all the great activity on Twitter around the use of SPARQL with Wikidata lately, I don't worry so much anymore.)

The ability to get at the generated SPARQL queries is also a huge help to my original goal of retrieving triples that let me store an RDFS/OWL ontology or a SKOS taxonomy about Wikipedia entities. I can change the SELECT part to a CONSTRUCT clause to create triples that use the variables bound in wdtaxonomy's WHERE clauses. wdtaxonomy (or rather, Jakob) has done the difficult work of assembling the necessary query logic and we can just take it and use it.

Some of the other command line options I liked include -U to get full URIs and -r to get superclasses of the named entity instead of its subclasses. I encourage everyone interested in SPARQL and Wikidata to install wdtaxonomy and start playing with it. Especially with that -s option!

Please add any comments to this Google+ post.

]]>

Extracting RDF data models from Wikidata

2018-11-18T14:41:46Z

That's "models", plural.

Their avoidance of the standard model vocabularies is not a big deal, and we should be glad that they make this available in RDF at all.

Some people complain when an RDF dataset lacks a documented data model. A great thing about RDF and SPARQL is that if you want to know what kind of modeling might have been done for a dataset, you just look, even if they're using non-(W3C-)standard modeling structures. They're still using triples, so you look at the triples.

If I know that there is an entity x:thing23 in a dataset, I'm going to query for {x:thing23 ?p ?o} and see what information there is about that entity. Hopefully I will find an rdf:type triple saying that it's a member of a class. If not, maybe it uses some other home-grown way to indicate class membership; either way, you can then start querying to find out about the class's relationships to properties and other classes, and you've got a data model. What if it doesn't use RDFS to describe these modeling structures and their relationships? A CONSTRUCT query will convert it to a data model that does.

And, if {x:thing23 ?p ?o} triples don't indicate any class membership, just seeing what the ?p values are tells you something about the data model. If certain entities use certain properties for their predicates, and other entities use a list that overlaps with that, you've learned more about relationships between sets of entities in the dataset. All of these things can be investigated with simple queries.

Wikidata offers tons of great data and modeling for us RDF people, but it wasn't designed for us. They created their own model and then expressed the model and instance data in RDF, and I'm not going to complain; can you imagine how cool it would be if Google did the same with their knowledge graph? (When I tweeted "Handy Wikidata hints for people who have been using RDF and SPARQL since before Wikidata was around: use wdt:P31 instead of rdf:type and wdt:P279 instead of rdfs:subClassOf", Mark Watson replied that he liked my sense of humor. While I hadn't meant to be funny I do appreciate his sense of humor.) As I've worked at understanding Wikidata's documentation about their mapping to RDF I've had fun just querying around to understand the structures. Again: this is one of the key reasons that RDF and SPARQL are great! Because we can do that!

Last month I described how you can find the subclass tree under a given class in Wikidata and since then I've done further exploration of how to pull data models out of Wikidata. Note that I say "models" and not "model". Olivier Rossel recently referred to extracting the data model of Wikidata (my translation from his French), but I worry that looking for "the" grand RDF data model of Wikidata might set someone up for disappointment. I think that looking for data models to suit various projects will be more productive. (Olivier and I discussed this further in the "Handy Wikidata hints" thread mentioned above.)

The following query builds on the one I did last month to either get a class tree below a given one or to get its superclasses instead. It creates triples that express the classes and their relationships using W3C standard properties.

CONSTRUCT {
  ?class a owl:Class . 
  ?class rdfs:subClassOf ?superclass . 
  ?class rdfs:label ?classLabel . 
  ?property rdfs:domain ?class . 
  ?property rdfs:label ?classLabel .
}
WHERE {
  BIND(wd:Q11344 AS ?mainClass) .    # Q11344 chemical element; Q1420 automobile
  
  # Pick one or the other of the following two triple patterns. 
  ?class wdt:P279* ?mainClass.     # Find subclasses of the main class. 
  #?mainClass wdt:P279* ?class.     # Find superclasses of the main class. 
  
  ?class wdt:P279 ?superclass .     # So we can create rdfs:subClassOf triples
  ?class rdfs:label ?classLabel.
  OPTIONAL {
    ?class wdt:P1963 ?property.
    ?property rdfs:label ?propertyLabel.
    FILTER((LANG(?propertyLabel)) = "en")
    }
  FILTER((LANG(?classLabel)) = "en")
}

(Because the query uses prefixes that Wikidata already understands, I didn't need to declare any.) When run in the Wikidata query service form, there are too many triples to see at once, so I put the query into a subtreeClasses.rq file and ran it with curl from the command line like this:

curl --data-urlencode "query@subtreeClasses.rq" https://query.wikidata.org/sparql -H "Accept: text/turtle"  > chemicalElementSubClasses.ttl

Loading the result into TopBraid Composer Free edition (available here; the Free edition is a choice on the Product dropdown list) showed a class tree of the result like this:

(It's tempting to add an entry for Frinkonium as a subclass of "hypothetical chemical element".) I understand that the Wikimedia Foundation had their reasons for not describing their models with the standard vocabularies, but this shows the value of using the standards: interoperability with other tools. It also shows that the Foundation's avoidance of the standard model vocabularies is not a big deal, and that we should be glad that they make this available in RDF at all, because the sheer fact that it's in RDF makes it easy to convert to whatever RDF we want with a CONSTRUCT query. (Again, imagine if Google did this with any portion of their knowledge graph...)

The query above also looks for properties for those classes so that it can express those in the output with the RDFS vocabulary. It didn't find many, but this bears further investigation. This query shows that in addition to the chemical element class having properties, there are constraints on those properties described with triples, so there's a lot more that can be done here to pull richer models out of Wikidata and then express them in more standard vocabularies.

And of course there's the possibility of pulling out instance data to go with these models. Queries for that would be easy enough to assemble but you might end up with so much data that Wikidata times out before giving it to you; you could use the techniques I described in Pipelining SPARQL queries in memory with the rdflib Python library to retrieve instance URIs and then retrieve the additional triples about those instances in batches of queries that use the VALUES keywords.

Lots of data instances of rich models, all transformed to conform to the W3C standards so that they work with lots of open source and commercial tools--the possibilities are pretty impressive. If anyone pulls datasets like this out of Wikidata for their field, let me know about it!

Please add any comments to this Google+ post.

]]>

SPARQL full-text Wikipedia searching and Wikidata subclass inferencing

2018-10-28T16:37:19Z

Wikipedia querying techniques inspired by a recent paper.

I found all kinds of interesting things in the article "Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia's Knowledge Graph"(pdf) by Stanislav Malyshev of the Wikimedia Foundation and four co-authors from the Technical University of Dresden. I wanted to highlight two particular things that I will find useful in the future and then I'll list a few more.

Before I cover them, I wanted to mention that I've really grown to appreciate the little diamond icon in the upper-left of the Wikidata query form. As I refine queries on that form, the queries typically get messier and messier, so the ability to clean it all up with one click is very convenient.

Full text searching of Wikipedia with SPARQL

The paper's "Custom SPARQL Extensions" section describes several extensions, including the MediaWiki Web API. The Wikidata Query Service/User Manual/MWAPI page describes how you can call the MediaWiki API search functions by using special property functions (that is, properties that instruct the query engine to execute certain special functions).

This API is definitely one of those topics where reviewing the examples will get you started more quickly than trying to read the actual documention. Their first SPARQL query search example, Find all entities with labels "cheese" and get their types, searches Wikipedia for entries that have "cheese" in one of their labels such as the page title or alternative names.

The key difference in the Find articles in Wikipedia example that follows the first cheese example is that its fifth line uses the property function mwapi:srsearch as a predicate instead of mwapi:search, telling the query to search the contents of all of the English (note the ".en" on the fourth line) Wikipedia pages. You can try that example yourself to do a full-text search for "cheese". I did a similar search for Darius Milhaud Burt Bacharach because I've recently been fascinated by the connections between Milhaud, a French composer who rose to prominence in the 1920s as a member of Les Six, and Bacharach, one of the greatest pop songwriters of the 1960s. (Listening to some Milhaud once, it struck me as odd that his use of horns would remind me of some Bacharach songs and arrangements until I found out that the author of "The Look of Love", "Walk on By", and "I Say a Little Prayer" studied with Milhaud in the 1940s at McGill University.) This query certainly doesn't need the "LIMIT 20" at the end like the full-text search for "cheese" does, because these two guys don't get mentioned on the same page as often as cheese gets mentioned, but it is an interesting set of pages.

Subclass inferencing with Wikidata

I'm still surprised at how many people use RDF without adding any schema information, or worse, without using schema information that's already there. Wikidata provides plenty for us, and while the Blazegraph instance used as the back end to its SPARQL engine does not have its RDFS inferencing capabilities turned on--understandably, because queries that take advantage of this ask more of a processor and could therefore hamper scalability--a nice property path trick does let us ask for all the instances of a particular class and of its subclasses. This wasn't even mentioned in the "Getting the Most out of Wikidata" paper, but a mention of how Wikidata uses owl:objectProperty inspired me to dig more into the use of the data modeling, and I came up with this.

The following (try it here) shows that Wikidata currently has data about 125 instances of home computer models:

SELECT (count(*) as ?instances) WHERE  {
  ?instance wdt:P31 wd:Q473708     # Instance has a type of "home computers"
}

This next query (try it here) shows that there are 28 instances of classes that are a direct subclass of "home computers":

SELECT (COUNT(*) AS ?instances) WHERE {
  ?instance wdt:P31 ?class.
  ?class wdt:P279 wd:Q473708.     # wdt:P279: subclass of 
}

Merely adding the property path asterisk operator to wdt:P31 tells the query engine to find instances of the home computer class and also instances of any class in the subclass tree below it (try it here) and it finds 154 of them:

SELECT (COUNT(*) AS ?instances) WHERE {
  ?instance wdt:P31 ?class.
  ?class wdt:P279* wd:Q473708.
}

As with regular expressions, the asterisk means "0 or more steps away," so that instances of wd:Q473708 would be counted along with instances of classes from its subclass tree. Using a plus sign instead would have meant "1 or more instances away" so that query would not have found instances of wd:Q473708.

The ability to use class relationships to identify potentially useful data is just one example of how schema metadata adds value to data. And, we get more than just these additional instances; we get additional class names that tell us more about these instances. For example, we can find that the Thomson MO5-CnAM 43737 computer is an instance of the class Thomson M05, which is a subclass of MOTO Gamme, which is a subclass of home computer.

And more

Some other nice things I learned about in the paper:

The use of wikibase:around and wikibase:box for additional kinds of geographic queries in addition to the ability to search within a city's limits as I described in July.
A list of additional endpoints that you can use in federated queries sent to Wikidata.
Support for Blazegraph's graph traversal features.
Multiple live Grafana dashboards about Wikidata usage such as data about agents and formats requested.

If you're interested in SPARQL, Wikidata, or especially the combination, you'll learn some fascinating things from this paper.

Please add any comments to this Google+ post.

]]>

Panic over "superhuman" AI

2018-09-23T15:27:48Z

Robot overlords not on the way.

When someone describe their worries about AI taking over the world, I usually think to myself "I recently bookmarked a good article about why this is silly and I should point this person to it", but in that instant I can't remember what the article was. I recently re-read a few and thought I'd summarize them here in case anyone wants to point their friends to some sensible discussions of why such worries are unfounded.

The impossibility of intelligence explosion by François Chollet

Chollet is an AI researcher at Google and the author of the Keras deep learning framework and the Manning books "Deep Learning with Python" and "Deep Learning with R". Like some of the other articles covered here, his piece takes on the idea that we will someday build an AI system that can build a better one on its own, and then that one will build a better one, and so on until the singularity.

His outline gives you a general idea of his line of reasoning; the bulleted lists in his last two sections are also good:

A flawed reasoning that stems from a misunderstanding of intelligence
Intelligence is situational
Our environment puts a hard limit on our individual intelligence
Most of our intelligence is not in our brain, it is externalized as our civilization
An individual brain cannot implement recursive intelligence augmentation
What we know about recursively self-improving systems
Conclusions

One especially nice paragraph:

In particular, there is no such thing as "general" intelligence. On an abstract level, we know this for a fact via the "no free lunch" theorem -- stating that no problem-solving algorithm can outperform random chance across all possible problems. If intelligence is a problem-solving algorithm, then it can only be understood with respect to a specific problem. In a more concrete way, we can observe this empirically in that all intelligent systems we know are highly specialized. The intelligence of the AIs we build today is hyper specialized in extremely narrow tasks -- like playing Go, or classifying images into 10,000 known categories. The intelligence of an octopus is specialized in the problem of being an octopus. The intelligence of a human is specialized in the problem of being human.

'The discourse is unhinged': how the media gets AI alarmingly wrong by Oscar Schwartz

This Guardian piece focuses on how the media encourages silly thinking about the future of AI. As the article's subtitle tells us,

Social media has allowed self-proclaimed 'AI influencers' who do nothing more than paraphrase Elon Musk to cash in on this hype with low-quality pieces. The result is dangerous.

Much of the article focuses on the efforts of Zachary Lipton, a machine learning assistant professor at Carnegie Mellon, to call out bad journalism on the topic. One example is an article that I was also guilty of taking too seriously: Fast Company's AI Is Inventing Languages Humans Can't Understand. Should We Stop It? The actual "language" was just overly repetitive sentences made possible by recursive grammar rules, which I had experienced myself many years ago doing a LISP-based project for a Natural Language Processing course. Schwartz quotes the Sun article Facebook shuts off AI experiment after two robots begin speaking in their OWN language only they can understand as saying that the incident "closely resembled the plot of The Terminator in which a robot becomes self-aware and starts waging a war on humans". (The Sun article also says "Experts have called the incident exciting but also incredibly scary"; according to the Guardian article, "These findings were considered to be fairly interesting by other experts in the field, but not totally surprising or groundbreaking".)

Schwartz's piece describes how the term "electronic brain" is as old as electronic computers, and how overhyped media coverage of machines that "think" as far back as the 1940s led to inflated expectations about AI that greatly contributed to the several AI winters we've had since then.

Ways to Think About Machine Learning by Benedict Evans

If you're going to read only one of the articles I describe here all the way through, I recommend this one. I don't listen to every episode of the a16z podcast, but I do listen to every one that includes Benedict Evans (this week's episode, on Tesla and the Nature of Disruption, was typically excellent), and I have subscribed to his newsletter for years. He's a sharp guy with sensible attitudes about how technologies and societies fit together and where it may lead.

One theme of many of the articles I describe here is the false notion that intelligence is a single thing that can be measured on a one-dimensional scale. As Evans puts it,

This gets to the heart of the most common misconception that comes up in talking about machine learning - that it is in some way a single, general purpose thing, on a path to HAL 9000, and that Google or Microsoft have each built *one*, or that Google 'has all the data', or that IBM has an actual thing called 'Watson'. Really, this is always the mistake in looking at automation: with each wave of automation, we imagine we're creating something anthropomorphic or something with general intelligence. In the 1920s and 30s we imagined steel men walking around factories holding hammers, and in the 1950s we imagined humanoid robots walking around the kitchen doing the housework. We didn't get robot servants - we got washing machines.

Washing machines are robots, but they're not 'intelligent'. They don't know what water or clothes are. Moreover, they're not general purpose even in the narrow domain of washing - you can't put dishes in a washing machine, nor clothes in a dishwasher (or rather, you can, but you won't get the result you want). They're just another kind of automation, no different conceptually to a conveyor belt or a pick-and-place machine. Equally, machine learning lets us solve classes of problem that computers could not usefully address before, but each of those problems will require a different implementation, and different data, a different route to market, and often a different company. Each of them is a piece of automation. Each of them is a washing machine.

After bringing up relational databases as a point of comparison for what new technology can do ("Relational databases gave us Oracle, but they also gave us SAP, and SAP and its peers gave us global just-in-time supply chains - they gave us Apple and Starbucks"), he asks "What, then, are the washing machines of machine learning, for real companies?" He offers some good suggestions, some of which can be summarized as "AI will allow the automation of more things".

He also discusses low-hanging fruit for what new things AI may automate. As an excellent followup to that, I recommend Kathryn Hume's Harvard Business Review article How to Spot a Machine Learning Opportunity, Even If You Aren't a Data Scientist.

The Myth of a Superhuman AI by Kevin Kelly

In this Wired Magazine article by one of their founders, after a discussion of some of the panicky scenarios out there we read that "buried in this scenario of a takeover of superhuman artificial intelligence are five assumptions which, when examined closely, are not based on any evidence". He lists them, then lists five "heresies [that] have more evidence to support them"; these five provide the structure for the rest of his piece:

Intelligence is not a single dimension, so "smarter than humans" is a meaningless concept.
Humans do not have general purpose minds, and neither will AIs.
Emulation of human thinking in other media will be constrained by cost.
Dimensions of intelligence are not infinite.
Intelligences are only one factor in progress.

A good point about how artificial general intelligence is not something to worry about makes a nice analogy with artificial flight:

When we invented artificial flying we were inspired by biological modes of flying, primarily flapping wings. But the flying we invented -- propellers bolted to a wide fixed wing -- was a new mode of flying unknown in our biological world. It is alien flying. Similarly, we will invent whole new modes of thinking that do not exist in nature. In many cases they will be new, narrow, "small," specific modes for specific jobs -- perhaps a type of reasoning only useful in statistics and probability.

(This reminds me of Evans writing "We didn't get robot servants - we got washing machines".) Another good metaphor is Kelly's comparison of attitudes about superhuman AI with cargo cults:

It is possible that superhuman AI could turn out to be another cargo cult. A century from now, people may look back to this time as the moment when believers began to expect a superhuman AI to appear at any moment and deliver them goods of unimaginable value. Decade after decade they wait for the superhuman AI to appear, certain that it must arrive soon with its cargo.

19 A.I. experts reveal the biggest myths about robots by Guia Marie Del Prado

This Business Insider piece is almost three years old but still relevant. Most of the experts it quotes are actual computer scientist professors, so you get much more sober assessments than you'll see in the panicky articles out there. Here's a good one from Berkeley computer scientist Stuart Russell:

The most common misconception is that what AI people are working towards is a conscious machine, that until you have a conscious machine there's nothing to worry about. It's really a red herring.

To my knowledge, nobody, no one who is publishing papers in the main field of AI, is even working on consciousness. I think there are some neuroscientists who are trying to understand it, but I'm not aware that they've made any progress.

As far as AI people, nobody is trying to build a conscious machine, because no one has a clue how to do it, at all. We have less clue about how to do that than we have about build a faster-than-light spaceship.

From Pieter Abbeel, another Berkeley computer scientist:

In robotics there is something called Moravec's Paradox: "It is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility".

This is well appreciated by researchers in robotics and AI, but can be rather counter-intuitive to people not actively engaged in the field.

Replicating the learning capabilities of a toddler could very well be the most challenging problem for AI, even though we might not typically think of a one-year-old as the epitome of intelligence.

I was happy to see the article quote NYU's Ernie Davis, whose AI class I took over 20 years ago while working on my master's degree there. (Reviewing my class notebook I see a lot of LISP and Prolog code, so things have changed a lot.)

This article implicitly has a nice guideline for when to take predictions about the future of AI seriously: are they computer scientists familiar with the actual work going on lately? If they're experts in other fields engaging in science fiction riffing (or as the Guardian article put it more cleverly, paraphrasing Elon Musk), take it all with a big grain of salt.

I don't mean to imply that the progress of technologies labeled as "Artificial Intelligence" has no potential problems to worry about. Just as automobiles and chain saws and a lot of other technology invented over the years can do harm as well as good, the new power brought by advanced processors, storage, and memory can be misused intentionally or accidentally, so it's important to think through all kinds of scenarios when planning for the future. In fact, this is all the more reason not to worry about sentient machines: as the Guardian piece quotes Lipton, "There are policymakers earnestly having meetings to discuss the rights of robots when they should be talking about discrimination in algorithmic decision making. But this issue is terrestrial and sober, so not many people take an interest." Sensible stuff to keep in mind.

Please add any comments to this Google+ post.

]]>

Pipelining SPARQL queries in memory with the rdflib Python library

2018-08-27T12:55:23Z

Using retrieved data to make more queries.

Last month in Dividing and conquering SPARQL endpoint retrieval I described how you can avoid timeouts for certain kinds of SPARQL endpoint queries by first querying for the resources that you want to know about and then querying for more data about those resources a subset at a time using the VALUES keyword. (The example query retrieved data, including the latitude and longitude, about points within a specified city.) I built my demo with some shell scripts, some Perl scripts, and a bit of spit and glue.

I started playing with RDFLib's SPARQL capabilities a few years ago as I put together the demo for Driving Hadoop data integration with standards-based models instead of code. I was pleasantly surprised to find out how easily it could run a CONSTRUCT query on triples stored in memory and then pass the result on to one or more additional queries, letting you pipeline a series of such queries with no disk I/O. Applying these techniques to replace my shell scripts and Perl scripts from last month showed me that these same techniques could be used for all kinds of RDF applications.

When I was at TopQuadrant I got to know SPARQLMotion, their (proprietary) drag-and-drop system for pipelining components that can do this sort of thing. RDFLib offers several graph manipulation methods that can extend what I've done here to do many additional SPARQLMotion-ish things. When I recently asked about other pipeline component-based RDF development tools out there, I learned of Linked Pipes ETL, Karma, ld-pipeline, VIVO Harvester, Silk, UnifiedViews, and a PoolParty framework around Unified Views. I hope to check out as many of them as I can in the future, but with the functions I've written for my new Python script, I can now accomplish so much with so little Python code that my motivation to go looking beyond that is diminishing--especially considering that when doing it this way, I have all of Python's abilities to manipulate strings and data structures standing by in case I need them.

For me, the two most basic RDF tasks to augment the general Python capabilities are retrieval of triples from a remote endpoint for local storage and querying of locally stored triples. RDFLib makes the latter easy. For the former I was looking for a library, but Jindřich Mynarz pointed out that no specialized library was necessary; he even showed me the basic code to make it happen. (I swear I had tried a few times before posting the question on Twitter, so the brevity and elegance of his example were a bit embarrassing for me.)

You can find my new Python script to replace last month's work on github. More than half of it is made up of the actual SPARQL queries being stored in variables. This is a good thing, because it means that the Python instructions (to retrieve triples from the endpoint, to load up the local graph with retrieved triples, to query that graph, and to build and then run new queries based on those query results) all together take up less than half of the script. In other words, the script is more about the queries than about the code to execute them.

The main part of the script isn't very long:

# 1. Get the qnames for the geotagged entities within the city and store in graph g. 

queryRetrieveGeoPoints = queryRetrieveGeoPoints.replace("CITY-QNAME",cityQname)
url = endpoint + "?" + urllib.urlencode({"query": queryRetrieveGeoPoints})
g.parse(url)
logging.info('Triples in graph g after queryRetrieveGeoPoints: ' + str(len(g)))

# 2. Take the subjects in graph g and create queries with a VALUES clause 
#    of up to maxValues of the subjects. 

subjectQueryResults = g.query(queryListSubjects)
splitAndRunRemoteQuery("querySubjectData",subjectQueryResults,
                       entityDataQueryHeader,entityDataQueryFooter)

# 3. See what classes are used and get their names and those of their superclasses.
classList = g.query(listClassesQuery)
splitAndRunRemoteQuery("queryGetClassInfo",classList,
                       queryGetClassesHeader,queryGetClassesFooter)

# 4. See what objects need labels and get them.
objectsThatNeedLabel = g.query(queryObjectsThatNeedLabel)
splitAndRunRemoteQuery("queryObjectsThatNeedLabel",objectsThatNeedLabel,
                       queryGetObjectLabelsHeader,queryGetObjectLabelsFooter)

print(g.serialize(format = "n3"))   # (Actually Turtle, which is what we want, not n3.)

The splitAndRunRemoteQuery function was one I wrote based on my prototype from last month.

I first used RDFLib over 15 years ago, when SPARQL hadn't even been invented yet. Hardcore RDFLib fans will prefer the greater efficiency of its native functions over the use of SPARQL queries, but my goal here was to have SPARQL 1.1 queries drive all the action, and RDFLib supports this very nicely. Its native functions also offer additional capabilities that bring it closer to some of the pipelining things I remember from SPARQLMotion. For example, the set operations on graphs let you perform actions such as unions, intersections, differences, and XORs of graphs, which can be handy when mixing and matching data from multiple sources to massage that data into a single cleaned-up dataset--just the kind of thing that makes RDF so great in the first place.

Picture by Michael Coghlan on Flickr (CC BY-SA 2.0)

Please add any comments to this Google+ post.

]]>

Dividing and conquering SPARQL endpoint retrieval

2018-07-22T15:52:42Z

With the VALUES keyword.

When I first tried SPARQL's VALUES keyword (at which point it was pretty new to SPARQL, having only recently been added to SPARQL 1.1) I demoed it with a fairly artificial example. I later found that it solved one particular problem for me by letting me create a little lookup table. Recently, it gave me huge help in one of the most classic SPARQL development problems of all: how to retrieve so much data from an endpoint that the first attempts at that retrieval resulted in timeouts.

The Wikidata:SPARQL query service/queries page includes an excellent Wikdata query to find latitudes and longitudes for places in Paris. You can easily modify this query to retrieve from places within other cities, and I wanted to build on this query to make it retrieve additional available data about those places as well. While accounting for the indirection in the Wikidata query model made this a little more complicated, it wasn't much trouble to write.

The expanded query worked great for a city like Charlottesville, where I live, but for larger cities, the query was just asking for too much information from the endpoint and timed out. My new idea was to first ask for the roughly the same information that the Paris query above does, and to then request additional data about those entities a batch at a time with a series of queries that use the VALUES keyword to specify each batch. (I've pasted a sample query requesting one batch below.)

It worked just fine. I put all the queries and other relevant files in a zip file for people who want to check it out, but it's probably not worth looking at too closely, because in a month or two I'll be replacing it with a Python version that does everything more efficiently. It's still worth explaining the steps in this version's shell script driver file, because the things I worked out for this prototype effort--despite its Perl scripting and extensive disk I/O--mean that the Python version should come together pretty quickly. That's what prototypes are for!

The driver shell script

Before running the shell script, you specify the Wikidata local name of the city to query near the top of the getCityEntities.rq SPARQL query file. (This is easier than it sounds--for example, to do it for Charlottesville, go to its Wikipedia page and click Wikidata item in the menu on the left to find that Q123766 is the local name.)

Once that's done, running the zip file's getCityData.sh shell script executes these main steps:

It uses a curl command to send the getCityEntities.rq CONSTRUCT query to the https://query.wikidata.org/sparql endpoint.The curl command saves the resulting triples in a file called cityEntities.ttl.
It uses ARQ to run the listSubjects.rq query on the new cityEntities.ttl file, specifying that the result should be a TSV file.
The results of listSubjects.rq get piped to a Perl script called makePart2Queries.pl. This creates a series of CONSTRUCT query files that ask Wikidata for data about entities listed in a VALUES section. It puts 50 entries in each file's VALUES section; this figure of 50 is stored in a $maxLines variable in makePart2Queries.pl where it can be reset if the endpoint is still timing out. This step also adds lines to a shell script called callTempQueries.sh, where each line uses curl to call one of the queries that uses VALUES to request a batch of data.
getCityData.sh next runs the callTempQueries.sh shell script to execute all of these new queries, storing the resulting triples in the file tempCityData.ttl.
The tempCityData.ttl file has plenty of good data, but it can be used to get additional relevant data, so the script's next line runs a query that creates a TSV file with a list of all of the classes found in tempCityData.ttl triples of the form {?instance wdt:P31 ?class}. (wdt:P31 is the Wikidata equivalent of rdf:type, indicating that a resource is an instance of a particular class.) That TSV file then drives the creation of a query that gets sent to the SPARQL endpoint to ask about the classes' parent and grandparent classes, and that data gets added to tempCityData.ttl.
Another ARQ call in the script uses a local query to check for triple objects in the http://www.wikidata.org/entity/ namespace that don't have rdfs:label values and get them--or at least, get the English ones, but it's easy to fix if you want labels in different or additional languages.
The script runs one final ARQ query on tempCityData.ttl: the classic SELECT * WHERE {?s ?p ?o}. This request for all the triples actually tidies up the Turtle data a bit, storing all the triples with common subjects together. It puts the result in cityData.ttl.

One running theme of some of the shell script's steps is the retrieval of labels associated with qnames. Wikidata has a lot of triples like {wd:Q69040 wd:P361 wd:Q16950} that are just three qnames, so retrieved data will have more value to applications if people and processes can find out what each qname refers to.

The main shell script has other housekeeping steps such as recording of the start and end times and deletion of the temporary files. I had more ideas for things to add, but I'll save those for the Python version.

The Python version won't just be a more efficient version of my use of VALUES to do batch retrievals of data that might otherwise time out. It will demonstrate, more nicely, something that only gets hinted at in this mess of shell and Perl scripts: the ability to automate the generation of SPARQL queries that build on the results of previously executed queries so that they can all work together as a pipeline to drive increasingly sophisticated RDF application development.

Here is a sample of one of the queries created to request data about one batch of entities within the specified city:

PREFIX p:  
PREFIX wgs84: 
PREFIX rdfs:  
PREFIX skos:  

CONSTRUCT
{ ?s ?p ?o. 
  ?s ?p1 ?o1 . 
  ?s wgs84:lat ?lat . 
  ?s wgs84:long ?long .
  ?p rdfs:label ?pname .
  ?s wdt:P31 ?class .   
}
WHERE {
  VALUES ?s {


# about 48 more of those here...
}
  # wdt:P131 means 'located in the administrative territorial entity' .
  ?s wdt:P131+ ?geoEntityWikidataID .  
      ?s p:P625 ?statement . # coordinate-location statement
  ?statement psv:P625 ?coordinate_node .
  ?coordinate_node wikibase:geoLatitude ?lat .
  ?coordinate_node wikibase:geoLongitude ?long .

  # Reduce the indirection used by Wikidata triples. Based on Tommy Potter query
  # at http://www.snee.com/bobdc.blog/2017/04/the-wikidata-data-model-and-yo.html.
  ?s ?directClaimP ?o .                   # Get the truthy triples. 
  ?p wikibase:directClaim ?directClaimP . # Find the wikibase properties linked
  ?p rdfs:label ?pname .                  # to the truthy triples' predicates.

  # the following VALUES clause is actually faster than just
  # having specific triple patterns for those 3 p1 values.
  ?s ?p1 ?o1 .
  VALUES ?p1 {
    schema:description
    rdfs:label        
    skos:altLabel
  }

  ?s wdt:P31 ?class . # Class membership. Pull this and higher level classes out in later query.
  
  # If only English names desired
  FILTER (isURI(?o1) || lang(?o1) = 'en' )
  # For English + something else, follow this pattern: 
  # FILTER (isURI(?o1) || lang(?o1) = 'en' || lang(?o1) = 'de')

  FILTER(lang(?pname) = 'en')
}

Neon sign picture by Jeremy Brooks on Flickr (CC BY-NC 2.0)

Please add any comments to this Google+ post.

]]>

Running and querying my own Wikibase instance

2018-06-17T15:17:14Z

Querying it, of course, with SPARQL.

Many of us have waited years for an open-source framework that makes the development of web-based RDF applications as easy as Ruby on Rails does for web-based SQL applications. This dockerized version of Wikibase looks like a big step in this direction.

When Dario Taraborelli's tweeted about how quickly he got a local wikibase instance and SPARQL endpoint up and running with wikibase-docker, he inspired me to give it a shot, and it was surprisingly easy and fun.

I have minimal experience with docker. As instructed by wikibase-docker's README page, I installed docker and docker-compose. (When I got to the Test Docker Installation part of the Get Started, Part 1: Orientation and setup page for setting up docker, the hello-world app gave me a "permission denied" problem, but this solution described at Techoverflow solved it. I did have to reboot, as it suggested.)

Continuing along with the wikibase-docker README, when I clicked "http://localhost:8181" under Accessing your Wikibase instance and the Query Service UI it was pretty cool to see my own local running instance of the wiki:

Moving along in the README, I clicked "Create a new item" before I clicked "Create a new property", but when I saw that the new item's property list offered no choices, I realized that I should define some properties before creating any items. Properties and items can have names, aliases, and descriptions in a wide choice of spoken languages, and Wikibase includes a nice choice of data types.

After defining a property and creating items that had a value for that property, the "Query Service UI @ http://localhost:8282" link on the README led to a web form where I could enter a SPARQL query. I entered SELECT * WHERE { ?s ?p ?o} and saw the default triples that were part of the store as well as triples about the items and property that I had created.

The "Get an RDF dump from wikibase" docker command on the README page did just fine. Reviewing the triples in its output, I saw that the created entities fit the Wikidata data model described at Wikibase/DataModel/Primer, which I wrote about at The Wikidata data model and your SPARQL queries.

It took me some time (and a tweet) to realize that the "Query Service Backend (Behind a proxy)" URL listed on the README file was the URL for the SPARQL endpoint. The first query I tried after that worked with no problem:

curl http://localhost:8989/bigdata/sparql?query=SELECT%20DISTINCT%20%3Fp%20WHERE%20%7B%20%3Fs%20%3Fp%20%3Fo%20%7D

It was also easy to access this server from my phone across my home wifi when I substituted the machine's name or IP address for "localhost" in the URLs above. The web interface was the same on a phone as on a big screen; the MediaWiki project's Mobiles, tablets and responsive design manual page describes some options for extending the interface. If someone out there is looking for UI work and has some time on their hands, contributing some phone and tablet responsiveness to this open source project would be a great line on your résumé.

And finally, while the docker version of this is quick to get up and running, if you're going far with your own MediaWiki installation, you'll want to look over the Installation instructions for the regular, non-docker version.

After I did these experiments and wrote my first draft of this, I discovered the medium.com posting Wikibase for Research Infrastructure -- Part 1 by Pratt Institute librarian and researcher Matt Miller. His piece describes a nice use case of following through on creating a Wikibase application and points to some handy Python scripts for automating the creation of classes and other structures from spreadsheets. His use case happens to be one of my favorite RDF-related available data sources: the Linked Jazz Project. I look forward to Part 2.

It's great to have such a comprehensive system running on my local machine, complete with a web interface that lets non-RDF people create and edit any data they want and, for the RDF people, a SPARQL interface to let them pull and manipulate that data. For more serious dataset development, the MediaWiki project includes some helpful documentation about how to define your own classes and associated properties and forms. (July 20th note: that page is actually about Semantic MediaWiki, which I played around with a few years ago--apparently I didn't keep my notes on that and Wikibase as organized as I should have.)

Many of us have waited years for an open-source framework that makes the development of web-based RDF applications as easy as Ruby on Rails does for web-based SQL applications. The dockerized version of Wikibase looks like a big step in this direction.

Please add any comments to this Google+ post.

]]>

RDF* and SPARQL*

2018-05-28T13:36:59Z

Reification can be pretty cool.

After I posted Reification is a red herring (and you don't need property graphs to assign data to individual relationships) last month, I had an amusingly difficult time explaining to my wife how that would generate so much Twitter activity. This month I wanted to make it clear that I'm not opposed to reification in and of itself, and I wanted to describe the fun I've been having playing with Olaf Hartig and Bryan Thompson's RDF* and and SPARQL* extensions to these standards to make reification more elegant.

In that post, I said that in many years of using RDF I've never needed to use reification because, for most use cases where it was a candidate solution, I was better off using RDFS to declare classes and properties that reflected the use case domain instead of going right to the standard reification syntax (awkward in any standardized serialization) that let me create triples about triples. My soapbox ranting in that post focused on the common argument that the property graph approach of systems like Tinkerpop and Neo4j is better than RDF because achieving similar goals in RDF would require reification; as I showed, it doesn't.

But, reification can still be very useful, especially in the world of metadata. (I am slightly jealous of the metadata librarians of the world for having the word "metadata" in their job title--it sounds even cooler in Canada: Bibliothécaire aux métadonnées.) If metadata is data about data, and more and more of the Information Science world is taking advantage of linked data technologies, then triples about triples are bound to be useful in their use of information for provenance, curation, and all kinds of scholarship about datasets.

The conclusion of my blog post mentioned how, just as I was finishing it up, I discovered Olaf Hartig and Bryan Thompson's 2014 paper Foundations of an Alternative Approach to Reification in RDF and Blazegraph's implementation of it. I decided to play with this a bit in Blazegraph in order to get a hands-on appreciation of what was possible, and I like it. (Olaf recently mentioned on Twitter that these capabilities are being added into Apache Jena as well, so this isn't just a Blazegraph thing.)

As I described in Trying out Blazegraph two years ago, it's pretty simple to download the Blazegraph jar, start it up, load RDF data, and query it. For my RDF* experiments, I started up Blazegraph and created a Blazegraph namespace with a mode of rdr and then did my first few experiments there.

I started with the examples in Olaf's slides RDF* and SPARQL*: An Alternative Approach to Statement-Level Metadata in RDF. To make the slides visually cleaner, he left out full URIs and prefixes, so I added some to properly see the querying in action. I loaded his slide 15 data into my new Blazegraph namespace, specifying a format of Turtle-RDR. The double brackets that you see here are the RDF* extension that lets us create triples that are themselves resources that we can use as subjects and objects of other triples:

@prefix d:  .
<> d:significance 0.8 ;
      d:source  .

This data tells us that the triple about Kubrik being influenced by Welles has a significance of 0.8 and a source at an article on nofilmschool.com.

I then executed the following query, based on Olaf's from slide 16, with no problem:

PREFIX d:  
SELECT ?x WHERE {
  <> d:significance ?sig .
  FILTER (?sig > 0.7)
}

In this case, the use of the double angle brackets is the SPARQL* extension that lets us do the same thing that this syntax does in RDF*. This query asks for whoever was named as being influenced by Welles in statements that have a significance greater than 0.7. The query worked just fine in Blazegraph.

SPARQL* also lets you query for the components of triples that are being treated as independent resources. From Olaf's slide 17, this next query asks for whoever was influenced by Welles and the significance and source of any returned statements, and it worked fine with the data above:

PREFIX d:  
SELECT ?x ?sig ?src WHERE {
  <> d:significance ?sig ;
  d:source ?src .
}

His slide 18 query returns the same result as that one, but takes the syntax a bit further by binding the triple pattern about someone influencing Welles to a variable and then querying for that:

PREFIX d:  
SELECT ?x ?sig ?src WHERE {
  BIND(<> AS ?t)
  ?t  d:significance ?sig ;
      d:source ?src .
}

Moving on to more easy experiments, I found that all the examples on the Blazegraph page Reification Done Right worked exactly as shown there. That page also provides some nice background for ways to use RDF* and SPARQL* in Blazegraph.

Blazegraph lets you do inferencing, so I couldn't resist mixing that with RDF* and SPARQL*. I had to create a new Blazegraph namespace that not only had a Mode of rdr but also had the "Inference" box checked upon creation, and then I loaded this data:

@prefix d:     .
@prefix rdfs:  .

<> a d:Class2 .
<> a d:Class3 .

d:Class2 rdfs:subClassOf d:Class1 . 
d:Class3 rdfs:subClassOf d:Class1 .

It creates two triples that are themselves resources, with one being an instance of Class2 and the other being an instanced of Class3. Two final triples tell us that each of those classes are subclasses of Class1. The following query asked for triples that are instances of Class1, despite the data have no explicit triples about Class1 instances, and Blazegraph did the inferencing and found both of them:

PREFIX d:  
SELECT ?x ?y ?z WHERE {
   <> a d:Class1 . 
}

After doing this inferencing, I was thinking that OWL metadata and inferencing about such triples should open up a lot of new possibilities, but I realized that none of those possibilities are necessarily new: they'll just be easier to implement than they would have been using the old method of reification that used four triples to represent one. Still, being easier to implement counts for plenty, and I think that metadata librarians and other people doing work to build value around existing triples now have a reasonable syntax some nice tools to explore this.

Please add any comments to this Google+ post.

]]>

Reification is a red herring

2018-04-22T14:14:15Z

And you don't need property graphs to assign data to individual relationships.

RDF's very simple subject-predicate-object data model is a building block that you can use to build other models that can make your applications even better.

I recently tweeted that the ZDNet article Back to the future: Does graph database success hang on query language? was the best overview of the graph database world(s) that I'd seen so far, and I also warned that many such "overviews" were often just Neo4j employees plugging their own product. (The Neo4j company is actually called Neo Technology.) The most extreme example of this is the free O'Reilly book Graph Databases, which is free because it's being given away by its three authors' common employer: Neo Technology! The book would have been more accurately titled "Building Graph Applications with Cypher", the Neo4j query language. This 238-page book on graph databases manages to mention SPARQL and Gremlin only twice each. The ZDNet article above does a much more balanced job of covering RDF and SPARQL, Gremlin and Tinkerpop, and Cypher and Neo4j.

The DZone article RDF Triple Stores vs. Labeled Property Graphs: What's the Difference? is by another Neo employee, field engineer Jesús Barrasa. It doesn't mention Tinkerpop or Gremlin at all, but does a decent job of describing the different approach that property graph databases such as Neo4j and Tinkerpop take in describing graphs of nodes and edges when compared with RDF triplestores. Its straw man arguments about RDF's supposed deficiencies as a data model reminded me of a common theme I've seen over the years.

The fundamental thing that most people don't get about RDF, including many people who are successfully using it to get useful work done, is that RDF's very simple subject-predicate-object data model is a building block that you can use to build other models that can make your applications even better. Just because RDF doesn't require the use of schemas doesn't mean that it can't use them; the RDF Schema Language lets you declare classes, properties, and information about these that you can use to drive user interfaces, to enable more efficient and readable queries, and to do all the other things that people typically use schemas for. Even better, you can develop a schema for the subset of the data you care about (as opposed to being forced to choose between a schema for the whole data set or no schema at all, as with XML), which is great for data integration projects, and then build your schema up from there.

Barrasa writes of property graphs that "[t]he important thing to remember here is that both the nodes and relationships have an internal structure, which differentiates this model from the RDF model. By internal structure, I mean this set of key-value pairs that describe them." This is the first important difference between RDF and property graphs: in the latter, nodes and edges can each have their own separate set (implemented as an array in Neo4j) of key-value pairs. Of course, nodes in RDF don't need this; to say that the node for Jack has an attribute-value pair of (hireDate, "2017-04-12"), we simply make another triple with Jack as the subject and these as the predicate and object.

Describing the other key difference, Barrasa writes that while the nodes of property graphs have unique identifiers, "[i]n the same way, edges, or connections between nodes--which we call relationships--have an ID". Property graph edges are unique at the instance level; if Jane reportsTo Jack and Jack reportsTo Jill, the two reportsTo relationships here each have their own unique identifier and their own set of key-value pairs to store information about each edge.

He writes that in RDF "[t]he predicate will represent an edge--a relationship--and the object will be another node or a literal value. But here, from the point of view of the graph, that's going to be another vertex." Not necessarily, at least for the literal values; these represent the values in RDF's equivalent of the key-value pairs--the non-relationship information being attached to a node such as (hireDate, "2017-04-12") above. This ability is why a node doesn't need its own internal key-value data structure.

He begins his list of differences between property graphs and RDF with the big one mentioned above: "Difference #1: RDF Does Not Uniquely Identify Instances of Relationships of the Same Type," which is certainly true. But, his example, which he describes as "an RDF graph in which Dan cannot like Ann three times", is very artificial.

One of his "RDF workarounds" for using RDF to describe that Dan liked Ann three times is reification, in which we convert each triple to four triples: one saying that a given resource is an RDF statement, a second identifying the resource's subject, a third naming the predicate, and a fourth naming the object. This way, the statement itself has identity, and we can add additional information about it as triples that use the statement's identifier as a subject and additional predicates and objects as key-value pairs such as (time, "2018-03-04T11:43:00") to show when a particular "like" took place. Barrasa writes "This is quite ugly"; I agree, and it can also do bad things to storage requirements.

In my 15 years of working with RDF, I have never felt the need to use reification. It's funny how the 2004 RDF Primer 1.0 has a section on reification but the 2014 RDF Primer 1.1 (of which I am proud to be listed in the Acknowledgments) doesn't even mention reification, because simpler modeling techniques are available, so reification was rarely if ever used.

By "modeling techniques" I mean "declaring and then using a model", although in RDF, you don't even have to declare it. If you want to keep track of separate instances of employees, or games, or buildings, you can declare any of these as a class and then create instances of it; similarly, if you want to keep track of separate instances of a particular relationship, declare a class for that relationship and then create instances of it.

How would we apply this to Barrasa's example, where he wants to keep track of information about Likes? We use a class called Like, where each instance identifies who liked who. (When I first wrote that previous sentence, I wrote that we can declare a class called Like, but again, we don't need to declare it to use it. Declaring it is better for serious applications where multiple developers must work together, because part of the point of a schema is to give everyone a common frame of reference about the data they're working with.) The instance could also identify the date and time of the Like, comments associated with it, and anything else you wanted to add as a set of key-value pairs for each Like instance that is implemented as just more triples.

Here's an example. After optional declarations of the relevant class and properties associated with it, the following has four Likes showing who liked who when and a "foo" value to demonstrate the association of arbitrary metadata with that Like.

@prefix d:     .
@prefix m:     .
@prefix rdfs:  . 

# Optional schema.
m:Like  a rdfs:Class .          # A class...
m:liker rdfs:domain m:Like .    # and properties that go with this class.
m:liked rdfs:domain m:Like .
m:foo   rdfs:domain m:Like .

[] a m:Like ;
   m:liker d:Dan ;
   m:liked d:Ann ;
   m:time "2018-03-04T11:43:00" ;
   m:foo "bar" .

[] a m:Like ;
   m:liker d:Dan ;
   m:liked d:Ann ;
   m:time "2018-03-04T11:58:00" ;
   m:foo "baz" .

[] a m:Like ;
   m:liker d:Dan ;
   m:liked d:Ann ;
   m:time "2018-03-04T12:04:00" ;
   m:foo "bat" .

[] a m:Like ;
   m:liker d:Ann ;
   m:liked d:Dan ;
   m:time "2018-03-04T12:06:00" ;
   m:foo "bam" .

Instead of making up specific identifiers for each Like, I made them blank nodes so that the RDF processing software will generate identifiers and keep track of them.

As to Barrasa's use case of counting how many times Dan liked Ann, it's pretty easy with SPARQL:

PREFIX d:  
PREFIX m: 

SELECT (count(*) AS ?likeCount) WHERE {
  ?like a m:Like ;
        m:liker d:Dan ;
        m:liked d:Ann .
}

(This query would actually work with just the m:liker and m:liked triple patterns, but as with the example that I tweeted to Dan Brickley about, declaring your RDF resources as instances of classes can lay the groundwork for more efficient and readable queries.) Here is ARQ's output for this query:

-------------
| likeCount |
=============
| 3         |
-------------

Let's get a little fancier. Instead of counting all of Dan's likes of Ann, we'll just list the ones from before noon on March 3, sorted by their foo values:

PREFIX d:  
PREFIX m: 

SELECT ?fooValue ?time WHERE {
  ?like a m:Like ;
        m:liker d:Dan ;
        m:liked d:Ann ;
        m:time ?time ;
        m:foo ?fooValue .
FILTER (?time < "2018-03-04T12:00")
}
ORDER BY ?fooValue

And here is ARQ's result for this query:

------------------------------------
| fooValue | time                  |
====================================
| "bar"    | "2018-03-04T11:43:00" |
| "baz"    | "2018-03-04T11:58:00" |
------------------------------------

After working through a similar example for modeling flights between New York and San Francisco, Barrasa begins a sentence "Because we can't create such a simple model in RDF..." This is ironic; the RDF model is simpler than the Labeled Property Graph model, because it's all subject-predicate-object triples without the use of additional data structures attached to the graph nodes and edges. His RDF version would have been much simpler if he had just created instances of a class called Flight, because again, while the base model of RDF is the simple triple, more complex models can easily be created by declaring classes, properties, and information about those classes and properties--which we can do by just creating new triples!

To summarize, complaints about RDF that focus on reification are so 2004, and they are a red herring, because they distract from the greater power that RDF's modeling abilities bring to application development.

A funny thing happened after writing all this, though. As part of my plans to look into Tinkerpop and Gremlin and potential connections to RDF as a next step, I was looking into Stardog and Blazegraph's common support of both. I found a Blazegraph page called Reification Done Right where I learned of Olaf Hartig and Bryan Thompson's 2014 paper Foundations of an Alternative Approach to Reification in RDF. If Blazegraph has implemented their ideas, then there is a lot of potential there. And if the Blazegraph folks brought this with them to Amazon Neptune, that would be even more interesting, although apparently that hasn't shown up yet.

Please add any comments to this Google+ post.

]]>

Album "Gin & Heptatonic" by my band The Heptatonic Jazz Quintet

2018-03-25T16:52:28Z

Now available on the big streaming services.

(I promise to go back to writing about RDF and related technology with my next entry, which is tentatively titled "Reification is a red herring: you don't need property graphs to assign data to individual relationships.")

Along with the jazz bass playing that I've been working on since 2003, I've written a few jazz tunes to try with the people I played with, so I recently got together some of my favorite local musicians and recorded an album of these songs. As soon as I told my wife that I planned to call the band "The Heptatonic Jazz Quintet" she suggested calling the album "Gin & Heptatonic", and I couldn't argue with that. (A heptatonic scale is a scale with seven notes, like most scales in Western music. And of course, beginning with "hep" makes it a great name for a jazz band. I was thrilled to grab the domain name heptatonic.com for only $12.) The music is mostly hard bop, swing, and variations on those.

My brother Peter produced the album and did the excellent Prestige and Blue Note-inspired front cover using a picture that I found in a Flickr search for Creative Commons CC BY 2.0 images. I did the back cover myself with a deep dive into GIMP. (On the topic of open source Linux-Windows-Mac software that played a role, I love the MuseScore scoring program and used it for lead sheets, MIDI demos, and horn arrangements.)

Two songs have lyrics. I knew that the album's closing song "Let's" required greater lyrical skills than I was capable of, so for that I called in my old New York music friend Philip Shelley. His illustrious musical career included the production of a demo of the last serious rock band I was in many years ago, and he wrote a song on the other demo. (You can read more about my limited New York rock career in an older blog entry.) Because no one in the quintet had any singing ambitions, for those two songs we got special guest Dick Orange, a popular local singer who specializes in "the great American songbook", which generally means songs made famous by Frank Sinatra.

It was interesting to learn about the current infrastructure of getting music out where people can hear it. A former business partner of my brother's recommended TuneCore, so I had them print a hundred CDs and, more importantly, take care of the music publishing administration and distribute the album to Spotify, Tidal, Amazon, Apple Music, iTunes, and other services. (I can't provide you with Apple Music or iTunes links to the album; just search for "heptatonic" from inside of your favorite Apple walled garden.)

So if you like jazz, please check out the album and "Like" the band's Facebook page. If you're in the Charlottesville Virginia area on June 1st, come to our CD Release Party at Cville Coffee, which has wine and beer in addition to coffee.

And I promise: next I'll go back to blogging about triples!

Please add any comments to this Google+ post.

]]>

Playing jazz bass

2018-02-25T17:15:11Z

A brief crash course.

I enjoy writing short tutorials to get people started on something that may have seemed intimidating to them before, and I thought it might be fun to write up something that isn't related to software but that I have thought a lot about in the last 15 years: jazz bass playing.

A few basic patterns that you can repeat over nearly any chord will get you pretty far. Any rock or classical bass player should be able to pick these up quickly. It should also work for any guitar player, because both electric and upright basses are tuned like the low four strings of a guitar. (Of course, the upright lacks frets, so you have to put your left hand's fingers where the frets would be.) This crash course can be useful to keyboard players as well, who can treat it as a guide to what to play with their left hand for jazz tunes.

You can think of just about all jazz as being composed of 7th chords: major 7th, minor 7th, dominant 7th, and, less often, diminished seventh, or half diminished chords. These each consist of four notes, and the distances between the notes are what make them sound different--for example, the first two notes of a major 7th are a major third apart, and in a minor 7th they're a minor third apart. Jazz musicians who see a three note triad chord like D minor may just add the seventh anyway, treating it as a D minor 7th. For a dominant 7th such as G7 in the key of C, jazz musicians since the advent of bebop in the 1940s sometimes add more notes to the chord such as the 9th, 11th, and 13th notes of the root note's scale. They may even shift some of those added notes up or down a half step so that you see a fancy chord name like G#9. As a bass player, just think of that as a G7. To summarize, it's simplest to think of it all as 7th chords.

There are some classic patterns that bass players typically play over these 7th chords, and if you learn a few of them and the notes of the chords, you can play simple jazz bass lines. Guitar players know that if they play the notes of an A minor 7th chord and then move their left hand one fret up the neck and do the same thing, they'll be playing a Bb minor 7th, so learning how to play all the chords means learning only a few patterns that you can play up and down the neck. The same applies to these jazz bassline patterns.

A walking jazz bass line is nearly all quarter notes, so when you see "1357" below, for a given chord in a given bar played in 4/4 time, you would play these four notes as quarter notes: the root of the chord (the 1), the 3rd, the 5th, and the 7th. For example, over an A minor 7th chord, 1357 would mean playing A C E G.

For each of these patterns, we'll look at how you would play them on the first four bars of the jazz standard Autumn Leaves. (Compare Nat King Cole's version with Miles Davis's; Miles' fifty-second intro puts off the actual song a bit.)

1357

This is probably the most important pattern, but not the one you'll use the most. It's just an arpeggio of the chord--that is, the playing of each note of the 7th chord from the root up. It's an important pattern to practice with any given song because it helps you to really understand the song's structure. Over the first four bars of Autumn Leaves, this pattern would look like this on a bass staff (click the play button underneath it to hear the bass line with a piano and drums generated by the excellent open source scoring program MuseScore):

Repeating the same pattern for four bars is not something you'd want to do when playing with other people, but for this pattern it's something worth doing for an entire song while practicing on your own because it helps you to get to know the song's chords better.

1353

This one is so useful that I use it too often when I'm on automatic pilot. You can't go wrong with it. I mentioned above that the main difference between a major seventh chord and a minor seventh chord is the "3" note; this pattern really brings that out while still hitting the most important notes of the chord from a bass player's perspective--the root and the fifth--on the crucial first and third beats of the bar. Here it is over the start of Autumn Leaves:

1155

This seems almost too simple, but it sounds great if you give it a strong swing feel on a song like Duke Ellington's Satin Doll. Here it is over Autumn Leaves:

1231

The 2nd note of the chord's scale is not a chord tone, but here it leads to a chord tone on the crucial third beat. This is the first pattern we've seen that doesn't always have either a 1 or a 5 on the first and third beat; the 3 on the third beat brings out the color of the chord more. In Autumn Leaves:

1235

Similar to the last one, and similarly useful. In Autumn Leaves:

1875

The 8 here really refers to the 1, but an octave higher. This is our first pattern with a 7th in it. In Autumn Leaves:

If you replace each quarter note in that with two swung eighth notes, you'd have a classic Chicago blues bass line, although major seventh chords don't come up in Chicago blues very often:

(John Paul Jones' bass line in Led Zeppelin's How Many More Times is a variation on this: 1 8757 1 8 7 5.)

8753

Going down from the root of the chord through the chord's other notes is also great. Again, you have the 1 (an octave higher this time) and the 5 on the first and third beat. In Autumn Leaves:

Half bars

Jazz songs typically have one chord per bar. There are songs ranging from I Got Rhythm (and the hundreds of songs based on it) to John Coltrane's Giant Steps that are mostly two chords per bar, but in most jazz you'll see one chord per bar with the occasional two-chord bar at the end of a four- or eight-bar phrase. If you play the chord notes 13, 15, or 85 over each half bar, you'll be fine. Here are the first four bars of "I Got Rhythm" using 13 13 15 85 13 85 15 85:

Putting some together

Good bass playing mixes and matches these (and more) patterns. Below I've written out a bass line for the first eight bars of Autumn Leaves, labeling which of the patterns above is used in each bar:

Note how all the patterns listed above start with the root note of the chord. This is a solid, dependable thing to do, and greatly aids the jazz bass player's job of showing the others what chord is being played. A step toward more advanced bass playing is getting away from this--for example, starting on the 3 or the 5 of the chord--while still making it clear to the rest of the group exactly which chord is happening. (They should already know, but still, you and the drummer and the piano or guitar player are providing the cake of which the other player's solos are the frosting.)

Using more non-chord tones, the way 1232 and 1235 do above, is also a way to move past beginner status, as is moving beyond playing four quarter notes for every bar. As a first step to moving beyond the patterns above, try substituting 8 for 1 in more of the patterns, and try coming up with your own combinations of 1, 3, 5, 7, and 8. And, listen to great bass players. My favorites are Paul Chambers and Ray Brown, but if you listen to older, pre-bebop jazz, you'll hear more of these simple patterns come up more often.

Please add any comments to this Google+ post.

]]>

JavaScript SPARQL

2018-01-28T14:35:35Z

With rdfstore-js.

... all in the world's most popular programming language.

I finally had a chance to play with rdfstore-js by Antonio Garrote and it was all pretty straightforward. I already had node.js installed, so a simple npm install js installed his library. Then, I was ready to include the library in a JavaScript script that would read some RDF and query it with SPARQL. I just ran my script from the command line, but node.js fans know that they can take advantage of this library's features in much more interesting application architectures. (Before I go on, I wanted to mention that after I tweeted yesterday that this blog entry was coming, Andy Seaborne reminded me about Apache Jena's ability to load and run JavaScript functions. I tried the example from the feature's home page and it worked great right out of the box.)

My sample script starts with a function I wrote for general-purpose output of SPARQL SELECT queries, then creates an rdfstore object and saves a query that will be used twice later in the script. After loading some RDF data about my book Learning SPARQL from the OCLC's Worldcat online library catalog into the rdfstore, it runs the saved query against the loaded data to list ISBN numbers. The script then loads data about another book, runs the same query, and you can see the additional ISBN numbers in the new output.

// Utility function for outputting SELECT results
function outputSPARQLResults(results) {
    for (row in results) {
        printedLine = ''
        for (column in results[row]) {
            printedLine = printedLine + results[row][column].value + ' '
        }
        console.log(printedLine)
    }
}

// Create an rdfstore
var rdfstore = require('rdfstore') 

// Define a query to execute.
var listISBNs = 'PREFIX s:  \
PREFIX ls:  \
PREFIX wco:  \
PREFIX wci:  \
SELECT ?isbn \
FROM ls:g1 WHERE { ?book s:isbn ?isbn } '

rdfstore.create(function(err, store) {   // no error handling
   
    store.execute(
        // Load data about the book Learning SPARQL into named graph g1 in the rdfstore.
        'LOAD  \
        INTO GRAPH ', function(err) {

            store.setPrefix('s', 'http://schema.org/')
            store.setPrefix('ls', 'http://learningsparql.com/ns/data#')
            store.setPrefix('wco', 'http://www.worldcat.org/title/-/oclc/')
            store.setPrefix('wci', 'http://worldcat.org/isbn/')
           
	    store.execute(listISBNs, function(err, results) {
                console.log("=== ISBN value ===")
                outputSPARQLResults(results)
	    })
        }
    )

    store.execute(
        // Load data about the book "XML: The Annotated Specification" into the same graph
        'LOAD  \
        INTO GRAPH ', function(err) {
	    store.execute(listISBNs, function(err, results) {
                console.log("\n=== ISBN values after adding 2nd book's data ===")
                outputSPARQLResults(results)
	    })
        }
    )
    
})

The script produces this output:

=== ISBN value ===
9781449371432 
1449371434 

=== ISBN values after adding 2nd book's data ===
9781449371432 
1449371434 
9780130826763 
0130826766

I loaded the data into a named graph because the library documentation's sample query for loading remote data did. I briefly tried loading the data into the default graph, but had no luck; I'm all for the use of name graphs, anyway. I also tried deleting triples from and inserting them into the g1 named graph and then querying again to see the results, and I didn't have much luck there either (no error messages--I just didn't see the query results I expected after the deletion and insertion) , but my minimal understanding of node.js asynchronous behavior was probably to blame. The library's github page shows that it does support INSERT and DELETE queries.

I wouldn't use this library's triplestore for ongoing production maintenance of a set of triples, anyway; I see it as a great lightweight way to grab triples from one or more sources and then perform SPARQL queries on those triples to look for subsets and patterns that can contribute to an application, all in the world's most popular programming language.

The rdfstore-js github page also shows that it offers many ways to query and manipulate the loaded data that, for a JavaScript programmer, would be more direct. If Antonio's ultimate goal was to bring RDF to JavaScript developers, I won't complain; I'm just glad that he brought a useful JavaScript library to RDF (and SPARQL) developers.

Please add any comments to this Google+ post.

]]>