31 July 2016

SPARQL in a Jupyter (a.k.a. IPython) notebook

With just a bit of Python to frame it all.

In a recent blog entry for my employer titled GeoMesa analytics in a Jupyter notebook, I wrote

As described on its home page, “The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.” Once you install the open source Jupyter server on your machine, you can create notebooks, share them with others, and learn from notebooks created by others. (You can also learn from others’ notebooks without installing Jupyter locally if those notebooks are hosted on a shared server.)

An animated GIF below that passage shows a sample mix of formatted text and executable Python code in a short Jupyter notebook, and it also demonstrates how code blocks can be tweaked, run in place, and build on previous code blocks. The blog entry goes on to describe how we at CCRi embedded Scala code in a Jupyter notebook to demonstrate the use of Apache Spark with the Hadoop-based GeoMesa spatio-temporal database to perform data analysis and visualization.

Jupyter supports over 40 languages besides Scala and Python, but not SPARQL. I realized recently, though, that with a minimum of Python code (Python being the original language for these notebooks; "Jupyter" was originally called "IPython") someone who hardly knows Python can enter and run SPARQL queries in a Jupyter notebook.

I created a Jupyter notebook that you can download and try yourself called JupyterSPARQLFun. If you look at the raw version of the file you'll see a lot of JSON, but if you follow that link you'll see that github renders the notebook the same way that a Jupyter server does, so you can read through the notebook and see all the formatted explanations with the code and the results.

If you did download the notebook and run it on a Jupyter server (and installed the rdflib and RDFClosure python libraries), you could edit the cells that have executable code, rerun them, and see the results, just like in the animated GIF mentioned above. In the case of this notebook, you'd be doing SPARQL manipulation of an RDF graph from your copy of the notebook. (I used the Anaconda Jupyter distribution. It was remarkably difficult to find out from their website how to start up Jupyter, but I did find out from the Jupyter Notebook Beginner Guide that you just enter "jupyter notebook" at the command line. When working with a notebook, you'll also find this list of keyboard shortcuts to be handy.)

I won't go into great detail here about what's in the JupyterSPARQLFun notebook, because much of the point of these notebooks is that their ability to mix formatted text with executable code lets people take explanation of code to a new level. So, to find out how I got SPARQL and inferencing working in the notebook, I recommend that you just read the explanations and code that I put in it.

I mentioned above how you can learn from others’ notebooks; some nice examples accompany the Data School Machine Learning videos on YouTube. These videos demonstrate various concepts by adding and running code within notebooks, adding explanatory text as well along the way. Because I could download the finished notebooks created in the videos, I could run all the example code myself, in place, with no need to copy it from one place and paste it to another. I could also tweak the code samples to try different variations, which made for some much more hands-on learning of the machine learning concepts being demonstrated.

That experience really showed me the power of Jupyter notebooks, and it's great to see that with just a little setup Python code, we can do SPARQL querying and RDF inferencing inside these notebooks as well.

screenshot of SPARQL Jupyter notebook

Please add any comments to this Google+ post.

12 June 2016

Emoji SPARQL😝!

If emojis have Unicode code points, then we can...

I knew that emojis have Unicode code points, but it wasn't until I saw this goofy picture in a chat room at work that I began to wonder about using emojis in RDF data and SPARQL queries. I have since learned that the relevant specs are fine with it, but as with the simple display of emojis on non-mobile devices, the tools you use to work with these characters (and the tools used to build those tools) aren't always as cooperative as you'd hope.

After hunting around a bit among these tools, I did have some with fun this. Black and white emojis, as shown in the Browser column of the unicode.org Emoji Data page, display with no problem in my Ubuntu terminal window and in web page forms, but I wanted the full-color emojis from that page's Sample column. The Emacs Emojify mode did the trick, so what you see below are screen shots from there.

sample RDF with emoji

I started by converting that same unicode.org web page (as opposed to the site's much larger Full Emoji Data page) to a Turtle file called emoji-list.ttl with a short perl script. (You can find both in github at emojirdf.) On the right, you can see triples from that web page's row about the french fries emoji. For the keywords assigned to each character, the Emoji Data web page has links, so it was tempting to use the link destinations as URI values for the lse:annotation values instead of strings, but some of those link destinations have local names like +1, which won't make for nice URIs in RDF triples.

I thought about augmenting my emoji-list.ttl file to turn it into an emoji ontology. I first dutifully searched for "emoji rdf" on Google (which asked me "did you mean emoji pdf? emoji def?") to avoid the reinvention of any wheels. The most promising search result was an Emoji Ontology that adds some interesting metadata to the emojis, but its Final emoji ontology in OWL/XML format has little to do with OWL or even RDF, and I didn't feel like writing the XSLT to convert its additional metadata to proper RDF.

With no proper emoji ontology already available, I thought more about creating my own by adding triples that would arrange the emojis into a hierarchical ontology or taxonomy. This would let me say that the ant 🐜 and the honeybee 🐝 are both insects, and that the ox 🐂 and the many, many cats are mammals, and then I could query for animals and see them all or query for insects and see just first two. This would add little, though, because the existing annotation values already serve as a non-hierarchical tagging system that identifies insects, so I could just query for those lse:annotation values.

Some of these annotation values led to some fun queries of the emoji-list.ttl file. I used Dave Beckett's Redlands roqet as a query processor, telling it to give me CSV data that I redirected to a file. Here's a query asking for the character and label of any emojis that have both "face" and "cold" in their annotation values:

PREFIX lse:  <http://learningsparq.com/emoji/> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?char ?label
  ?s lse:annotation 'face', 'cold' ;
     rdfs:label ?label ;
     lse:char ?char .

It returned this result, showing that "cold" can refer to both low temperature and wintertime sniffles:

result of first SPARQL emoji query

This next query uses emojis in string data to ask which annotations have tagged both the alien head and one of the moon face emojis:

SPARQL query

(Apparently, Emacs SPARQL mode thinks that the "not" in "annotation" is the SPARQL keyword, because it resets the substring's font color.) Here is the query result; note that, as is typical with many query tools, the first row is the variable name, not a returned value:


Emoji Unicode code points range from x1F600 to x1F1FF, which SPARQL spec productions 164 - 166 say are legal for use in variable names. The following query requests the satellite dish character's annotation values and stores them in a variable whose three-character name is three emojis:

SPARQL emoji query

Here is our result:

SPARQL query result

This is actually why I used roqet—the Java-based SPARQL engines that I first tried may have implemented the spec faithfully, but some layer of the Java tooling underneath them couldn't handle the full extent of Unicode in every place where it should.

Emojis in RDF data are not limited to quoted strings. When I told roqet to run a query against this next Turtle file, which uses emoji characters as prefixes and as subject and predicate local names in its one triple, it had no problem:

Turtle file with emoji properties

This final query went even further, and roqet had no problem with it: it defines a bowl of spaghetti emoji as a namespace prefix and then, using emojis for the variable names, asks for the subjects and objects of any triples that have the predicate from the one triple in the Turtle file above.

Turtle file with emoji properties

Of course, it's difficult to read, and the fact that running the query and even just displaying it required me to dig around for the right combination of tools doesn't speak well for the use of emojis in queries. Besides being a fun exercise, though, the experience and the result—that it all ultimately worked—provided a nice testament to the design of the Unicode, RDF, and SPARQL standards.

Please add any comments to this Google+ post.

17 May 2016

Trying out Blazegraph

Especially inferencing.

I've been hearing more about the Blazegraph triplestore (well, "graph database with RDF support"), especially its support for running on GPUs, and because they also advertise some degree of RDFS and OWL support, I wanted to see how quickly I could try that after downloading the community edition. It was pretty quick.

Downloading from the main download page with my Ubuntu machine got me an rpm file, but I found it simpler to download the jar file version that I could start as a server from the command line as described on the Nano SPARQL Server page. I found the jar file (and several other download options) on the sourceforge page for release 2.1.

The jar file's startup message tells you the URL for the web-based interface to the Nano SPARQL Server, shown here:

At this point, uploading some RDF on the UPDATE tab and issuing SPARQL queries on the QUERY tab was easy. I was more interested sending it SPARQL queries that could take advantage of RDFS and OWL inferencing, so after a little help from Blazegraph Chief Scientist Bryan Thompson via their mailing list (with a quick answer on a Saturday) I learned how: I had to first create a namespace on the NAMESPACES tab with the Inference checkbox checked. The same form also offers checkboxes for Isolatable indexes, Full text index, and Enable geospatial when configuring a new namespace. I found this typical of how Blazegraph lets you configure it to take advantage of more powerful features while leaving the out-of-box configuration simple and easy to use.

For finer-grained namespace configuration, after you select checkboxes and click the Create namespace button, a dialog box lets you edit the configuration details, with each of these lines explained in the Blazegraph documentation:

I wanted to check Blazegraph's support for owl:TransitiveProperty, because this is such a basic, useful OWL class, as well as its ability to do subclass inferencing. I created some data about chairs, desks, rooms, and buildings, specifying which chairs and desks were in which rooms and which rooms were in which buildings, and also made dm:locatedIn a transitive property:

@prefix d: <http://learningsparql.com/ns/data#> .
@prefix dm: <http://learningsparql.com/ns/demo#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

dm:Room rdfs:subClassOf owl:Thing .
dm:Building rdfs:subClassOf owl:Thing .
dm:Furniture rdfs:subClassOf owl:Thing .
dm:Chair rdfs:subClassOf dm:Furniture .
dm:Desk rdfs:subClassOf dm:Furniture .

dm:locatedIn a owl:TransitiveProperty. 

d:building100 rdf:type dm:Building .
d:building200 rdf:type dm:Building .
d:room101 rdf:type dm:Room ; dm:locatedIn d:building100 . 
d:room102 rdf:type dm:Room ; dm:locatedIn d:building100 . 
d:room201 rdf:type dm:Room ; dm:locatedIn d:building200 . 
d:room202 rdf:type dm:Room ; dm:locatedIn d:building200 . 

d:chair15 rdf:type dm:Chair ; dm:locatedIn d:room101 . 
d:chair23 rdf:type dm:Chair ; dm:locatedIn d:room101 . 
d:chair35 rdf:type dm:Chair ; dm:locatedIn d:room202 . 
d:desk22 rdf:type dm:Desk ; dm:locatedIn d:room101 . 
d:desk59 rdf:type dm:Desk ; dm:locatedIn d:room202 . 

The following query asks for furniture in building 100. No triples above will match either of the query's two triple patterns, so a SPARQL engine that can't do inferencing won't return anything. I wanted the query engine to infer that if chair 15 is a Chair, and Chair is a subclass of Furniture, then chair 15 is Furniture; also, if that furniture is in room 101 and room 101 is in building 100, then that furniture is in building 100.

PREFIX dm: <http://learningsparql.com/ns/demo#> 
PREFIX d: <http://learningsparql.com/ns/data#> 
SELECT ?furniture
  ?furniture a dm:Furniture .
  ?furniture dm:locatedIn d:building100 . 

We need the first triple pattern because the data above includes triples saying that rooms 101 and 102 are located in building 100, so those would have bound to ?furniture in the second triple pattern if the first triple pattern wasn't there. This is a nice example of why declaring resources as instances of specific classes, while not necessary in RDF, does a favor to anyone who will query that data—it makes it easier for them to specify more detail about exactly what data they want.

When using this query and data in a namespace (in the Blazegraph sense of the term) configured to do inferencing, Blazegraph executed the query against the original triples plus the inferred triples and listed the furniture in building 100:

Several years ago I backed off from discussions of the "semantic web" as a buzzphrase tying together technology around RDF-related standards because I felt that the phrase was not aging well and that the technology could be sold on its own without the buzzphrase, but the example above really does show semantics at work. Saying that dm:locatedIn is a transitive property stores some semantics about that property, and these extra semantics let me get more out of the data set: they let me query for which furniture is in which building, even though the data has no explicit facts about furniture being in buildings. (Saying that Desk and Chair are subclasses of Furniture also stores semantics about all three terms, but that won't be as interesting to a typical developer with object-oriented experience.)

Blazegraph calls their subset of OWL RDFS+, which was inspired by Jim Hendler and Dean Allemang's RDFS+ superset of RDF that added in OWL's most useful bits. (It's similar but not identical to AllegroGraph's RDFS++ profile, which has the same goal.) Blazegraph's Product description page describes which parts of OWL it supports, and their Inference And Truth Maintenance page describes more.

A few other interesting things about Blazegraph as a triplestore and query engine:

  • The REST interface offers access to a wide range of features.

  • Queries can include Query Hints to optimize how the SPARQL engine executes them, which will be handy if you plan on scaling way up.

  • I saw no no direct references to GeoSPARQL in the Blazegraph documentation, but they recently announced support for geospatial SPARQL queries. (I've been learning a lot about working with geospatial data at Hadoop scale with GeoMesa.)

Blazegraph's main selling points seems to be speed and scalability (for example, see its Scaleout Cluster mode) and I didn't play with those at all, but I liked seeing that SPARQL querying with inferencing support can take advantage of such new hotness technology as GPUs. It will be interesting to see where Blazegraph takes it.

Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets



    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists