17 June 2018

Running and querying my own Wikibase instance

Querying it, of course, with SPARQL.

Many of us have waited years for an open-source framework that makes the development of web-based RDF applications as easy as Ruby on Rails does for web-based SQL applications. This dockerized version of Wikibase looks like a big step in this direction.

When Dario Taraborelli's tweeted about how quickly he got a local wikibase instance and SPARQL endpoint up and running with wikibase-docker, he inspired me to give it a shot, and it was surprisingly easy and fun.

I have minimal experience with docker. As instructed by wikibase-docker's README page, I installed docker and docker-compose. (When I got to the Test Docker Installation part of the Get Started, Part 1: Orientation and setup page for setting up docker, the hello-world app gave me a "permission denied" problem, but this solution described at Techoverflow solved it. I did have to reboot, as it suggested.)

Continuing along with the wikibase-docker README, when I clicked "http://localhost:8181" under Accessing your Wikibase instance and the Query Service UI it was pretty cool to see my own local running instance of the wiki:

Moving along in the README, I clicked "Create a new item" before I clicked "Create a new property", but when I saw that the new item's property list offered no choices, I realized that I should define some properties before creating any items. Properties and items can have names, aliases, and descriptions in a wide choice of spoken languages, and Wikibase includes a nice choice of data types.

After defining a property and creating items that had a value for that property, the "Query Service UI @ http://localhost:8282" link on the README led to a web form where I could enter a SPARQL query. I entered SELECT * WHERE { ?s ?p ?o} and saw the default triples that were part of the store as well as triples about the items and property that I had created.

The "Get an RDF dump from wikibase" docker command on the README page did just fine. Reviewing the triples in its output, I saw that the created entities fit the Wikidata data model described at Wikibase/DataModel/Primer, which I wrote about at The Wikidata data model and your SPARQL queries.

It took me some time (and a tweet) to realize that the "Query Service Backend (Behind a proxy)" URL listed on the README file was the URL for the SPARQL endpoint. The first query I tried after that worked with no problem:

curl http://localhost:8989/bigdata/sparql?query=SELECT%20DISTINCT%20%3Fp%20WHERE%20%7B%20%3Fs%20%3Fp%20%3Fo%20%7D
      

It was also easy to access this server from my phone across my home wifi when I substituted the machine's name or IP address for "localhost" in the URLs above. The web interface was the same on a phone as on a big screen; the MediaWiki project's Mobiles, tablets and responsive design manual page describes some options for extending the interface. If someone out there is looking for UI work and has some time on their hands, contributing some phone and tablet responsiveness to this open source project would be a great line on your résumé.

And finally, while the docker version of this is quick to get up and running, if you're going far with your own MediaWiki installation, you'll want to look over the Installation instructions for the regular, non-docker version.

After I did these experiments and wrote my first draft of this, I discovered the medium.com posting Wikibase for Research Infrastructure -- Part 1 by Pratt Institute librarian and researcher Matt Miller. His piece describes a nice use case of following through on creating a Wikibase application and points to some handy Python scripts for automating the creation of classes and other structures from spreadsheets. His use case happens to be one of my favorite RDF-related available data sources: the Linked Jazz Project. I look forward to Part 2.

It's great to have such a comprehensive system running on my local machine, complete with a web interface that lets non-RDF people create and edit any data they want and, for the RDF people, a SPARQL interface to let them pull and manipulate that data. For more serious dataset development, the MediaWiki project includes some helpful documentation about how to define your own classes and associated properties and forms.

Many of us have waited years for an open-source framework that makes the development of web-based RDF applications as easy as Ruby on Rails does for web-based SQL applications. The dockerized version of Wikibase looks like a big step in this direction.


Please add any comments to this Google+ post.

28 May 2018

RDF* and SPARQL*

Reification can be pretty cool.

triple within a triple

After I posted Reification is a red herring (and you don't need property graphs to assign data to individual relationships) last month, I had an amusingly difficult time explaining to my wife how that would generate so much Twitter activity. This month I wanted to make it clear that I'm not opposed to reification in and of itself, and I wanted to describe the fun I've been having playing with Olaf Hartig and Bryan Thompson's RDF* and and SPARQL* extensions to these standards to make reification more elegant.

In that post, I said that in many years of using RDF I've never needed to use reification because, for most use cases where it was a candidate solution, I was better off using RDFS to declare classes and properties that reflected the use case domain instead of going right to the standard reification syntax (awkward in any standardized serialization) that let me create triples about triples. My soapbox ranting in that post focused on the common argument that the property graph approach of systems like Tinkerpop and Neo4j is better than RDF because achieving similar goals in RDF would require reification; as I showed, it doesn't.

But, reification can still be very useful, especially in the world of metadata. (I am slightly jealous of the metadata librarians of the world for having the word "metadata" in their job title--it sounds even cooler in Canada: Bibliothécaire aux métadonnées.) If metadata is data about data, and more and more of the Information Science world is taking advantage of linked data technologies, then triples about triples are bound to be useful in their use of information for provenance, curation, and all kinds of scholarship about datasets.

The conclusion of my blog post mentioned how, just as I was finishing it up, I discovered Olaf Hartig and Bryan Thompson's 2014 paper Foundations of an Alternative Approach to Reification in RDF and Blazegraph's implementation of it. I decided to play with this a bit in Blazegraph in order to get a hands-on appreciation of what was possible, and I like it. (Olaf recently mentioned on Twitter that these capabilities are being added into Apache Jena as well, so this isn't just a Blazegraph thing.)

As I described in Trying out Blazegraph two years ago, it's pretty simple to download the Blazegraph jar, start it up, load RDF data, and query it. For my RDF* experiments, I started up Blazegraph and created a Blazegraph namespace with a mode of rdr and then did my first few experiments there.

I started with the examples in Olaf's slides RDF* and SPARQL*: An Alternative Approach to Statement-Level Metadata in RDF. To make the slides visually cleaner, he left out full URIs and prefixes, so I added some to properly see the querying in action. I loaded his slide 15 data into my new Blazegraph namespace, specifying a format of Turtle-RDR. The double brackets that you see here are the RDF* extension that lets us create triples that are themselves resources that we can use as subjects and objects of other triples:

@prefix d: <http://www.learningsparql.com/ns/data/> .
<<d:Kubrik d:influencedBy d:Welles>> d:significance 0.8 ;
      d:source <https://nofilmschool.com/2013/08/films-directors-that-influenced-stanley-kubrick> .

This data tells us that the triple about Kubrik being influenced by Welles has a significance of 0.8 and a source at an article on nofilmschool.com.

I then executed the following query, based on Olaf's from slide 16, with no problem:

PREFIX d: <http://www.learningsparql.com/ns/data/> 
SELECT ?x WHERE {
  <<?x d:influencedBy d:Welles>> d:significance ?sig .
  FILTER (?sig > 0.7)
}

In this case, the use of the double angle brackets is the SPARQL* extension that lets us do the same thing that this syntax does in RDF*. This query asks for whoever was named as being influenced by Welles in statements that have a significance greater than 0.7. The query worked just fine in Blazegraph.

SPARQL* also lets you query for the components of triples that are being treated as independent resources. From Olaf's slide 17, this next query asks for whoever was influenced by Welles and the significance and source of any returned statements, and it worked fine with the data above:

PREFIX d: <http://www.learningsparql.com/ns/data/> 
SELECT ?x ?sig ?src WHERE {
  <<?x d:influencedBy d:Welles>> d:significance ?sig ;
  d:source ?src .
}

His slide 18 query returns the same result as that one, but takes the syntax a bit further by binding the triple pattern about someone influencing Welles to a variable and then querying for that:

PREFIX d: <http://www.learningsparql.com/ns/data/> 
SELECT ?x ?sig ?src WHERE {
  BIND(<<?x d:influencedBy d:Welles>> AS ?t)
  ?t  d:significance ?sig ;
      d:source ?src .
}

Moving on to more easy experiments, I found that all the examples on the Blazegraph page Reification Done Right worked exactly as shown there. That page also provides some nice background for ways to use RDF* and SPARQL* in Blazegraph.

Blazegraph lets you do inferencing, so I couldn't resist mixing that with RDF* and SPARQL*. I had to create a new Blazegraph namespace that not only had a Mode of rdr but also had the "Inference" box checked upon creation, and then I loaded this data:

@prefix d:    <http://www.learningsparql.com/ns/data/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<<d:s1 d:p1 d:o1>> a d:Class2 .
<<d:s2 d:p2 d:o2>> a d:Class3 .

d:Class2 rdfs:subClassOf d:Class1 . 
d:Class3 rdfs:subClassOf d:Class1 . 

It creates two triples that are themselves resources, with one being an instance of Class2 and the other being an instanced of Class3. Two final triples tell us that each of those classes are subclasses of Class1. The following query asked for triples that are instances of Class1, despite the data have no explicit triples about Class1 instances, and Blazegraph did the inferencing and found both of them:

PREFIX d: <http://www.learningsparql.com/ns/data/> 
SELECT ?x ?y ?z WHERE {
   <<?x ?y ?z>> a d:Class1 . 
}

After doing this inferencing, I was thinking that OWL metadata and inferencing about such triples should open up a lot of new possibilities, but I realized that none of those possibilities are necessarily new: they'll just be easier to implement than they would have been using the old method of reification that used four triples to represent one. Still, being easier to implement counts for plenty, and I think that metadata librarians and other people doing work to build value around existing triples now have a reasonable syntax some nice tools to explore this.


Please add any comments to this Google+ post.

22 April 2018

Reification is a red herring

And you don't need property graphs to assign data to individual relationships.

RDF's very simple subject-predicate-object data model is a building block that you can use to build other models that can make your applications even better.

I recently tweeted that the ZDNet article Back to the future: Does graph database success hang on query language? was the best overview of the graph database world(s) that I'd seen so far, and I also warned that many such "overviews" were often just Neo4j employees plugging their own product. (The Neo4j company is actually called Neo Technology.) The most extreme example of this is the free O'Reilly book Graph Databases, which is free because it's being given away by its three authors' common employer: Neo Technology! The book would have been more accurately titled "Building Graph Applications with Cypher", the Neo4j query language. This 238-page book on graph databases manages to mention SPARQL and Gremlin only twice each. The ZDNet article above does a much more balanced job of covering RDF and SPARQL, Gremlin and Tinkerpop, and Cypher and Neo4j.

The DZone article RDF Triple Stores vs. Labeled Property Graphs: What's the Difference? is by another Neo employee, field engineer Jesús Barrasa. It doesn't mention Tinkerpop or Gremlin at all, but does a decent job of describing the different approach that property graph databases such as Neo4j and Tinkerpop take in describing graphs of nodes and edges when compared with RDF triplestores. Its straw man arguments about RDF's supposed deficiencies as a data model reminded me of a common theme I've seen over the years.

The fundamental thing that most people don't get about RDF, including many people who are successfully using it to get useful work done, is that RDF's very simple subject-predicate-object data model is a building block that you can use to build other models that can make your applications even better. Just because RDF doesn't require the use of schemas doesn't mean that it can't use them; the RDF Schema Language lets you declare classes, properties, and information about these that you can use to drive user interfaces, to enable more efficient and readable queries, and to do all the other things that people typically use schemas for. Even better, you can develop a schema for the subset of the data you care about (as opposed to being forced to choose between a schema for the whole data set or no schema at all, as with XML), which is great for data integration projects, and then build your schema up from there.

Barrasa writes of property graphs that "[t]he important thing to remember here is that both the nodes and relationships have an internal structure, which differentiates this model from the RDF model. By internal structure, I mean this set of key-value pairs that describe them." This is the first important difference between RDF and property graphs: in the latter, nodes and edges can each have their own separate set (implemented as an array in Neo4j) of key-value pairs. Of course, nodes in RDF don't need this; to say that the node for Jack has an attribute-value pair of (hireDate, "2017-04-12"), we simply make another triple with Jack as the subject and these as the predicate and object.

Describing the other key difference, Barrasa writes that while the nodes of property graphs have unique identifiers, "[i]n the same way, edges, or connections between nodes--which we call relationships--have an ID". Property graph edges are unique at the instance level; if Jane reportsTo Jack and Jack reportsTo Jill, the two reportsTo relationships here each have their own unique identifier and their own set of key-value pairs to store information about each edge.

He writes that in RDF "[t]he predicate will represent an edge--a relationship--and the object will be another node or a literal value. But here, from the point of view of the graph, that's going to be another vertex." Not necessarily, at least for the literal values; these represent the values in RDF's equivalent of the key-value pairs--the non-relationship information being attached to a node such as (hireDate, "2017-04-12") above. This ability is why a node doesn't need its own internal key-value data structure.

He begins his list of differences between property graphs and RDF with the big one mentioned above: "Difference #1: RDF Does Not Uniquely Identify Instances of Relationships of the Same Type," which is certainly true. But, his example, which he describes as "an RDF graph in which Dan cannot like Ann three times", is very artificial.

One of his "RDF workarounds" for using RDF to describe that Dan liked Ann three times is reification, in which we convert each triple to four triples: one saying that a given resource is an RDF statement, a second identifying the resource's subject, a third naming the predicate, and a fourth naming the object. This way, the statement itself has identity, and we can add additional information about it as triples that use the statement's identifier as a subject and additional predicates and objects as key-value pairs such as (time, "2018-03-04T11:43:00") to show when a particular "like" took place. Barrasa writes "This is quite ugly"; I agree, and it can also do bad things to storage requirements.

In my 15 years of working with RDF, I have never felt the need to use reification. It's funny how the 2004 RDF Primer 1.0 has a section on reification but the 2014 RDF Primer 1.1 (of which I am proud to be listed in the Acknowledgments) doesn't even mention reification, because simpler modeling techniques are available, so reification was rarely if ever used.

By "modeling techniques" I mean "declaring and then using a model", although in RDF, you don't even have to declare it. If you want to keep track of separate instances of employees, or games, or buildings, you can declare any of these as a class and then create instances of it; similarly, if you want to keep track of separate instances of a particular relationship, declare a class for that relationship and then create instances of it.

How would we apply this to Barrasa's example, where he wants to keep track of information about Likes? We use a class called Like, where each instance identifies who liked who. (When I first wrote that previous sentence, I wrote that we can declare a class called Like, but again, we don't need to declare it to use it. Declaring it is better for serious applications where multiple developers must work together, because part of the point of a schema is to give everyone a common frame of reference about the data they're working with.) The instance could also identify the date and time of the Like, comments associated with it, and anything else you wanted to add as a set of key-value pairs for each Like instance that is implemented as just more triples.

Here's an example. After optional declarations of the relevant class and properties associated with it, the following has four Likes showing who liked who when and a "foo" value to demonstrate the association of arbitrary metadata with that Like.

@prefix d:    <http://learningsparql.com/ns/data/> .
@prefix m:    <http://learningsparql.com/ns/model/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . 

# Optional schema.
m:Like  a rdfs:Class .          # A class...
m:liker rdfs:domain m:Like .    # and properties that go with this class.
m:liked rdfs:domain m:Like .
m:foo   rdfs:domain m:Like .

[] a m:Like ;
   m:liker d:Dan ;
   m:liked d:Ann ;
   m:time "2018-03-04T11:43:00" ;
   m:foo "bar" .

[] a m:Like ;
   m:liker d:Dan ;
   m:liked d:Ann ;
   m:time "2018-03-04T11:58:00" ;
   m:foo "baz" .

[] a m:Like ;
   m:liker d:Dan ;
   m:liked d:Ann ;
   m:time "2018-03-04T12:04:00" ;
   m:foo "bat" .

[] a m:Like ;
   m:liker d:Ann ;
   m:liked d:Dan ;
   m:time "2018-03-04T12:06:00" ;
   m:foo "bam" .

Instead of making up specific identifiers for each Like, I made them blank nodes so that the RDF processing software will generate identifiers and keep track of them.

As to Barrasa's use case of counting how many times Dan liked Ann, it's pretty easy with SPARQL:

PREFIX d: <http://learningsparql.com/ns/data/> 
PREFIX m: <http://learningsparql.com/ns/model/>

SELECT (count(*) AS ?likeCount) WHERE {
  ?like a m:Like ;
        m:liker d:Dan ;
        m:liked d:Ann .
}

(This query would actually work with just the m:liker and m:liked triple patterns, but as with the example that I tweeted to Dan Brickley about, declaring your RDF resources as instances of classes can lay the groundwork for more efficient and readable queries.) Here is ARQ's output for this query:

-------------
| likeCount |
=============
| 3         |
-------------

Let's get a little fancier. Instead of counting all of Dan's likes of Ann, we'll just list the ones from before noon on March 3, sorted by their foo values:

PREFIX d: <http://learningsparql.com/ns/data/> 
PREFIX m: <http://learningsparql.com/ns/model/>

SELECT ?fooValue ?time WHERE {
  ?like a m:Like ;
        m:liker d:Dan ;
        m:liked d:Ann ;
        m:time ?time ;
        m:foo ?fooValue .
FILTER (?time < "2018-03-04T12:00")
}
ORDER BY ?fooValue

And here is ARQ's result for this query:

------------------------------------
| fooValue | time                  |
====================================
| "bar"    | "2018-03-04T11:43:00" |
| "baz"    | "2018-03-04T11:58:00" |
------------------------------------

After working through a similar example for modeling flights between New York and San Francisco, Barrasa begins a sentence "Because we can't create such a simple model in RDF..." This is ironic; the RDF model is simpler than the Labeled Property Graph model, because it's all subject-predicate-object triples without the use of additional data structures attached to the graph nodes and edges. His RDF version would have been much simpler if he had just created instances of a class called Flight, because again, while the base model of RDF is the simple triple, more complex models can easily be created by declaring classes, properties, and information about those classes and properties--which we can do by just creating new triples!

To summarize, complaints about RDF that focus on reification are so 2004, and they are a red herring, because they distract from the greater power that RDF's modeling abilities bring to application development.

A funny thing happened after writing all this, though. As part of my plans to look into Tinkerpop and Gremlin and potential connections to RDF as a next step, I was looking into Stardog and Blazegraph's common support of both. I found a Blazegraph page called Reification Done Right where I learned of Olaf Hartig and Bryan Thompson's 2014 paper Foundations of an Alternative Approach to Reification in RDF. If Blazegraph has implemented their ideas, then there is a lot of potential there. And if the Blazegraph folks brought this with them to Amazon Neptune, that would be even more interesting, although apparently that hasn't shown up yet.


Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Archives

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0