24 August 2014

Exploring a SPARQL endpoint

In this case, semanticweb.org.

graph of ISWC SPARQL papers

In the second edition of my book Learning SPARQL, a new chapter titled "A SPARQL Cookbook" includes a section called "Exploring the Data," which features useful queries for looking around a dataset that you know little or nothing about. I was recently wondering about the data available at the SPARQL endpoint http://data.semanticweb.org/sparql, so to explore it I put several of the queries from this section of the book to work.

An important lesson here is how easy SPARQL and RDF make it to explore a dataset that you know nothing about. If you don't know about the properties used, or whether any schema or schemas were used and how much they was used, you can just query for this information. Most hypertext links below will execute the queries they describe using semanticweb.org's SNORQL interface.

I started with what is generally my favorite query, listing which predicates are used in the data, because that's the quickest way to get a flavor for what kind of data is available. Several of the predicates that got listed immediately told me some interesting things:

  • rdfs:subClassOf shows me that there's probably some structure worth exploring.

  • dcterms:subject (and dc:subject) shows that things have probably been tagged with keywords.

  • ical properties such as dtstart shows that events are recorded.

  • FOAF properties show that there is probably information about people.

  • dcterms:title, swrc:booktitle, dc:title, src:title, and swrc:subtitle show me that works are covered.

An RDF dataset may or may not have explicit structure, and the use of rdfs:subClassOf in this data showed me that there was, so my next query asked what classes were subclasses of what classes so that I could get an overview of how much structure the dataset included. The result showed me that the ontology seemed to be mostly in the swc namespace, which turns out to be the semanticweb.com conference ontology. The site does include nice documentation for this ontology.

The use of the FOAF vocabulary showed me that there are probably people described, but if the properties foaf:name, foaf:lastName, foaf:familyName, foaf:family_name, and foaf:surname are all in there, which should I try first? A quick ego search showed foaf:family_name being used. It also showed that the URI used to represent me is http://data.semanticweb.org/person/bob-ducharme, and because they've published this data as linked data, sending a browser to that URL showed that it described me as a member of the 2010 ISWC program committee.

It also showed me to be a proud instance of the foaf:Person class, so I did a query to find out how many persons there were in all: 10,982.

Given the domain of the ontology and the reason that I was listed, I guessed that it was all about ISWC conferences, so I listed the dc:title values to see what would show up. The query took long enough that I added a LIMIT keyword to create a politer version of that query. Looking at the complete data for one work showed all kinds of interesting information, including an swrc:year value to indicate the year of this paper's conference. A list of all year values showed a range from 2001 right up to 2014, so it's nice to see that they're keeping the data up to date.

Next, I listed all papers that mention "SPARQL" in their title, with their years. After listing the number of papers with SPARQL in their title each year, I used sgvizler (which I described here last September) to create the chart of these figures shown above.

The use of dcterms:subject and dc:subject was interesting because these add some pretty classic metadata for navigating content. Listing triples that used either, I included LIMIT 100 to be polite to the server in case these properties were used a lot. They are. Doing this with dc:subject shows subjects such as "ontology alignment" and "controlled natural language" assigned to articles. Doing it with dcterms:subject showed it used more the way I might use rdf:type, indicating that something is an instance of a particular class: for example, swc:Chair and swc:Delegate each have dcterms:subject values of http://dbpedia.org/resource/Role.

My interest in taxonomies (spurred by my work with TopQuadrant's TopBraid EVN) led me to look harder at the dc:subject values. They're string values, and not instances of something like skos:Concept, so they have no hierarchical relationship or other metadata themselves. I'm guessing that this is because key phrases assigned to conference papers are more of a folksonomy, in which people can make up their own key phrases as they wish. Either some people must have been aware of other key phrases in use or some were added automatically, because, while counting how many different ones there were came up with 3,594, a query to see which were the most popular showed that "Corpus (creation, annotation, etc.)" was far and away the most used, with 506 papers having that subject.

I could go on. Call me a SPARQL geek, but I really enjoy looking around a data set like this, especially when (as the presence of the papers for ISWC 2014 shows) the data is kept up to date. For people interested in any aspect of semantic web technology, the ability to look around this particular dataset and count up which data falls into which patterns is a great resource.


Please add any comments to this Google+ post.

20 July 2014

When did linking begin?

Pointing somewhere with a dereferenceable address, in the twelfth (or maybe fifth) century.

University of Bologna woodcut

As I have once before, I'm republishing an entry from an O'Reilly blog I had from 2003 to 2005 on topics related to linking. I've been reading up on early concepts of metadata lately—I particularly recommend Ann Blair's Too Much to Know: Managing Scholarly Information before the Modern Age—and have recently found another interesting reference to the "Regulae Iuris" book mentioned below. When I wrote this, I was more interested in hypertext issues, and if I was going to change anything to update this piece, I would change the word "traverse" to "dereference," but all the points are still meaningful.

Works about linking often claim that it's been around for thousands of years, and then they give examples that are no more than a few centuries old. I can only find one reference to something more than a thousand years old that qualifies as a link: Peter Stein's 1966 work "Regulae Iuris: from Juristic Rules to Legal Maxims" describes some late fifth-century lecture notes on a commentary by the legal scholar Ulpian. The notes mention that confirmation of a particular point can be found in the Regulae ("Rules") of the third-century Roman jurist (and student of Ulpian) Modestinus, "seventeen regulae from the end, in the regula beginning 'Dotis'...". The citation's explicit identification of the point in the cited work where the material could be found makes it the earliest link that I know of.

Other than Stein's tantalizing example, all of my research points to the 12th century as the beginning of linking. In a 1938 work on the medieval scholars of Bologna, Italy, who studied what remained of ancient Roman law, Hermann Kantorowicz wrote that in "the eleventh century...titles of law books are cited without indicating the passage, books of the Code are numbered, and the name of the law book is considered a sufficient reference." He uses this to build his argument that that a particular work described in his essay is from the eleventh century and not the twelfth, as other scholars had argued. Apparently, it was common knowledge in Kantoriwicz's field that twelfth century Bolognese scholars would reference a written law using the name of the law book, the rubric heading, and the first few words of the law itself. (Referencing of particular chapters and sections by their first few words was common at the time; the use of chapter, section, and page numbers didn't begin until the following century.)

Italian legal scholars trying to organize and make sense of the massive amounts of accumulated Roman law contributed a great deal to the mechanics of the cross-referencing that provide many of the earliest examples of linking. The medievalist husband and wife team Richard and Mary Rouse also found some in their research into evolving scholarship techniques in the great universities of England and France (that is, Oxford, Cambridge, and the Sorbonne) and they described Gilbert of Poitiers's innovative twelfth-century mechanism for addressing specific parts of his work on the psalms: he added a selection of Greek letters and other symbols down the side of each page to identify concepts such as the Penitential Psalms or the Passion and Resurrection. If you found the symbol for the Passion and Resurrection in the margin of Psalm 2 with a little 8 next to it (actually, a little "viii"—they weren't using Arabic numerals quite yet), it would tell you that the next discussion of this concept appeared in Psalm 8. Once you found the same symbol on one of the eighth psalm's pages, you might find a little "xii" with it to show that the next discussion of the same concept was in Psalm 12. This addressing system made it possible for someone preparing a sermon on the Passion and Resurrection to easily find the relevant material in the Psalms. (In fact, aids to sermon preparation was one of the main forces in the development of new research tools, as clergymen were encouraged to go out and compete with the burgeoning heretic movements for the hearts and minds of the people.)

The use of information addressing systems really got rolling in the thirteenth-century English and French universities, as scholarly monks developed concordances, subject indexes, and page numbers for both Christian religious works and the classic ancient Greek works that they learned about from their contact with the Arabic world. In fact, this is where Arabic numbers start to appear in Europe; page numbering was one of the early drivers for its adoption.

Quoting of one work by another was certainly around long before the twelfth century, but if an author doesn't identify an address for his source, his reference can't be traversed, so it's not really a link. Before the twelfth century, religious works had a long tradition of quoting and discussing other works, but in many traditions (for example, Islam, Theravada Buddhism, and Vedic Hinduism) memorization of complete religious works was so common that telling someone where to look within a work was unnecessary. If one Muslim scholar said to another "In the words of the Prophet..." he didn't need to name the sura of the Qur'an that the quoted words came from; he could assume that his listener already knew. Describing such allusions as "links" adds heft to claims that linking is thousands of years old, but a link that doesn't provide an address for its destination can't be traversed, and a link that can't be traversed isn't much of a link. And, such claims diminish the tremendous achievements of the 12th-century scholars who developed new techniques to navigate the accumulating amounts of recorded information they were studying.


Please add any comments to this Google+ post.

10 June 2014

Integrating hiphop vocabulary scores with other relevant data—then querying it

With a little JSON + DBpedia integration.

rapper vocabularies chart

About a month ago, media outlets ranging from NPR to Rolling Stone to Britain's Daily Mail reported on how a "designer, coder, and data scientist" named Matt Daniels had analyzed the number of unique words in samples of work by Shakespeare, Herman Melville, and 85 rappers. He then published a chart and article about how their scores related to each other. The highest score went to Aesop Rock, who I thought I'd heard of but hadn't—I was confusing him with A$AP Rocky, who was not included in the survey.

The chart and discussion were interesting, but what I really wanted to see was the complete list of subjects with their scores, and after searching around the web a bit I found that it was under my nose the whole time—the chart is dynamically generated from JSON embedded in his web page. So, I converted that JSON to RDF, used some SPARQL to retrieve additional data about each rapper from DBpedia such as their record labels, the years their careers began, any subject keywords assigned to them, and the abstracts, or summaries of their careers. (You'll find more details on the procedure for doing this below; the resulting integrated data is available for you to query here as a Turtle file.) Combining this additional data with the vocabulary scores let me do some interesting queries and provide an excellent example of how RDF and SPARQL let you perform ad hoc data integration to combine different data sets into aggregates that let you identify new patterns and other information.

For example, of all record labels with more than four rappers associated with them, I found that MCA's roster had the highest average vocabulary score at 5472.5, well above the overall average of 4624. Who are these artists? Another simple query showed their names and scores:

GZA 6426
The Roots 5803
Killah Priest 5737
Blackalicious 5480
Big Daddy Kane 4768
Rakim 4621

(As Daniels pointed out, members of the Wu-Tang Clan tend to have higher scores, so GZA and Killah Priest are a big help to MCA's average score.)

The dcterms:subject values assigned to the rappers in DBpedia provide the most interesting opportunities for exploration. In fact, it turned out that I didn't even need to pull down the record label values, because they each have corresponding dcterms:subject values. For example, each of the artists listed above have a dcterms:subject value of http://dbpedia.org/resource/Category:MCA_Records_artists along with their other dcterms:subject values.

Of the subject categories with more than four rappers, here are several interesting ones with high average scores, ranked by number of members in the category:

countavg score
Members of the Nation of Gods and Earths 13 5117
Underground rappers 8 5849
People from Brooklyn 7 5323
MCA Records artists 7 5401
Rappers from Long Island 6 5160
Alternative hip hop groups 5 5286
Wu-Tang Clan members 5 5611

I hadn't heard of the Nation of Gods and Earths, also known as the Five-Percent Nation; again, we have Wu-Tang skewing the numbers up. After I saw the high averages for "People from Brooklyn" and "Rappers from Long Island" but no mention of Staten Island, I clicked around and found out that only about half of Wu-Tang came from the borough in which they were based, which I never knew before.

Here are some interesting low scoring categories. Again, remember that the overall average score is 4624:

countavg score
Participants in American reality television series8 4108
People convicted of drug offenses 7 3741
American philanthropists 6 4022
American shooting survivors 5 4025
American fashion businesspeople 5 4110

Of course, the data collection itself isn't very scientific; what constitutes an "alternative" rapper? A less successful artist popular with music nerds? "People convicted of drug offenses" seems like a more cut and dried category, but remember that data from a Wikipedia page is not an authoritative source for such facts.

As with the list of MCA artists above, a simple query of the data can tell you who falls in each of these categories, so pull down the data from the link above and have fun querying it. If you're interested in how I did the integration, read on.

Integrating the data

Upon seeing that Daniels includes a score for Ghostface Killah, it's easy to ask DBpedia for all the { <http://dbpedia.org/resource/Ghostface_Killah> ?p ?o } triples. It's not as simple for many other artists, though, for several reasons:

  • Some rappers use stage names that are common phrases and words, so putting that name at the end of "http://dbpedia.org/resource/" won't necessarily get you data about them.

  • Tricky spellings and punctuation are pretty common in hiphop names. For example, Jay Z originally spelled his name with a hyphen but later dropped it, much as LexisNexis did twelve years earlier.

  • Daniels sometimes included qualifications in names ("GZA (only solo albums)"), included or didn't include the word "The" that was in the DBpedia name ("Roots" vs. "The Roots") or just spelled their names wrong, such as omitting the final "t" from "Missy Elliott."

Dropping parenthesized qualifications was easy enough. Even better, DBpedia often has the data necessary to find the page based on a slightly wrong name, and the techniques I described in Normalizing company names with SPARQL and DBpedia worked for most of them. This is not a minor point: even when the names aren't quite right, sending the right SPARQL queries to DBpedia can still retrieve valuable data about them. This has applications in all kinds of domains.

You can find the scripts and queries mentioned below in rapperrdf.zip. The rapperdata.js file is taken directly from the source of Daniels' web page, and loads his data into an array. Another JavaScript file, rappervocab.js, loads rapperdata.js and outputs Turtle RDF of the rapper's scores and the Daniels versions of their names. (If you're using the TopBraid platform and working with JSON, there's an excellent SPARQLMotion module to automate the conversion of any JSON to RDF.) I used Rhino to run the JavaScript, as I described in Javascript from the command line.

Another short script called rapperValuesList.js reads the same data and creates the list of names that I inserted as a VALUES list into the retrieveRapperData.rq SPARQL query that actually retrieves the relevant data from DBpedia. (VALUES is a great SPARQL technique for saying "I need data about this list of specific things," as I've written here before.) This SPARQL query uses the SERVICE keyword to send the request off to DBpedia and does a CONSTRUCT to save the triples. It uses the "Normalizing company names" trick mentioned above to see if the Daniels name with the parenthesized part stripped out is either the "official" rdfs:label value for a resource or otherwise attached to something that gets redirected to that.

Of the 81 artists in Daniels' list, there were 12 whose names couldn't be looked up even with the redirect trick in retrieveRapperData.rq. To account for these, I created extraRapperDanielsNames.ttl with a text editor to link Daniels' names for these 12 extra rappers to their DBpedia resource URIs such as http://dbpedia.org/resource/Common_(entertainer), which I had to look up manually. The retrieveExtraRapperData.rq query then uses that to retrieve the same data about those 12.

The queries only retrieve the start year, record label, abstract, and subjects about the artists because they all had those values. Retrieving data that only some of them have (such as the birth year, which you don't have for bands like The Roots) would mean using the OPTIONAL keyword, and DBpedia said that my query would take too long when I tried that—I'm sure the big VALUES part has a lot to do with that.

The integrateRapperData.rq query reads the extraRapperDanielsNames.ttl data and the data created by rappervocab.js, retrieveRapperData.rq, and retrieveExtraRapperData.rq, and then creates the final product: rapperDataIntegrated.ttl.

Querying the data

Next was the fun part: executing queries to explore that integrated data. The zip file includes queries to find the following information from rapperDataIntegrated.ttl:

averageScore.rqoverall average Daniels score
averageScoreByLabel.rq average score by record label for labels with more than four artists associated with them
subjectReport.rqaverage score by subject associated with the rappers for all subjects (like "Underground rappers" and "American philanthropists")
MCAArtists.rqMCA artists
JamaicanDescent.rqthe name, Daniels score, and abstract of "American rappers of Jamaican descent"

That last one can provide a template for the creation of other queries about who falls into which subject categories.

Linking this data with other data about the artists from some of the blue parts of the Linked Data Cloud such as DBTune or the BBC would provide some even more interesting possibilities. As one taste, this link has a SPARQL query that retrieves all the MusicBrainz data about Missy Elliott.


Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists