15 July 2015

Visualizing DBpedia geographic data

With some help from SPARQL.

US astronaut birth places

I've been learning about Geographical Information System (GIS) data lately. More and more projects and businesses are doing interesting things by associating new kinds of data with specific latitude/longitude pairs; this data might be about air quality, real estate prices, or the make and model of the nearest Uber car.

DBpedia has a lot of latitude and longitude data, and SPARQL queries let you associate it with other data. Because you can retrieve these query results as CSV files, and many GIS packages can read CSV data, you can do a lot of similar interesting things yourself.

A query of DBpedia data about American astronauts shows that the oldest one was born in 1918 and the youngest one was born in 1979. I wondered whether, over time, there were any patterns in what part of the country they came from, and I managed to combine a DBpedia SPARQL query with an open-source GIS visualization package to create the map shown here.

The following query asks for the birth year and latitude and longitude of the birthplace of each American astronaut:

SELECT (MAX(?latitude) AS ?maxlat) (MAX(?longitude) AS ?maxlong) 
       ?astronaut (substr(str(MAX(?birthYear)),1,4) AS ?by) 
  WHERE {
  ?astronaut dcterms:subject category:American_astronauts ;
             dbpedia-owl:birthPlace ?birthPlace ;
             dbpedia-owl:birthYear ?birthYear ; 
              dbpedia2:nationality :United_States .  
  ?birthPlace geo:lat ?latitude ;
              geo:long ?longitude . 
}
GROUP BY ?astronaut

(The query has no prefix declarations because it uses the ones built into DBpedia. Also, because some places have more than one pair of geo:lat and geo:long values, I found it simplest to just take the maximum value of each to get one pair for each astronaut.) The following shows the first few lines of the result when I asked for CSV:

"maxlat","maxlong","astronaut","by"
37.195,-93.2861,"http://dbpedia.org/resource/Janet_L._Kavandi","1959"
42.6461,-83.2925,"http://dbpedia.org/resource/Brent_W._Jett,_Jr.","1958"
40.1,-75.0997,"http://dbpedia.org/resource/John-David_F._Bartoe","1944"

QGIS Desktop is an open-source tool for working with GIS data that, among other things, lets you visualize data. The data can come from disk files or from several other sources, including the PostGIS add-on to the PostgreSQL database, which lets you scale up pretty far in the amount of data you can work with.

Using QGIS to create the image above, I first loaded the shapefile (actually a collection of files, including an old-fashioned dBase dbf file) from the US Census website with outlines of the individual states of the United States.

GIS visualization is often about layering of data such as state boundaries, altitude data, and roads to see the combined effects; those little cars in your phone's Uber app would like kind of silly if the roads and your current location weren't shown with them. For my experiment, the census shapefile was my first layer, and QGIS Desktop's "Add Delimited Text Layer" feature let me add the results of my SPARQL query about astronaut data as another layer. One tricky bit for us GIS novices is that these tools usually ask you to specify a Coordinate Reference System for any set of data, typically as an EPSG number, and there are a lot of those out there. I used EPSG 4269.

At first, QGIS added in all the astronaut birthplace locations as little black circles filled with the same shade of green. It had also set the default fill color of the US map to green, so I reset that to white in the dialog box for configuring that layer's properties. Then, in the astronaut data layer's properties, I found that instead of using identical symbols to represent each point on the map, I could pick "Graduated" and specify a "color ramp" that QGIS would use to assign color values according to the values in the property that I selected for this: by, or birth year, which you'll recognized from the fourth column of the sample CSV output above. QGIS looked at the lowest and highest of these values and offered to assign the following colors to by values in the ranges shown, and I just accepted the default:

QGIS color configuration

(While the earlier query showed a few astronauts born in 1978 and 1979, the range here only goes up to 1977 because I now see that some geographic coordinates in DBpedia are specified with dbpprop:latitude and dbpprop:longitude instead of geo:lat and geo:long, so if I was redoing this I'd revise the query to take those into account.)

If you click on the map above to see the larger image, you'll see that many early astronauts came from the midwest, and then over time, they gradually came from the four corners of the continental US. Why so many from the New York City area and none from Wyoming? Is there something in New York more conducive to producing astronauts than the wide-open spaces of Wyoming? Yes: there are more people there, so the odds are that more astronauts will come from there. See this excellent xkcd cartoon for more on this principle.

I only scratched the surface of what QGIS can do. I found this video from the Vermont Center for Geographic Info to be an excellent introduction. I learned from it and the book PostGIS in Action that an important set of features that GIS systems such as QGIS add is the automation of some of the math involved in computing distances and areas, which is not simple geometry because it takes place on the curved surface of the earth. A package like PostGIS adds specialized datatypes and functions to a general-purpose database like PostgreSQL to do the more difficult parts of the geography math. This lets your SQL queries do proximity analysis and other GIS tasks as well as handing off of such data to a visualization tool such as QGIS. (The open-source GeoMesa database adds similar features to Apache Accumulo and Google BigTable for more Hadoop-scale applications.)

The great news for SPARQL users is that a GIS extension called GeoSPARQL does something similar. You can try it out at the geosparql.org website. For example, entering the following query there will list all the airports within 10 miles of New York City:

PREFIX spatial:<http://jena.apache.org/spatial#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX geo:<http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX gn:<http://www.geonames.org/ontology#>

Select ?name 
WHERE{
  ?object spatial:nearby(40.712700 -74.005898 10 'mi').
  ?object a <http://www.lotico.com/ontology/Airport> ;
  gn:name ?name 
}

(The data uses a fairly broad definition of "airport," including heliports and seaplane bases.) I have not played with any GeoSPARQL implementations outside of geosparql.org, but the Parliament one mentioned on the GeoSPARQL wikipedia page looks interesting. I have not played much with the Linked Open Streeet Map SPARQL endpoint, but it also looks great for people who interested in GIS and SPARQL.

Whether you try out GeoSPARQL or not, when you take DBpedia's ability to associate such a broad range of data with geographic coordinates, and you combine that with the ability of GIS visualization tools like QGIS to work with that data (especially the ability to visualize the associated data—in my case, the color coding of astronaut birth years), you have a vast new category of cool things you can do with SPARQL.


Please add any comments to this Google+ post.

20 June 2015

Artificial Intelligence, then (1960) and now

Especially machine learning.

It's fascinating how relevant much of [this 1960 paper] still is today, especially considering the limited computing power available 55 years ago.

Earlier this month I tweeted "When people write about AI like it's this brand new thing, should I be amused, feel old, or both?" The tweet linked to a recent Harvard Business Review article called Data Scientists Don't Scale about the things that Artificial Intelligence is currently doing, which just happened to be the things that the author of the article's automated prose-generation company is doing.

The article provided absolutely no historical context to this phrase that has thrilled, annoyed, and fascinated people since the term was first coined by John McCarthy in 1955. (For a little historical context, this was two years after Dwight Eisenhower succeeded Harry Truman as President of the United States. Three years later, McCarthy invented Lisp—a programming language that, besides providing the basis of other popular languages such as Scheme and the currently very hot Clojure, is still used today.) I recently came across a link to the seminal 1960 paper Steps Toward Artificial Intelligence by AI pioneer Marvin Minsky, who was there at the beginning in 1955, and so I read it on a long plane ride. It's fascinating how relevant much of it still is today, especially when you take into account the limited computing power available 55 years ago.

After enumerating the five basic categories of "making computers solve really difficult problems" (search, pattern-recognition, learning, planning, and induction), the paper mentions several algorithms that are still considered to be basic tools in Machine Learning toolboxes: hill climbing, naive Bayesian classification, perceptrons, reinforcement learning, and neural nets. He mentions that one part of Bayesian classification "can be made by a simple network device" that he illustrates with this diagram:

some description

It's wild to consider that the software possibilities were so limited at the time that implementing some of these ideas were easier by just building specialized hardware. Minksy also describes the implementation of a certain math game by a network of resistors as designed by Claude Shannon (who I was happy to hear mentioned in the season 1 finale of Silicon Valley):

some description

Minsky's paper also references the work of B.F. Skinner, of Skinner box fame, when describing reinforcement learning, and it cites Noam Chomsky when describing inductive learning. I mention these two together because this past week I also read an interview that took place just three years ago titled Noam Chomsky on Where Artificial Intelligence Went Wrong. Describing those early days of AI research, the interview's introduction tells us how

Some of McCarthy's colleagues in neighboring departments, however, were more interested in how intelligence is implemented in humans (and other animals) first. Noam Chomsky and others worked on what became cognitive science, a field aimed at uncovering the mental representations and rules that underlie our perceptual and cognitive abilities. Chomsky and his colleagues had to overthrow the then-dominant paradigm of behaviorism, championed by Harvard psychologist B.F. Skinner, where animal behavior was reduced to a simple set of associations between an action and its subsequent reward or punishment. The undoing of Skinner's grip on psychology is commonly marked by Chomsky's 1959 critical review of Skinner's book Verbal Behavior, a book in which Skinner attempted to explain linguistic ability using behaviorist principles.

The introduction goes on to describe a 2011 symposium at MIT on "Brains, Minds and Machines," which "was meant to inspire multidisciplinary enthusiasm for the revival of the scientific question from which the field of artificial intelligence originated: how does intelligence work?"

Noam Chomsky, speaking in the symposium, wasn't so enthused. Chomsky critiqued the field of AI for adopting an approach reminiscent of behaviorism, except in more modern, computationally sophisticated form. Chomsky argued that the field's heavy use of statistical techniques to pick regularities in masses of data is unlikely to yield the explanatory insight that science ought to offer. For Chomsky, the "new AI" — focused on using statistical learning techniques to better mine and predict data — is unlikely to yield general principles about the nature of intelligent beings or about cognition.

The whole interview is worth reading. I'm not saying that I completely agree with Chomsky or completely disagree (as Google's Peter Norvig has in an essay that has the excellent URL http://norvig.com/chomsky.html but gets a little ad hominem when he starts comparing Chomsky to Bill O'Reilly), only that Minsky's 1960 paper and Chomsky's 2012 interview, taken together, provide a good perspective on where AI came from and the path it took to the roles it play today.

I'll closed with this nice quote from a discussion in Minsky's paper of what exactly "intelligence" is and whether machines are capable of it:

Programmers, too, know that there is never any "heart" in a program. There are high-level routines in each program, but all they do is dictate that "if such-and-such, then transfer to such-and-such a subroutine." And when we look at the low-level subroutines, which "actually do the work," we find senseless loops and sequences of trivial operations, merely carrying out the dictates of their superiors. The intelligence in such a system seems to be as intangible as becomes the meaning of a single common word when it is thoughtfully pronounced over and over again.

Please add any comments to this Google+ post.

3 May 2015

SPARQL: the video

Well, a video, but a lot of important SPARQL basics in a short period of time.

SPARQL in 11 minutes

While doing training for a TopQuadrant customer recently, the schedule led to my having ten minutes to explain the basics of writing SPARQL queries. I think I did OK, but on the plane home I thought harder about what to put in those ten minutes, which led to my making the video SPARQL in 11 minutes. While the video is 11 minutes and 14 seconds long, between the opening part about RDF and the plug for Learning SPARQL at the end, the SPARQL introduction is less than eight minutes.

After explaining what RDF triples are and how they're represented in Turtle, the video walks through some simple SELECT queries and how they work with the data. This leads up to a CONSTRUCT query and a list of other things that people will find useful if they learn more about SPARQL. I had a lot of fun making the video's SPARQL engine noise with my Korg Monotron synthesizer and also making more traditional music for the introduction and ending.

I hope this video is helpful for people who are new to SPARQL. The other SPARQL videos on YouTube are mostly real-time classroom lectures. My favorite is an ad for what seems like a Dutch cable TV provider that has nothing to do with the query language but has the excellent domain name sparql.nl. If you skip ahead to 1:03 of this ad for the company, you'll see a finger snap turn into a swirl of flames and then their shining "sparql" logo, all with the most dramatic music possible. My production values were not quite that high, but higher than most of the other SPARQL videos you'll find on YouTube.

some description

Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists