22 August 2015

Querying machine learning movie ratings data with SPARQL

Well, movie ratings data popular with machine learning people.

I hope that more people using R, pandas, and other popular tools associated with data science projects appreciate what a nice addition SPARQL can be to their tool box.

While watching an excellent video about the pandas python data analysis library recently, I learned about how the University of Minnesota's grouplens project has made a large amount of movie rating data from the movielens website available. Their download page lets you pull down 100,000, one million, ten million, or 100 million ratings, including data about the people doing the rating and the movies they rated.

This dataset is popular in the machine learning world: a Google search on "movielens 'machine learning'" gets over 33 thousand hits, with over ten percent being in scholarly articles. I thought it would be fun to query this data with SPARQL, so I downloaded the 1 million rating set, wrote some short perl scripts to convert the ratings, users, and movies "tables" to turtle RDF, and was off and running.

The data

I put "tables" in quotes above because while most people like to think of data in terms of tables, the data about the movies themselves was not strictly a normalized table. As the README file tells us, each line has the structure "MovieID::Title::Genres", in which Genres is a pipe-delimited list of one or more genres selected from the list in the README file. Here's one example:

3932::Invisible Man, The (1933)::Horror|Sci-Fi

The potential presence of more than one genre value in that last column means that this table's data is not fully normalized, but speaking as an RDF guy, we don't need no stinkin' normalization. A short perl script converted that line into the following turtle:

gldm:i3932 rdfs:label "The Invisible Man" ;
   a schema:Movie ;
   dcterms:type "Horror" ;
   dcterms:type "Sci-Fi" ;
   schema:datePublished "1933" .

As you can see, my perl script also moved the word "The" in the film's title back where it belonged and pulled the release date out into its own triple, which let me query for things like the effect of a movie's age on its popularity among viewers. Although the 3,883 movies listed went back to 1919, most were from the 1990s.

Something else from the 1990s was the movie file's Latin 1 encoding, so I used the iconv utility to convert it to UTF-8 before running the script that turned it into turtle so that a title such as "Not Love, Just Frenzy (Más que amor, frenesí)" wouldn't get mangled along the way.

A simpler perl script converted user descriptions of the format "UserID::Gender::Age::Occupation::Zip-code" to triples like this:

gldu:i48 a schema:Person ;
   schema:gender "M" ;
   glschema:age glda:i25 ;
   schema:jobTitle gldo:i4 ;
   schema:postalCode "92107" .

I created a ratingsSchemaAndCodeLists.ttl file to assign the age range and job title labels shown in the README file to the age and jobTitle values with triples like this:

glda:i25 rdfs:label "25-34" . 
gldo:i4 a schema:jobTitle ;
   rdfs:label "college/grad student" . 

Finally, a third perl script converted ratings lines of the format "UserID::MovieID::Rating::Timestamp" to triples grouped together with blank nodes like this:

  a schema:Review ;
  schema:author gldu:i1 ;
  schema:about gldm:i661 ;
  schema:reviewRating 3 ;
  dcterms:date "2000-12-31" 
] .

The scripts and the ratingsSchemaAndCodeLists.ttl file are available in the file movieLensScripts.zip, and you can see the queries described below and their results at movieLensQueries.html.

The queries

I mentioned that most of the movies were from the 1990s; the results of query 1 show the actual number of rated movies by release year.

Query 2 listed the movie genres sorted by the average ratings they received. The results put Film-Noir, Documentary, War, and Drama in the top four spots. Does that make these four genres the most popular? Perhaps, if you measure popularity by assigned ratings, but if you measure it by the movies that people actually choose to see (or, more accurately, to rate), as query 3 does, the results reveal that the four most popular genres to see are Comedy, Drama, Action, and Thrillers, with Film-Noir and Documentary ranking in the bottom two spots.

Breaking ratings down by age group makes things more interesting. Query 4 asks for average ratings by age group, and the results show a strong correlation between age and ratings: while movie viewers aged 18-24 give slightly lower ratings than those under 18--it is a cynical age to be--from there on up, the older the viewers, the higher the average ratings.

What are each age group's favorite genres by rating and by attendance? Query 5 asks for attendance figures and average ratings broken down by age group and genres. In the first version of these results, sorted by rating, we see that most age groups give the highest average ratings to Film Noir, Documentary, and War movies, in that order, except the two oldest groups, who rate War movies higher than Documentaries, and the youngest group, whose average rating for Documentary films puts them behind Film-Noir, War and Drama.

With the same results sorted by attendance within each age group, we see that the three age groups under 35 prefer to watch Comedy, Drama, and Action movies, in that order. Most people 35 and older would rather watch Drama than Comedies, with Action in third place for them as well.

I was curious whether a movie's age affected viewers' choices of what to see and their ratings--for example, when watching a movie that you've heard about for a few years, are you more likely to assume that it's good because it hasn't faded away? Query 6 lists the average ratings given to movies by movie type if the movie was seen more than five years after release. In these results, Film Noir is once again at the top, but the average rating of War movies puts them above Documentaries, and Mysteries climb from seventh to fourth place.

Query 7 asks the same thing about movies that were ten years old when viewed. These results show Mysteries climbing to third place and pushing Documentaries down to fourth, so it appears that Mysteries age better than Documentaries. (Nothing ages better than Film-Noir, whose average ratings go up with age, but remember that they're not nearly as popular to watch as the other genres; people who like them just like them more.)

Finally, Query 8 asks for the average ratings and total attendance by age group for the movies that were more than ten years old when viewed. Comparing the results sorted by rating with the same figures calculated for all movies (the first query 5 results), we see that it's the older movie viewers driving the higher ratings of older Mysteries over Documentaries--the ratings of the 199 movie viewers aged 18-24 actually put Documentaries at the top of their list of older movies. The same results sorted by attendance were remarkably similar to the query 5 version that took all the movies into account.

And more queries

It's easy to think of more questions to ask; we haven't even asked about about specific movies and their roles in the ratings. For example, what were these older Documentaries that the 18-24 year-old viewers liked so much? Perhaps there was some breakout hit that skewed the averages by being more popular than Documentaries typically are. Do viewers' genders or job titles affect their choice of movies to see or the ratings they gave them? If you're wondering, or thinking of new queries, you can download the data from the grouplens link above, convert it to turtle with my perl scripts, and query away.

With more recent ratings and movies, these kinds of explorations of the data could be used to plan advertising budgets or a film festival program. I mostly found it fun as a way to use SPARQL to explore a set of data that was not designed to be represented in RDF, but was very easy to convert, and I hope that more people using R, pandas, and other popular tools associated with data science projects appreciate what a great addition SPARQL can be to their tool box.

Please add any comments to this Google+ post.

15 July 2015

Visualizing DBpedia geographic data

With some help from SPARQL.

US astronaut birth places

I've been learning about Geographical Information System (GIS) data lately. More and more projects and businesses are doing interesting things by associating new kinds of data with specific latitude/longitude pairs; this data might be about air quality, real estate prices, or the make and model of the nearest Uber car.

DBpedia has a lot of latitude and longitude data, and SPARQL queries let you associate it with other data. Because you can retrieve these query results as CSV files, and many GIS packages can read CSV data, you can do a lot of similar interesting things yourself.

A query of DBpedia data about American astronauts shows that the oldest one was born in 1918 and the youngest one was born in 1979. I wondered whether, over time, there were any patterns in what part of the country they came from, and I managed to combine a DBpedia SPARQL query with an open-source GIS visualization package to create the map shown here.

The following query asks for the birth year and latitude and longitude of the birthplace of each American astronaut:

SELECT (MAX(?latitude) AS ?maxlat) (MAX(?longitude) AS ?maxlong) 
       ?astronaut (substr(str(MAX(?birthYear)),1,4) AS ?by) 
  ?astronaut dcterms:subject category:American_astronauts ;
             dbpedia-owl:birthPlace ?birthPlace ;
             dbpedia-owl:birthYear ?birthYear ; 
              dbpedia2:nationality :United_States .  
  ?birthPlace geo:lat ?latitude ;
              geo:long ?longitude . 
GROUP BY ?astronaut

(The query has no prefix declarations because it uses the ones built into DBpedia. Also, because some places have more than one pair of geo:lat and geo:long values, I found it simplest to just take the maximum value of each to get one pair for each astronaut.) The following shows the first few lines of the result when I asked for CSV:


QGIS Desktop is an open-source tool for working with GIS data that, among other things, lets you visualize data. The data can come from disk files or from several other sources, including the PostGIS add-on to the PostgreSQL database, which lets you scale up pretty far in the amount of data you can work with.

Using QGIS to create the image above, I first loaded the shapefile (actually a collection of files, including an old-fashioned dBase dbf file) from the US Census website with outlines of the individual states of the United States.

GIS visualization is often about layering of data such as state boundaries, altitude data, and roads to see the combined effects; those little cars in your phone's Uber app would like kind of silly if the roads and your current location weren't shown with them. For my experiment, the census shapefile was my first layer, and QGIS Desktop's "Add Delimited Text Layer" feature let me add the results of my SPARQL query about astronaut data as another layer. One tricky bit for us GIS novices is that these tools usually ask you to specify a Coordinate Reference System for any set of data, typically as an EPSG number, and there are a lot of those out there. I used EPSG 4269.

At first, QGIS added in all the astronaut birthplace locations as little black circles filled with the same shade of green. It had also set the default fill color of the US map to green, so I reset that to white in the dialog box for configuring that layer's properties. Then, in the astronaut data layer's properties, I found that instead of using identical symbols to represent each point on the map, I could pick "Graduated" and specify a "color ramp" that QGIS would use to assign color values according to the values in the property that I selected for this: by, or birth year, which you'll recognized from the fourth column of the sample CSV output above. QGIS looked at the lowest and highest of these values and offered to assign the following colors to by values in the ranges shown, and I just accepted the default:

QGIS color configuration

(While the earlier query showed a few astronauts born in 1978 and 1979, the range here only goes up to 1977 because I now see that some geographic coordinates in DBpedia are specified with dbpprop:latitude and dbpprop:longitude instead of geo:lat and geo:long, so if I was redoing this I'd revise the query to take those into account.)

If you click on the map above to see the larger image, you'll see that many early astronauts came from the midwest, and then over time, they gradually came from the four corners of the continental US. Why so many from the New York City area and none from Wyoming? Is there something in New York more conducive to producing astronauts than the wide-open spaces of Wyoming? Yes: there are more people there, so the odds are that more astronauts will come from there. See this excellent xkcd cartoon for more on this principle.

I only scratched the surface of what QGIS can do. I found this video from the Vermont Center for Geographic Info to be an excellent introduction. I learned from it and the book PostGIS in Action that an important set of features that GIS systems such as QGIS add is the automation of some of the math involved in computing distances and areas, which is not simple geometry because it takes place on the curved surface of the earth. A package like PostGIS adds specialized datatypes and functions to a general-purpose database like PostgreSQL to do the more difficult parts of the geography math. This lets your SQL queries do proximity analysis and other GIS tasks as well as handing off of such data to a visualization tool such as QGIS. (The open-source GeoMesa database adds similar features to Apache Accumulo and Google BigTable for more Hadoop-scale applications.)

The great news for SPARQL users is that a GIS extension called GeoSPARQL does something similar. You can try it out at the geosparql.org website. For example, entering the following query there will list all the airports within 10 miles of New York City:

PREFIX spatial:<http://jena.apache.org/spatial#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX geo:<http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX gn:<http://www.geonames.org/ontology#>

Select ?name 
  ?object spatial:nearby(40.712700 -74.005898 10 'mi').
  ?object a <http://www.lotico.com/ontology/Airport> ;
  gn:name ?name 

(The data uses a fairly broad definition of "airport," including heliports and seaplane bases.) I have not played with any GeoSPARQL implementations outside of geosparql.org, but the Parliament one mentioned on the GeoSPARQL wikipedia page looks interesting. I have not played much with the Linked Open Streeet Map SPARQL endpoint, but it also looks great for people who interested in GIS and SPARQL.

Whether you try out GeoSPARQL or not, when you take DBpedia's ability to associate such a broad range of data with geographic coordinates, and you combine that with the ability of GIS visualization tools like QGIS to work with that data (especially the ability to visualize the associated data—in my case, the color coding of astronaut birth years), you have a vast new category of cool things you can do with SPARQL.

Please add any comments to this Google+ post.

20 June 2015

Artificial Intelligence, then (1960) and now

Especially machine learning.

It's fascinating how relevant much of [this 1960 paper] still is today, especially considering the limited computing power available 55 years ago.

Earlier this month I tweeted "When people write about AI like it's this brand new thing, should I be amused, feel old, or both?" The tweet linked to a recent Harvard Business Review article called Data Scientists Don't Scale about the things that Artificial Intelligence is currently doing, which just happened to be the things that the author of the article's automated prose-generation company is doing.

The article provided absolutely no historical context to this phrase that has thrilled, annoyed, and fascinated people since the term was first coined by John McCarthy in 1955. (For a little historical context, this was two years after Dwight Eisenhower succeeded Harry Truman as President of the United States. Three years later, McCarthy invented Lisp—a programming language that, besides providing the basis of other popular languages such as Scheme and the currently very hot Clojure, is still used today.) I recently came across a link to the seminal 1960 paper Steps Toward Artificial Intelligence by AI pioneer Marvin Minsky, who was there at the beginning in 1955, and so I read it on a long plane ride. It's fascinating how relevant much of it still is today, especially when you take into account the limited computing power available 55 years ago.

After enumerating the five basic categories of "making computers solve really difficult problems" (search, pattern-recognition, learning, planning, and induction), the paper mentions several algorithms that are still considered to be basic tools in Machine Learning toolboxes: hill climbing, naive Bayesian classification, perceptrons, reinforcement learning, and neural nets. He mentions that one part of Bayesian classification "can be made by a simple network device" that he illustrates with this diagram:

some description

It's wild to consider that the software possibilities were so limited at the time that implementing some of these ideas were easier by just building specialized hardware. Minksy also describes the implementation of a certain math game by a network of resistors as designed by Claude Shannon (who I was happy to hear mentioned in the season 1 finale of Silicon Valley):

some description

Minsky's paper also references the work of B.F. Skinner, of Skinner box fame, when describing reinforcement learning, and it cites Noam Chomsky when describing inductive learning. I mention these two together because this past week I also read an interview that took place just three years ago titled Noam Chomsky on Where Artificial Intelligence Went Wrong. Describing those early days of AI research, the interview's introduction tells us how

Some of McCarthy's colleagues in neighboring departments, however, were more interested in how intelligence is implemented in humans (and other animals) first. Noam Chomsky and others worked on what became cognitive science, a field aimed at uncovering the mental representations and rules that underlie our perceptual and cognitive abilities. Chomsky and his colleagues had to overthrow the then-dominant paradigm of behaviorism, championed by Harvard psychologist B.F. Skinner, where animal behavior was reduced to a simple set of associations between an action and its subsequent reward or punishment. The undoing of Skinner's grip on psychology is commonly marked by Chomsky's 1959 critical review of Skinner's book Verbal Behavior, a book in which Skinner attempted to explain linguistic ability using behaviorist principles.

The introduction goes on to describe a 2011 symposium at MIT on "Brains, Minds and Machines," which "was meant to inspire multidisciplinary enthusiasm for the revival of the scientific question from which the field of artificial intelligence originated: how does intelligence work?"

Noam Chomsky, speaking in the symposium, wasn't so enthused. Chomsky critiqued the field of AI for adopting an approach reminiscent of behaviorism, except in more modern, computationally sophisticated form. Chomsky argued that the field's heavy use of statistical techniques to pick regularities in masses of data is unlikely to yield the explanatory insight that science ought to offer. For Chomsky, the "new AI" — focused on using statistical learning techniques to better mine and predict data — is unlikely to yield general principles about the nature of intelligent beings or about cognition.

The whole interview is worth reading. I'm not saying that I completely agree with Chomsky or completely disagree (as Google's Peter Norvig has in an essay that has the excellent URL http://norvig.com/chomsky.html but gets a little ad hominem when he starts comparing Chomsky to Bill O'Reilly), only that Minsky's 1960 paper and Chomsky's 2012 interview, taken together, provide a good perspective on where AI came from and the path it took to the roles it play today.

I'll closed with this nice quote from a discussion in Minsky's paper of what exactly "intelligence" is and whether machines are capable of it:

Programmers, too, know that there is never any "heart" in a program. There are high-level routines in each program, but all they do is dictate that "if such-and-such, then transfer to such-and-such a subroutine." And when we look at the low-level subroutines, which "actually do the work," we find senseless loops and sequences of trivial operations, merely carrying out the dictates of their superiors. The intelligence in such a system seems to be as intangible as becomes the meaning of a single common word when it is thoughtfully pronounced over and over again.

Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets


    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists