Querying machine learning movie ratings data with SPARQL

Well, movie ratings data popular with machine learning people.
I hope that more people using R, pandas, and other popular tools associated with data science projects appreciate what a nice addition SPARQL can be to their tool box.

While watching an excellent video about the pandas python data analysis library recently, I learned about how the University of Minnesota's grouplens project has made a large amount of movie rating data from the movielens website available. Their download page lets you pull down 100,000, one million, ten million, or 100 million ratings, including data about the people doing the rating and the movies they rated.

This dataset is popular in the machine learning world: a Google search on "movielens 'machine learning'" gets over 33 thousand hits, with over ten percent being in scholarly articles. I thought it would be fun to query this data with SPARQL, so I downloaded the 1 million rating set, wrote some short perl scripts to convert the ratings, users, and movies "tables" to turtle RDF, and was off and running.

The data

I put "tables" in quotes above because while most people like to think of data in terms of tables, the data about the movies themselves was not strictly a normalized table. As the README file tells us, each line has the structure "MovieID::Title::Genres", in which Genres is a pipe-delimited list of one or more genres selected from the list in the README file. Here's one example:

3932::Invisible Man, The (1933)::Horror|Sci-Fi

The potential presence of more than one genre value in that last column means that this table's data is not fully normalized, but speaking as an RDF guy, we don't need no stinkin' normalization. A short perl script converted that line into the following turtle:

gldm:i3932 rdfs:label "The Invisible Man" ;
   a schema:Movie ;
   dcterms:type "Horror" ;
   dcterms:type "Sci-Fi" ;
   schema:datePublished "1933" .

As you can see, my perl script also moved the word "The" in the film's title back where it belonged and pulled the release date out into its own triple, which let me query for things like the effect of a movie's age on its popularity among viewers. Although the 3,883 movies listed went back to 1919, most were from the 1990s.

Something else from the 1990s was the movie file's Latin 1 encoding, so I used the iconv utility to convert it to UTF-8 before running the script that turned it into turtle so that a title such as "Not Love, Just Frenzy (Más que amor, frenesí)" wouldn't get mangled along the way.

A simpler perl script converted user descriptions of the format "UserID::Gender::Age::Occupation::Zip-code" to triples like this:

gldu:i48 a schema:Person ;
   schema:gender "M" ;
   glschema:age glda:i25 ;
   schema:jobTitle gldo:i4 ;
   schema:postalCode "92107" .

I created a ratingsSchemaAndCodeLists.ttl file to assign the age range and job title labels shown in the README file to the age and jobTitle values with triples like this:

glda:i25 rdfs:label "25-34" . 
gldo:i4 a schema:jobTitle ;
   rdfs:label "college/grad student" . 

Finally, a third perl script converted ratings lines of the format "UserID::MovieID::Rating::Timestamp" to triples grouped together with blank nodes like this:

  a schema:Review ;
  schema:author gldu:i1 ;
  schema:about gldm:i661 ;
  schema:reviewRating 3 ;
  dcterms:date "2000-12-31" 
] .

The scripts and the ratingsSchemaAndCodeLists.ttl file are available on github, and you can see the queries described below and their results at movieLensQueries.html.

The queries

I mentioned that most of the movies were from the 1990s; the results of query 1 show the actual number of rated movies by release year.

Query 2 listed the movie genres sorted by the average ratings they received. The results put Film-Noir, Documentary, War, and Drama in the top four spots. Does that make these four genres the most popular? Perhaps, if you measure popularity by assigned ratings, but if you measure it by the movies that people actually choose to see (or, more accurately, to rate), as query 3 does, the results reveal that the four most popular genres to see are Comedy, Drama, Action, and Thrillers, with Film-Noir and Documentary ranking in the bottom two spots.

Breaking ratings down by age group makes things more interesting. Query 4 asks for average ratings by age group, and the results show a strong correlation between age and ratings: while movie viewers aged 18-24 give slightly lower ratings than those under 18--it is a cynical age to be--from there on up, the older the viewers, the higher the average ratings.

What are each age group's favorite genres by rating and by attendance? Query 5 asks for attendance figures and average ratings broken down by age group and genres. In the first version of these results, sorted by rating, we see that most age groups give the highest average ratings to Film Noir, Documentary, and War movies, in that order, except the two oldest groups, who rate War movies higher than Documentaries, and the youngest group, whose average rating for Documentary films puts them behind Film-Noir, War and Drama.

With the same results sorted by attendance within each age group, we see that the three age groups under 35 prefer to watch Comedy, Drama, and Action movies, in that order. Most people 35 and older would rather watch Drama than Comedies, with Action in third place for them as well.

I was curious whether a movie's age affected viewers' choices of what to see and their ratings--for example, when watching a movie that you've heard about for a few years, are you more likely to assume that it's good because it hasn't faded away? Query 6 lists the average ratings given to movies by movie type if the movie was seen more than five years after release. In these results, Film Noir is once again at the top, but the average rating of War movies puts them above Documentaries, and Mysteries climb from seventh to fourth place.

Query 7 asks the same thing about movies that were ten years old when viewed. These results show Mysteries climbing to third place and pushing Documentaries down to fourth, so it appears that Mysteries age better than Documentaries. (Nothing ages better than Film-Noir, whose average ratings go up with age, but remember that they're not nearly as popular to watch as the other genres; people who like them just like them more.)

Finally, Query 8 asks for the average ratings and total attendance by age group for the movies that were more than ten years old when viewed. Comparing the results sorted by rating with the same figures calculated for all movies (the first query 5 results), we see that it's the older movie viewers driving the higher ratings of older Mysteries over Documentaries--the ratings of the 199 movie viewers aged 18-24 actually put Documentaries at the top of their list of older movies. The same results sorted by attendance were remarkably similar to the query 5 version that took all the movies into account.

And more queries

It's easy to think of more questions to ask; we haven't even asked about about specific movies and their roles in the ratings. For example, what were these older Documentaries that the 18-24 year-old viewers liked so much? Perhaps there was some breakout hit that skewed the averages by being more popular than Documentaries typically are. Do viewers' genders or job titles affect their choice of movies to see or the ratings they gave them? If you're wondering, or thinking of new queries, you can download the data from the grouplens link above, convert it to turtle with my perl scripts, and query away.

With more recent ratings and movies, these kinds of explorations of the data could be used to plan advertising budgets or a film festival program. I mostly found it fun as a way to use SPARQL to explore a set of data that was not designed to be represented in RDF, but was very easy to convert, and I hope that more people using R, pandas, and other popular tools associated with data science projects appreciate what a great addition SPARQL can be to their tool box.

Please add any comments to this Google+ post.