19 September 2015

My data science glossary

Complete with a dot org domain name.

glossary in dictionary

Lately I've been studying up on the math and technology associated with data science because there are so many interesting things going on. Despite taking many notes, I found myself learning certain important terms, seeing them again later, and then thinking "What was that again? P-values? Huh?"

So, I turned a portion of my notes into a glossary to make these things easy to look up when I wanted to remember them. I decided that I may as well publish this glossary in case others found it helpful, or if they had suggestions or corrections. And, when I found that the domain name datascienceglossary.org wasn't taken, I couldn't resist grabbing it.

Now it's up and ready for the world: datascienceglossary.org. I also took the opportunity to try out Bootstrap to see how easily it might make my new little website look presentable on Android and Apple phones and tablets in addition to bigger screens. It was pretty easy, especially after I found their documentation page. (In the past, I've found that many CSS frameworks that are supposed to make your life easier have horrible if any documentation--"just look out our fabulous examples" isn't enough; if the class values that we're supposed to assign to our HTML elements are packed with cryptic little abbreviations, then tell us what all the abbreviations stand for.)

I hope my data science glossary is useful to some people. I know it will be useful to me, especially the next time I forget what "P-value" means.

Please add any comments to this Google+ post.

22 August 2015

Querying machine learning movie ratings data with SPARQL

Well, movie ratings data popular with machine learning people.

I hope that more people using R, pandas, and other popular tools associated with data science projects appreciate what a nice addition SPARQL can be to their tool box.

While watching an excellent video about the pandas python data analysis library recently, I learned about how the University of Minnesota's grouplens project has made a large amount of movie rating data from the movielens website available. Their download page lets you pull down 100,000, one million, ten million, or 100 million ratings, including data about the people doing the rating and the movies they rated.

This dataset is popular in the machine learning world: a Google search on "movielens 'machine learning'" gets over 33 thousand hits, with over ten percent being in scholarly articles. I thought it would be fun to query this data with SPARQL, so I downloaded the 1 million rating set, wrote some short perl scripts to convert the ratings, users, and movies "tables" to turtle RDF, and was off and running.

The data

I put "tables" in quotes above because while most people like to think of data in terms of tables, the data about the movies themselves was not strictly a normalized table. As the README file tells us, each line has the structure "MovieID::Title::Genres", in which Genres is a pipe-delimited list of one or more genres selected from the list in the README file. Here's one example:

3932::Invisible Man, The (1933)::Horror|Sci-Fi

The potential presence of more than one genre value in that last column means that this table's data is not fully normalized, but speaking as an RDF guy, we don't need no stinkin' normalization. A short perl script converted that line into the following turtle:

gldm:i3932 rdfs:label "The Invisible Man" ;
   a schema:Movie ;
   dcterms:type "Horror" ;
   dcterms:type "Sci-Fi" ;
   schema:datePublished "1933" .

As you can see, my perl script also moved the word "The" in the film's title back where it belonged and pulled the release date out into its own triple, which let me query for things like the effect of a movie's age on its popularity among viewers. Although the 3,883 movies listed went back to 1919, most were from the 1990s.

Something else from the 1990s was the movie file's Latin 1 encoding, so I used the iconv utility to convert it to UTF-8 before running the script that turned it into turtle so that a title such as "Not Love, Just Frenzy (Más que amor, frenesí)" wouldn't get mangled along the way.

A simpler perl script converted user descriptions of the format "UserID::Gender::Age::Occupation::Zip-code" to triples like this:

gldu:i48 a schema:Person ;
   schema:gender "M" ;
   glschema:age glda:i25 ;
   schema:jobTitle gldo:i4 ;
   schema:postalCode "92107" .

I created a ratingsSchemaAndCodeLists.ttl file to assign the age range and job title labels shown in the README file to the age and jobTitle values with triples like this:

glda:i25 rdfs:label "25-34" . 
gldo:i4 a schema:jobTitle ;
   rdfs:label "college/grad student" . 

Finally, a third perl script converted ratings lines of the format "UserID::MovieID::Rating::Timestamp" to triples grouped together with blank nodes like this:

  a schema:Review ;
  schema:author gldu:i1 ;
  schema:about gldm:i661 ;
  schema:reviewRating 3 ;
  dcterms:date "2000-12-31" 
] .

The scripts and the ratingsSchemaAndCodeLists.ttl file are available on github, and you can see the queries described below and their results at movieLensQueries.html.

The queries

I mentioned that most of the movies were from the 1990s; the results of query 1 show the actual number of rated movies by release year.

Query 2 listed the movie genres sorted by the average ratings they received. The results put Film-Noir, Documentary, War, and Drama in the top four spots. Does that make these four genres the most popular? Perhaps, if you measure popularity by assigned ratings, but if you measure it by the movies that people actually choose to see (or, more accurately, to rate), as query 3 does, the results reveal that the four most popular genres to see are Comedy, Drama, Action, and Thrillers, with Film-Noir and Documentary ranking in the bottom two spots.

Breaking ratings down by age group makes things more interesting. Query 4 asks for average ratings by age group, and the results show a strong correlation between age and ratings: while movie viewers aged 18-24 give slightly lower ratings than those under 18--it is a cynical age to be--from there on up, the older the viewers, the higher the average ratings.

What are each age group's favorite genres by rating and by attendance? Query 5 asks for attendance figures and average ratings broken down by age group and genres. In the first version of these results, sorted by rating, we see that most age groups give the highest average ratings to Film Noir, Documentary, and War movies, in that order, except the two oldest groups, who rate War movies higher than Documentaries, and the youngest group, whose average rating for Documentary films puts them behind Film-Noir, War and Drama.

With the same results sorted by attendance within each age group, we see that the three age groups under 35 prefer to watch Comedy, Drama, and Action movies, in that order. Most people 35 and older would rather watch Drama than Comedies, with Action in third place for them as well.

I was curious whether a movie's age affected viewers' choices of what to see and their ratings--for example, when watching a movie that you've heard about for a few years, are you more likely to assume that it's good because it hasn't faded away? Query 6 lists the average ratings given to movies by movie type if the movie was seen more than five years after release. In these results, Film Noir is once again at the top, but the average rating of War movies puts them above Documentaries, and Mysteries climb from seventh to fourth place.

Query 7 asks the same thing about movies that were ten years old when viewed. These results show Mysteries climbing to third place and pushing Documentaries down to fourth, so it appears that Mysteries age better than Documentaries. (Nothing ages better than Film-Noir, whose average ratings go up with age, but remember that they're not nearly as popular to watch as the other genres; people who like them just like them more.)

Finally, Query 8 asks for the average ratings and total attendance by age group for the movies that were more than ten years old when viewed. Comparing the results sorted by rating with the same figures calculated for all movies (the first query 5 results), we see that it's the older movie viewers driving the higher ratings of older Mysteries over Documentaries--the ratings of the 199 movie viewers aged 18-24 actually put Documentaries at the top of their list of older movies. The same results sorted by attendance were remarkably similar to the query 5 version that took all the movies into account.

And more queries

It's easy to think of more questions to ask; we haven't even asked about about specific movies and their roles in the ratings. For example, what were these older Documentaries that the 18-24 year-old viewers liked so much? Perhaps there was some breakout hit that skewed the averages by being more popular than Documentaries typically are. Do viewers' genders or job titles affect their choice of movies to see or the ratings they gave them? If you're wondering, or thinking of new queries, you can download the data from the grouplens link above, convert it to turtle with my perl scripts, and query away.

With more recent ratings and movies, these kinds of explorations of the data could be used to plan advertising budgets or a film festival program. I mostly found it fun as a way to use SPARQL to explore a set of data that was not designed to be represented in RDF, but was very easy to convert, and I hope that more people using R, pandas, and other popular tools associated with data science projects appreciate what a great addition SPARQL can be to their tool box.

Please add any comments to this Google+ post.

15 July 2015

Visualizing DBpedia geographic data

With some help from SPARQL.

US astronaut birth places

I've been learning about Geographical Information System (GIS) data lately. More and more projects and businesses are doing interesting things by associating new kinds of data with specific latitude/longitude pairs; this data might be about air quality, real estate prices, or the make and model of the nearest Uber car.

DBpedia has a lot of latitude and longitude data, and SPARQL queries let you associate it with other data. Because you can retrieve these query results as CSV files, and many GIS packages can read CSV data, you can do a lot of similar interesting things yourself.

A query of DBpedia data about American astronauts shows that the oldest one was born in 1918 and the youngest one was born in 1979. I wondered whether, over time, there were any patterns in what part of the country they came from, and I managed to combine a DBpedia SPARQL query with an open-source GIS visualization package to create the map shown here.

The following query asks for the birth year and latitude and longitude of the birthplace of each American astronaut:

SELECT (MAX(?latitude) AS ?maxlat) (MAX(?longitude) AS ?maxlong) 
       ?astronaut (substr(str(MAX(?birthYear)),1,4) AS ?by) 
  ?astronaut dcterms:subject category:American_astronauts ;
             dbpedia-owl:birthPlace ?birthPlace ;
             dbpedia-owl:birthYear ?birthYear ; 
              dbpedia2:nationality :United_States .  
  ?birthPlace geo:lat ?latitude ;
              geo:long ?longitude . 
GROUP BY ?astronaut

(The query has no prefix declarations because it uses the ones built into DBpedia. Also, because some places have more than one pair of geo:lat and geo:long values, I found it simplest to just take the maximum value of each to get one pair for each astronaut.) The following shows the first few lines of the result when I asked for CSV:


QGIS Desktop is an open-source tool for working with GIS data that, among other things, lets you visualize data. The data can come from disk files or from several other sources, including the PostGIS add-on to the PostgreSQL database, which lets you scale up pretty far in the amount of data you can work with.

Using QGIS to create the image above, I first loaded the shapefile (actually a collection of files, including an old-fashioned dBase dbf file) from the US Census website with outlines of the individual states of the United States.

GIS visualization is often about layering of data such as state boundaries, altitude data, and roads to see the combined effects; those little cars in your phone's Uber app would like kind of silly if the roads and your current location weren't shown with them. For my experiment, the census shapefile was my first layer, and QGIS Desktop's "Add Delimited Text Layer" feature let me add the results of my SPARQL query about astronaut data as another layer. One tricky bit for us GIS novices is that these tools usually ask you to specify a Coordinate Reference System for any set of data, typically as an EPSG number, and there are a lot of those out there. I used EPSG 4269.

At first, QGIS added in all the astronaut birthplace locations as little black circles filled with the same shade of green. It had also set the default fill color of the US map to green, so I reset that to white in the dialog box for configuring that layer's properties. Then, in the astronaut data layer's properties, I found that instead of using identical symbols to represent each point on the map, I could pick "Graduated" and specify a "color ramp" that QGIS would use to assign color values according to the values in the property that I selected for this: by, or birth year, which you'll recognized from the fourth column of the sample CSV output above. QGIS looked at the lowest and highest of these values and offered to assign the following colors to by values in the ranges shown, and I just accepted the default:

QGIS color configuration

(While the earlier query showed a few astronauts born in 1978 and 1979, the range here only goes up to 1977 because I now see that some geographic coordinates in DBpedia are specified with dbpprop:latitude and dbpprop:longitude instead of geo:lat and geo:long, so if I was redoing this I'd revise the query to take those into account.)

If you click on the map above to see the larger image, you'll see that many early astronauts came from the midwest, and then over time, they gradually came from the four corners of the continental US. Why so many from the New York City area and none from Wyoming? Is there something in New York more conducive to producing astronauts than the wide-open spaces of Wyoming? Yes: there are more people there, so the odds are that more astronauts will come from there. See this excellent xkcd cartoon for more on this principle.

I only scratched the surface of what QGIS can do. I found this video from the Vermont Center for Geographic Info to be an excellent introduction. I learned from it and the book PostGIS in Action that an important set of features that GIS systems such as QGIS add is the automation of some of the math involved in computing distances and areas, which is not simple geometry because it takes place on the curved surface of the earth. A package like PostGIS adds specialized datatypes and functions to a general-purpose database like PostgreSQL to do the more difficult parts of the geography math. This lets your SQL queries do proximity analysis and other GIS tasks as well as handing off of such data to a visualization tool such as QGIS. (The open-source GeoMesa database adds similar features to Apache Accumulo and Google BigTable for more Hadoop-scale applications.)

The great news for SPARQL users is that a GIS extension called GeoSPARQL does something similar. You can try it out at the geosparql.org website. For example, entering the following query there will list all the airports within 10 miles of New York City:

PREFIX spatial:<http://jena.apache.org/spatial#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX geo:<http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX gn:<http://www.geonames.org/ontology#>

Select ?name 
  ?object spatial:nearby(40.712700 -74.005898 10 'mi').
  ?object a <http://www.lotico.com/ontology/Airport> ;
  gn:name ?name 

(The data uses a fairly broad definition of "airport," including heliports and seaplane bases.) I have not played with any GeoSPARQL implementations outside of geosparql.org, but the Parliament one mentioned on the GeoSPARQL wikipedia page looks interesting. I have not played much with the Linked Open Streeet Map SPARQL endpoint, but it also looks great for people who interested in GIS and SPARQL.

Whether you try out GeoSPARQL or not, when you take DBpedia's ability to associate such a broad range of data with geographic coordinates, and you combine that with the ability of GIS visualization tools like QGIS to work with that data (especially the ability to visualize the associated data—in my case, the color coding of astronaut birth years), you have a vast new category of cool things you can do with SPARQL.

Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets


    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists