22 January 2017

Brand-name companies using SPARQL: the sparql.club

Disney! Apple! Amazon! MasterCard!

Since I wrote "Experience in SPARQL a plus" about SPARQL appearances in job postings almost three years ago, I still find myself pointing people to it to show them that SPARQL is not some academic theoretical thing but a popular tool in production use at well-known companies.

On the job listing site indeed.com, I have a saved search for SPARQL mentions. The daily email of new search hits that this sends me typically lists a few entries for companies that I have heard of and some for companies that I haven't. Every now and then I'll pick out one to tweet about on @learningsparql, although I don't do it nearly as often as I could.

Between this ongoing stream of new job postings, the increasing age of that blog posting, and my ownership (inspired by Paul Ford's tilde.club) of the domain name sparql.club, I thought it would be fun to keep an updated list there so that I can point the SPARQL haters at it.

So the next time you see someone making ridiculous claims about SPARQL not catching on, tell them to check out the members of the sparql.club!

Please add any comments to this Google+ post.

22 December 2016

A modern neural network in 11 lines of Python

And a great learning tool for understanding neural nets.

the mark I Perceptron

When you learn new technology, it's common to hear "don't worry about the low-level details--use the tools!" That's a good long-term strategy, but when you learn the lower-level details of how the tools work, it gives you a fuller understanding of what they can do for you. I decided to go through Andrew Trask's A Neural Network in 11 lines of Python to really learn how every line worked, and it's been very helpful. I had to review some matrix math and look up several numpy function calls that he uses, but it was worth it.

My title here refers to it as a "modern neural network" because while neural nets have been around since the 1950s, the use of backpropagation, a sigmoid function and the sigmoid's derivative in Andrew's script highlight the advances that have made neural nets so popular in machine learning today. For some excellent background on how we got from Frank Rosenblatt's 1957 hard-wired Mark I Perceptron (pictured here) to how derivatives and backpropagation addressed the limitations of these early neural nets, see Andrey Kurenkov's A 'Brief' History of Neural Nets and Deep Learning, Part 1. The story includes a bit more drama than you might expect, with early AI pioneers Marvin Minsky and Seymour Papert convincing the community that limitations in the perceptron model would prevent neural nets from getting very far. I also recommend Michael Nielsen's Using neural nets to recognize handwritten digits, in particular the part on perceptrons, which gives further background on that part of Kurenkov's "Brief History," and then Nielsen's sigmoid neurons part that follows it and describes how these limitations were addressed.

Andrew's 11-line neural network, with its lack of comments and whitespace, is more for show. The 42-line version that follows that is easier to follow and includes a great line-by-line explanation. Below are some of my own additional notes that I made as I dissected and played with his code. Often, I'm just restating something he already wrote but in my own words to try to understand it better. Hereafter, when I refer to his script, I mean the 42-line one.

I took his advice of trying the script in an IPython (Jupyter) notebook, where it was a lot easier to change some numbers (for example, the number of iterations in the main for loop) and to add print statements that told me more about what was happening to the variables through the training step iterations. After playing with this a bit and reviewing his piece again, I realized that many of my experiments were things that he suggests in his bulleted list that begins with "Compare l1 after the first iteration and after the last iteration." That whole list is good advice for learning more about how the script works.

Beneath his script and above his line-by-line description he includes a chart explaining each variable's role. As you read through the line-by-line description, I encourage you to refer back to that chart often.

I have minimal experience with the numpy library, but based on the functions from Andrew's script that I looked up, it seems typical that if you take a numpy function that does something to a number and pass it a data structure such as an array or matrix filled with numbers, it will do that thing to all the numbers and return the data structure.

Line 23 of Andrew's script initializes the weights that tell the neural net how much attention to pay to the input at each neuron. Ultimately, a neural net's job is to tune these weights based on what it sees in how input (in this script's case, the rows of X) corresponds to output (the values of y) so that when it later sees new input it will hopefully output the right things. When this script starts, it has no idea what values to use as weights, so it puts random values in, but not completely random--as Andrew writes, they should have a mean of 0. The np.random.random((x,y)) function returns a matrix of x rows of y random numbers between 0 and 1, so 2*np.random.random((3,1)) returns 3 rows with 1 number each between 0 and 2, and the "- 1" added to that makes them random numbers between -1 and 1.

np.dot() returns dot products. I found the web page How to multiply matrices (that is, how to find their dot product) helpful in reviewing something I hadn't thought about in a while. You can reproduce that page's "Multiplying a Matrix by a Matrix" example using numpy with this:

matrix1 = np.array([[1,2,3],[4,5,6]])
matrix2 = np.array([[7,8],[9,10],[11,12]])

The four lines of code in Andrew's main loop perform three tasks:

  1. predict the output based on the input (l0) and the current set of weights (syn0)

  2. check how far off the predictions were

  3. use that information to update the weights before proceeding to the next iteration

If you increase the number of iterations, you'll see that first step get closer and closer to predicting an output of [[0][0][1][1]] in its final passes.

Line 29 does its prediction by calculating the dot product of the input and the weights and then passing the result (a 4 x 1 matrix like [[-4.98467345] [-5.19108471] [ 5.39603866] [ 5.1896274 ]], as I learned from one of those extra print statements I mentioned) to the sigmoid function named nonlin() that is defined at the beginning of the script. If you graphed the values potentially returned by this function, they would not fall in a line (it's "nonlinear") but along an S (sigmoid) curve. Looking at the Sigmoid function Wikipedia page shows that the expression 1/(1+np.exp(-x)) that Andrew's nonlin() function uses to calculate the function's return value (if the optional deriv parameter has a value of False) corresponds to the formula shown near the top of the Wikipedia page. This nonlin() function takes any number and returns a number between 0 and 1; as Andrew writes, "We use it to convert numbers to probabilities." For example, if you pass a 0 to the function (or look at an S curve graph) you'll see that the function returns .5; if you pass it a 4 or higher it returns a number very close to 1, and if you pass it a -4 or lower it returns a number very close to 0. The np.exp() function used within that expression calculates the exponential of the passed value--or all the values in an array or matrix, returning the same data structure. For example, np.exp(1) returns the natural logarithm e, which is about 2.718.

Line 29 calls that function and stores the returned matrix in the l1 variable. Reviewing the variable chart, this is the "Second Layer of the Network, otherwise known as the hidden layer." Line 32 then subtracts the l1 matrix from y (the array of answers that it was hoping to get) and stores the difference in l1_error. (Subtracting matrices follows the basic pattern of np.array([[5],[4],[3]]) - np.array([[1],[1],[1]]) = np.array([[4],[3],[2]]).)

Remember how line 23 assigned random values to the weights? After line 32 executes, the l1_error matrix has clues about how to tune those weights, so as the comments in lines 34 and 35 say, the script multiplies how much it missed (l1_error) by the slope of the sigmoid at the values in l1. We find that slope by passing l1 to the same nonlin() function, but this time, setting the deriv parameter to True to get that slope. (See "using the derivatives" in Kurenkov's A 'Brief' History for an explanation of why derivatives played such a big role in helping neural nets move beyond the simple perceptron models.) As Andrew writes, "When we multiply the 'slopes' by the error, we are reducing the error of high confidence predictions" (his emphasis). In other words, we're putting more faith in those high confidence predictions when we create the data that will be used to update the weights.

The script stores the result of multiplying the error by the slope in the l1_delta variable and then uses the dot product of that and l0 (from the variable table: "First Layer of the Network, specified by the input data") to update the weights stored in syn0.

Per Harald Borgen's Learning How To Code Neural Networks (which begins with an excellent description of the relationship of a neuron's inputs to its weights and goes on to talk about how useful Andrew's "A Neural Network in 11 lines of Python" is) says that backpropagation "essentially means that you look at how wrong the network guessed, and then adjust the networks weights accordingly." When someone on Quora asked Yann LeCun (director of AI research at Facebook and one of the Three Kings of Deep Learning) "Which is your favorite Machine Learning Algorithm?" his answer was a single eight-letter word: "backprop." Backpropagation is that important to why neural nets have become so fundamental in so many modern computer applications, so the updating of syn0 in line 39 is very crucial here.

And that's it for the neural net training code. After the first iteration, the weighting values in syn0 will be a bit less random, and after 9,999 more iterations, they'll be a lot closer to where you want them. I found that adding the following lines after line 29 gave me a better idea of what was happening in the l1 variable at the beginning and end of the script's execution:

   if (iter < 4 or iter > 9997):
        print("np.dot(l0,syn0) at iteration " + str(iter) + ": " + str(np.dot(l0,syn0)))
        print("l1 = " + str(l1))

(One note for people using Python 3, like I did: in addition to adding the parentheses in calls to the print function, the main for loop had to say just "range" instead of "xrange". More on this at stackoverflow.)

These new lines showed that after the second iteration, l1 had these values, rounded to two decimal place here: [[ 0.26] [ 0.36] [ 0.23] [ 0.32 ]]. As Andrew's output shows, at the very end, l1 equals [[ 0.00966449] [ 0.00786506] [ 0.99358898] [ 0.99211957]], so it got a lot closer to the [0,0,1,1] that it was shooting for. How can you make it get even closer? By increasing the iteration count to be greater than 10,000.

For some real fun, I added the following after the script's last line, because if you're going to train a neural net on some data, why not then try the trained network (that is, the set of tuned weights) on some other data to see how well it performs? After all, Andrew does write "All of the learning is stored in the syn0 matrix."

X1 = np.array([ [0,1,1], [1,1,0], [1,0,1],[1,1,1] ])  
x1prediction = nonlin(np.dot(X1,syn0))

The first two rows of my new input are different from those in the training data. The xlprediction variable ended up as [[ 0.00786466] [ 0.9999225 ] [ 0.99358931] [ 0.99211997]], which was great to see. Rounded, these are 0, 1, 1, and 1, so the neural net knew that for those first two rows of data--which it hadn't seen before--the output should be the first value from each.

Everything I describe here is from part 1 of Andrew's exposition, "A Tiny Toy Network." Part 2, "A Slightly Harder Problem" has a script that is eight lines longer (four lines if you don't count white space and comments) and I plan to dig into that next, because among other things, it has a more explicit demo of backpropagation.

Please add any comments to this Google+ post.

Image courtesy of Wikipedia.

13 November 2016

Pulling RDF out of MySQL

With a command line option and a very short stylesheet.

MySQL and RDF logos

When I wrote the blog posting My SQL quick reference last month, I showed how you can pass an SQL query to MySQL from the operating system command line when starting up MySQL, and also how adding a -B switch requests a tab-separated version of the data. I did not mention that -X requests it in XML, and that this XML is simple enough that a fifteen-line XSLT 1.0 spreadsheet can convert any such output to RDF.

I've written before about how tools like the open source D2RQ and Capsenta's Ultrawrap provide middleware layers that let you send SPARQL queries to relational databases--and to combinations of relational databases from different vendors, which is where the real fun begins. This command line stylesheet trick gives you a simpler, more lightweight way to pull the relational data you want into an RDF file where you can use it with SPARQL or any other RDF tool.

If you have MySQL and xsltproc installed, you can do it all with a single command at the operating system prompt:

mysql -u someuser --password=someuserpw -X -e 'USE employees; SELECT * FROM employees LIMIT 5' | xsltproc mysql2ttl.xsl -

(Two notes about that command line: 1. don't miss that hyphen at the very end, which tells xsltproc to read from standard in. 2. I added the LIMIT part for faster testing because the employees table has 30,024 rows. To come up with that number of 30,024, I had to look at my last blog entry to remember how to count the table's rows, so writing out that quick reference has already paid off for me.) The XML returned by MySQL looks like this, with data from subsequent rows following a similar pattern:

  <resultset statement="SELECT * FROM employees LIMIT 5"
	<field name="emp_no">10001</field>
	<field name="first_name">Georgi</field>
	<field name="last_name">Facello</field>
	<field name="birth_date">1953-09-02</field>
	<field name="gender">M</field>
	<field name="hire_date">1986-06-26</field>
	<field name="department">Development</field>

I thought the inclusion of the query as an attribute of the resultset attribute was a nice touch. The following XSLT stylesheet converts any such XML to Turtle RDF; you'll want to adjust the prefix declarations to use URIs more appropriate to your data:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="text"/>

<xsl:template match="resultset">
  @prefix v: &lt;http://learningsparql.com/ns/myVocabURI/> . 
  @prefix d: &lt;http://learningsparql.com/ns/myDataURI/> . 

        <xsl:template match="row">
d:<xsl:value-of select="count(preceding-sibling::row) + 1"/> 
          <xsl:apply-templates/> . 

    <xsl:template match="field">
      v:<xsl:value-of select="@name"/> "<xsl:value-of select="."/>" ;


The result includes some extra blank lines that I could suppress with xsl:text elements wrapping certain bits of the stylesheet, but a Turtle parser doesn't care, so neither do I:

      v:emp_no "10001" ;
      v:first_name "Georgi" ;
      v:last_name "Facello" ;
      v:birth_date "1953-09-02" ;
      v:gender "M" ;
      v:hire_date "1986-06-26" ;
      v:department "Development" ;

You can customize the stylesheet for specific input data. For example, the URIs in your triple subjects could build on an ID value selected from the data instead of building on the position of the XML row element, as I did. As another customization, instead outputting all triple objects as strings, you could insert this template rule into the XSLT stylesheet to output the two date fields typed as actual dates, as long as you remembered to also add an xsd prefix declaration at the top of the spreadsheet:

    <xsl:template match="field[@name='birth_date' or @name='birth_date']">
      v:<xsl:value-of select="@name"/> "<xsl:value-of select="."/>"^^xsd:date ;

Or, you could leave the XSLT stylesheet in its generic form and convert the data types using a SPARQL query further down your processing pipeline with something like this:

PREFIX v: <http://learningsparql.com/ns/myVocabURI/> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

  ?row v:birth_date ?bdate ;
       v:hire_date ?hdate . 
  ?row v:birth_date ?bdateString ;
  v:hire_date ?hdateString . 
  BIND(xsd:date(?bdateString) AS ?bdate)
  BIND(xsd:date(?hdateString) AS ?hdate)

However you choose to do it, the nice thing is that you have lots of options for grabbing the massive amounts of data stored in the many MySQL databases out there and then using that data as triples with a variety of lightweight, open source software.

Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets



    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists