13 November 2016

Pulling RDF out of MySQL

With a command line option and a very short stylesheet.

MySQL and RDF logos

When I wrote the blog posting My SQL quick reference last month, I showed how you can pass an SQL query to MySQL from the operating system command line when starting up MySQL, and also how adding a -B switch requests a tab-separated version of the data. I did not mention that -X requests it in XML, and that this XML is simple enough that a fifteen-line XSLT 1.0 spreadsheet can convert any such output to RDF.

I've written before about how tools like the open source D2RQ and Capsenta's Ultrawrap provide middleware layers that let you send SPARQL queries to relational databases--and to combinations of relational databases from different vendors, which is where the real fun begins. This command line stylesheet trick gives you a simpler, more lightweight way to pull the relational data you want into an RDF file where you can use it with SPARQL or any other RDF tool.

If you have MySQL and xsltproc installed, you can do it all with a single command at the operating system prompt:

mysql -u someuser --password=someuserpw -X -e 'USE employees; SELECT * FROM employees LIMIT 5' | xsltproc mysql2ttl.xsl -

(Two notes about that command line: 1. don't miss that hyphen at the very end, which tells xsltproc to read from standard in. 2. I added the LIMIT part for faster testing because the employees table has 30,024 rows. To come up with that number of 30,024, I had to look at my last blog entry to remember how to count the table's rows, so writing out that quick reference has already paid off for me.) The XML returned by MySQL looks like this, with data from subsequent rows following a similar pattern:

  <resultset statement="SELECT * FROM employees LIMIT 5"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <row>
	<field name="emp_no">10001</field>
	<field name="first_name">Georgi</field>
	<field name="last_name">Facello</field>
	<field name="birth_date">1953-09-02</field>
	<field name="gender">M</field>
	<field name="hire_date">1986-06-26</field>
	<field name="department">Development</field>
  </row>

I thought the inclusion of the query as an attribute of the resultset attribute was a nice touch. The following XSLT stylesheet converts any such XML to Turtle RDF; you'll want to adjust the prefix declarations to use URIs more appropriate to your data:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="text"/>

<xsl:template match="resultset">
  @prefix v: &lt;http://learningsparql.com/ns/myVocabURI/> . 
  @prefix d: &lt;http://learningsparql.com/ns/myDataURI/> . 
      <xsl:apply-templates/>
    </xsl:template>

        <xsl:template match="row">
d:<xsl:value-of select="count(preceding-sibling::row) + 1"/> 
          <xsl:apply-templates/> . 
        </xsl:template>

    <xsl:template match="field">
      v:<xsl:value-of select="@name"/> "<xsl:value-of select="."/>" ;
    </xsl:template>

</xsl:stylesheet>

The result includes some extra blank lines that I could suppress with xsl:text elements wrapping certain bits of the stylesheet, but a Turtle parser doesn't care, so neither do I:

  d:1
	
      v:emp_no "10001" ;
    
	
      v:first_name "Georgi" ;
    
	
      v:last_name "Facello" ;
    
	
      v:birth_date "1953-09-02" ;
    
	
      v:gender "M" ;
    
	
      v:hire_date "1986-06-26" ;
    
	
      v:department "Development" ;
    
   . 

You can customize the stylesheet for specific input data. For example, the URIs in your triple subjects could build on an ID value selected from the data instead of building on the position of the XML row element, as I did. As another customization, instead outputting all triple objects as strings, you could insert this template rule into the XSLT stylesheet to output the two date fields typed as actual dates, as long as you remembered to also add an xsd prefix declaration at the top of the spreadsheet:

    <xsl:template match="field[@name='birth_date' or @name='birth_date']">
      v:<xsl:value-of select="@name"/> "<xsl:value-of select="."/>"^^xsd:date ;
    </xsl:template>

Or, you could leave the XSLT stylesheet in its generic form and convert the data types using a SPARQL query further down your processing pipeline with something like this:

PREFIX v: <http://learningsparql.com/ns/myVocabURI/> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

CONSTRUCT {
  ?row v:birth_date ?bdate ;
       v:hire_date ?hdate . 
}
WHERE {
  ?row v:birth_date ?bdateString ;
  v:hire_date ?hdateString . 
  BIND(xsd:date(?bdateString) AS ?bdate)
  BIND(xsd:date(?hdateString) AS ?hdate)
}

However you choose to do it, the nice thing is that you have lots of options for grabbing the massive amounts of data stored in the many MySQL databases out there and then using that data as triples with a variety of lightweight, open source software.


Please add any comments to this Google+ post.

30 October 2016

My SQL quick reference

Pun intended.

SQL graffiti

I sometimes go many months with no need to use SQL, so over the years I've developed my own quick reference to remind me how to do basic tasks when necessary. Most SQL quick reference sheets out there try to pack as much different syntax as they can in a small space, but mine focuses on what the basic tasks are and how to do them. I hope that someone finds it useful.

Most of my SQL experience has been with MySQL, and I separated what I believe are the standard SQL parts below from the MySQL-specific ones. Corrections welcome. If you really want to know where SQL implementations differ from the standard, Comparison of different SQL implementations is an excellent, detailed reference on what's different from one implementation to another.

I tested all the SELECT commands shown with the MySQL employee sample database that I downloaded from github.

Standard SQL

Enter these at the SQL command line. I don't think semicolons are necessary after every one of these commands, but I find it simplest to just always add them. SQL is not case-sensitive about keywords, and I tend to enter them in lower-case, but I'm showing them in the conventional upper-case here because it makes it easier to distinguish them from database, table, and column names.

quit to return to the operating system command linequit;
list available databases # comments start with a pound sign
SHOW DATABASES;
select the database named employees to useUSE employees;
create a new database called someDatabase # database and table names are case-sensitive
CREATE DATABASE someDatabase;
delete database someDatabaseDROP DATABASE someDatabase;
create a table called tablename1, with the two columns shown, for the database currently in use # lots of other options available for how you specify the columns...
CREATE TABLE tablename1 (`fname` VARCHAR(20),`lname` VARCHAR(20));
insert a row of data into the table tablename1INSERT INTO tablename1 VALUES('Richard','Mutt');
delete the table tablename1DROP TABLE tablename1;
show all data in the departments tableSELECT * FROM departments;
show just the dept_no and dept_name columns from the departments tableSELECT dept_no, dept_name FROM departments;
just get the first 5 rows from table departmentsSELECT * FROM departments LIMIT 5;
show dept_name column values in table departments where dept_name has the substring "en"SELECT dept_name FROM departments WHERE dept_name LIKE "%en%";
show all columns from table departments where the dept_name column starts with an "S" and has exactly 4 characters after itSELECT * FROM departments WHERE dept_name LIKE "S____";
Select title values from the titles table but don't show duplicatesSELECT DISTINCT title FROM titles;
Same as above, but sorted (case-sensitive) by the title valuesSELECT DISTINCT title FROM titles ORDER BY title;
Count the rows in the departments tableSELECT count(*) FROM departments;
Count the rows in the departments table that have "en" as a substring of the dept_name valueSELECT count(*) FROM departments WHERE dept_name LIKE "%en%";
In tablename1, change the fname value to "John" for all rows that have an lname value of "Mutt"UPDATE tablename1 SET fname="John" WHERE lname="Mutt";
delete all rows from the tablename1 tableDELETE FROM tablename1;
delete rows from the tablename1 table where the lname value begins with "M"DELETE FROM tablename1 WHERE lname like "M%";

MySQL-specific SQL prompt commands

list the tables in the currently selected databaseSHOW TABLES;
Describe the columns in table departments (handy before doing SELECT statements to see column names and types)DESCRIBE departments;
run the SQL commands stored in the file myscript.sqlSOURCE myscript.sql;
Load a local csv file (enabling this may require --local-infile with the mysql startup command or the adjustment of a config file) # Enter the following as one command
LOAD DATA LOCAL INFILE '/some/path/names.csv' INTO TABLE tablename1 COLUMNS TERMINATED BY ',';
Create new user jane with password janepw, then grant her access to everythingCREATE USER 'jane' IDENTIFIED BY 'janepw';
GRANT ALL ON *.* TO 'jane';

Handy MySQL commands from the operating system prompt

There are often multiple ways to execute some of the following tasks, but these work for me. Treat all as single-line commands.

start up MySQL with a single command (which includes the plain text password, which is not a good idea for any kind of production system)mysql -u someuser --password=somepassword
Run a script of SQL commands from the operating system command line and then return to the command line; output of the SELECT statements will be tab-delimitedmysql -u someuser --password=somepassword -t < employees.sql
create a file of SQL commands to recreate the database employees (with the employees demo database, this created a 168MB file)mysqldump -u someuser --password=somepassword employees > makeemployees.sql
Run a SQL command (or more than one using a semicolon to separate them) from the operating system promptmysql -u someuser --password=somepassword -e 'USE employees; SELECT * FROM departments'
Same as above, but getting output as tab-separated values--only difference is to add -B for "batch" modemysql -u someuser --password=somepassword -B -e 'USE employees; SELECT * FROM departments'

Other handy tricks, as covered in the MySQL documentation

The MySQL documentation's Examples of Common Queries covers many additional useful tasks:

  • The Maximum Value for a Column
  • The Row Holding the Maximum of a Certain Column
  • Maximum of Column per Group
  • The Rows Holding the Group-wise Maximum of a Certain Column
  • Using User-Defined Variables
  • Using Foreign Keys
  • Searching on Two Keys
  • Calculating Visits Per Day
  • Using AUTO_INCREMENT

Note that my URL for the link to this information doesn't include a version number, but gets redirected by mysql.com to the URL for the latest release's version of this documentation, as documentation URLs should do.

CC BY-NC photo by duncan


Please add any comments to this Google+ post.

25 September 2016

Semantic web semantics vs. vector embedding machine learning semantics

It's all semantics.

Home and semantics

When I presented "intro to the semantic web" slides in TopQuadrant product training classes, I described how people talking about "semantics" in the context of semantic web technology mean something specific, but that other claims for computerized semantics (especially, in many cases, "semantic search") were often vague attempts to use the word as a marketing term. Since joining CCRi, though, I've learned plenty about machine learning applications that use semantics to get real work done (often, "semantic search"), and they can do some great things.

Semantic Web semantics

To review the semantic web sense of "semantics": RDF gives us a way to state facts using {subject, predicate, object} triples. RDFS and OWL give us vocabularies to describe the resources referenced in these triples, and the descriptions can record semantics about those resources that let us get more out of the data. Of course, the descriptions themselves are triples, letting us say things like {ex:Employee rdfs:subClassOf ex:Person}, which tells us that any instance of the ex:Employee class is also an instance of ex:Person.

That example indicates some of the semantics of what it means to be an employee, but people familiar with object-oriented development take that ability for granted. OWL can take the recording of semantics well beyond that. For example, because properties themselves are resources, when I say {dm:locatedIn rdf:type owl:TransitiveProperty}, I'm encoding some of the meaning of the dm:locatedIn property in a machine-readable way: I'm saying that it's transitive, so that if {x:resource1 dm:locatedIn x:resource2} and {x:resource2 dm:locatedIn x:resource3}, we can infer that {x:resource1 dm:locatedIn x:resource3}.

A tool that understands what owl:TransitiveProperty means will let me get more out of my data. My blog entry Trying Out Blazegraph from earlier this year showed how I took advantage of OWL metadata to query for all the furniture in a particular building even though the dataset had no explicit data about any resources being furniture or any resources being in that building other than some rooms.

This is all built on very explicit semantics: we use triples to say things about resources so that people and applications can understand and do more with those resources. The interesting semantics work in the machine learning world is more about inferring semantic relationships.

Semantics and embedded vector spaces

(All suggestions for corrections to this section are welcome.) Machine learning is essentially the use of data-driven algorithms that perform better as they have more data to work with, "learning" from this additional data. For example, Netflix can make better recommendations to you now than they could ten years ago because the additional accumulated data about what you like to watch and what other people with similar tastes have also watched gives Netflix more to go on when making these recommendations.

The world of distributional semantics shows that analysis of what words appear with what other words, in what order, can tell us a lot about these words and their relationships--if you analyze enough text. Let's say we begin by using a neural network to assign a vector of numbers to each word. This creates a collection of vectors known as a "vector space"; adding vectors to this space is known as "embedding" them. Performing linear algebra on these vectors can provide insight about the relationships between the words that the vectors represent. In the most popular example, the mathematical relationship between the vectors for the words "king" and "queen" is very similar to the relationship between the vectors for "man" and "woman". This diagram from the TensorFlow tutorial Vector Representations of Words shows that other identified relationships include grammatical and geographical ones:

TensorFlow diagram about inferred word relationships

The popular open source word2vec implementation of this developed at Google includes a script that lets you do analogy queries. (The TensorFlow tutorial mentioned above uses word2vec; another great way to get hands-on experience with word vectors is Radim Rehurek's gensim tutorial.) I installed word2vec on an Ubuntu machine easily enough, started up the demo-analogy.sh script, and it prompted me to enter three words. I entered "king queen father" to ask it "king is to queen as father is to what?" It gave me a list of 40 word-score pairs with these at the top:

     mother    0.698822
    husband    0.553576
     sister    0.552917
        her    0.548955
grandmother    0.529910
       wife    0.526212
    parents    0.512507
   daughter    0.509455

Entering "london england berlin" produced a list that began with this:

   germany     0.522487
   prussia     0.482481
   austria     0.447184
    saxony     0.435668
   bohemia     0.429096
westphalia     0.407746
     italy     0.406134

I entered "run ran walk" in the hope of seeing "walked" but got a list that began like this:

   hooray      0.446358
    rides      0.445045
ninotchka      0.444158
searchers      0.442369
   destry      0.435961

It did a pretty good job with most of these, but obviously not a great job throughout. The past tense of walk is definitely not "hooray", but these inferences were based on a training data set of 96 megabytes, which isn't very large. A Google search on phrases from the text8 input file included with word2vec for this demo shows that it's probably part of a 2006 Wikipedia dump used for text compression tests and other processes that need a non-trivial text collection. More serious applications of word2vec often read much larger Wikipedia subsets as training data, and of course you're not limited to using Wikipedia data: the exploration of other datasets that use a variety of spoken languages and scripts is one of the most interesting aspects of these early days of the use of this technology.

The one-to-one relationships shown in the TensorFlow diagrams above make the inferred relationships look more magical than they are. As you can see from the results of my queries, word2vec finds the words that are closest to what you asked for and lists them with their scores, and you may have several with good scores or none. Your application can just pick the result with the highest score, but you might want to first set an acceptable cutoff value so that you don't take the "hooray" inference too seriously.

On the other hand, if you just pick the single result with the highest score, you might miss some good inferences, because while Berlin is the capital of Germany, it was also the capital of Prussia for over 200 years, so I was happy to see that get the second-highest score there--although, if we put too much faith in a score of 0.482481 (or even of 0.522487) we're going to get some "king queen father" answers that we don't want. Again, a bigger training data set would help there.

If you look at the demo-analogy.sh script itself, you'll see various parameters that you can tweak when creating the vector data. The use of larger training sets is not the only thing that can improve the results above, and machine learning expertise means not only getting to know the algorithms that are available but also learning how to tune parameters like these.

The script is simple enough that I saw that I could easily revise it to make it read some other file instead of the text8 one included with it. I set it to read the Summa Thelogica, in which St. Thomas Aquinas laid out all the theology of the Catholic Church, as I made grand plans for Big Question analogy queries like "man is to soul as God is to what?" My eventual query results were a lot more like the "run ran walk hooray" results above than anything sensible, with low scores for what it did find. With my text file of the complete Summa Thelogica weighing in at 17 megabytes, I was clearly hoping for too much from it. I do have ideas for other input to try and I encourage you to try it for yourself.

An especially exciting thing about the use of embedding vectors to identify potentially previously unknown relationships is that it's not limited to use on text. You can use it with images, video, audio, and any other machine readable data, and at CCRi, we have. (I'm using the marketing "we" here; if you've read this far you're familiar with all of my hands-on experience with embedding vectors.)

Embedding vector space semantics and semantic web semantics

Can there be any connection between these two "semantic" technologies? RDF-based models are designed to take advantage of explicit semantics, and a program like word2vec can infer semantic relationships and make them explicit. Modifications to the scripts included with word2vec could output OWL or SKOS triples that enumerate relationships between identified resources, making a nice contribution to the many systems using SKOS taxonomies and thesauruses. Another possibility is that if you can train a machine learning model with instances (for example, labeled pictures of dogs and cats) that are identified with declared classes in an ontology, then running the model on new data can do classifications that take advantage of the ontology--for example, after identifying new cat and dog pictures, a query for mammals can find them.

Going the other way, machine learning systems designed around unstructured text can often do even more with structured text, where it's easier to find what you want, and I've learned at CCRi that RDF (if not RDFS or OWL) is much more popular among such applications than I realized. Large taxonomies such as those of the Library of Congress, DBpedia, and Wikidata have lots of synonyms, explicit subclass relationships, and sometimes even definitions, and they can contribute a great deal to these applications.

A well-known success story in combining the two technologies is IBM's Watson. The paper Semantic Technologies in IBM Watson describes the technologies used in Watson and how these technologies formed the basis of a seminar course given at Columbia University; distributional semantics, semantic web technology, and DBpedia all play a role. Frederick Giasson and Mike Bergman's Cognonto also looks like an interesting project to connect machine learning to large collections of triples. I'm sure that other interesting combinations are happening around the world, especially considering the amount of open source software available in both areas.


Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Archives

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists