25 September 2016

Semantic web semantics vs. vector embedding machine learning semantics

It's all semantics.

Home and semantics

When I presented "intro to the semantic web" slides in TopQuadrant product training classes, I described how people talking about "semantics" in the context of semantic web technology mean something specific, but that other claims for computerized semantics (especially, in many cases, "semantic search") were often vague attempts to use the word as a marketing term. Since joining CCRi, though, I've learned plenty about machine learning applications that use semantics to get real work done (often, "semantic search"), and they can do some great things.

Semantic Web semantics

To review the semantic web sense of "semantics": RDF gives us a way to state facts using {subject, predicate, object} triples. RDFS and OWL give us vocabularies to describe the resources referenced in these triples, and the descriptions can record semantics about those resources that let us get more out of the data. Of course, the descriptions themselves are triples, letting us say things like {ex:Employee rdfs:subClassOf ex:Person}, which tells us that any instance of the ex:Employee class is also an instance of ex:Person.

That example indicates some of the semantics of what it means to be an employee, but people familiar with object-oriented development take that ability for granted. OWL can take the recording of semantics well beyond that. For example, because properties themselves are resources, when I say {dm:locatedIn rdf:type owl:TransitiveProperty}, I'm encoding some of the meaning of the dm:locatedIn property in a machine-readable way: I'm saying that it's transitive, so that if {x:resource1 dm:locatedIn x:resource2} and {x:resource2 dm:locatedIn x:resource3}, we can infer that {x:resource1 dm:locatedIn x:resource3}.

A tool that understands what owl:TransitiveProperty means will let me get more out of my data. My blog entry Trying Out Blazegraph from earlier this year showed how I took advantage of OWL metadata to query for all the furniture in a particular building even though the dataset had no explicit data about any resources being furniture or any resources being in that building other than some rooms.

This is all built on very explicit semantics: we use triples to say things about resources so that people and applications can understand and do more with those resources. The interesting semantics work in the machine learning world is more about inferring semantic relationships.

Semantics and embedded vector spaces

(All suggestions for corrections to this section are welcome.) Machine learning is essentially the use of data-driven algorithms that perform better as they have more data to work with, "learning" from this additional data. For example, Netflix can make better recommendations to you now than they could ten years ago because the additional accumulated data about what you like to watch and what other people with similar tastes have also watched gives Netflix more to go on when making these recommendations.

The world of distributional semantics shows that analysis of what words appear with what other words, in what order, can tell us a lot about these words and their relationships--if you analyze enough text. Let's say we begin by using a neural network to assign a vector of numbers to each word. This creates a collection of vectors known as a "vector space"; adding vectors to this space is known as "embedding" them. Performing linear algebra on these vectors can provide insight about the relationships between the words that the vectors represent. In the most popular example, the mathematical relationship between the vectors for the words "king" and "queen" is very similar to the relationship between the vectors for "man" and "woman". This diagram from the TensorFlow tutorial Vector Representations of Words shows that other identified relationships include grammatical and geographical ones:

TensorFlow diagram about inferred word relationships

The popular open source word2vec implementation of this developed at Google includes a script that lets you do analogy queries. (The TensorFlow tutorial mentioned above uses word2vec; another great way to get hands-on experience with word vectors is Radim Rehurek's gensim tutorial.) I installed word2vec on an Ubuntu machine easily enough, started up the demo-analogy.sh script, and it prompted me to enter three words. I entered "king queen father" to ask it "king is to queen as father is to what?" It gave me a list of 40 word-score pairs with these at the top:

     mother    0.698822
    husband    0.553576
     sister    0.552917
        her    0.548955
grandmother    0.529910
       wife    0.526212
    parents    0.512507
   daughter    0.509455

Entering "london england berlin" produced a list that began with this:

   germany     0.522487
   prussia     0.482481
   austria     0.447184
    saxony     0.435668
   bohemia     0.429096
westphalia     0.407746
     italy     0.406134

I entered "run ran walk" in the hope of seeing "walked" but got a list that began like this:

   hooray      0.446358
    rides      0.445045
ninotchka      0.444158
searchers      0.442369
   destry      0.435961

It did a pretty good job with most of these, but obviously not a great job throughout. The past tense of walk is definitely not "hooray", but these inferences were based on a training data set of 96 megabytes, which isn't very large. A Google search on phrases from the text8 input file included with word2vec for this demo shows that it's probably part of a 2006 Wikipedia dump used for text compression tests and other processes that need a non-trivial text collection. More serious applications of word2vec often read much larger Wikipedia subsets as training data, and of course you're not limited to using Wikipedia data: the exploration of other datasets that use a variety of spoken languages and scripts is one of the most interesting aspects of these early days of the use of this technology.

The one-to-one relationships shown in the TensorFlow diagrams above make the inferred relationships look more magical than they are. As you can see from the results of my queries, word2vec finds the words that are closest to what you asked for and lists them with their scores, and you may have several with good scores or none. Your application can just pick the result with the highest score, but you might want to first set an acceptable cutoff value so that you don't take the "hooray" inference too seriously.

On the other hand, if you just pick the single result with the highest score, you might miss some good inferences, because while Berlin is the capital of Germany, it was also the capital of Prussia for over 200 years, so I was happy to see that get the second-highest score there--although, if we put too much faith in a score of 0.482481 (or even of 0.522487) we're going to get some "king queen father" answers that we don't want. Again, a bigger training data set would help there.

If you look at the demo-analogy.sh script itself, you'll see various parameters that you can tweak when creating the vector data. The use of larger training sets is not the only thing that can improve the results above, and machine learning expertise means not only getting to know the algorithms that are available but also learning how to tune parameters like these.

The script is simple enough that I saw that I could easily revise it to make it read some other file instead of the text8 one included with it. I set it to read the Summa Thelogica, in which St. Thomas Aquinas laid out all the theology of the Catholic Church, as I made grand plans for Big Question analogy queries like "man is to soul as God is to what?" My eventual query results were a lot more like the "run ran walk hooray" results above than anything sensible, with low scores for what it did find. With my text file of the complete Summa Thelogica weighing in at 17 megabytes, I was clearly hoping for too much from it. I do have ideas for other input to try and I encourage you to try it for yourself.

An especially exciting thing about the use of embedding vectors to identify potentially previously unknown relationships is that it's not limited to use on text. You can use it with images, video, audio, and any other machine readable data, and at CCRi, we have. (I'm using the marketing "we" here; if you've read this far you're familiar with all of my hands-on experience with embedding vectors.)

Embedding vector space semantics and semantic web semantics

Can there be any connection between these two "semantic" technologies? RDF-based models are designed to take advantage of explicit semantics, and a program like word2vec can infer semantic relationships and make them explicit. Modifications to the scripts included with word2vec could output OWL or SKOS triples that enumerate relationships between identified resources, making a nice contribution to the many systems using SKOS taxonomies and thesauruses. Another possibility is that if you can train a machine learning model with instances (for example, labeled pictures of dogs and cats) that are identified with declared classes in an ontology, then running the model on new data can do classifications that take advantage of the ontology--for example, after identifying new cat and dog pictures, a query for mammals can find them.

Going the other way, machine learning systems designed around unstructured text can often do even more with structured text, where it's easier to find what you want, and I've learned at CCRi that RDF (if not RDFS or OWL) is much more popular among such applications than I realized. Large taxonomies such as those of the Library of Congress, DBpedia, and Wikidata have lots of synonyms, explicit subclass relationships, and sometimes even definitions, and they can contribute a great deal to these applications.

A well-known success story in combining the two technologies is IBM's Watson. The paper Semantic Technologies in IBM Watson describes the technologies used in Watson and how these technologies formed the basis of a seminar course given at Columbia University; distributional semantics, semantic web technology, and DBpedia all play a role. Frederick Giasson and Mike Bergman's Cognonto also looks like an interesting project to connect machine learning to large collections of triples. I'm sure that other interesting combinations are happening around the world, especially considering the amount of open source software available in both areas.

Please add any comments to this Google+ post.

28 August 2016

Converting between MIDI and RDF: readable MIDI and more fun with RDF

Listen to my fun!

MIDI and RDF logos

When I first heard about Albert Meroño-Peñuela and Rinke Hoekstra's midi2rdf project, which converts back and forth between the venerable Musical Instrument Digital Interface binary format and RDF, at first I thought it seemed like an interesting academic exercise. Thinking about it more, I realized that it makes a great contribution to both the MIDI world and to musical RDF geeks.

MIDI has been the standard protocol for integrating synthesizers and related musical equipment together since the 1980s. I've only recently thrown out a book with the MIDI specs that I've owned for nearly that long because, as with so many other technical specifications, they're now available online.

Meroño-Peñuela and Hoekstra's midi2rdf lets you convert between MIDI files and Turtle RDF. I love the title of their ESWC 2016 paper on it, "The Song Remains the Same" (pdf)--I was pretty young when Led Zeppelin's Houses of the Holy album came out, but I remember it vividly. The song remains the same because the project's midi2rdf and rdf2midi scripts provide lossless round trip compression between the two formats, which makes it a very valuable tool: it gives us a text file serialization of MIDI based on a published standard, which makes MIDI downright readable. Looking at these RDF files and spending no serious time with the MIDI spec, I worked out which resources and properties were doing what and used this to create my own MIDI files.

As a somewhat musical RDF geek, this was a lot of fun. I wrote Python scripts to generate different Turtle files of different kinds of random music, then converted them to MIDI so that I could listen to them. (You can find it all in github.) The use of random functions means that running the same script several times creates different variations on the music. Below you will find links to MP3 versions of what I called fakeBebop and two versions of some whole-tone piano music that I generated, along with the MIDI and RDF files that go with them.

Each MIDI file (and its RDF equivalent) starts with some setup data to identify information such as the sounds that it will play and the tempo. Instead of learning all those setup details for my program to generate, I used the excellent Linux/Mac/Windows open source MuseScore music scoring program to generate a MIDI file with just a few notes of whatever instruments I wanted and then converted that to RDF. (This ability to convert in both directions is is an important part of the value of the midi2rdf package.) Then, keeping the setup part of that RDF, I deleted the actual notes and had my script copy the setup part and then generate new notes that it appended to the setup part.

In RDF terms, the note generation meant two things: adding a pair of mid:NoteOnEvent resources (one to start playing a note and one to stop) and then adding references to those events onto a musical track listing the events to execute. So, for example, the first mid:NoteOnEvent in the following pair defines the start of of a note at pitch 69, which is A above middle C on a piano. The mid:channel of 0 had been defined in the setup part, and the mid:tick value specifies how long the note will play until the next mid:NoteOnEvent. (I was too lazy to look up how the mid:tick values relate to elapsed time and picked some through trial and error.) The mid:velocity values essentially turn the note on and off.

p2:event0104 a mid:NoteOnEvent ;
    mid:channel 0 ;
    mid:pitch 69 ;
    mid:tick 400 ;
    mid:velocity 80 .

p2:event0105 a mid:NoteOnEvent ;
    mid:channel 0 ;
    mid:pitch 69 ;
    mid:tick 500 ;
    mid:velocity 0 .

As my script outputs noteOn events after the setup part, it appends references to them onto a string in memory that begins like this:

mid:pianoHeadertrack01 a mid:Track ;
    mid:hasEvent p2:event0000,
        # etc. until you finish with a period

After outputting all the mid:NoteOnEvent events, the script outputs this string. (While the triples in this resource are technically unordered, rdf2midi seemed to assume that the event names are "event" followed by a zero-padded number. When an early version of my first script didn't do this, the notes got played in an odd order. Maybe it's just playing them in alphabetic sort order.)

That's all for just one track. My fakeBebop script does this for three tracks: a bass track playing fairly random quarter notes in the range of an upright bass, a muted trumpet track playing fairly random triplet-feel eighth notes (sometimes with a rest substituted), and a percussion track repeating a standard bebop ride cymbal pattern. You can see some generated Turtle RDF at fakeBebop.ttl, the MIDI file generated from the Turtle file by midi2rdf at fakeBebop.mid, and listen to what it sounds like at fakeBebop.mp3.

By "fairly random" I mean a random note within 5 half steps (a major third) of the previous note. Without any melodies beyond this random selection of notes, I think it still sounds a bit beboppy because, as the early bebop pioneers added more complex scales to the simple major and minor scales played by earlier jazz musicians, it all got more chromatic.

I have joked with my brother about how if you quietly play random notes on a piano with both hands using the same whole tone scale, it can sound a bit like Debussy, who was one of the early users of this scale. My wholeTonePianoQuarterNotes.py script follows logic similar to the fakeBebop script but outputs two piano tracks that correspond to a piano player's left and right hands and use the same whole tone scale. You can see some generated Turtle RDF at wholeTonePianoQuarterNotes.ttl, the MIDI file generated from that by rdf2midi at wholeTonePianoQuarterNotes.mid, and hear what it sounds like at wholeTonePianoQuarterNotes.mp3.

Before doing the whole tone piano quarter notes script I did one with random note durations, so it sounds like something from a bit later in the twentieth century. Generated Turtle RDF: wholeTonePiano.ttl; MIDI file generated by rdf2midi: wholeTonePiano.mid; MP3: wholeTonePiano.mp3.

I can think of all kinds of ideas for additional experiments, such as redoing the two piano experiments with the four voices of a string quartet or having the fakeBebop one generate common jazz chord progressions and typical licks over them. (Speaking of string quartets and Debussy, I love that Apple iPad Pro ad that NBC showed so often during the recent Olympics.) It would also be interesting to try some experiments with Black MIDI (or perhaps "Black RDF"!). If I had pursued these ideas, I wouldn't be writing this blog entry right now, because I had to cut myself off at some point.

I recently learned about Supercollider, an open source Windows/Mac/Linux IDE with its own programming language that several serious electronic music composers use for generating music, and I could easily picture spending all of my free time playing with that. At least midi2rdf's RDF basis gave me the excuse of having a work-related angle as I wrote scripts to generate odd music. Although I was just slapping together some demo code for fun, I do think that midi2rdf's ability to provide lossless round-trip conversion between a popular old binary music format and a readable standardized format has a lot of potential to help people doing music with computers.

Please add any comments to this Google+ post.

31 July 2016

SPARQL in a Jupyter (a.k.a. IPython) notebook

With just a bit of Python to frame it all.

In a recent blog entry for my employer titled GeoMesa analytics in a Jupyter notebook, I wrote

As described on its home page, “The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.” Once you install the open source Jupyter server on your machine, you can create notebooks, share them with others, and learn from notebooks created by others. (You can also learn from others’ notebooks without installing Jupyter locally if those notebooks are hosted on a shared server.)

An animated GIF below that passage shows a sample mix of formatted text and executable Python code in a short Jupyter notebook, and it also demonstrates how code blocks can be tweaked, run in place, and build on previous code blocks. The blog entry goes on to describe how we at CCRi embedded Scala code in a Jupyter notebook to demonstrate the use of Apache Spark with the Hadoop-based GeoMesa spatio-temporal database to perform data analysis and visualization.

Jupyter supports over 40 languages besides Scala and Python, but not SPARQL. I realized recently, though, that with a minimum of Python code (Python being the original language for these notebooks; "Jupyter" was originally called "IPython") someone who hardly knows Python can enter and run SPARQL queries in a Jupyter notebook.

I created a Jupyter notebook that you can download and try yourself called JupyterSPARQLFun. If you look at the raw version of the file you'll see a lot of JSON, but if you follow that link you'll see that github renders the notebook the same way that a Jupyter server does, so you can read through the notebook and see all the formatted explanations with the code and the results.

If you did download the notebook and run it on a Jupyter server (and installed the rdflib and RDFClosure python libraries), you could edit the cells that have executable code, rerun them, and see the results, just like in the animated GIF mentioned above. In the case of this notebook, you'd be doing SPARQL manipulation of an RDF graph from your copy of the notebook. (I used the Anaconda Jupyter distribution. It was remarkably difficult to find out from their website how to start up Jupyter, but I did find out from the Jupyter Notebook Beginner Guide that you just enter "jupyter notebook" at the command line. When working with a notebook, you'll also find this list of keyboard shortcuts to be handy.)

I won't go into great detail here about what's in the JupyterSPARQLFun notebook, because much of the point of these notebooks is that their ability to mix formatted text with executable code lets people take explanation of code to a new level. So, to find out how I got SPARQL and inferencing working in the notebook, I recommend that you just read the explanations and code that I put in it.

I mentioned above how you can learn from others’ notebooks; some nice examples accompany the Data School Machine Learning videos on YouTube. These videos demonstrate various concepts by adding and running code within notebooks, adding explanatory text as well along the way. Because I could download the finished notebooks created in the videos, I could run all the example code myself, in place, with no need to copy it from one place and paste it to another. I could also tweak the code samples to try different variations, which made for some much more hands-on learning of the machine learning concepts being demonstrated.

That experience really showed me the power of Jupyter notebooks, and it's great to see that with just a little setup Python code, we can do SPARQL querying and RDF inferencing inside these notebooks as well.

screenshot of SPARQL Jupyter notebook

Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets



    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists