18 November 2018

Extracting RDF data models from Wikidata

That's "models", plural.

Their avoidance of the standard model vocabularies is not a big deal, and we should be glad that they make this available in RDF at all.

Some people complain when an RDF dataset lacks a documented data model. A great thing about RDF and SPARQL is that if you want to know what kind of modeling might have been done for a dataset, you just look, even if they're using non-(W3C-)standard modeling structures. They're still using triples, so you look at the triples.

If I know that there is an entity x:thing23 in a dataset, I'm going to query for {x:thing23 ?p ?o} and see what information there is about that entity. Hopefully I will find an rdf:type triple saying that it's a member of a class. If not, maybe it uses some other home-grown way to indicate class membership; either way, you can then start querying to find out about the class's relationships to properties and other classes, and you've got a data model. What if it doesn't use RDFS to describe these modeling structures and their relationships? A CONSTRUCT query will convert it to a data model that does.

And, if {x:thing23 ?p ?o} triples don't indicate any class membership, just seeing what the ?p values are tells you something about the data model. If certain entities use certain properties for their predicates, and other entities use a list that overlaps with that, you've learned more about relationships between sets of entities in the dataset. All of these things can be investigated with simple queries.

Wikidata offers tons of great data and modeling for us RDF people, but it wasn't designed for us. They created their own model and then expressed the model and instance data in RDF, and I'm not going to complain; can you imagine how cool it would be if Google did the same with their knowledge graph? (When I tweeted "Handy Wikidata hints for people who have been using RDF and SPARQL since before Wikidata was around: use wdt:P31 instead of rdf:type and wdt:P279 instead of rdfs:subClassOf", Mark Watson replied that he liked my sense of humor. While I hadn't meant to be funny I do appreciate his sense of humor.) As I've worked at understanding Wikidata's documentation about their mapping to RDF I've had fun just querying around to understand the structures. Again: this is one of the key reasons that RDF and SPARQL are great! Because we can do that!

Last month I described how you can find the subclass tree under a given class in Wikidata and since then I've done further exploration of how to pull data models out of Wikidata. Note that I say "models" and not "model". Olivier Rossel recently referred to extracting the data model of Wikidata (my translation from his French), but I worry that looking for "the" grand RDF data model of Wikidata might set someone up for disappointment. I think that looking for data models to suit various projects will be more productive. (Olivier and I discussed this further in the "Handy Wikidata hints" thread mentioned above.)

The following query builds on the one I did last month to either get a class tree below a given one or to get its superclasses instead. It creates triples that express the classes and their relationships using W3C standard properties.

CONSTRUCT {
  ?class a owl:Class . 
  ?class rdfs:subClassOf ?superclass . 
  ?class rdfs:label ?classLabel . 
  ?property rdfs:domain ?class . 
  ?property rdfs:label ?classLabel .
}
WHERE {
  BIND(wd:Q11344 AS ?mainClass) .    # Q11344 chemical element; Q1420 automobile
  
  # Pick one or the other of the following two triple patterns. 
  ?class wdt:P279* ?mainClass.     # Find subclasses of the main class. 
  #?mainClass wdt:P279* ?class.     # Find superclasses of the main class. 
  
  ?class wdt:P279 ?superclass .     # So we can create rdfs:subClassOf triples
  ?class rdfs:label ?classLabel.
  OPTIONAL {
    ?class wdt:P1963 ?property.
    ?property rdfs:label ?propertyLabel.
    FILTER((LANG(?propertyLabel)) = "en")
    }
  FILTER((LANG(?classLabel)) = "en")
}
      

(Because the query uses prefixes that Wikidata already understands, I didn't need to declare any.) When run in the Wikidata query service form, there are too many triples to see at once, so I put the query into a subtreeClasses.rq file and ran it with curl from the command line like this:

curl --data-urlencode "query@subtreeClasses.rq" https://query.wikidata.org/sparql -H "Accept: text/turtle"  > chemicalElementSubClasses.ttl
      

Loading the result into TopBraid Composer Free edition (available here; the Free edition is a choice on the Product dropdown list) showed a class tree of the result like this:

(It's tempting to add an entry for Frinkonium as a subclass of "hypothetical chemical element".) I understand that the Wikimedia Foundation had their reasons for not describing their models with the standard vocabularies, but this shows the value of using the standards: interoperability with other tools. It also shows that the Foundation's avoidance of the standard model vocabularies is not a big deal, and that we should be glad that they make this available in RDF at all, because the sheer fact that it's in RDF makes it easy to convert to whatever RDF we want with a CONSTRUCT query. (Again, imagine if Google did this with any portion of their knowledge graph...)

The query above also looks for properties for those classes so that it can express those in the output with the RDFS vocabulary. It didn't find many, but this bears further investigation. This query shows that in addition to the chemical element class having properties, there are constraints on those properties described with triples, so there's a lot more that can be done here to pull richer models out of Wikidata and then express them in more standard vocabularies.

And of course there's the possibility of pulling out instance data to go with these models. Queries for that would be easy enough to assemble but you might end up with so much data that Wikidata times out before giving it to you; you could use the techniques I described in Pipelining SPARQL queries in memory with the rdflib Python library to retrieve instance URIs and then retrieve the additional triples about those instances in batches of queries that use the VALUES keywords.

Lots of data instances of rich models, all transformed to conform to the W3C standards so that they work with lots of open source and commercial tools--the possibilities are pretty impressive. If anyone pulls datasets like this out of Wikidata for their field, let me know about it!


Please add any comments to this Google+ post.

28 October 2018

SPARQL full-text Wikipedia searching and Wikidata subclass inferencing

Wikipedia querying techniques inspired by a recent paper.

Milhaud and Bacharach

I found all kinds of interesting things in the article "Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia's Knowledge Graph"(pdf) by Stanislav Malyshev of the Wikimedia Foundation and four co-authors from the Technical University of Dresden. I wanted to highlight two particular things that I will find useful in the future and then I'll list a few more.

Before I cover them, I wanted to mention that I've really grown to appreciate the little diamond icon in the upper-left of the Wikidata query form. As I refine queries on that form, the queries typically get messier and messier, so the ability to clean it all up with one click is very convenient.

Full text searching of Wikipedia with SPARQL

The paper's "Custom SPARQL Extensions" section describes several extensions, including the MediaWiki Web API. The Wikidata Query Service/User Manual/MWAPI page describes how you can call the MediaWiki API search functions by using special property functions (that is, properties that instruct the query engine to execute certain special functions).

This API is definitely one of those topics where reviewing the examples will get you started more quickly than trying to read the actual documention. Their first SPARQL query search example, Find all entities with labels "cheese" and get their types, searches Wikipedia for entries that have "cheese" in one of their labels such as the page title or alternative names.

The key difference in the Find articles in Wikipedia example that follows the first cheese example is that its fifth line uses the property function mwapi:srsearch as a predicate instead of mwapi:search, telling the query to search the contents of all of the English (note the ".en" on the fourth line) Wikipedia pages. You can try that example yourself to do a full-text search for "cheese". I did a similar search for Darius Milhaud Burt Bacharach because I've recently been fascinated by the connections between Milhaud, a French composer who rose to prominence in the 1920s as a member of Les Six, and Bacharach, one of the greatest pop songwriters of the 1960s. (Listening to some Milhaud once, it struck me as odd that his use of horns would remind me of some Bacharach songs and arrangements until I found out that the author of "The Look of Love", "Walk on By", and "I Say a Little Prayer" studied with Milhaud in the 1940s at McGill University.) This query certainly doesn't need the "LIMIT 20" at the end like the full-text search for "cheese" does, because these two guys don't get mentioned on the same page as often as cheese gets mentioned, but it is an interesting set of pages.

Subclass inferencing with Wikidata

I'm still surprised at how many people use RDF without adding any schema information, or worse, without using schema information that's already there. Wikidata provides plenty for us, and while the Blazegraph instance used as the back end to its SPARQL engine does not have its RDFS inferencing capabilities turned on--understandably, because queries that take advantage of this ask more of a processor and could therefore hamper scalability--a nice property path trick does let us ask for all the instances of a particular class and of its subclasses. This wasn't even mentioned in the "Getting the Most out of Wikidata" paper, but a mention of how Wikidata uses owl:objectProperty inspired me to dig more into the use of the data modeling, and I came up with this.

The following (try it here) shows that Wikidata currently has data about 125 instances of home computer models:

SELECT (count(*) as ?instances) WHERE  {
  ?instance wdt:P31 wd:Q473708     # Instance has a type of "home computers"
}

This next query (try it here) shows that there are 28 instances of classes that are a direct subclass of "home computers":

SELECT (COUNT(*) AS ?instances) WHERE {
  ?instance wdt:P31 ?class.
  ?class wdt:P279 wd:Q473708.     # wdt:P279: subclass of 
}

Merely adding the property path asterisk operator to wdt:P31 tells the query engine to find instances of the home computer class and also instances of any class in the subclass tree below it (try it here) and it finds 154 of them:

SELECT (COUNT(*) AS ?instances) WHERE {
  ?instance wdt:P31 ?class.
  ?class wdt:P279* wd:Q473708.
}

As with regular expressions, the asterisk means "0 or more steps away," so that instances of wd:Q473708 would be counted along with instances of classes from its subclass tree. Using a plus sign instead would have meant "1 or more instances away" so that query would not have found instances of wd:Q473708.

The ability to use class relationships to identify potentially useful data is just one example of how schema metadata adds value to data. And, we get more than just these additional instances; we get additional class names that tell us more about these instances. For example, we can find that the Thomson MO5-CnAM 43737 computer is an instance of the class Thomson M05, which is a subclass of MOTO Gamme, which is a subclass of home computer.

And more

Some other nice things I learned about in the paper:

  • The use of wikibase:around and wikibase:box for additional kinds of geographic queries in addition to the ability to search within a city's limits as I described in July.

  • A list of additional endpoints that you can use in federated queries sent to Wikidata.

  • Support for Blazegraph's graph traversal features.

  • Multiple live Grafana dashboards about Wikidata usage such as data about agents and formats requested.

If you're interested in SPARQL, Wikidata, or especially the combination, you'll learn some fascinating things from this paper.


Please add any comments to this Google+ post.

23 September 2018

Panic over "superhuman" AI

Robot overlords not on the way.

Robot Overlords movie poster

When someone describe their worries about AI taking over the world, I usually think to myself "I recently bookmarked a good article about why this is silly and I should point this person to it", but in that instant I can't remember what the article was. I recently re-read a few and thought I'd summarize them here in case anyone wants to point their friends to some sensible discussions of why such worries are unfounded.

The impossibility of intelligence explosion by François Chollet

Chollet is an AI researcher at Google and the author of the Keras deep learning framework and the Manning books "Deep Learning with Python" and "Deep Learning with R". Like some of the other articles covered here, his piece takes on the idea that we will someday build an AI system that can build a better one on its own, and then that one will build a better one, and so on until the singularity.

His outline gives you a general idea of his line of reasoning; the bulleted lists in his last two sections are also good:

  • A flawed reasoning that stems from a misunderstanding of intelligence

  • Intelligence is situational

  • Our environment puts a hard limit on our individual intelligence

  • Most of our intelligence is not in our brain, it is externalized as our civilization

  • An individual brain cannot implement recursive intelligence augmentation

  • What we know about recursively self-improving systems

  • Conclusions

One especially nice paragraph:

In particular, there is no such thing as "general" intelligence. On an abstract level, we know this for a fact via the "no free lunch" theorem -- stating that no problem-solving algorithm can outperform random chance across all possible problems. If intelligence is a problem-solving algorithm, then it can only be understood with respect to a specific problem. In a more concrete way, we can observe this empirically in that all intelligent systems we know are highly specialized. The intelligence of the AIs we build today is hyper specialized in extremely narrow tasks -- like playing Go, or classifying images into 10,000 known categories. The intelligence of an octopus is specialized in the problem of being an octopus. The intelligence of a human is specialized in the problem of being human.

'The discourse is unhinged': how the media gets AI alarmingly wrong by Oscar Schwartz

This Guardian piece focuses on how the media encourages silly thinking about the future of AI. As the article's subtitle tells us,

Social media has allowed self-proclaimed 'AI influencers' who do nothing more than paraphrase Elon Musk to cash in on this hype with low-quality pieces. The result is dangerous.

Much of the article focuses on the efforts of Zachary Lipton, a machine learning assistant professor at Carnegie Mellon, to call out bad journalism on the topic. One example is an article that I was also guilty of taking too seriously: Fast Company's AI Is Inventing Languages Humans Can't Understand. Should We Stop It? The actual "language" was just overly repetitive sentences made possible by recursive grammar rules, which I had experienced myself many years ago doing a LISP-based project for a Natural Language Processing course. Schwartz quotes the Sun article Facebook shuts off AI experiment after two robots begin speaking in their OWN language only they can understand as saying that the incident "closely resembled the plot of The Terminator in which a robot becomes self-aware and starts waging a war on humans". (The Sun article also says "Experts have called the incident exciting but also incredibly scary"; according to the Guardian article, "These findings were considered to be fairly interesting by other experts in the field, but not totally surprising or groundbreaking".)

Schwartz's piece describes how the term "electronic brain" is as old as electronic computers, and how overhyped media coverage of machines that "think" as far back as the 1940s led to inflated expectations about AI that greatly contributed to the several AI winters we've had since then.

Ways to Think About Machine Learning by Benedict Evans

If you're going to read only one of the articles I describe here all the way through, I recommend this one. I don't listen to every episode of the a16z podcast, but I do listen to every one that includes Benedict Evans (this week's episode, on Tesla and the Nature of Disruption, was typically excellent), and I have subscribed to his newsletter for years. He's a sharp guy with sensible attitudes about how technologies and societies fit together and where it may lead.

One theme of many of the articles I describe here is the false notion that intelligence is a single thing that can be measured on a one-dimensional scale. As Evans puts it,

This gets to the heart of the most common misconception that comes up in talking about machine learning - that it is in some way a single, general purpose thing, on a path to HAL 9000, and that Google or Microsoft have each built *one*, or that Google 'has all the data', or that IBM has an actual thing called 'Watson'. Really, this is always the mistake in looking at automation: with each wave of automation, we imagine we're creating something anthropomorphic or something with general intelligence. In the 1920s and 30s we imagined steel men walking around factories holding hammers, and in the 1950s we imagined humanoid robots walking around the kitchen doing the housework. We didn't get robot servants - we got washing machines.

Washing machines are robots, but they're not 'intelligent'. They don't know what water or clothes are. Moreover, they're not general purpose even in the narrow domain of washing - you can't put dishes in a washing machine, nor clothes in a dishwasher (or rather, you can, but you won't get the result you want). They're just another kind of automation, no different conceptually to a conveyor belt or a pick-and-place machine. Equally, machine learning lets us solve classes of problem that computers could not usefully address before, but each of those problems will require a different implementation, and different data, a different route to market, and often a different company. Each of them is a piece of automation. Each of them is a washing machine.

After bringing up relational databases as a point of comparison for what new technology can do ("Relational databases gave us Oracle, but they also gave us SAP, and SAP and its peers gave us global just-in-time supply chains - they gave us Apple and Starbucks"), he asks "What, then, are the washing machines of machine learning, for real companies?" He offers some good suggestions, some of which can be summarized as "AI will allow the automation of more things".

He also discusses low-hanging fruit for what new things AI may automate. As an excellent followup to that, I recommend Kathryn Hume's Harvard Business Review article How to Spot a Machine Learning Opportunity, Even If You Aren't a Data Scientist.

The Myth of a Superhuman AI by Kevin Kelly

In this Wired Magazine article by one of their founders, after a discussion of some of the panicky scenarios out there we read that "buried in this scenario of a takeover of superhuman artificial intelligence are five assumptions which, when examined closely, are not based on any evidence". He lists them, then lists five "heresies [that] have more evidence to support them"; these five provide the structure for the rest of his piece:

  • Intelligence is not a single dimension, so "smarter than humans" is a meaningless concept.

  • Humans do not have general purpose minds, and neither will AIs.

  • Emulation of human thinking in other media will be constrained by cost.

  • Dimensions of intelligence are not infinite.

  • Intelligences are only one factor in progress.

A good point about how artificial general intelligence is not something to worry about makes a nice analogy with artificial flight:

When we invented artificial flying we were inspired by biological modes of flying, primarily flapping wings. But the flying we invented -- propellers bolted to a wide fixed wing -- was a new mode of flying unknown in our biological world. It is alien flying. Similarly, we will invent whole new modes of thinking that do not exist in nature. In many cases they will be new, narrow, "small," specific modes for specific jobs -- perhaps a type of reasoning only useful in statistics and probability.

(This reminds me of Evans writing "We didn't get robot servants - we got washing machines".) Another good metaphor is Kelly's comparison of attitudes about superhuman AI with cargo cults:

It is possible that superhuman AI could turn out to be another cargo cult. A century from now, people may look back to this time as the moment when believers began to expect a superhuman AI to appear at any moment and deliver them goods of unimaginable value. Decade after decade they wait for the superhuman AI to appear, certain that it must arrive soon with its cargo.

19 A.I. experts reveal the biggest myths about robots by Guia Marie Del Prado

This Business Insider piece is almost three years old but still relevant. Most of the experts it quotes are actual computer scientist professors, so you get much more sober assessments than you'll see in the panicky articles out there. Here's a good one from Berkeley computer scientist Stuart Russell:

The most common misconception is that what AI people are working towards is a conscious machine, that until you have a conscious machine there's nothing to worry about. It's really a red herring.

To my knowledge, nobody, no one who is publishing papers in the main field of AI, is even working on consciousness. I think there are some neuroscientists who are trying to understand it, but I'm not aware that they've made any progress.

As far as AI people, nobody is trying to build a conscious machine, because no one has a clue how to do it, at all. We have less clue about how to do that than we have about build a faster-than-light spaceship.

From Pieter Abbeel, another Berkeley computer scientist:

In robotics there is something called Moravec's Paradox: "It is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility".

This is well appreciated by researchers in robotics and AI, but can be rather counter-intuitive to people not actively engaged in the field.

Replicating the learning capabilities of a toddler could very well be the most challenging problem for AI, even though we might not typically think of a one-year-old as the epitome of intelligence.

I was happy to see the article quote NYU's Ernie Davis, whose AI class I took over 20 years ago while working on my master's degree there. (Reviewing my class notebook I see a lot of LISP and Prolog code, so things have changed a lot.)

This article implicitly has a nice guideline for when to take predictions about the future of AI seriously: are they computer scientists familiar with the actual work going on lately? If they're experts in other fields engaging in science fiction riffing (or as the Guardian article put it more cleverly, paraphrasing Elon Musk), take it all with a big grain of salt.

I don't mean to imply that the progress of technologies labeled as "Artificial Intelligence" has no potential problems to worry about. Just as automobiles and chain saws and a lot of other technology invented over the years can do harm as well as good, the new power brought by advanced processors, storage, and memory can be misused intentionally or accidentally, so it's important to think through all kinds of scenarios when planning for the future. In fact, this is all the more reason not to worry about sentient machines: as the Guardian piece quotes Lipton, "There are policymakers earnestly having meetings to discuss the rights of robots when they should be talking about discrimination in algorithmic decision making. But this issue is terrestrial and sober, so not many people take an interest." Sensible stuff to keep in mind.


Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets

    Archives

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0