23 September 2018

Panic over "superhuman" AI

Robot overlords not on the way.

Robot Overlords movie poster

When someone describe their worries about AI taking over the world, I usually think to myself "I recently bookmarked a good article about why this is silly and I should point this person to it", but in that instant I can't remember what the article was. I recently re-read a few and thought I'd summarize them here in case anyone wants to point their friends to some sensible discussions of why such worries are unfounded.

The impossibility of intelligence explosion by François Chollet

Chollet is an AI researcher at Google and the author of the Keras deep learning framework and the Manning books "Deep Learning with Python" and "Deep Learning with R". Like some of the other articles covered here, his piece takes on the idea that we will someday build an AI system that can build a better one on its own, and then that one will build a better one, and so on until the singularity.

His outline gives you a general idea of his line of reasoning; the bulleted lists in his last two sections are also good:

  • A flawed reasoning that stems from a misunderstanding of intelligence

  • Intelligence is situational

  • Our environment puts a hard limit on our individual intelligence

  • Most of our intelligence is not in our brain, it is externalized as our civilization

  • An individual brain cannot implement recursive intelligence augmentation

  • What we know about recursively self-improving systems

  • Conclusions

One especially nice paragraph:

In particular, there is no such thing as "general" intelligence. On an abstract level, we know this for a fact via the "no free lunch" theorem -- stating that no problem-solving algorithm can outperform random chance across all possible problems. If intelligence is a problem-solving algorithm, then it can only be understood with respect to a specific problem. In a more concrete way, we can observe this empirically in that all intelligent systems we know are highly specialized. The intelligence of the AIs we build today is hyper specialized in extremely narrow tasks -- like playing Go, or classifying images into 10,000 known categories. The intelligence of an octopus is specialized in the problem of being an octopus. The intelligence of a human is specialized in the problem of being human.

'The discourse is unhinged': how the media gets AI alarmingly wrong by Oscar Schwartz

This Guardian piece focuses on how the media encourages silly thinking about the future of AI. As the article's subtitle tells us,

Social media has allowed self-proclaimed 'AI influencers' who do nothing more than paraphrase Elon Musk to cash in on this hype with low-quality pieces. The result is dangerous.

Much of the article focuses on the efforts of Zachary Lipton, a machine learning assistant professor at Carnegie Mellon, to call out bad journalism on the topic. One example is an article that I was also guilty of taking too seriously: Fast Company's AI Is Inventing Languages Humans Can't Understand. Should We Stop It? The actual "language" was just overly repetitive sentences made possible by recursive grammar rules, which I had experienced myself many years ago doing a LISP-based project for a Natural Language Processing course. Schwartz quotes the Sun article Facebook shuts off AI experiment after two robots begin speaking in their OWN language only they can understand as saying that the incident "closely resembled the plot of The Terminator in which a robot becomes self-aware and starts waging a war on humans". (The Sun article also says "Experts have called the incident exciting but also incredibly scary"; according to the Guardian article, "These findings were considered to be fairly interesting by other experts in the field, but not totally surprising or groundbreaking".)

Schwartz's piece describes how the term "electronic brain" is as old as electronic computers, and how overhyped media coverage of machines that "think" as far back as the 1940s led to inflated expectations about AI that greatly contributed to the several AI winters we've had since then.

Ways to Think About Machine Learning by Benedict Evans

If you're going to read only one of the articles I describe here all the way through, I recommend this one. I don't listen to every episode of the a16z podcast, but I do listen to every one that includes Benedict Evans (this week's episode, on Tesla and the Nature of Disruption, was typically excellent), and I have subscribed to his newsletter for years. He's a sharp guy with sensible attitudes about how technologies and societies fit together and where it may lead.

One theme of many of the articles I describe here is the false notion that intelligence is a single thing that can be measured on a one-dimensional scale. As Evans puts it,

This gets to the heart of the most common misconception that comes up in talking about machine learning - that it is in some way a single, general purpose thing, on a path to HAL 9000, and that Google or Microsoft have each built *one*, or that Google 'has all the data', or that IBM has an actual thing called 'Watson'. Really, this is always the mistake in looking at automation: with each wave of automation, we imagine we're creating something anthropomorphic or something with general intelligence. In the 1920s and 30s we imagined steel men walking around factories holding hammers, and in the 1950s we imagined humanoid robots walking around the kitchen doing the housework. We didn't get robot servants - we got washing machines.

Washing machines are robots, but they're not 'intelligent'. They don't know what water or clothes are. Moreover, they're not general purpose even in the narrow domain of washing - you can't put dishes in a washing machine, nor clothes in a dishwasher (or rather, you can, but you won't get the result you want). They're just another kind of automation, no different conceptually to a conveyor belt or a pick-and-place machine. Equally, machine learning lets us solve classes of problem that computers could not usefully address before, but each of those problems will require a different implementation, and different data, a different route to market, and often a different company. Each of them is a piece of automation. Each of them is a washing machine.

After bringing up relational databases as a point of comparison for what new technology can do ("Relational databases gave us Oracle, but they also gave us SAP, and SAP and its peers gave us global just-in-time supply chains - they gave us Apple and Starbucks"), he asks "What, then, are the washing machines of machine learning, for real companies?" He offers some good suggestions, some of which can be summarized as "AI will allow the automation of more things".

He also discusses low-hanging fruit for what new things AI may automate. As an excellent followup to that, I recommend Kathryn Hume's Harvard Business Review article How to Spot a Machine Learning Opportunity, Even If You Aren't a Data Scientist.

The Myth of a Superhuman AI by Kevin Kelly

In this Wired Magazine article by one of their founders, after a discussion of some of the panicky scenarios out there we read that "buried in this scenario of a takeover of superhuman artificial intelligence are five assumptions which, when examined closely, are not based on any evidence". He lists them, then lists five "heresies [that] have more evidence to support them"; these five provide the structure for the rest of his piece:

  • Intelligence is not a single dimension, so "smarter than humans" is a meaningless concept.

  • Humans do not have general purpose minds, and neither will AIs.

  • Emulation of human thinking in other media will be constrained by cost.

  • Dimensions of intelligence are not infinite.

  • Intelligences are only one factor in progress.

A good point about how artificial general intelligence is not something to worry about makes a nice analogy with artificial flight:

When we invented artificial flying we were inspired by biological modes of flying, primarily flapping wings. But the flying we invented -- propellers bolted to a wide fixed wing -- was a new mode of flying unknown in our biological world. It is alien flying. Similarly, we will invent whole new modes of thinking that do not exist in nature. In many cases they will be new, narrow, "small," specific modes for specific jobs -- perhaps a type of reasoning only useful in statistics and probability.

(This reminds me of Evans writing "We didn't get robot servants - we got washing machines".) Another good metaphor is Kelly's comparison of attitudes about superhuman AI with cargo cults:

It is possible that superhuman AI could turn out to be another cargo cult. A century from now, people may look back to this time as the moment when believers began to expect a superhuman AI to appear at any moment and deliver them goods of unimaginable value. Decade after decade they wait for the superhuman AI to appear, certain that it must arrive soon with its cargo.

19 A.I. experts reveal the biggest myths about robots by Guia Marie Del Prado

This Business Insider piece is almost three years old but still relevant. Most of the experts it quotes are actual computer scientist professors, so you get much more sober assessments than you'll see in the panicky articles out there. Here's a good one from Berkeley computer scientist Stuart Russell:

The most common misconception is that what AI people are working towards is a conscious machine, that until you have a conscious machine there's nothing to worry about. It's really a red herring.

To my knowledge, nobody, no one who is publishing papers in the main field of AI, is even working on consciousness. I think there are some neuroscientists who are trying to understand it, but I'm not aware that they've made any progress.

As far as AI people, nobody is trying to build a conscious machine, because no one has a clue how to do it, at all. We have less clue about how to do that than we have about build a faster-than-light spaceship.

From Pieter Abbeel, another Berkeley computer scientist:

In robotics there is something called Moravec's Paradox: "It is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility".

This is well appreciated by researchers in robotics and AI, but can be rather counter-intuitive to people not actively engaged in the field.

Replicating the learning capabilities of a toddler could very well be the most challenging problem for AI, even though we might not typically think of a one-year-old as the epitome of intelligence.

I was happy to see the article quote NYU's Ernie Davis, whose AI class I took over 20 years ago while working on my master's degree there. (Reviewing my class notebook I see a lot of LISP and Prolog code, so things have changed a lot.)

This article implicitly has a nice guideline for when to take predictions about the future of AI seriously: are they computer scientists familiar with the actual work going on lately? If they're experts in other fields engaging in science fiction riffing (or as the Guardian article put it more cleverly, paraphrasing Elon Musk), take it all with a big grain of salt.

I don't mean to imply that the progress of technologies labeled as "Artificial Intelligence" has no potential problems to worry about. Just as automobiles and chain saws and a lot of other technology invented over the years can do harm as well as good, the new power brought by advanced processors, storage, and memory can be misused intentionally or accidentally, so it's important to think through all kinds of scenarios when planning for the future. In fact, this is all the more reason not to worry about sentient machines: as the Guardian piece quotes Lipton, "There are policymakers earnestly having meetings to discuss the rights of robots when they should be talking about discrimination in algorithmic decision making. But this issue is terrestrial and sober, so not many people take an interest." Sensible stuff to keep in mind.

Please add any comments to this Google+ post.

27 August 2018

Pipelining SPARQL queries in memory with the rdflib Python library

Using retrieved data to make more queries.

Last month in Dividing and conquering SPARQL endpoint retrieval I described how you can avoid timeouts for certain kinds of SPARQL endpoint queries by first querying for the resources that you want to know about and then querying for more data about those resources a subset at a time using the VALUES keyword. (The example query retrieved data, including the latitude and longitude, about points within a specified city.) I built my demo with some shell scripts, some Perl scripts, and a bit of spit and glue.

I started playing with RDFLib's SPARQL capabilities a few years ago as I put together the demo for Driving Hadoop data integration with standards-based models instead of code. I was pleasantly surprised to find out how easily it could run a CONSTRUCT query on triples stored in memory and then pass the result on to one or more additional queries, letting you pipeline a series of such queries with no disk I/O. Applying these techniques to replace my shell scripts and Perl scripts from last month showed me that these same techniques could be used for all kinds of RDF applications.

When I was at TopQuadrant I got to know SPARQLMotion, their (proprietary) drag-and-drop system for pipelining components that can do this sort of thing. RDFLib offers several graph manipulation methods that can extend what I've done here to do many additional SPARQLMotion-ish things. When I recently asked about other pipeline component-based RDF development tools out there, I learned of Linked Pipes ETL, Karma, ld-pipeline, VIVO Harvester, Silk, UnifiedViews, and a PoolParty framework around Unified Views. I hope to check out as many of them as I can in the future, but with the functions I've written for my new Python script, I can now accomplish so much with so little Python code that my motivation to go looking beyond that is diminishing--especially considering that when doing it this way, I have all of Python's abilities to manipulate strings and data structures standing by in case I need them.

For me, the two most basic RDF tasks to augment the general Python capabilities are retrieval of triples from a remote endpoint for local storage and querying of locally stored triples. RDFLib makes the latter easy. For the former I was looking for a library, but Jindřich Mynarz pointed out that no specialized library was necessary; he even showed me the basic code to make it happen. (I swear I had tried a few times before posting the question on Twitter, so the brevity and elegance of his example were a bit embarrassing for me.)

You can find my new Python script to replace last month's work on github. More than half of it is made up of the actual SPARQL queries being stored in variables. This is a good thing, because it means that the Python instructions (to retrieve triples from the endpoint, to load up the local graph with retrieved triples, to query that graph, and to build and then run new queries based on those query results) all together take up less than half of the script. In other words, the script is more about the queries than about the code to execute them.

The main part of the script isn't very long:

# 1. Get the qnames for the geotagged entities within the city and store in graph g. 

queryRetrieveGeoPoints = queryRetrieveGeoPoints.replace("CITY-QNAME",cityQname)
url = endpoint + "?" + urllib.urlencode({"query": queryRetrieveGeoPoints})
logging.info('Triples in graph g after queryRetrieveGeoPoints: ' + str(len(g)))

# 2. Take the subjects in graph g and create queries with a VALUES clause 
#    of up to maxValues of the subjects. 

subjectQueryResults = g.query(queryListSubjects)

# 3. See what classes are used and get their names and those of their superclasses.
classList = g.query(listClassesQuery)

# 4. See what objects need labels and get them.
objectsThatNeedLabel = g.query(queryObjectsThatNeedLabel)

print(g.serialize(format = "n3"))   # (Actually Turtle, which is what we want, not n3.)

The splitAndRunRemoteQuery function was one I wrote based on my prototype from last month.

I first used RDFLib over 15 years ago, when SPARQL hadn't even been invented yet. Hardcore RDFLib fans will prefer the greater efficiency of its native functions over the use of SPARQL queries, but my goal here was to have SPARQL 1.1 queries drive all the action, and RDFLib supports this very nicely. Its native functions also offer additional capabilities that bring it closer to some of the pipelining things I remember from SPARQLMotion. For example, the set operations on graphs let you perform actions such as unions, intersections, differences, and XORs of graphs, which can be handy when mixing and matching data from multiple sources to massage that data into a single cleaned-up dataset--just the kind of thing that makes RDF so great in the first place.

Picture by Michael Coghlan on Flickr (CC BY-SA 2.0)

Please add any comments to this Google+ post.

22 July 2018

Dividing and conquering SPARQL endpoint retrieval

With the VALUES keyword.

VALUES neon sign

When I first tried SPARQL's VALUES keyword (at which point it was pretty new to SPARQL, having only recently been added to SPARQL 1.1) I demoed it with a fairly artificial example. I later found that it solved one particular problem for me by letting me create a little lookup table. Recently, it gave me huge help in one of the most classic SPARQL development problems of all: how to retrieve so much data from an endpoint that the first attempts at that retrieval resulted in timeouts.

The Wikidata:SPARQL query service/queries page includes an excellent Wikdata query to find latitudes and longitudes for places in Paris. You can easily modify this query to retrieve from places within other cities, and I wanted to build on this query to make it retrieve additional available data about those places as well. While accounting for the indirection in the Wikidata query model made this a little more complicated, it wasn't much trouble to write.

The expanded query worked great for a city like Charlottesville, where I live, but for larger cities, the query was just asking for too much information from the endpoint and timed out. My new idea was to first ask for the roughly the same information that the Paris query above does, and to then request additional data about those entities a batch at a time with a series of queries that use the VALUES keyword to specify each batch. (I've pasted a sample query requesting one batch below.)

It worked just fine. I put all the queries and other relevant files in a zip file for people who want to check it out, but it's probably not worth looking at too closely, because in a month or two I'll be replacing it with a Python version that does everything more efficiently. It's still worth explaining the steps in this version's shell script driver file, because the things I worked out for this prototype effort--despite its Perl scripting and extensive disk I/O--mean that the Python version should come together pretty quickly. That's what prototypes are for!

The driver shell script

Before running the shell script, you specify the Wikidata local name of the city to query near the top of the getCityEntities.rq SPARQL query file. (This is easier than it sounds--for example, to do it for Charlottesville, go to its Wikipedia page and click Wikidata item in the menu on the left to find that Q123766 is the local name.)

Once that's done, running the zip file's getCityData.sh shell script executes these main steps:

  1. It uses a curl command to send the getCityEntities.rq CONSTRUCT query to the https://query.wikidata.org/sparql endpoint.The curl command saves the resulting triples in a file called cityEntities.ttl.

  2. It uses ARQ to run the listSubjects.rq query on the new cityEntities.ttl file, specifying that the result should be a TSV file.

  3. The results of listSubjects.rq get piped to a Perl script called makePart2Queries.pl. This creates a series of CONSTRUCT query files that ask Wikidata for data about entities listed in a VALUES section. It puts 50 entries in each file's VALUES section; this figure of 50 is stored in a $maxLines variable in makePart2Queries.pl where it can be reset if the endpoint is still timing out. This step also adds lines to a shell script called callTempQueries.sh, where each line uses curl to call one of the queries that uses VALUES to request a batch of data.

  4. getCityData.sh next runs the callTempQueries.sh shell script to execute all of these new queries, storing the resulting triples in the file tempCityData.ttl.

  5. The tempCityData.ttl file has plenty of good data, but it can be used to get additional relevant data, so the script's next line runs a query that creates a TSV file with a list of all of the classes found in tempCityData.ttl triples of the form {?instance wdt:P31 ?class}. (wdt:P31 is the Wikidata equivalent of rdf:type, indicating that a resource is an instance of a particular class.) That TSV file then drives the creation of a query that gets sent to the SPARQL endpoint to ask about the classes' parent and grandparent classes, and that data gets added to tempCityData.ttl.

  6. Another ARQ call in the script uses a local query to check for triple objects in the http://www.wikidata.org/entity/ namespace that don't have rdfs:label values and get them--or at least, get the English ones, but it's easy to fix if you want labels in different or additional languages.

  7. The script runs one final ARQ query on tempCityData.ttl: the classic SELECT * WHERE {?s ?p ?o}. This request for all the triples actually tidies up the Turtle data a bit, storing all the triples with common subjects together. It puts the result in cityData.ttl.

One running theme of some of the shell script's steps is the retrieval of labels associated with qnames. Wikidata has a lot of triples like {wd:Q69040 wd:P361 wd:Q16950} that are just three qnames, so retrieved data will have more value to applications if people and processes can find out what each qname refers to.

The main shell script has other housekeeping steps such as recording of the start and end times and deletion of the temporary files. I had more ideas for things to add, but I'll save those for the Python version.

The Python version won't just be a more efficient version of my use of VALUES to do batch retrievals of data that might otherwise time out. It will demonstrate, more nicely, something that only gets hinted at in this mess of shell and Perl scripts: the ability to automate the generation of SPARQL queries that build on the results of previously executed queries so that they can all work together as a pipeline to drive increasingly sophisticated RDF application development.

Here is a sample of one of the queries created to request data about one batch of entities within the specified city:

PREFIX p: <http://www.wikidata.org/prop/> 
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 

{ ?s ?p ?o. 
  ?s ?p1 ?o1 . 
  ?s wgs84:lat ?lat . 
  ?s wgs84:long ?long .
  ?p rdfs:label ?pname .
  ?s wdt:P31 ?class .   
  VALUES ?s {
# about 48 more of those here...
  # wdt:P131 means 'located in the administrative territorial entity' .
  ?s wdt:P131+ ?geoEntityWikidataID .  
      ?s p:P625 ?statement . # coordinate-location statement
  ?statement psv:P625 ?coordinate_node .
  ?coordinate_node wikibase:geoLatitude ?lat .
  ?coordinate_node wikibase:geoLongitude ?long .

  # Reduce the indirection used by Wikidata triples. Based on Tommy Potter query
  # at http://www.snee.com/bobdc.blog/2017/04/the-wikidata-data-model-and-yo.html.
  ?s ?directClaimP ?o .                   # Get the truthy triples. 
  ?p wikibase:directClaim ?directClaimP . # Find the wikibase properties linked
  ?p rdfs:label ?pname .                  # to the truthy triples' predicates.

  # the following VALUES clause is actually faster than just
  # having specific triple patterns for those 3 p1 values.
  ?s ?p1 ?o1 .
  VALUES ?p1 {

  ?s wdt:P31 ?class . # Class membership. Pull this and higher level classes out in later query.
  # If only English names desired
  FILTER (isURI(?o1) || lang(?o1) = 'en' )
  # For English + something else, follow this pattern: 
  # FILTER (isURI(?o1) || lang(?o1) = 'en' || lang(?o1) = 'de')

  FILTER(lang(?pname) = 'en')

Neon sign picture by Jeremy Brooks on Flickr (CC BY-NC 2.0)

Please add any comments to this Google+ post.

"Learning SPARQL" cover

Recent Tweets



    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0