Having fun with Reuters Calais

Feeding it Renaissance art history.

Calais is the Reuters Clearforest product that, according to their homepage, "automatically annotates your content with rich semantic metadata". Give it text, and it returns the text marked up with RDF that identifies entities and various semantic information about those entities.

I'm looking forward to Reuters Clearforest CEO Barak Pridor's talk Enabling Semantic Applications Through Calais at the Linked Data Planet conference (his Talking with Talis podcast interview is definitely worth listening to), and I thought I'd get more out of his talk if I played with the software a bit first. It was easy and straightforward, but before I describe my first experiment, I wanted to mention two important points:

  • For work-related projects, I've been researching machine-aided indexing tools and similar software on and off, and they usually look complex and expensive. GATE and UIMA look tantalizing, but these are not tools; they're frameworks into which you can plug such tools. Their included sample tools are either simple enough to do little more than a perl script with a few regular expressions would do, or else they're complex enough to appear difficult to set up and get running—the term "training corpus" comes up a lot. Still, an admirable goal of both frameworks is that tool vendors doing more complex text processing should make their products compatible with these frameworks, letting their customers mix and match tools as necessary instead of forcing them to choose between expensive packages of tools that often do more than they need. (I look forward to discussing these with Jeni Tennison, who's also been researching this, the next time we're in the same city. Unfortunately, there won't be an Oxford XML Summer School this year, but we're all very hopeful for 2009.) After a few years of sporadic research into automated entity recognition tools, I was very happy to see a free, RESTful web service one come along.

  • Kingsley Idehen recently wrote the following about the Linked Data On the Web workshop in Beijing: "As the sessions progressed, it became clear during a number of accompanying Q&A sessions that a new Linked Data exploitation frontier is emerging. The frontier in question takes the form of a Linked Data substrate capable of addressing the taxonomic needs of solutions aimed at automated Named Entity Extraction, Disambiguation, Subject matter Concept alignment, transparently integrated with existing Web Content." As you'll see, I pushed Calais a little further than it was supposed to go for named entity extraction, but it still did an admirable job and helped me to focus more on the kinds of applications where it can shine.

After a few years of sporadic research into automated entity recognition tools, I was very happy to see a free, RESTful web service one come along.

The quickest way to start playing with Calais is to use Calais Viewer, where you just paste some text and click the "submit" button to see what kinds of entities Calais can find in that text. To write you own applications that use it, you need to become a Calais developer.

To write my own application that passes content to Calais and then uses the results for something, I like python-calais. It handles the web service interactions when you want to send text or a URL identifying content to Calais, and it loads the returned RDF triples into a triplestore where you can play with them. (It uses RDFlib, a Python RDF library I hadn't been able to get working for a while, but an RDFLib comment page explained what I had to do to get a new version working on a Windows machine.)

Because of a talk I'll be giving tomorrow at the University of Virginia on Semantic web technologies, RDF, and OWL, and because the audience will have a large representation of people using technology for humanities research, I chose Giorgio Vasari's "Lives of the Painters" for Calais to analyze. Vasari was a late-Renaissance Italian painter and architect whose writings about other painters are his main claim to modern fame. His biographies are considered a founding work of art history, and while several editions still sell well, Project Gutenberg has a public domain version of Volume 1 available for free.

Being more than 100,000 characters long, this work was too big to send off to Calais, so I broke it up into pieces, sent those off, and then aggregated the returned metadata. (Being in RDF makes the metadata very easy to aggregate.) The following is an excerpt, with some chunks removed and some carriage returns added for readability, of what Calais returned for one section of the book:

<!--Use of the Calais Web Service is governed by the Terms of
Service located at http://www.opencalais.com. By using this
service or the results of the service you agree to these terms of
<!--Relations: PersonPolitical, Quotation

Facility: castle of S. Angelo, portico of S. Peter, Doge's palace
Organization: Holy Church
NaturalFeature: Celian Hill
Continent: Africa
Country: France, Greece, Italy
Person: Maria Maggiore, Paul, Giovanni Evangelista, Valentinian,
Giovanni Battista, Ser Brunnellesco, St Gregory,
Giustiniano, John , St Hilarion, Luit, Hugh, Giovanni Morosini
City: Florence, Milan, Alexandria, Rome, Venice, Pistoia-->
  <rdf:Description c:allowDistribution="false" c:allowSearch="false" 
    <rdf:type rdf:resource="http://s.opencalais.com/1/type/sys/DocInfo"/>
    <Date>2008-04-15</Date><Body>figures and
    some marble candelabra exquisitely carved with leaves,
and some children in bas-relief of extraordinary beauty? In short, by
these and many other signs, it is clear that sculpture was in
decadence in the time of Constantine, and with it the other superior
arts. If anything was required to complete their ruin it was supplied
by the departure of Constantine from Rome when he transferred the
seat of government to Byzantium, as he took with him to Greece not
only all the best sculptors and other artists of the age, such as
they were, but also a quantity of statues and other beautiful works
of sculpture.

<!-- lots of plain text content deleted -->

I must not forget to mention either, how in the course of time the

<!-- Various header information removed -->

  <rdf:Description rdf:about="http://d.opencalais.com/pershash-1/f87fd977-7cd7-348d-972b-8f77716da77d">
    <rdf:type rdf:resource="http://s.opencalais.com/1/type/em/e/Person"/>
    <c:name>Ser Brunnellesco</c:name>
  <rdf:Description rdf:about="http://d.opencalais.com/dochash-1/77406f3f-26bb-3adb-9364-8b15cc0f756d/Instance/1">
    <rdf:type rdf:resource="http://s.opencalais.com/1/type/sys/InstanceInfo"/>
    <c:docId rdf:resource="http://d.opencalais.com/dochash-1/77406f3f-26bb-3adb-9364-8b15cc0f756d"/>
    <c:subject rdf:resource="http://d.opencalais.com/pershash-1/f87fd977-7cd7-348d-972b-8f77716da77d"/>
<!--Person: Ser Brunnellesco-->
    <c:detection>[of this church is such that Pippo di
]Ser Brunnellesco[ did not disdain to make use of it as his model]</c:detection>

  <rdf:Description rdf:about="http://d.opencalais.com/genericHasher-1/4fd6dc07-5b9d-3356-8f83-0e735dfa9910">
    <rdf:type rdf:resource="http://s.opencalais.com/1/type/em/e/Country"/>
  <rdf:Description rdf:about="http://d.opencalais.com/dochash-1/77406f3f-26bb-3adb-9364-8b15cc0f756d/Instance/16">
    <rdf:type rdf:resource="http://s.opencalais.com/1/type/sys/InstanceInfo"/>
    <c:docId rdf:resource="http://d.opencalais.com/dochash-1/77406f3f-26bb-3adb-9364-8b15cc0f756d"/>
    <c:subject rdf:resource="http://d.opencalais.com/genericHasher-1/4fd6dc07-5b9d-3356-8f83-0e735dfa9910"/>
<!--Country: Greece-->
    <c:detection>[government to Byzantium, as he took with him to ]Greece[ not
only all the best sculptors and other]</c:detection>

After a comment with some legalese, the document starts with an XML comment listing the identified entities by classes such as Facility, Organization, and Person. Next, in a c:document element, is the text passed to Calais to analyze, followed by RDF/XML about the entities that Calais found with length and offset figures to show where in the original text it found these entities. As you can see, it assigns a URL identifier to each one and lists metadata about it; for example, the thing with an identifier of http://d.opencalais.com/pershash-1/f87fd977-7cd7-348d-972b-8f77716da77d is a Person, has a name of "Ser Brunnellesco", and is 16 characters long starting at position 14713 of the CDATA in the c:document element within the XML returned by Calais.

I tried writing something that used this metadata to wrap start- and end-tags around the entities at the identified points (for example, <c:country>Greece</c:country>) but quickly found out why so much text analysis software adds metadata out-of-line: because identified entities can overlap, so my added tags would usually make the result ill-formed XML. (I guess I should have read the fourth paragraph of Jeni's post referenced above more closely.) I did write something to insert empty elements marking the beginning of an identified entity and its length (for example, <c:Country length="6"/>Greece) but I haven't used it for anything yet.

Once this new metadata is loaded into a triplestore, you can do some interesting things with it. For example, a little SPARQL let me create a report on the types of entities found. The following shows the beginning of the report after sorting by number of occurrences:

    323  Person           "Jesus Christ"
    191  City             "Florence"
    136  Person           "After Giotto"
    116  Person           "Jesus Christ"
    108  Person           "Franco Sacchetti"
     87  Person           "Agnolo Gaddi"
     79  Person           "Lorenzo di Bicci"
     72  Person           "Diocletian"
     69  Person           "Giovanni Cimabue"
     67  Person           "Tuscany Niccola"
     65  Person           "St Paul"
     64  Person           "After Andrea"
     58  Person           "Giovanni Evangelista"
     58  City             "Rome"
     57  Person           "Antonio Vite"
     56  Person           "St Francis"
     55  Person           "Andrea Tafi"
     54  Person           "Francesco Petrarch"
     54  Person           "Francesco di Giorgio"
     53  Person           "Jacopo di Casentino"
     48  Person           "Bernardo Orcagna"
     47  Person           "Guglielmo da Forli"
     47  Person           "Giovanni Boccaccio"
     46  Country          "Italy"

In the world of Italian Renaissance art, Jesus Christ is a big name. Calais didn't really find the phrase "After Giotto" 136 times; it's in there twice—once at the beginning of a sentence, with "After" having that capital "A"—and I assume that this led Calais to believe that "After" was Giotto's first name, and that all other references to Giotto referred to the same guy.

Public domain English translations of 16th-century biographies of 15th-century Italian painters are not the class of content that Calais was optimized for. Along with entities such as Person, Country, and City, a look at the entities, events, and facts that it searches for shows that business news—the second most popular (that is, second most easily funded) domain in computational linguistics research after terrorism—is the most important domain for the Calais folk. Potentially identified "Events/Facts" include Acquisition, AnalystEarningsEstimate, and StockSplit. This makes it a little clearer why Reuters would buy Clearforest and the technology behind Calais.

I'm working on another project using Calais that comes much closer to the kind of things that it is optimized for, and it's looking really cool. It might even be a worthwhile enough application to park a domain name to host it. I hope to have some demos to show within a few weeks.


Tom Tague from Calais here.

I love this stuff. It never fails to amaze me what uses & experiments people try with Calais. Who knows - maybe feeding a whole collection of similar works from various era's might provide a useful research tool for looking at artistic trends by time or geography?

One of the interesting experiments a number of Calais users have been playing with is document level co-occurrence. For example, what people occur most frequently mentioned together in news articles, etc.

Might be a little tough @ the book level - but perhaps a book could be decomposed into chapters by (perhaps representing time periods) to simulate the same thing.

Thanks for the experiment!