15 June 2009

Big legal publishers and semantic web technology

Which one will see the good fit first?

A recent @TopQuadrant tweet about legal knowledge and RDF/XML led me to Dr. Adam Wyner's piece Legal Ontologies Spin a Semantic Web on law.com. After reading it, I wanted to leave a comment, but this required registering on law.com and telling them lots of details about the law firm I work for. I don't work for a law firm, so I'm just putting my comments here and expanding on them a bit.

It's a logical next step for the big legal publishers to build ontologies that define new kinds of relationships among the data that they store.

Before discussing the value that ontologies can bring to the practice of law, Dr. Wyner writes:

Reading a case such as Manhattan Loft v. Mercury Liquors, there are elementary questions that can be answered by any legal professional, but not by a computer:

  • Where was the case decided?
  • Who were the participants and what roles did they play?
  • Was it a case of first instance or on appeal?
  • What was the basis of the appeal?
  • What were the legal issues at stake?
  • What were the facts?
  • What factors were relevant in making the decision?
  • What was the decision?
  • What legislation or case law was cited?
Legal information service providers such as LexisNexis index some of the information...

Actually, they identify and index most of the information in this list, as do Westlaw and the Wolters-Kluwer legal publishers, because they store the majority of their content in XML. (As early adopters of this technology, these companies sometimes store it using XML's predecessor, SGML.) A case's venue, its participants and their roles, the facts of the case, and the judge's decision are typical pieces of information that a legal publisher identifies with XML markup and stores in a system that can use this information for specialized queries.

Ontologies can add a lot to this, and the schemas for this XML will be a great head start to any semantic web-oriented system for getting more out of this data. This won't happen outside of the publishers' firewalls soon, though, because the schemas for their legal content play such an important role in the extra value that they add and charge for that no legal publisher would share them. (They don't worry about open source efforts to reproduce their work nearly as much as they worry about competitive advantages over each other.)

Two other resources that these publishers can build on are their existing taxonomies and their databases of citation relationships. Taxonomies such as West's Key Number system are divided by practice areas (for example, asbestos construction issues vs. child custody) and not document roles or purposes, and therefore make a nice complement to the XML schemas. Legal publishers have sold databases of citation relationships (for example, which case overruled another one) since the nineteenth century, and this data is all in clean, well-organized databases.

Kingsley Idehen likes to discuss how relational databases added a level of abstraction over previous models, XML provided an additional layer of flexibility by enabling people to store and use structured data whose structure wasn't necessarily tables, and the RDF data model and associated technology add another layer of abstraction and therefore more possibilities. Behind their firewalls, it's a logical next step for the big legal publishers to build ontologies that define new kinds of relationships among the XML content, the relational citation information, and the taxonomy data that they currently store so that they can get more value out of this data.

While there are cool things to do with this technology using content such as ancient literature, it's much easier to see a business model in a domain such as legal publishing where customers have a bigger budget to spend on information that can help them do their jobs. Making a case for the return on semantic web technology investment for legal publishing will be an interesting challenge, but not too difficult, because these technologies can build incrementally on so many existing information resources such as relational databases and the XML content infrastructure that Dr. Wyner forgot to mention. It will be interesting to see which of the big legal publishers moves ahead with this first, although they may choose not to publicize it.

For work outside of the big legal publishers, in a 2006 posting titled Law metadata on the web I wrote about how legal-rdf.org looked like a good start, but apparently there's been little enough activity there that they let their domain name ownership lapse, and now it's just parked by a speculator. (That posting also mentions the OASIS LegalXML work, which hasn't gotten to defining schemas for court decisions and kind of petered out in defining schemas for legislation, the other main document type for legal publishing.)

Can anyone tell me of other public standards for legal metadata in development that could provide input to semantic web projects?

2 June 2009

SearchMonkey and RDFa

What am I missing?

[searchmonkey logo]

Yahoo! SearchMonkey is one of those interesting, RDF-related technologies that I'd been meaning to check out for a while, and when I saw how much of the reaction to Google's Rich Snippets was people like Ryan Smith or Peter Mika in the May Semantic Web Gang podcast saying that Google was just doing what SearchMonkey had already done, I knew that it was time to look more closely at SearchMonkey.

I wanted to see support for RDFa embedded in HTML, and to be honest, I only see it in SearchMonkey if I squint while I'm looking and tilt my head slightly sideways. Perhaps I'm missing something, and I hope someone points it out to me.

According to the Site Owner Overview, there are two ways to take advantage of SearchMonkey: Standard Enhanced Results or Custom SearchMonkey Applications.

Standard Enhanced Results

The Site Owner Overview page says this is "Currently available for certain content types such as Video, Games, and Documents". Sounds good to me; I'm very interested in adding metadata to documents. According to the Documents page, though, "the Yahoo! Search document reader currently supports Flash documents only". If you want to use RDFa to identify specialized metadata for Yahoo to use when they return your document in a search result list, your document must be stored in a Flash document, and then you embed your metadata in the attributes of an object element that points at that document.

I think it's great that this lets us use RDFa to assign metadata to slideshare and Scribd documents, but if this has such a strong dependency on a binary format controlled by a single software company, I'm not that interested.

Custom SearchMonkey Applications

OK, so I don't want to see a shared web publishing infrastructure have such dependencies on this proprietary binary format. The SearchMonkey Getting Started page tells us: "Don't have Flash objects? Or want to build an app to display custom enhanced results? Head on over to the SearchMonkey Developer Tool to build an app where you can display a custom image, extract structured data from your site, [or] link to pages within your site". This sounded a bit better.

According to the SearchMonkey Application Dashboard page, "Presentation Applications are small PHP apps that display enhanced search results using data services. You can use an existing data service or create a custom service below". When I went through the steps of building a Custom Data Service based on an existing one, it asked me for a URL pattern to specify pages where it should look for data and URLs that fit that pattern to use for testing. Then, it showed the XSLT that it would use to extract data, displayed in an edit box where I could customize it.

You use this stylesheet to "specify XSLT code for extracting information from the page and representing that information as DataRSS". Despite the admonition to "avoid using namespaces in your XPATH expressions, as SearchMonkey strips these out", this looked like something I could work with once I get to know the DataRSS format. (There's a schema on that page to use for testing your stylesheet output.)

So if I point Yahoo at some documents and write a stylesheet that goes through those documents and returns DataRSS, SearchMonkey can use this. I could put RDFa in those documents and have my stylesheet get DataRSS data out of that... but I could also make up my own BobFooBar format to embed in the HTML and have my stylesheet get DataRSS out of that as well, so I don't really see how this counts as RDFa support.

The Semantic Web community is still trying to piece together the nature of Google's support of RDFa in HTML documents, and there are things to complain about, but we know that their crawlers will look for some sort of RDFa in HTML documents. This looks like a real step forward for support of standards-based metadata on the web by a major search engine. Perhaps my review of the SearchMonkey options is missing something, but so far I haven't seen anything to show me that what they offer is something for people interested in open web standards to get excited about.

Again, if I'm wrong about any of this, I'd be happy to be corrected.

27 May 2009

"Semantic Web for the Working Ontologist"

And for anyone interested in working with ontologies.

["Semantic Web for the Working Ontologist" cover]

I recently finished Dean Allemang and Jim Hendler's book Semantic Web for the Working Ontologist, and I strongly recommend it to anyone interested in OWL, RDF, or the Semantic Web. I'm surprised that their publishers even agreed to the title; there may be some people who look at the book's title and say "Hey, I'm a working ontologist, so I need that book!", but I think that it would benefit a much wider audience: not just people who consider themselves working ontologists, but anyone who needs to work with standards-based ontologies or with people who do.

The book describes many modeling issues and then shows how to work through them using concrete examples that are explained well enough to generalize them to other domains. Anyone who reads this book and then works with ontologies will come back to it saying to themselves "I know I saw something in here about how to handle this particular information relationship..." Examples are not presented as working code per se, but there are many examples showing a set of triples, a few RDFS and/or OWL statements, and the resulting new triples implied by the combination. Many of these examples made me want to type them into a text editor, run them through Pellet, and then start modifying the examples to see what happened, because to me, those implied triples are the coolest part of OWL: the new facts that you get out of an existing set of facts by adding metadata.

I've wondered before about what good RDFS was without OWL. I started to get a better appreciation for the possibilities when I played a bit with Sesame, and Dean and Jim's book gave me a much better idea of what you can do with RDFS when you don't have OWL support, so there's a reason for Sesame developers to get the book.

In addition to showing people who are dabbling with Semantic Web technologies how to get deeper into the technology, the book does an especially good job of showing experienced software developers which aspects of Semantic Web development are different from what they're used to and why these differences open up new possibilities instead of limiting them. For example:

The ability in OWL to infer class relationships is a severe departure from Object Oriented modeling. In OO modeling the class structure forms the backbone of the model's organization. All instances are created as members of some class, and their behavior is specified by the class structure. Changes to the class structure have far-reaching impact on the behavior of the system. In OWL, it is possible for the class structure to change as more information is learned about classes or individuals.

And this is a Good Thing! Got that, OO folks? If not, there's plenty more in the book to demonstrate this to you. For example, an early chapter in the book asks "How can we accommodate variation of sources if we can't structure the entities they are describing into a class model? The Semantic Web provides an elegant solution to this problem... any model can be built up from contributions from multiple sources". Or this: "it is never accurate in the Semantic Web to say that a property is 'defined for a class.' A property is defined independently of any class, and the RDFS relations specify which inferences can be correctly made about it in particular contexts."

Some great advice for all software developers:

...you might think that modeling for reuse is best done by anticipating everything that someone might want to use your model for, and thus the more you include the better. This is a mistake because the more you put in, the more you restrict someone else's ability to extend your model instead of just use it as is. Reuse is best done, as in other systems, by designing to maximize future combination with other things, not to restrict it.

Closing the book with chapters such as "Using OWL in the Wild", "Good and Bad Modeling Practices", and a "Frequently Asked Questions" appendix help even more to connect the theory to the practice, and the final chapter's "Beyond OWL 1.0" section shows what deficiencies the experts currently see in OWL and what kind of new features a future release might offer us. All in all, for people who are strongly interested in OWL and the Semantic Web, or even just a little curious, this book will give you a solid grounding in both the theory and practice of what the technology can bring to new applications that you might be working with.

Recent Tweets

    Feeds

    [What are these?]
    Atom 1.0 (summarized entries)
    Atom 1.0 (full entries)
    RSS 1.0
    RSS 2.0
    Gawker Artists