« Command line processing with the DITA Open Toolkit | Main | Free epub children's picture books »

Scraping and linked data

Wired Magazine gives scraping the buzzword treatment but remains clueless about the semantic web and linked data.

The latest issue of Wired has an article with the provocative title of The Data Wars about web sites built around data retrieved by "bots" doing "scraping". I quote these because the article twists the terms a bit to make them and their subjects seem more dramatic, more cutting edge, and—you guessed it—more "Web 2.0".

I see three historical phases for this kind of data retrieval, and Wired still doesn't know about the third.

The dramatic tension of the article is the conflict between, on the one hand, craigslist and other large sites with lots of valuable public data, and on the other hand, the sites who pull some of this data, "remix" it to be more useful to their own audience, and then put ads around their "mashups". (Somehow, code monkeys surrounded by earth-toned cubicle fabric think that it makes them resemble DJs surrounded by crates of vinyl if they use musical buzzwords to refer to the act of combining multiple things into a new one. If I wrote a Turbo Pascal program twenty years ago that included a few existing libraries that I collected from different sources, was that a mashup, or was it a remix?)

According to the article, scraping

refers to the act of automatically harvesting information from another site and using the results for sometimes nefarious activities. (Some scrapers, for instance, collect email addresses from public web sites and sell them to spammers)... Scrapers write software robots using script languages like Perl, PHP, or Java. They direct the bots to go out (either from a Web server or a computer of their own) to the target site and, if necessary, log in. Then the bots copy and bring back the requested payload, be it images, lists of contact information, or a price catalog.

So using wget or curl to pull down a text file and then feeding that file to a Perl script that looks for and extracts strings that match certain patterns is now a command to a robot (excuse me, "bot") army to go forth and retrieve payloads. I suppose that if we do this after the sun goes down, we can refer to our scripts as an "unholy army of the night" for added excitement.

I see three historical phases for this kind of data retrieval, and Wired still doesn't know about the third. The first, which people have been doing since late in the last century, involves retrieving files and then running scripts to find and pull out useful information as described above. (By the way, John Cowan has just put out a new release of TagSoup, a parser that converts HTML retrieved from "in the wild" to well-formed XML. My own base tool set for scraping is wget, TagSoup, and XSLT; lately I've been using these to get Project Gutenberg metadata about public domain children's books such as Little Bo Peep.)

The second phase, implemented by sites that are willing to share some data but want to control that sharing, are APIs like those provided by Amazon and Google, which the article covers. Since I first drafted this posting, the difference between scraping and API use became a big story in the data geek world after Facebook disabled Robert Scoble's account because he was beta testing Plaxo. This online address service scrapes Facebook instead of using its API because Facebook's API doesn't provide address book information, and Scoble has enough Facebook friends that trying to scrape all that data violated Facebook's terms of service, or something. (If you think that this is a really big deal, Dan Brickley brings some much-needed perspective to it.)

The third phase of web-based data retrieval is the pulling down of data that was intentionally put into web pages for retrieval by automated processes. Unlike the data retrieved in the first phase of web data retrieval, this data goes into the web pages in a format that conforms to simple rules so that it's immediately usable, with no requirement for pattern matching and rearranging. Unlike the APIs of the second phase, the new data is retrieved with a simple HTTP request (perhaps wget or curl) with no need to provide a login developer token or to make calls to specific processes that will then hand you the data if you make the calls correctly.

There are multiple efforts working in this area. The Linked Data, Semantic Web, and microformats movements all overlap to some extent, but I don't know of any single term that encompasses them all, unless an especially passionate advocate of one insists that the others are subsets of their work. The key difference between this work and the scraping described in the Wired article is that this third phrase is about people putting up data that they want others to retrieve and use. I don't want you pulling my data and running it next to Google AdSense ads unless it helps me in some way. If the data consists of schedules for events that I charge money for, such as plane flights or movie showings, then I'm happy to let you drive more business to me. If I'm craigslist or Facebook, I just see you building a business model around my data with no benefit to me, and I don't like it.

Of course, it's not purely about sharing data to make more money; the academic world also has plenty of research efforts with good reasons to share their data. I'll be writing soon here about the value of seeking out organizations with strong motivations to share their data.


(Note: I usually close comments for an entry a few weeks after posting it to avoid comment spam.)


In the Linked Data community we see Linked Data as ground zero (foundation layer) within the Semantic Web stack. It's the part Semantic Web that deals with injecting structured data into the Web with Meshing/Mixing/Joining in mind from the onset (courtesy of HTTP based Data Object Identifiers or URIs, RDF Data Model, and HTTP Content Negotiation).

Hopefully, we might be able to use the term: Meshing as a constructive distinguishing mechanism between Web 2.0 and Web.vNext (3.0, Semantic Web, Semantic Data web, Linked Data Web, Giant Global Graph or whatever label eventually sticks).

Happy New Year!


As a big fan of Dan Bricklin's various work, I was dissapointed to find that your intriguing link above points to my blog instead. Shome mishtake shurely?

Oops. Sorry. Corrected.

You danbris confuse me.