Publishing academic research data

My geeky perspective and some broader perspectives.
David Shotton's 5 stars of academic publishing

Along with Jo Rabin's talk that I mentioned here earlier this month, another inspirational talk in the recent XML Summer School Trends and Transients track was "Applying XML and semantic technologies to liberate infectious disease data" by Oxford University zoology professor David Shotton. He described how, while assembling a paper on leptospira infection in urban slums, he used data and metadata from the project to create the version described in a separate paper, Semantically enhanced version of a research article from PLoS Neglected Tropical Diseases. (Note the bottom of that page, where it lets you pull down bibliographic data in your choice of RDF serializations. Also, don't miss the semantically enhanced paper itself, and make sure to click around in it.)

After his presentation one audience member asked how an academic department with limited resources and technical background could move in this same direction without attempting to reproduce the full infrastructure, and Professor Shotton suggested that they start by putting their research data on the web along with some metadata about it. This got me thinking about Tim Berners-Lee's Linked Data 5 Stars, a series of incremental steps toward publishing open linked data in machine-readable standardized formats. I raised my hand and suggested to Shotton that, building on his answer to that question, an alternative version of the five stars for academic researchers could provide a valuable guideline for others interested in following in his footsteps. And he's done it! He just published The Five Stars of Online Journal Articles on his blog, which points to a longer version of the article that he's submitted to Nature. My original idea was more of a revision of Berners-Lee's original five stars, but Shotton drew on his extensive academic publishing experience to bring in a lot of bigger-picture issues such as peer review and specific repositories that could host such data.

I had been thinking about the potential of academic researchers publishing data using Linked Data principles before this year's XML Summer School; one reason I started the Charlottesville Semantic Web Meetup was to find people at the University of Virginia who were interested in pursuing this. I recently learned about someone else who's been thinking hard about issues around publication of research data: UCLA's Christine Borgman, whose paper The Conundrum of Sharing Research Data appeared in the June issue of the Journal of the American Society for Information Science and Technology. (Click "One-Click Download" on that page to retrieve the paper itself.)

As I realized when I read David Shotton's article, I've been focused on the technical issues, but there are many others to consider. Here are a few quotes from Borgman's abstract:

This article explores the complexities of data, research practices, innovation, incentives, economics, intellectual property, and public policy associated with the data sharing conundrum.
Rationales for sharing data vary along two dimensions: whether motivated by research concerns or by leveraging public investments, and whether intended to serve the interests of researchers who produce data or the interests of potential re-users of data.
Four rationales for sharing research data are identified and positioned on these dimensions. Researchers’ incentives to share their data depend not only on these rationales, but on characteristics of their data and research practices, funding agency policies, and resources for data management. Much more is understood about why researchers do not share data than about when, why, and how researchers do share data, or about when, how, and why researchers or the public reuse data. The model and research agenda are illustrated with examples from the sciences, social sciences, and humanities.

Here's one quote from the main body of the article:

If the rewards of big data are to be reaped, then researchers who produce those data must share them, and do so in such a way that the data are interpretable and reusable by others. Underlying this simple statement are thick layers of complexity about the nature of data, research, innovation, and scholarship, incentives and rewards, economics and intellectual property, and public policy.

Her paper goes on to describe these layers. And, I have to love any academic paper that refers to a "dirty little secret." I'll let you find that part yourself. While Borgman's paper doesn't get down to the level of data models and serializations for sharing data, if you're at all interested in how Linked Data may benefit the academic research world, her paper is really worth reading.

Please add any comments to this Google+ post.