What is Shared Data?

Wherein I reiterate what others have said and express frustration with The Powers That Be

Of all of the problems in the world of data sharing, the most frustrating to me is the data that are shared but are completely unusable to others. As I've argued before (and of course, I'm not alone in this argument), the point of all of this sharing and management of data business is to facilitate access and reuse. Now, access seems to be moving along at a not unreasonable pace. More and more i'm seeing data deposits in our local repository and even more so, data sharing through journals in the form of supplementary data. General purpose repository services (e.g. figshare, dryad, etc.) seem to be doing pretty well for themselves (right?), and a number of high profile journals have implemented requirements of one sort or another for data sharing (cf. PLOS). 

But the useability part seems to be a real sticking point for some. Poking through journals looking for shared data, I regularly (more times than not?) find data shared in the following ways (note, these are real examples, i've seen all of these many, many times):

  • Data are shared as PDFs (or other document type) - for instance, a table of numbers in a PDF, a set of figures in a PDF
  • Claims that "all the data are in the paper" when, in fact, it is clear that this is not the case
  • Data are shared, but do not include any documentation, code books, data dictionary, or metadata
  • Shared data are supplementary figures
  • Links to the source of the raw data (e.g. a database of some sort) with insufficient documentation for how to query the database to get ahold of the same dataset used in the research (e.g. a link to lpdaac.usgs.gov/data_access)

These are just a few of the useability issues I've seen with shared data, and what this amounts to is actually not sharing. It's almost worse to share this stuff than to share nothing at all. 

But what should shared data look like? Lucky for us there are loads of resources out there offering best practices for data sharing (cf: White et al. 2013, Kervin et al. 2015). But I think a lot of this just comes down to common sense, and, as I've mentioned, remembering the point of all of this sharing business: can someone take action on your data (e.g. in a statistical software package) without an inordinate amount of work such as reformatting, and without sending you any emails or calling you on the phone or knocking on your door? If the answer is "yes", you've shared your data, if your answer is "no" or "meh", you ain't shared nothing. 

Another part of this discussion - related but separate - is why is this happening? One obvious answer to this is that data sharing is not normal practice for researchers in many fields, and it will take time for folks to get used to doing this in the right way. I'm totally okay with that - i've been there on the giving and receiving end of the sharing question and I fully understand the overhead for preparing data and lack of clarity and training in this area. So, I'm not really going to pick a fight on that front. 

However, the lack of institutional (universities, funding agencies, journals, etc.) scrutiny of the data sharing question seems pretty problematic to me. I'm going to pick on journals here because this is where i am personally seeing a lot of problems with sharing (I'll spare funding agencies and universities for now...). My issue with what i'm seeing in journals is this: if the journal doesn't have policies/guidelines for data sharing, but researchers are sharing through their venue, they need a policy and guidelines - full stop. I'll bet my dollars it is being done poorly without guidance in many if not most cases. If a journal has a policy calling for data sharing, then they'd better be prepared to vet the data being shared in the journal and to enforce that policy by determining whether data is being shared in a meaningful way.

If a journal (or agency, or university) policy or mandate isn't being met, something should be done. Does that thing need to be drastic? No, but I think it does need to enforce some standard for quality data sharing, instead of letting researchers think their data has been shared adequately, when, in fact, it has not. I'm always going to come back to the fact that if the shared data aren't comprehensible, then the data haven't been shared

The excuse of "we're all new at this" is really only applicable, in my opinion, to the nuances and vagaries of domain specific data sharing practices or data formats. There are many basics of data sharing, however, that I think we can mostly agree on (or if we can't we're in serious trouble) and it think these basics are where the bar should be set (for now). Can we all do this better? I think we can.