hard-assing the repo

I've just returned (a couple of weeks ago now) from the annual Research Data Access and Preservation (RDAP) Summit in Minneapolis. Apart from being a great conference this year (as always), I've come away with what at first I thought was a bad attitude, but I now realize was a fire in my belly for changing the way we talk about ingesting data into our repository. What does that even mean? It means I'm going to start being a hard-ass about data deposits. 

It's probably important to back up a bit and explain. The repository in question is a typical (though pretty big) institutional repository at a research university. Over the past few years we've started accepting datasets in the repository and our services around research data management. As those service were growing the focus was, understandably, on driving data deposit traffic to the repository and sorting out the details of how best to gather sufficient documentation to make data usable. In some ways this approach was necessarily vague and unformed, but the result is it empowered depositors to submit datasets to the repository that were not documented or formatted well enough to share properly. 

Fast forward to about two months ago. We received a large supplementary data file with a dissertation that included scads of poorly documented data - it appeared to me that the student had zipped their working directory and sent it along with very little thought put to whether the data were actually usable. How prevalent was this problem? I decided to take a look - and the results were disappointing, though not unexpected. Zip files associated with scholarship (e.g. Dissertations, etc.) in our repository were replete with problems ranging from unreadable formats to spreadsheets with macros that crashed my machine. Format and obsolescence problems aside, almost none of the data was documented in any way whatsoever. None - no readme files, no metadata, not even field descriptions in spreadsheets.

I laughed. I cried. I considered a different career path. 

In the end, though, this experience makes me realize a few things:

  1. The point of all of this data management stuff, really, is to facilitate open science and the scientific process. Opening up science is the reason we see an increase in funder mandates for data management and sharing. Opening up science can be an effective route to better and more impactful science. So thats the goal - opening up science - and that is going to be the goal from this point forward.
  2. File formats are important. Better judgement aside, I'd long thought that if we could just get data IN the repository that would be a huge step in the right direction. But if that data isn't usable because the format is funked-out, then whats the point? Flatten those files! Make them non-proprietary! Assume the worst!
  3. Documentation is important and even a tiny bit can make a huge difference. Those zip files with a simple readme, a spreadsheet with a readme tab, any documentation will help. 
  4. We might be setting the bar too low for minimum participation. The "please deposit, we'll take anything" approach can be useful for getting services up and running but at some point that insanity has to stop. Proper formatting and documentation need to happen in order for any of this stuff to be usable in the future, and thats the point, right?
  5. It's time to be a hard ass about this. I'm ready to reject content and/or make it unavailable if depositors can't meet basic data sharing requirements. 

Now, pardon me while i go rewrite some policies and craft emails to ill-behaved depositors.



tl;dr - crappy data deposits in the repository have to stop. open science is the point of all of this business and if you can't even bother to document your data so that it is reusable, it's going to be very hard to convince me to accept it into this here data repository.