Part Deux - Prioritization of Content Types

Horse pulling plow in an orchard, southern Oregon?

Horse pulling plow in an orchard, southern Oregon?

Oh, hey, I’m back! So, last time we talked about event based preservation. At the end of that post, i posed a number of question about how to move forward with this discussion/idea. I’d like to dig in on one of those questions here: What is the right balance of technical vs. art vs. sociological vs. ??? content for this corpus of recovery-related information/knowledge?  Recall that the goal here is to consider information preservation in the context of catalyzing recovery after a catastrophic societal collapse. So what types of information are helpful on that front? Here is my ranked list of information types for this style of preservation.

  1. Applied Technical - Most Important - these are things like technical reports, agriculture guides, engineering guides, etc. What do we get from this information that is so important? Primarily we would hope to use this information to rebuild physical infrastructure. This is how we get our water purification, electricity generation, crop production, and food preservation back online. Depending on the speed and severity of collapse, these elementary skills and technologies may be of utmost importance for stabilizing society and setting the recovery trajectory.

    As a librarian at a Land Grant institution, I am pretty familiar with this type of content through our interactions with the State Extension Office. Many of these applied technical materials assume an element of DIY-ness, and especially older content (e.g. from the turn of the 20th century) is a treasure trove of how-tos and best-practices.

  2. Art - Medium Important - a broad category, to be sure, but contains music, literature, visual arts, performance arts, and so on. Why medium important? Well, this is the stuff that feeds the soul when everything else is in the pits. Need to unwind after a hard day of building a grain mill out of local stone? Maybe picking up a technical manual isn’t the best choice. Reading a collection of short stories, or looking through a collection of paintings from days of yore is a better choice.

  3. Humanities/Social Science/Science Academic Writing - Least Important - again, a pretty broad category. This is the stuff we often find in our institutional repositories and journals, and it is usually pretty esoteric. Important? Yes! Critical for catalyzing recovery? Unliklely? Now, this is not to say this stuff is not important, but the level of specificity here is likely to be hard to reconcile with basic survival and reconstruction.  

Maybe there is something here: physical infrastructure vs. spiritual infrastructure vs. intellectual infrastructure. Again, these are all important things, but in prioritization, context is critical.

Selection is another issue - how does one select which information is most relevant. Where does this information come from? I think this is a much larger issue that i can’t muster the energy to discuss here today, but i’ve got some thoughts bouncing around that might be helpful. Will update on this topic at some point in the future.

Of course, sources of information for preservation are going to vary regionally, by organization, etc. It doesn’t necessarily make sense for one institution to attempt to preserve information for recovery if that institution currently retains only a subset of the useful information. The university I work for has a long history of agriculture and engineering research and outreach, but is particularly lacking in the areas of the arts and humanities. Down the road, another state institution has a relatively stronger history in the areas of architecture, law, and the humanities. Does it make sense to build regional caches of preserved information for recovery? How do we define those regions?

I think thats all for now. Lets keep talking about this (if you’ve engaged with me on the topic already) or start talking about it (if you haven’t). Looking forward to more discussion.

Event-Based Curation and Preservation - Considering the Collapse

Screen Shot 2019-02-14 at 8.12.55 AM.png

I’ve long had an interest in the end of the world. The 1980s, the core decade of my childhood, was replete with books and film about pre, post, and mid apocalypses - nuclear, robot, and otherwise. Over the past decade, my interest in this topic has gained momentum again, aligning with the birth of my children and the ever-looming sense that humanity has done too much to the world, and to ourselves, to expect that we won't pilot this ship directly into the iceberg. Of course, during this decade i’ve also switched from research-land to library-land, with responsibilities that swirl around protecting humanity from losing the knowledge we’ve gained over a relatively short period of time.

These losses have happened in the past - we’re constantly reminded in library-land school that the great Library of Alexandria was once a global center of knowledge that was lost, apparently not due to the apocryphal fire, but due to purging of academics and loss of funding ( Regardless of cause, the loss resulted, surely, in loss of information and knowledge gathered over centuries. Some of that information has come back to us, more than likely some of it will never come back. There are plenty of other examples of catastrophic losses of information in humanity’s past, some slow motion losses, some sudden. Some losses local, some regional, some national. Decades and centuries later we have the benefit of hindsight and can say to ourselves, “we’re doing fairly well now*” and the stories of knowledge lost now serve more a role of Preservation Parables than actual examples of why preservation is important now.

Screen Shot 2019-02-14 at 8.18.27 AM.png

Of course, there are loads of us out there thinking about preservation of digital and analog things, and we think about this at different time and space scales. But some recent reading and a re-engagement with that apocalyptic fascination of yore has me thinking about a particular point in time around an apocalyptic event that i don’t hear discussed much, if at all, in our field of preservation and access** to information - how information preservation might impact recovery from catastrophe.

From Baum et al. (2018). Long Term Trajectories of Human Civilization. Foresight. DOI 10.1108/FS-04-2018-0037

From Baum et al. (2018). Long Term Trajectories of Human Civilization. Foresight. DOI 10.1108/FS-04-2018-0037

Take the figure here from Baum et al. (2018) - the trajectories presented represents, on a time series, theoretical current societal trajectories, catastrophe related trajectories, and, important to this discussion, a variety of recovery trajectories. Baum et al. discuss recovery from catastrophe to regain civilization’s current agricultural or industrial condition, but little is dicussed about the inflection point where catastrophe meets recovery. The question I have is this: what role does/could curation and preservation of information and knowledge play in affecting the inflection point? How does access to information affect the rate of recovery?

After Baum et al. (2018). Theoretical civilization recovery trajectories impacted by differential availability of recovery-related information. The inflection point in the red circle is where I argue we ought to focus event-based curation and preservation to impact recovery trajectories.

After Baum et al. (2018). Theoretical civilization recovery trajectories impacted by differential availability of recovery-related information. The inflection point in the red circle is where I argue we ought to focus event-based curation and preservation to impact recovery trajectories.

Let’s look at a version of these curves a little more closely? Currently, we have digital preservation thinking for cultural heritage happening at a few places along this curve.***

  1. Short Term - This is backups, local restores, and maybe some mirroring to other institutions (cf Metaarchive) or regions (as is commonly allowed with cloud services).

  2. Long Term - This is, it seems to me, what we typically think about when we discuss digital preservation. One of the core assumptions about this kind of preservation is that the current (or more advanced) state of technological infrastructure will be available when we need to retrieve our content.

  3. Super-Long Term - this is not usually a time scale or set of conditions we in cultural heritage typically (I think?) consider seriously, but it is a time scale and set of conditions that folks like the GCRI and the Long Now Foundation take seriously. One of the core assumptions here seems to be that there will be a loss of access to our current technological infrastructure, and that rediscovery of knowledge requires considering how we can share information at very long time scales. This might look like a reconstruction of technological infrastructure to help provide access to content (like for a DVD made of special materials) or deciphering a physical (usually) information storage system in order to access critical knowledge.

Short, Long, and Super Long Term Preservation time scales. The highlighted red section is where event-based preservation comes in. How do we curate and preserve for the event?

Short, Long, and Super Long Term Preservation time scales. The highlighted red section is where event-based preservation comes in. How do we curate and preserve for the event?

There is a way of thinking about preservation that is missing from these approaches to preservation that are fixed at temporal scales. How could an event-based conception of preservation change the way we think about curation of, preservation of, and access to content? In some ways, the disaster recovery plans we often have in place for cultural heritage institutions address a type of event based need. Typically, though, these disasters (though disastrous indeed) are relatively contained in space and time to such events as a water leak, fire, or temporary power grid disruption. By bumping up the severity of the disruption event, we are forced to think about similar issues (ie, if the power grid goes out for a year, what would we do?), but also to think about larger issues (ie, if the power grid goes out for a year, how do we rebuild the power grid?).

Should we seriously plan for major societal collapse events from a cultural heritage perspective? I think the answer to this is “yes”, in the same way that the answer to “should I earthquake proof my house” is “yes” - the probability of the event is low, but the potential impact of not preparing is high enough that it is worth putting in some effort now. If the answer to this question is “yes” then a whole host of follow on questions raise their heads, including, but not limited to:

  1. How to make information/knowledge available given a disruption of current technological infrastructure? If there is limited electricity, if there is loss of internet, how do we access digital content?

  2. What information/knowledge might need to be available in such a disruption? How could this information affect the trajectory of the recovery curve? How does this need change regionally? Based on event type (earthquake, tsunami, war)?

  3. If we assume that we will be unable (or severely challenged) to access digital content, how do we format the information for access?

  4. What is the right balance of technical vs. art vs. sociological vs. ??? content for this corpus of recovery-related information/knowledge?

  5. Are there events we (humanity) has experienced relatively recently that can help answer these and other questions, or that can raise other important questions?

There are so many other questions here, and so many assumptions embedded in the above. I’m interested in continuing this line of thought and doing so in conversation with my broader communities. Leave comments, email, tweet, visit my house - let’s chat.

*I know, this isn’t true, we’re not doing fine. It's just, we sort of tell ourselves we are fine in order to avoid thinking about how treacherous things are. Thats a separate blog post, but i get it.

**I know, preservation and access are different things. I get it. But they are coupled. Again, different blag post.

***I would be delighted to hear from archivists where they think their preservation activities and planning fit on this curve (and plan to engage my local community on the topic when i can get them in a room).

Invisible Work

What do i even do all day?

Over the past few months I've been struck by how:
1) I don't have a strong sense for what my colleagues do all day long
2) Some baseline assumptions I think many of us make (myself included at times) about what people do, or don't do, with their time.

This presents itself on a daily basis as either frustration that I don't know what is even happening in my organization or a sense that others may think that I'm not doing my work because my job is opaque to them. So, to help clear the air around this, below is what I did yesterday, including all the time i spent not working. For background, I am a digital repository librarian at a research university library. My job is comprised of managing an institutional repository, project managing and consulting on tech projects in our library, helping manage/set direction for research services in our library, and acting as product owner for Hyrax, a repository application based on the Samvera framework.

So, without further ado, the fascinating details:

8:15 am - arrive at work. Thursdays are the day we normally have a long meeting in the morning with our entire department. This is both a department meeting (for updating the department on business matters), a weekly standup meeting (to discuss what we've been up to and what we are planning for the next week), and a working meeting (after the other two topics, for issues the require longer discussion/work). Knowing this meeting is ahead, i spend the next hour or so prepping myself for the meeting and the rest of the day, catching up on email, and reviewing 3-4 Github project boards for project I'm involved with. I also check Slack for my organization, my department, and the Samvera Community to see if there is anything i particularly need to pay attention to in the coming hours (anyone panicking, any outstanding questions, anyone looking for help but not receiving it, etc.).

9:45 am - Head upstairs to my cube to drop my stuff off, say hey to everyone before the meeting starts.

10:00 am - Department Meeting. Starts with normal "standup" format (though on a weekly time frame) everyone goes around the room, says what they've been working on this week, what they're working on in the coming week, identifies any blockers that someone else (sometimes me) can help resolve, and offers up any issues or topics that need longer discussion. These last items are written up on the whiteboard and we'll work from this list for the remainder of the meeting.

We don't have any department meeting business today, so after standup we head straight into the longer form topics including such delights as:
- defining better process for closing Github issues and indicating whether new code has been deployed to our staging or production servers.
- discussion of our need to define our internal processes and expectations for project management, code review, github repository maintenance, and other procedural things. This is all put off to a meeting next week where we'll discuss this topic specifically, though we have a lot of good ideas flying around...
- determine our process for managing bulk metadata changes in our institutional repository. the software we use doesn't (yet) have a good UI for making these changes, so we have to coordinate between our Metadat Technician and our development team to get this work done. We decide on a process that we'll pilot over the next week.
- discussion and implementation of new labels in our Github repositories and what they mean. e.g. when something is labelled "critical" that means it's something that needs immediate attention (it's a blocker or something is really broken) versus when something is labelled "high priority", which is more of a "when you get time to work on the institutional repository, do these things first"
- a couple of other topics that are not particularly germain to my work, so i kinda space out and participate in some discussion on Slack about a multi-month development effort i am coordinating in the Samvera Community.

11:30 am - I head over to a meeting/lunch for our proposed faculty union and wind up talking with library colleagues about people management, project management, and other high-ish level administrative stuff in the library.

12:30 pm - Back at the library, I run to a meeting and the person I'm meeting with can no longer meet due to a conflict. I check in with a co-worker on a project he's been working on related another project I'm involved with. He's made amazing progress on a web-application for displaying oral histories and I make sure to tell him that he is a wizard and his work is great.

1:00 pm - Meeting with a colleague to clarify details of a large project to ingest the proceedings of an annual conference into our repository. These go back to the 1960s and the organizers of the conference were delighted to know that we would work with them to upload the content. There are lots of fiddly details and it is really a lot of content, so it's been a bit of a struggle to get our heads around the best approach, but we're getting there.

We also chat about the current state of digital scholarship services in the library. There is a contingent of faculty in the library seeking to coordinate efforts around these services and we're in the process of writing a proposal to that effect, but, again, fiddly details. This is exhausting, as we've been discussing this for over a year, but there is a lot to coordinate and not a lot of resources to go around, so we have to be patient.  

1:30 pm - a moment of down time. drink some water. drink some coffee. I check the news (why do i do that?) and try to read something not news related - this time an update on the archival preservation media (created by the Arch Mission) sent into space on Elon Musk's Giant Rocket. The Arch Mission is a bit of a mystery to me, but i'm curious about the technology and the data curated for the project - trying to keep an open mind, but, you know, im a bit skeptical given that the entire project has zero information workers involved...

2:00 pm - I attend a meeting with a colleagues here and at a university down the road to discuss an upcoming set of Library Carpentry workshops we're going to hold (one at each institution). We've got the registration form out, the classes are filling up, and we're working through the logistics. We have enough interest that maybe we need to hold an additional workshop, so we discuss what that would look like. Next week we'll meet again and review the curriculum and our syllabus.

2:30 pm - I meet with two of our software developers to come finalize what needs to be done with a library wayfinding application we're building before we launch. We review existing Github issues for the project, move many to the backlog (not MVP), and clarify any new work that we need to do based on current functionality. I then sit with these great coworkers and do what I can to clarify issues, discuss user interface and user experience concerns, and help resolve a few issues (what logos should we use? how should highlighting of areas on the map work? etc.). I have to say, I love working meetings, I really do.

4:15 pm - That meeting over, I stop back at my desk to pack up for the day. I spend a bit of time clearing the email inbox, flagging items to do the next day (including writing this post), and looking over Slack, again, for any issues that have emerged that I should be aware of. I note some discussion among the Samvera community members releated to an ongoing development effort and write down some notes to discuss with the team tomorrow morning.

4:45 pm - head home. spring is here in the Pacific Northwest!

5:15 pm - jump back on the computer while the kids get ready for evening activities. respond to a few flagged emails, make additional notes about taks to do tomorrow, and look over the Samvera Google Groups to see if there are any posts I should peek at.

5:45 pm - call it a day.

On changing direction

Well, here's a thing. Over the past year my positing has been squished around a lot. Like, A Lot A Lot. I came into this institution, and this field, with a deep interest in research data management and the issues therein and here, just a handful of years later find myself heading deep in the world of open source software projects. How'd that happen? Let's see...

At my previous place of work I was deeply involved with the creation of a nascent data services program. We didn't get everything done there (who does?) but we did get the machine started and, hopefully, I left something behind that those who followed found useful. In moving to my current institution I was tasked with splitting my work across data services and management of an institutional repository. In a way this made a lot of sense - we had a data specialist taking care of all of the public facing data work, and doing a great job, and i gather what was expected was a little more of the back end data work in the repository. Super fun, i thought - repositories are pretty cool (though I had no experience as a repo man at that time). Then we decided to launch a repository migration project and here I sit, two years later, doing exclusively repository work - product owning, going to repository conferences, sitting in on calls, all that good stuff. 

I love this repository work, it's way more exciting then I thought it could be. I get to work with great people here at at institutions the world over, have interesting problems to solve, and feel like I'm contributing to something big (ish). But, data sits on the back burner for me now and that's something I feel a bit sad about sometimes. Jobs change, sure, but i've passed by my raison d'être and moved on to other things. I'll still see the data folks i've grown to love, but less frequently than the already infrequent amount. Hopefully we can all stay in touch. 




Aliens, Astronomers, and Reproducibility*

KIC 8462852 has been all the rage (sort of) in the SETI community over the past few months after an unexpectedly large dimming around the star was observed by astronomers (see Boyajian et al. 2015). The long and short of it is that one explanation for why the dimming was happening was that there could be an "alien megastructure" around the start that was big enough to cause the dimming (Wright 2015). Through a number of publications and discussions, the original findings and suggestions of alien megastructures have been refuted and further investigated a number of times. It is all a bit of a soap opera - but it is also a great example of how science works. 

The reason I'm talking about this here is 1) i'm fascinated by SETI. But more importantly, 2) recent developments on this question highlight a major problem we see in the scientific literature - improperly documented data and methods leading to confusion and lack of reproducibility. To wit - more recent investigations of KIC846252 (e.g. Schaefer 2016) have suggested long-term dimming of the star based on the data provided via the Digital Access to a Sky Century @ Harvard (DASCH) Project. A response paper meant to reproduce Shaefer's results (Hippke and Angerhausen 2016), doesn't show such dimming - thus the confusion.

Now the bit about reproducibility - in recent interview with the WOW! Signal Podcast, Dr. Johnathan Grindlay, Principle Investigator at DASCH, indicates that he suspects differences in findings using DASCH data are likely due to the problem of not knowing what calibrations and data flags Schaeffer used when retrieving data for his analysis. The more recent version of Hippke and Angerhausen's paper also discusses these issues. Here we have an academic discussion in the literature that could have been almost entirely avoided if the original author had adequately described the methods used to retrieve data from a data source. Reproducibility is hindered, results are not verifiable, and a super interesting (at least to some of us!) line of research has been clouded with uncertainty because of bad documentation. Now Dr. Schaeffer isn't entirely at fault here - scientific norms have created a host of barriers to best practices for data sharing, and we know (see the next paragraph) that this kind of problematic sharing behavior comes up all the time. 

In our forthcoming paper (Van Tuyl and Whitmire, in press)**, Amanda Whitmire and I note this type of problem appears pretty regularly when faculty share data. It is not uncommon for researchers to "share" their data by providing a link to the data source, which, it turns out, is really just a link to the landing page for a database. Ultimately, in these cases, someone attempting to access the same data is unlikely to be able to track down the exact data the authors used, making reproducibility impossible. So what can be done? Here are some thoughts:

  1. If you share data that you extracted from a database, consider sharing the actual data you extracted, not a link to the database. Even if your methods are described very explicitly, database structure, access methods, and content can change over time.
  2. If sharing a snapshot of the data is not possible, be very explicit in your paper (or supplementary materials) about how you extracted the data. The details are extremely important. Also consider asking someone else to trace your methods to see if you get the same dataset out of the database.
  3. [Meta Problem] Maybe we need to have better ways to document, cite, and trace data extractions from these types of databases to help resolve these reproducibility issues. Not sure exactly what to do here, but one could imagine a data download being tied to a persistent and unique identifier (URI, DOI, somethingOI).

In the mean time, lets keep an eye out for more aliens, eh?


*Amateur Alert - look, I ain't no astronomer, but i've been following this discussion in popular media for a while. Anyone who knows more about this stuff than me, please feel free to make corrections or nudges in comments below.

**This paper is due out, like seriously any day now. Keep checking the DOI below!


Boyajian et al. (2015). Planet Hunters X. KIC 8462852 - Where's the Flux? arXiv:1509.03622

Hippke and Angerhausen (2016). KIC 8462852 did likely not fade during the last 100 years.

Schaeffer (2016). KIC 8462852 Faded at an Average Rate of 0.165+-0.013 Magnitudes Per Century From 1890 To 1989.

Van Tuyl and Whitmire (in press). Water, water everywhere: Defining and assessing data sharing in academia. PLOS One. 10.1371/journal.pone.0147942

WOW! Signal Podcast. Burst 11- DASCH Photometry with Dr. Josh Grindlay. [Retrieved 2016-02-11]

Wright (2015). [Retrieved 2016-02-11]

Am I here to make myself redundant?

Warning: half formed thoughts ahead

Self-preservation as a terrible motivator in the workplace. Really it is. Who wants to go through their day just trying to keep their job? I mean, don't get me wrong, I like my job and I don't want to be without one, but it can be hard to stay on target with persistent double vision: do good work and make sure you keep your job. 

By keeping my job, here, I mean "getting tenure." I'm lucky (??!?!?!) enough to be in a tenure-track position at a large university, which brings with it all of the immediate horrors and potential glories of such an esteemed position. And by doing good work, here, I mean doing things that are meaningful and impactful. Obviously one can do both - these two visions can be aligned perfectly, or nearly so, such that doing good work results in keeping the job (getting tenure). But this isn't always the case. I see folks around me (in my field and others) who are highly motivated by keeping their job and I wonder whether their good work suffers. 

I think it pays, and certainly has for me, to stop all the work and reflect on why I'm choosing to do what I'm doing. Is this project I'm working on useful for something other than churning out a publication or a talk or a poster? Am I contributing to a useful conversation with this work? Is anyone better off for this work being done? The answer is usually fuzzy, but I hope it trends towards "yes, this is useful and it is good work."

In many ways, the work that I do (facilitating the process of faculty sharing their scholarship, more or less) is the kind of thing that we'd (libraries) really like to not have to do. We'd like faculty to just want to do the sharing and know how to do it and have tools to do it on their own. In the end, the successful outcome of my position is the redundancy of my position - If I'm being honest about what I do, I should be trying to work my way out of a job.


The Model We Use for Research Data Services

Over the past few months I've been reconsidering my long-standing assertions/assumptions about the necessity of library involvement in research data services. For the extent of my admittedly short career in the research data services world, I've convinced myself, and I'm not alone, that libraries have a natural place in data services due to our long-standing tradition of making information accessible to others. It is a common refrain, but I'm not really convinced by it any more. 

I want to be careful about what I'm *not* saying here. I'm not saying that libraries don't or shouldn't have a place in data services. I'm not saying that libraries are doing it wrong or badly. And I'm not saying that everyone should drop everything and do something different. 

What I am saying is that I think there are other models for how research data services could be provided at a university and very few of them, as best I can tell, have been tested. As someone at a university that may be in the (questionably) luxurious state of having a existing data services program *and* an opportunity to rethink how we structure data services, i figured this would be a good time to try to take these assumptions apart a bit and examine the pieces. 

I know that the research data services program I helped build from scratch at my previous place of work was modeled on successful programs I saw elsewhere (Cornell, Minnesota), and the research data services at my current place of work have tried to move in that direction, too. That model, the now classic trio of Libraries, Research Office, and Central Computing, is useful and logical in many ways. But the truly logical (i think) units in that trio are the Research Office (for compliance) and Central Computing (for infrastructure) - the Library fits in that trio for less obvious reasons. Though one of the reasons libraries have fit themselves into this space is very important: libraries have asserted themselves into this space and filled a vacuum. Few were stepping to the plate way back when, and libraries took on the task of predicting and filling a void. Someone has to actually *do* this stuff, and libraries have stepped up in a major way. 

But what other models could there be? Well, I think it helps to consider the components that are required. In my estimation these include:

  • Computing Infrastructure - "Obvious" stuff including storage and backup/replication, but also discovery, hosting, online tools, etc.
  • Compliance Infrastructure - Making sure researchers do all the required data management things so the money stuff keeps flowing
  • Outreach and Education - Facilitate data activities and helping researchers understand best practices
  • Coordination - Central body to ensure services are being provided in a useful way and that researcher needs are being met

I don't think i've left anything major out of that list. And i don't think there is anything in that list that specifically calls for the library to be involved. Now, depending on your institution, the library actually might be the best unit to fill one or all of those roles. On the flip side, depending on your institution, the library may not be the best to fill any of those roles. What would that look like? Here are some examples one might consider at a large institution like mine - all without libraries:

  1. College level outreach, education, and computing infrastructure coordinated by the research office
  2. College level outreach and education, centralized computing infrastructure, coordination by the research office
  3. Research office coordinated outreach and education with centralized computing infrastructure

Of course, each of these models has its own problems including but not limited to issues of trust, recognition of competency of service providers, costs, etc. But those issues do not necessarily go away when the library is involved. 

I'll also note that there are institutions that are doing great research data services work that do not include libraries - look at some domain repositories like some of the NASA DAACs or to other long-standing data providers like the National Weather Service*. 

Where does that leave us? Not sure where this leaves you, dear reader, but makes me want to step back and think about how the services that are needed by researchers could be provided differently at my institution. Should we try to lean more heavily on our university colleges? The Research Office? Computing? I think the answer to all of these might be yes. Should our goal in the library be to focus our role around the repository aspect of our work, outreach and education, coordination? 

Discuss, please. Help me think through this beast.


*pretty sure neither of these explicitly includes libraries in their research data services/curation/sharing but correct me if i'm wrong


What is Shared Data?

Wherein I reiterate what others have said and express frustration with The Powers That Be

Of all of the problems in the world of data sharing, the most frustrating to me is the data that are shared but are completely unusable to others. As I've argued before (and of course, I'm not alone in this argument), the point of all of this sharing and management of data business is to facilitate access and reuse. Now, access seems to be moving along at a not unreasonable pace. More and more i'm seeing data deposits in our local repository and even more so, data sharing through journals in the form of supplementary data. General purpose repository services (e.g. figshare, dryad, etc.) seem to be doing pretty well for themselves (right?), and a number of high profile journals have implemented requirements of one sort or another for data sharing (cf. PLOS). 

What this amounts to is actually not sharing. It's almost worse to share this stuff than to share nothing at all. 

But the useability part seems to be a real sticking point for some. Poking through journals looking for shared data, I regularly (more times than not?) find data shared in the following ways (note, these are real examples, i've seen all of these many, many times):

  • Data are shared as PDFs (or other document type) - for instance, a table of numbers in a PDF, a set of figures in a PDF
  • Claims that "all the data are in the paper" when, in fact, it is clear that this is not the case
  • Data are shared, but do not include any documentation, code books, data dictionary, or metadata
  • Shared data are supplementary figures
  • Links to the source of the raw data (e.g. a database of some sort) with insufficient documentation for how to query the database to get ahold of the same dataset used in the research (e.g. a link to

These are just a few of the useability issues I've seen with shared data, and what this amounts to is actually not sharing. It's almost worse to share this stuff than to share nothing at all. 

But what should shared data look like? Lucky for us there are loads of resources out there offering best practices for data sharing (cf: White et al. 2013, Kervin et al. 2015). But I think a lot of this just comes down to common sense, and, as I've mentioned, remembering the point of all of this sharing business: can someone take action on your data (e.g. in a statistical software package) without an inordinate amount of work such as reformatting, and without sending you any emails or calling you on the phone or knocking on your door? If the answer is "yes", you've shared your data, if your answer is "no" or "meh", you ain't shared nothing. 

Another part of this discussion - related but separate - is why is this happening? One obvious answer to this is that data sharing is not normal practice for researchers in many fields, and it will take time for folks to get used to doing this in the right way. I'm totally okay with that - i've been there on the giving and receiving end of the sharing question and I fully understand the overhead for preparing data and lack of clarity and training in this area. So, I'm not really going to pick a fight on that front. 

However, the lack of institutional (universities, funding agencies, journals, etc.) scrutiny of the data sharing question seems pretty problematic to me. I'm going to pick on journals here because this is where i am personally seeing a lot of problems with sharing (I'll spare funding agencies and universities for now...). My issue with what i'm seeing in journals is this: if the journal doesn't have policies/guidelines for data sharing, but researchers are sharing through their venue, they need a policy and guidelines - full stop. I'll bet my dollars it is being done poorly without guidance in many if not most cases. If a journal has a policy calling for data sharing, then they'd better be prepared to vet the data being shared in the journal and to enforce that policy by determining whether data is being shared in a meaningful way.

If a journal (or agency, or university) policy or mandate isn't being met, something should be done. Does that thing need to be drastic? No, but I think it does need to enforce some standard for quality data sharing, instead of letting researchers think their data has been shared adequately, when, in fact, it has not. I'm always going to come back to the fact that if the shared data aren't comprehensible, then the data haven't been shared

The excuse of "we're all new at this" is really only applicable, in my opinion, to the nuances and vagaries of domain specific data sharing practices or data formats. There are many basics of data sharing, however, that I think we can mostly agree on (or if we can't we're in serious trouble) and it think these basics are where the bar should be set (for now). Can we all do this better? I think we can. 

That Open Letter

Warning: wall of text/rant. 

Earlier this week I had the good fortune to hear about a seminar on campus at the last minute. The head of the NIH National Institute for Environmental Health and Safety (NIEHS) Office of Sceintific Information Management (Dr. Allen Dearry) was giving a talk to environmental health researchers - "Towards Biomedical Research as Digital Enterprise". Dr. Dearry's talk was fine - an introduction to new and impending data stewardship expectations out of NIH (and many other agencies), funding opportunities from NIH to support these goals, and a discussion of the new (to me at least) Precision Medicine movement. There were plenty of questions from the audience, some of them lobbed by my prickly self, about the data stewardship elements of the talk and what kind of support and guideance NIH would offer and what it all meant for the researcher. Unfortunately, the speaker was hard pressed to answer most of the questions with any real level of clarity. What is data? What should be shared and how? What resources were available? What standards should be used? These questions were all met with a smile, a shrug of shoulders, and promises that more information would be forthcoming. But, as we've heard so many times before, the agency couldn't be responsible for making this happen - the research communities need to step up to the plate and sort it out.

I'm being hard on Dr. Dearry, not because his talk was especially problematic, it wasn't unique - he is just a convenient and recent example. His talk was fine - it offered information on what NIH is doing to address and support these mandates (some stuff) and he was quite honest about how soon we might expect to see real guidance (5-10 years). It is exactly what I have come to expect over the past few years when discussing the impact of and support for the famous OSTP mandate from 2013 and previous mandates (explicit and implicit) for data sharing and curation from funding agencies. Communities of researchers are expected to apply their best practices to enable data curation, data sharing, and all that other data stuff and they should do it because it is the right thing to do - agencies can't force the issue - the researchers need to do this themselves. This is what we've heard over and over. But this is a false dichotomy that has been perpetuated for too long - it is not a choice between an agency "forcing" research communties to be better data stewards and research communities self organizing to do the same. There is a middle ground that is not being explored there, and this lack of exploration is at the expense of a potentially more thorough and holistic suite of support services for meeting NIH (and all of those other agency) goals and the goals of open science. 

So what is that middle ground? Well, for starters, it would be helpful to see some more meaningful engagement from these agencies with communities that are trying to provide guidance and services - something I don't feel like I've seen much of from where I sit. If the agencies aren't going to offer guidance, and the researchers are looking for guidance, maybe there is someone out there who can help kick-start the process. And, it turns out, there is such a someone - in fact, there are many of them. 

Some of these someones already exist in areas of research that have historically been better at data stewardship. We point to them all the time - "Why can't you be more like the astrophysicists?" we say.Great, lets figure out how to make extensible some of those practices and to identify what practices are simply a function of the type of science that community does. Lets have a nuanced discussion.

We can't forget our friends in the commercial world, many of whom are providing high quality services and products to facilitate data stewardship and open science in a way that is effective and easy for researchers to incorporate into existing workflows (I'm looking at you, figshare and the Center for Open Science). 

Last, we have a growing community in academia in libraries, IT organiations, research offices who are trying to develop service profiles to help meet these data stewardship needs. These communities are providing guidance on best practices, providing repositories (often at a very low cost if at any cost at all to users) for data and other digital assets, and a host of other services to facilitate data stewardship and open science. This community is growing and is excited and seems to have a pretty good handle on how to approach this problem. But this community can only get so far with waiting to be invited to the table and waiting to see what guidance comes down the road. I feel strongly about this because this is the community in which I sit and this is the community that is primed to really make a difference to the data stewardship needs of our researchers.

I'd like to call on that last community (and the others if they're up for it) to step up to the table and assert our role in this grand challenge. We're doing this already in our own ways with local projects, regional collaborations, and national grants. But after so many years of, to quote a speaker (sorry I don't remember which one!) at the recent RDAP meeting, "We Exist!", we still are relative unknowns. Luckily some of this is happening and it seems like there is momentum around asserting ourselves into this space. I'm less than a day past the DataQ editor's meeting and, as I mentioned, a few weeks out from RDAP, where there seems to be a common understanding of many of these issues. But what's next?

We've had calls for open letters to the funding agencies to ask them to acknowledge the value of libraries in the data management process. I'd like to call for an open letter that is not apologetic and isn't asking for including, rather, a letter that asserts the value of the data management communities in libraries (and affiliates) to the process of opening science. A letter that points out that agencies have failed to do so thus far. A letter that point out that by pushing off responsibility for providing guidance for data management issues onto the research communities, and by offering a level of financial support that, to many of us, seems misdirected or too small to matter, they have shirked their duties and have created, unnecessarily, a landscape of confusion and frustration.

tl:dr - maybe next time an agency representative comes to campus to talk about data stewardship, they could invite to the table the people that are already providing these services to the campus community


hard-assing the repo

I've just returned (a couple of weeks ago now) from the annual Research Data Access and Preservation (RDAP) Summit in Minneapolis. Apart from being a great conference this year (as always), I've come away with what at first I thought was a bad attitude, but I now realize was a fire in my belly for changing the way we talk about ingesting data into our repository. What does that even mean? It means I'm going to start being a hard-ass about data deposits. 

It's probably important to back up a bit and explain. The repository in question is a typical (though pretty big) institutional repository at a research university. Over the past few years we've started accepting datasets in the repository and our services around research data management. As those service were growing the focus was, understandably, on driving data deposit traffic to the repository and sorting out the details of how best to gather sufficient documentation to make data usable. In some ways this approach was necessarily vague and unformed, but the result is it empowered depositors to submit datasets to the repository that were not documented or formatted well enough to share properly. 

Fast forward to about two months ago. We received a large supplementary data file with a dissertation that included scads of poorly documented data - it appeared to me that the student had zipped their working directory and sent it along with very little thought put to whether the data were actually usable. How prevalent was this problem? I decided to take a look - and the results were disappointing, though not unexpected. Zip files associated with scholarship (e.g. Dissertations, etc.) in our repository were replete with problems ranging from unreadable formats to spreadsheets with macros that crashed my machine. Format and obsolescence problems aside, almost none of the data was documented in any way whatsoever. None - no readme files, no metadata, not even field descriptions in spreadsheets.

I laughed. I cried. I considered a different career path. 

In the end, though, this experience makes me realize a few things:

  1. The point of all of this data management stuff, really, is to facilitate open science and the scientific process. Opening up science is the reason we see an increase in funder mandates for data management and sharing. Opening up science can be an effective route to better and more impactful science. So thats the goal - opening up science - and that is going to be the goal from this point forward.
  2. File formats are important. Better judgement aside, I'd long thought that if we could just get data IN the repository that would be a huge step in the right direction. But if that data isn't usable because the format is funked-out, then whats the point? Flatten those files! Make them non-proprietary! Assume the worst!
  3. Documentation is important and even a tiny bit can make a huge difference. Those zip files with a simple readme, a spreadsheet with a readme tab, any documentation will help. 
  4. We might be setting the bar too low for minimum participation. The "please deposit, we'll take anything" approach can be useful for getting services up and running but at some point that insanity has to stop. Proper formatting and documentation need to happen in order for any of this stuff to be usable in the future, and thats the point, right?
  5. It's time to be a hard ass about this. I'm ready to reject content and/or make it unavailable if depositors can't meet basic data sharing requirements. 

Now, pardon me while i go rewrite some policies and craft emails to ill-behaved depositors.



tl;dr - crappy data deposits in the repository have to stop. open science is the point of all of this business and if you can't even bother to document your data so that it is reusable, it's going to be very hard to convince me to accept it into this here data repository. 

The Degree (sigh)

I try really hard to avoid all of the identity crisis shenanigans that librarians love to engage in. Questions like "why don't people think we're important?" or "can you believe nobody knows you have to have a degree to be a librarian?" so just really not interesting to me. I think we all in library land (and, like, in the entire world) have better things to talk about. 

But the topic of The Degree has recently been chapping my hide. Looking through position descriptions that have come across my screen lately, I can't help but notice the number of job searches for unicorns seems to be on the rise (though i could be imagining it). Organizations are looking for people to do things like in depth analytics or assist with research data management and expecting candidates to have extensive experience in the area of focus (analytics or data management), a degree in a relevant field (e.g. statistics, 'science'), and a Library Degree. I mean, do you even want to get applicants? Do you want to fill the position?

How about this - and bear with me hear because a lot of people don't like to hear this stuff - why not just hire someone without The Degree to do work that they are better suited to do? How many Library Land Programs are offering meaningful analytics and data management courses and/or experiences? And by meaningful I don't mean learning how to calculate an H-Index, I mean courses in statistics, embedded experience in the research process (of actual researchers), or experience with analytical tools (and I don't mean Excel)?

Do I mean having The Degree prevents you from being able to do this work? Or that The Degree or Degree Programs are 'bad' in some way? No. I mean that if we want to hire quality people into our organizations to do things that having The Degree doesn't help them to do, maybe we should just, you know, do that. 

I could go on all day about this, but I'll stop now. 

Big Idea Collective - A Non-Committal Ideas Club for Lazy Bums Like Me

I'm sort of a fan of coming up with and sharing crack-pot schemes with my friends and colleagues. I think it helps fuel the fires of creativity and offers opportunities to think about ideas without having to commit to the dreadful details of budgeting and Gantt charts and human resources and annual reports. All those things just get in the way of the ideas and anyone saying otherwise is a liar or a swindler*. 

In the spirit of sharing crack-pot schemes I recently reeled in a colleague to participate in weekly (if we remember to do it) Big Idea Collective meetups. Here's the setup. We meet very briefly, like while waiting in line at the coffee shop, to talk about our big idea for the week. No commitments. No follow-through required. Just tell your Big Idea for the week. Nothing much emerges from these meetings, apart from a little head-scratching and a few laughs, but occasionally one of these ideas actually seems to have traction. And we put that idea off to the side for further investigation - for the doing part.

Not much to it. Just a commitment to not committing to doing the things we talk about.


*I know, I know. At some point you have to actually do things and not just talk about them. But the freedom to talk about this stuff without worrying about how you'd do it is pretty valuable. 

On Existential Crises

A few months ago I watched a lecture by famous computer scientist (among other things) Allen Newell (thanks for the prompt CMU Computer Science!). In this lecture, which is fascinating, even though I know very little about much of the content, Newell discusses his career and how he wound up where he is (or was at the time of the lecture). While fascinating and inspiring, I spent weeks after watching this video in an existential funk. What was the purpose and direction of my career? What were the important big questions in my field and who was working on them? How could I find and do interesting things and escape the day-to-day drudgery of the profession?

So what did I do? I did what any self respecting human would do - I spent a week pouting. While that might sound completely useless, I came to a number of important realizations during my pout:

  1. I have so many wonderful colleagues. Really. In my days crying into my coffee, all these great folks spent time talking with me about my crisis and their thoughts on how to approach the problem. 
  2. One way to resolve this multitude of Big Deal Crises is group therapy. So naturally, we started a club to talk about all the big idea issues we grapple with (or should). Monthly meetings. Reading and discussion and ennui. 
  3. My main takeaway from Newell's lecture is this: you will have many distractions during your career (and some distractions can last years), but you need to make sure that you learn something, that you get something out of each distraction that you can apply to those Big Deal Questions. 
  4. Existential crises are contagious, and that is a good thing. I'm happy to say I've sent (in part or in whole) no fewer than four colleagues into downward spirals since I entered my funk. Some have emerged, some are still fighting in the maelstrom. Misery loves company, but more than that, I suspect that all these spin-off crises will result in new directions and new focus for everyone involved. I must say, though, that I can't take any credit for what emerges, just for sending everyone into the darkness.

An existential crisis every now and again seems to be a good thing. Give it a whirl.