Aliens, Astronomers, and Reproducibility*

KIC 8462852 has been all the rage (sort of) in the SETI community over the past few months after an unexpectedly large dimming around the star was observed by astronomers (see Boyajian et al. 2015). The long and short of it is that one explanation for why the dimming was happening was that there could be an "alien megastructure" around the start that was big enough to cause the dimming (Wright 2015). Through a number of publications and discussions, the original findings and suggestions of alien megastructures have been refuted and further investigated a number of times. It is all a bit of a soap opera - but it is also a great example of how science works. 

The reason I'm talking about this here is 1) i'm fascinated by SETI. But more importantly, 2) recent developments on this question highlight a major problem we see in the scientific literature - improperly documented data and methods leading to confusion and lack of reproducibility. To wit - more recent investigations of KIC846252 (e.g. Schaefer 2016) have suggested long-term dimming of the star based on the data provided via the Digital Access to a Sky Century @ Harvard (DASCH) Project. A response paper meant to reproduce Shaefer's results (Hippke and Angerhausen 2016), doesn't show such dimming - thus the confusion.

Now the bit about reproducibility - in recent interview with the WOW! Signal Podcast, Dr. Johnathan Grindlay, Principle Investigator at DASCH, indicates that he suspects differences in findings using DASCH data are likely due to the problem of not knowing what calibrations and data flags Schaeffer used when retrieving data for his analysis. The more recent version of Hippke and Angerhausen's paper also discusses these issues. Here we have an academic discussion in the literature that could have been almost entirely avoided if the original author had adequately described the methods used to retrieve data from a data source. Reproducibility is hindered, results are not verifiable, and a super interesting (at least to some of us!) line of research has been clouded with uncertainty because of bad documentation. Now Dr. Schaeffer isn't entirely at fault here - scientific norms have created a host of barriers to best practices for data sharing, and we know (see the next paragraph) that this kind of problematic sharing behavior comes up all the time. 

In our forthcoming paper (Van Tuyl and Whitmire, in press)**, Amanda Whitmire and I note this type of problem appears pretty regularly when faculty share data. It is not uncommon for researchers to "share" their data by providing a link to the data source, which, it turns out, is really just a link to the landing page for a database. Ultimately, in these cases, someone attempting to access the same data is unlikely to be able to track down the exact data the authors used, making reproducibility impossible. So what can be done? Here are some thoughts:

  1. If you share data that you extracted from a database, consider sharing the actual data you extracted, not a link to the database. Even if your methods are described very explicitly, database structure, access methods, and content can change over time.
  2. If sharing a snapshot of the data is not possible, be very explicit in your paper (or supplementary materials) about how you extracted the data. The details are extremely important. Also consider asking someone else to trace your methods to see if you get the same dataset out of the database.
  3. [Meta Problem] Maybe we need to have better ways to document, cite, and trace data extractions from these types of databases to help resolve these reproducibility issues. Not sure exactly what to do here, but one could imagine a data download being tied to a persistent and unique identifier (URI, DOI, somethingOI).

In the mean time, lets keep an eye out for more aliens, eh?


*Amateur Alert - look, I ain't no astronomer, but i've been following this discussion in popular media for a while. Anyone who knows more about this stuff than me, please feel free to make corrections or nudges in comments below.

**This paper is due out, like seriously any day now. Keep checking the DOI below!


Boyajian et al. (2015). Planet Hunters X. KIC 8462852 - Where's the Flux? arXiv:1509.03622

Hippke and Angerhausen (2016). KIC 8462852 did likely not fade during the last 100 years.

Schaeffer (2016). KIC 8462852 Faded at an Average Rate of 0.165+-0.013 Magnitudes Per Century From 1890 To 1989.

Van Tuyl and Whitmire (in press). Water, water everywhere: Defining and assessing data sharing in academia. PLOS One. 10.1371/journal.pone.0147942

WOW! Signal Podcast. Burst 11- DASCH Photometry with Dr. Josh Grindlay. [Retrieved 2016-02-11]

Wright (2015). [Retrieved 2016-02-11]

Am I here to make myself redundant?

Warning: half formed thoughts ahead

Self-preservation as a terrible motivator in the workplace. Really it is. Who wants to go through their day just trying to keep their job? I mean, don't get me wrong, I like my job and I don't want to be without one, but it can be hard to stay on target with persistent double vision: do good work and make sure you keep your job. 

By keeping my job, here, I mean "getting tenure." I'm lucky (??!?!?!) enough to be in a tenure-track position at a large university, which brings with it all of the immediate horrors and potential glories of such an esteemed position. And by doing good work, here, I mean doing things that are meaningful and impactful. Obviously one can do both - these two visions can be aligned perfectly, or nearly so, such that doing good work results in keeping the job (getting tenure). But this isn't always the case. I see folks around me (in my field and others) who are highly motivated by keeping their job and I wonder whether their good work suffers. 

I think it pays, and certainly has for me, to stop all the work and reflect on why I'm choosing to do what I'm doing. Is this project I'm working on useful for something other than churning out a publication or a talk or a poster? Am I contributing to a useful conversation with this work? Is anyone better off for this work being done? The answer is usually fuzzy, but I hope it trends towards "yes, this is useful and it is good work."

In many ways, the work that I do (facilitating the process of faculty sharing their scholarship, more or less) is the kind of thing that we'd (libraries) really like to not have to do. We'd like faculty to just want to do the sharing and know how to do it and have tools to do it on their own. In the end, the successful outcome of my position is the redundancy of my position - If I'm being honest about what I do, I should be trying to work my way out of a job.


The Model We Use for Research Data Services

Over the past few months I've been reconsidering my long-standing assertions/assumptions about the necessity of library involvement in research data services. For the extent of my admittedly short career in the research data services world, I've convinced myself, and I'm not alone, that libraries have a natural place in data services due to our long-standing tradition of making information accessible to others. It is a common refrain, but I'm not really convinced by it any more. 

I want to be careful about what I'm *not* saying here. I'm not saying that libraries don't or shouldn't have a place in data services. I'm not saying that libraries are doing it wrong or badly. And I'm not saying that everyone should drop everything and do something different. 

What I am saying is that I think there are other models for how research data services could be provided at a university and very few of them, as best I can tell, have been tested. As someone at a university that may be in the (questionably) luxurious state of having a existing data services program *and* an opportunity to rethink how we structure data services, i figured this would be a good time to try to take these assumptions apart a bit and examine the pieces. 

I know that the research data services program I helped build from scratch at my previous place of work was modeled on successful programs I saw elsewhere (Cornell, Minnesota), and the research data services at my current place of work have tried to move in that direction, too. That model, the now classic trio of Libraries, Research Office, and Central Computing, is useful and logical in many ways. But the truly logical (i think) units in that trio are the Research Office (for compliance) and Central Computing (for infrastructure) - the Library fits in that trio for less obvious reasons. Though one of the reasons libraries have fit themselves into this space is very important: libraries have asserted themselves into this space and filled a vacuum. Few were stepping to the plate way back when, and libraries took on the task of predicting and filling a void. Someone has to actually *do* this stuff, and libraries have stepped up in a major way. 

But what other models could there be? Well, I think it helps to consider the components that are required. In my estimation these include:

  • Computing Infrastructure - "Obvious" stuff including storage and backup/replication, but also discovery, hosting, online tools, etc.
  • Compliance Infrastructure - Making sure researchers do all the required data management things so the money stuff keeps flowing
  • Outreach and Education - Facilitate data activities and helping researchers understand best practices
  • Coordination - Central body to ensure services are being provided in a useful way and that researcher needs are being met

I don't think i've left anything major out of that list. And i don't think there is anything in that list that specifically calls for the library to be involved. Now, depending on your institution, the library actually might be the best unit to fill one or all of those roles. On the flip side, depending on your institution, the library may not be the best to fill any of those roles. What would that look like? Here are some examples one might consider at a large institution like mine - all without libraries:

  1. College level outreach, education, and computing infrastructure coordinated by the research office
  2. College level outreach and education, centralized computing infrastructure, coordination by the research office
  3. Research office coordinated outreach and education with centralized computing infrastructure

Of course, each of these models has its own problems including but not limited to issues of trust, recognition of competency of service providers, costs, etc. But those issues do not necessarily go away when the library is involved. 

I'll also note that there are institutions that are doing great research data services work that do not include libraries - look at some domain repositories like some of the NASA DAACs or to other long-standing data providers like the National Weather Service*. 

Where does that leave us? Not sure where this leaves you, dear reader, but makes me want to step back and think about how the services that are needed by researchers could be provided differently at my institution. Should we try to lean more heavily on our university colleges? The Research Office? Computing? I think the answer to all of these might be yes. Should our goal in the library be to focus our role around the repository aspect of our work, outreach and education, coordination? 

Discuss, please. Help me think through this beast.


*pretty sure neither of these explicitly includes libraries in their research data services/curation/sharing but correct me if i'm wrong


What is Shared Data?

Wherein I reiterate what others have said and express frustration with The Powers That Be

Of all of the problems in the world of data sharing, the most frustrating to me is the data that are shared but are completely unusable to others. As I've argued before (and of course, I'm not alone in this argument), the point of all of this sharing and management of data business is to facilitate access and reuse. Now, access seems to be moving along at a not unreasonable pace. More and more i'm seeing data deposits in our local repository and even more so, data sharing through journals in the form of supplementary data. General purpose repository services (e.g. figshare, dryad, etc.) seem to be doing pretty well for themselves (right?), and a number of high profile journals have implemented requirements of one sort or another for data sharing (cf. PLOS). 

What this amounts to is actually not sharing. It's almost worse to share this stuff than to share nothing at all. 

But the useability part seems to be a real sticking point for some. Poking through journals looking for shared data, I regularly (more times than not?) find data shared in the following ways (note, these are real examples, i've seen all of these many, many times):

  • Data are shared as PDFs (or other document type) - for instance, a table of numbers in a PDF, a set of figures in a PDF
  • Claims that "all the data are in the paper" when, in fact, it is clear that this is not the case
  • Data are shared, but do not include any documentation, code books, data dictionary, or metadata
  • Shared data are supplementary figures
  • Links to the source of the raw data (e.g. a database of some sort) with insufficient documentation for how to query the database to get ahold of the same dataset used in the research (e.g. a link to

These are just a few of the useability issues I've seen with shared data, and what this amounts to is actually not sharing. It's almost worse to share this stuff than to share nothing at all. 

But what should shared data look like? Lucky for us there are loads of resources out there offering best practices for data sharing (cf: White et al. 2013, Kervin et al. 2015). But I think a lot of this just comes down to common sense, and, as I've mentioned, remembering the point of all of this sharing business: can someone take action on your data (e.g. in a statistical software package) without an inordinate amount of work such as reformatting, and without sending you any emails or calling you on the phone or knocking on your door? If the answer is "yes", you've shared your data, if your answer is "no" or "meh", you ain't shared nothing. 

Another part of this discussion - related but separate - is why is this happening? One obvious answer to this is that data sharing is not normal practice for researchers in many fields, and it will take time for folks to get used to doing this in the right way. I'm totally okay with that - i've been there on the giving and receiving end of the sharing question and I fully understand the overhead for preparing data and lack of clarity and training in this area. So, I'm not really going to pick a fight on that front. 

However, the lack of institutional (universities, funding agencies, journals, etc.) scrutiny of the data sharing question seems pretty problematic to me. I'm going to pick on journals here because this is where i am personally seeing a lot of problems with sharing (I'll spare funding agencies and universities for now...). My issue with what i'm seeing in journals is this: if the journal doesn't have policies/guidelines for data sharing, but researchers are sharing through their venue, they need a policy and guidelines - full stop. I'll bet my dollars it is being done poorly without guidance in many if not most cases. If a journal has a policy calling for data sharing, then they'd better be prepared to vet the data being shared in the journal and to enforce that policy by determining whether data is being shared in a meaningful way.

If a journal (or agency, or university) policy or mandate isn't being met, something should be done. Does that thing need to be drastic? No, but I think it does need to enforce some standard for quality data sharing, instead of letting researchers think their data has been shared adequately, when, in fact, it has not. I'm always going to come back to the fact that if the shared data aren't comprehensible, then the data haven't been shared

The excuse of "we're all new at this" is really only applicable, in my opinion, to the nuances and vagaries of domain specific data sharing practices or data formats. There are many basics of data sharing, however, that I think we can mostly agree on (or if we can't we're in serious trouble) and it think these basics are where the bar should be set (for now). Can we all do this better? I think we can. 

That Open Letter

Warning: wall of text/rant. 

Earlier this week I had the good fortune to hear about a seminar on campus at the last minute. The head of the NIH National Institute for Environmental Health and Safety (NIEHS) Office of Sceintific Information Management (Dr. Allen Dearry) was giving a talk to environmental health researchers - "Towards Biomedical Research as Digital Enterprise". Dr. Dearry's talk was fine - an introduction to new and impending data stewardship expectations out of NIH (and many other agencies), funding opportunities from NIH to support these goals, and a discussion of the new (to me at least) Precision Medicine movement. There were plenty of questions from the audience, some of them lobbed by my prickly self, about the data stewardship elements of the talk and what kind of support and guideance NIH would offer and what it all meant for the researcher. Unfortunately, the speaker was hard pressed to answer most of the questions with any real level of clarity. What is data? What should be shared and how? What resources were available? What standards should be used? These questions were all met with a smile, a shrug of shoulders, and promises that more information would be forthcoming. But, as we've heard so many times before, the agency couldn't be responsible for making this happen - the research communities need to step up to the plate and sort it out.

I'm being hard on Dr. Dearry, not because his talk was especially problematic, it wasn't unique - he is just a convenient and recent example. His talk was fine - it offered information on what NIH is doing to address and support these mandates (some stuff) and he was quite honest about how soon we might expect to see real guidance (5-10 years). It is exactly what I have come to expect over the past few years when discussing the impact of and support for the famous OSTP mandate from 2013 and previous mandates (explicit and implicit) for data sharing and curation from funding agencies. Communities of researchers are expected to apply their best practices to enable data curation, data sharing, and all that other data stuff and they should do it because it is the right thing to do - agencies can't force the issue - the researchers need to do this themselves. This is what we've heard over and over. But this is a false dichotomy that has been perpetuated for too long - it is not a choice between an agency "forcing" research communties to be better data stewards and research communities self organizing to do the same. There is a middle ground that is not being explored there, and this lack of exploration is at the expense of a potentially more thorough and holistic suite of support services for meeting NIH (and all of those other agency) goals and the goals of open science. 

So what is that middle ground? Well, for starters, it would be helpful to see some more meaningful engagement from these agencies with communities that are trying to provide guidance and services - something I don't feel like I've seen much of from where I sit. If the agencies aren't going to offer guidance, and the researchers are looking for guidance, maybe there is someone out there who can help kick-start the process. And, it turns out, there is such a someone - in fact, there are many of them. 

Some of these someones already exist in areas of research that have historically been better at data stewardship. We point to them all the time - "Why can't you be more like the astrophysicists?" we say.Great, lets figure out how to make extensible some of those practices and to identify what practices are simply a function of the type of science that community does. Lets have a nuanced discussion.

We can't forget our friends in the commercial world, many of whom are providing high quality services and products to facilitate data stewardship and open science in a way that is effective and easy for researchers to incorporate into existing workflows (I'm looking at you, figshare and the Center for Open Science). 

Last, we have a growing community in academia in libraries, IT organiations, research offices who are trying to develop service profiles to help meet these data stewardship needs. These communities are providing guidance on best practices, providing repositories (often at a very low cost if at any cost at all to users) for data and other digital assets, and a host of other services to facilitate data stewardship and open science. This community is growing and is excited and seems to have a pretty good handle on how to approach this problem. But this community can only get so far with waiting to be invited to the table and waiting to see what guidance comes down the road. I feel strongly about this because this is the community in which I sit and this is the community that is primed to really make a difference to the data stewardship needs of our researchers.

I'd like to call on that last community (and the others if they're up for it) to step up to the table and assert our role in this grand challenge. We're doing this already in our own ways with local projects, regional collaborations, and national grants. But after so many years of, to quote a speaker (sorry I don't remember which one!) at the recent RDAP meeting, "We Exist!", we still are relative unknowns. Luckily some of this is happening and it seems like there is momentum around asserting ourselves into this space. I'm less than a day past the DataQ editor's meeting and, as I mentioned, a few weeks out from RDAP, where there seems to be a common understanding of many of these issues. But what's next?

We've had calls for open letters to the funding agencies to ask them to acknowledge the value of libraries in the data management process. I'd like to call for an open letter that is not apologetic and isn't asking for including, rather, a letter that asserts the value of the data management communities in libraries (and affiliates) to the process of opening science. A letter that points out that agencies have failed to do so thus far. A letter that point out that by pushing off responsibility for providing guidance for data management issues onto the research communities, and by offering a level of financial support that, to many of us, seems misdirected or too small to matter, they have shirked their duties and have created, unnecessarily, a landscape of confusion and frustration.

tl:dr - maybe next time an agency representative comes to campus to talk about data stewardship, they could invite to the table the people that are already providing these services to the campus community


hard-assing the repo

I've just returned (a couple of weeks ago now) from the annual Research Data Access and Preservation (RDAP) Summit in Minneapolis. Apart from being a great conference this year (as always), I've come away with what at first I thought was a bad attitude, but I now realize was a fire in my belly for changing the way we talk about ingesting data into our repository. What does that even mean? It means I'm going to start being a hard-ass about data deposits. 

It's probably important to back up a bit and explain. The repository in question is a typical (though pretty big) institutional repository at a research university. Over the past few years we've started accepting datasets in the repository and our services around research data management. As those service were growing the focus was, understandably, on driving data deposit traffic to the repository and sorting out the details of how best to gather sufficient documentation to make data usable. In some ways this approach was necessarily vague and unformed, but the result is it empowered depositors to submit datasets to the repository that were not documented or formatted well enough to share properly. 

Fast forward to about two months ago. We received a large supplementary data file with a dissertation that included scads of poorly documented data - it appeared to me that the student had zipped their working directory and sent it along with very little thought put to whether the data were actually usable. How prevalent was this problem? I decided to take a look - and the results were disappointing, though not unexpected. Zip files associated with scholarship (e.g. Dissertations, etc.) in our repository were replete with problems ranging from unreadable formats to spreadsheets with macros that crashed my machine. Format and obsolescence problems aside, almost none of the data was documented in any way whatsoever. None - no readme files, no metadata, not even field descriptions in spreadsheets.

I laughed. I cried. I considered a different career path. 

In the end, though, this experience makes me realize a few things:

  1. The point of all of this data management stuff, really, is to facilitate open science and the scientific process. Opening up science is the reason we see an increase in funder mandates for data management and sharing. Opening up science can be an effective route to better and more impactful science. So thats the goal - opening up science - and that is going to be the goal from this point forward.
  2. File formats are important. Better judgement aside, I'd long thought that if we could just get data IN the repository that would be a huge step in the right direction. But if that data isn't usable because the format is funked-out, then whats the point? Flatten those files! Make them non-proprietary! Assume the worst!
  3. Documentation is important and even a tiny bit can make a huge difference. Those zip files with a simple readme, a spreadsheet with a readme tab, any documentation will help. 
  4. We might be setting the bar too low for minimum participation. The "please deposit, we'll take anything" approach can be useful for getting services up and running but at some point that insanity has to stop. Proper formatting and documentation need to happen in order for any of this stuff to be usable in the future, and thats the point, right?
  5. It's time to be a hard ass about this. I'm ready to reject content and/or make it unavailable if depositors can't meet basic data sharing requirements. 

Now, pardon me while i go rewrite some policies and craft emails to ill-behaved depositors.



tl;dr - crappy data deposits in the repository have to stop. open science is the point of all of this business and if you can't even bother to document your data so that it is reusable, it's going to be very hard to convince me to accept it into this here data repository. 

The Degree (sigh)

I try really hard to avoid all of the identity crisis shenanigans that librarians love to engage in. Questions like "why don't people think we're important?" or "can you believe nobody knows you have to have a degree to be a librarian?" so just really not interesting to me. I think we all in library land (and, like, in the entire world) have better things to talk about. 

But the topic of The Degree has recently been chapping my hide. Looking through position descriptions that have come across my screen lately, I can't help but notice the number of job searches for unicorns seems to be on the rise (though i could be imagining it). Organizations are looking for people to do things like in depth analytics or assist with research data management and expecting candidates to have extensive experience in the area of focus (analytics or data management), a degree in a relevant field (e.g. statistics, 'science'), and a Library Degree. I mean, do you even want to get applicants? Do you want to fill the position?

How about this - and bear with me hear because a lot of people don't like to hear this stuff - why not just hire someone without The Degree to do work that they are better suited to do? How many Library Land Programs are offering meaningful analytics and data management courses and/or experiences? And by meaningful I don't mean learning how to calculate an H-Index, I mean courses in statistics, embedded experience in the research process (of actual researchers), or experience with analytical tools (and I don't mean Excel)?

Do I mean having The Degree prevents you from being able to do this work? Or that The Degree or Degree Programs are 'bad' in some way? No. I mean that if we want to hire quality people into our organizations to do things that having The Degree doesn't help them to do, maybe we should just, you know, do that. 

I could go on all day about this, but I'll stop now. 

Big Idea Collective - A Non-Committal Ideas Club for Lazy Bums Like Me

I'm sort of a fan of coming up with and sharing crack-pot schemes with my friends and colleagues. I think it helps fuel the fires of creativity and offers opportunities to think about ideas without having to commit to the dreadful details of budgeting and Gantt charts and human resources and annual reports. All those things just get in the way of the ideas and anyone saying otherwise is a liar or a swindler*. 

In the spirit of sharing crack-pot schemes I recently reeled in a colleague to participate in weekly (if we remember to do it) Big Idea Collective meetups. Here's the setup. We meet very briefly, like while waiting in line at the coffee shop, to talk about our big idea for the week. No commitments. No follow-through required. Just tell your Big Idea for the week. Nothing much emerges from these meetings, apart from a little head-scratching and a few laughs, but occasionally one of these ideas actually seems to have traction. And we put that idea off to the side for further investigation - for the doing part.

Not much to it. Just a commitment to not committing to doing the things we talk about.


*I know, I know. At some point you have to actually do things and not just talk about them. But the freedom to talk about this stuff without worrying about how you'd do it is pretty valuable. 

On Existential Crises

A few months ago I watched a lecture by famous computer scientist (among other things) Allen Newell (thanks for the prompt CMU Computer Science!). In this lecture, which is fascinating, even though I know very little about much of the content, Newell discusses his career and how he wound up where he is (or was at the time of the lecture). While fascinating and inspiring, I spent weeks after watching this video in an existential funk. What was the purpose and direction of my career? What were the important big questions in my field and who was working on them? How could I find and do interesting things and escape the day-to-day drudgery of the profession?

So what did I do? I did what any self respecting human would do - I spent a week pouting. While that might sound completely useless, I came to a number of important realizations during my pout:

  1. I have so many wonderful colleagues. Really. In my days crying into my coffee, all these great folks spent time talking with me about my crisis and their thoughts on how to approach the problem. 
  2. One way to resolve this multitude of Big Deal Crises is group therapy. So naturally, we started a club to talk about all the big idea issues we grapple with (or should). Monthly meetings. Reading and discussion and ennui. 
  3. My main takeaway from Newell's lecture is this: you will have many distractions during your career (and some distractions can last years), but you need to make sure that you learn something, that you get something out of each distraction that you can apply to those Big Deal Questions. 
  4. Existential crises are contagious, and that is a good thing. I'm happy to say I've sent (in part or in whole) no fewer than four colleagues into downward spirals since I entered my funk. Some have emerged, some are still fighting in the maelstrom. Misery loves company, but more than that, I suspect that all these spin-off crises will result in new directions and new focus for everyone involved. I must say, though, that I can't take any credit for what emerges, just for sending everyone into the darkness.

An existential crisis every now and again seems to be a good thing. Give it a whirl.