Aliens, Astronomers, and Reproducibility*

KIC 8462852 has been all the rage (sort of) in the SETI community over the past few months after an unexpectedly large dimming around the star was observed by astronomers (see Boyajian et al. 2015). The long and short of it is that one explanation for why the dimming was happening was that there could be an "alien megastructure" around the start that was big enough to cause the dimming (Wright 2015). Through a number of publications and discussions, the original findings and suggestions of alien megastructures have been refuted and further investigated a number of times. It is all a bit of a soap opera - but it is also a great example of how science works. 

The reason I'm talking about this here is 1) i'm fascinated by SETI. But more importantly, 2) recent developments on this question highlight a major problem we see in the scientific literature - improperly documented data and methods leading to confusion and lack of reproducibility. To wit - more recent investigations of KIC846252 (e.g. Schaefer 2016) have suggested long-term dimming of the star based on the data provided via the Digital Access to a Sky Century @ Harvard (DASCH) Project. A response paper meant to reproduce Shaefer's results (Hippke and Angerhausen 2016), doesn't show such dimming - thus the confusion.

Now the bit about reproducibility - in recent interview with the WOW! Signal Podcast, Dr. Johnathan Grindlay, Principle Investigator at DASCH, indicates that he suspects differences in findings using DASCH data are likely due to the problem of not knowing what calibrations and data flags Schaeffer used when retrieving data for his analysis. The more recent version of Hippke and Angerhausen's paper also discusses these issues. Here we have an academic discussion in the literature that could have been almost entirely avoided if the original author had adequately described the methods used to retrieve data from a data source. Reproducibility is hindered, results are not verifiable, and a super interesting (at least to some of us!) line of research has been clouded with uncertainty because of bad documentation. Now Dr. Schaeffer isn't entirely at fault here - scientific norms have created a host of barriers to best practices for data sharing, and we know (see the next paragraph) that this kind of problematic sharing behavior comes up all the time. 

In our forthcoming paper (Van Tuyl and Whitmire, in press)**, Amanda Whitmire and I note this type of problem appears pretty regularly when faculty share data. It is not uncommon for researchers to "share" their data by providing a link to the data source, which, it turns out, is really just a link to the landing page for a database. Ultimately, in these cases, someone attempting to access the same data is unlikely to be able to track down the exact data the authors used, making reproducibility impossible. So what can be done? Here are some thoughts:

  1. If you share data that you extracted from a database, consider sharing the actual data you extracted, not a link to the database. Even if your methods are described very explicitly, database structure, access methods, and content can change over time.
  2. If sharing a snapshot of the data is not possible, be very explicit in your paper (or supplementary materials) about how you extracted the data. The details are extremely important. Also consider asking someone else to trace your methods to see if you get the same dataset out of the database.
  3. [Meta Problem] Maybe we need to have better ways to document, cite, and trace data extractions from these types of databases to help resolve these reproducibility issues. Not sure exactly what to do here, but one could imagine a data download being tied to a persistent and unique identifier (URI, DOI, somethingOI).

In the mean time, lets keep an eye out for more aliens, eh?


*Amateur Alert - look, I ain't no astronomer, but i've been following this discussion in popular media for a while. Anyone who knows more about this stuff than me, please feel free to make corrections or nudges in comments below.

**This paper is due out, like seriously any day now. Keep checking the DOI below!


Boyajian et al. (2015). Planet Hunters X. KIC 8462852 - Where's the Flux? arXiv:1509.03622

Hippke and Angerhausen (2016). KIC 8462852 did likely not fade during the last 100 years.

Schaeffer (2016). KIC 8462852 Faded at an Average Rate of 0.165+-0.013 Magnitudes Per Century From 1890 To 1989.

Van Tuyl and Whitmire (in press). Water, water everywhere: Defining and assessing data sharing in academia. PLOS One. 10.1371/journal.pone.0147942

WOW! Signal Podcast. Burst 11- DASCH Photometry with Dr. Josh Grindlay. [Retrieved 2016-02-11]

Wright (2015). [Retrieved 2016-02-11]