Friday, January 22, 2016

Thoughts on the NEJM editorial: what’s good for the (experimental) goose is good for the (computational) gander

Huge Twitter explosion about this editorial in the NEJM about “research parasites”. Basically, the authors say that computational people interested in working with someone else’s data should work together with the experimenters (which, incidentally, is how I would approach something like that in most cases). Things get a bit darker (and perhaps more revealing) when they also call out “research parasites”–aka “Mountain Dew chugging computational types”, to paraphrase what I’ve heard elsewhere–who are are to them just people sitting around, umm, chugging Mountain Dew while banging on their computers, stealing papers from those who worked so hard to generate these datasets.

So this NEJM editorial is certainly wrong on many counts, and I think that most people have that covered. Not only that, but it is particularly tone-deaf: “… or even use the data to try to disprove what the original investigators had posited.” Seriously?!?

The response has been particularly strong from the computational genomics community, who are often reliant on other people’s data. Ewan Birney had a nice set of Tweets on the topic, first noting that “For me this is the start of clinical research transitioning from a data limited to an analysis limited world.”, noting further that “This is what mol. biology / genomics went through in the 90s/00s and it’s scary for the people who base their science on control of data.” True, perhaps.

He then goes on to say: “1. Publication means... publication, including the data. No ifs, no buts. Patient data via restricted access (bonafide researcher) terms.”

Agreed, who can argue with that! But let’s put this chain of reasoning together. If we are moving to an “analysis limited world”, then it is the analyses that are the precious resource. And all the arguments for sharing data are just as applicable to sharing analyses, no? Isn’t the progress of science impeded by people not sharing their analyses? This is not just an abstract argument: for example, we have been doing some ATAC-seq experiments in the lab, and we had a very hard time finding out exactly how to analyze that data, because there was no code out there for how to do it, even in published papers (for the record, Will Greenleaf has been very kind and helpful via personal communication, and this has been fine for us).

What does, say, Genome Research have to say about it? Well, here’s what they say about data:
Genome Research will not publish manuscripts where data used and/or reported in the paper is not freely available in either a public database or on the Genome Research website. There are no exceptions.
Uh, so that’s pretty explicit. And here’s what they say about code:
Authors submitting papers that describe or present a new computer program or algorithm or papers where in-house software is necessary to reproduce the work should be prepared to make a downloadable program freely available. We encourage authors to also make the source code available.
Okay, so only if there’s some novel analysis, and then only if you want to or if someone asks you. Probably via e-mail. To which someone may or may not respond. Hmm, kettle, the pot is calling…

So what happens in practice at Genome Research? I took a quick look at the first three papers from the current TOC (1, 2, 3).

The first paper has a “Supplemental PERL.zip” that contains some very poorly documented code in a few files and as far as I can tell, is missing a file called “mcmctree_copy.ctl” that I’m guessing is pretty important to the running the mcmctree algorithm.

The third paper is perhaps the best, with a link to a software package that seems fairly well put together. But still, no link to the actual code to make the actual figures in the paper, as far as I can see, just “DaPars analysis was performed as described in the original paper (Masamha et al. 2014) by using the code available at https://code.google.com/p/dapars with default settings.”

The second paper has no code at all. They have a fairly detailed description of their analysis in the supplement, but again, no actual code I could run.

Aren’t these the same things we’ve been complaining about in experimental materials and methods forever? First paper: missing steps of a protocol? Second paper: vague prescription referencing previous paper and a “kit”? Third paper: just a description of how they did it, just like, you know, most “old fashioned” materials and methods from experimental biology papers.

Look, trust me, I understand completely why this is the case in these papers, and I’m not trying to call these authors out. All I’m saying is that if you’re going to get on your high horse and say that data is part of the paper and must be distributed, no ifs, no buts, well, then distribute the analyses as well–and I don’t want to hear any ifs or buts. If we require authors to deposit their sequence data, then surely we can require that they upload their code. Where is the mandate for depositing code on the journal website?

Of course, in the real world, there are legitimate ifs and buts. Let me anticipate one: “Our analyses are so heterogeneous, and it’s so complicated for us to share the code in a usable way.” I’m actually very sympathetic to that. Indeed, we have lots of data that is very heterogeneous and hard to share reasonably–for anyone who really believes all data MUST be accessible, well, I’ve got around 12TB of images for our next paper submission that I would love for you to pay to host… and that probably nobody will ever use. Not all science is genomics, and what works in one place won’t necessarily make sense elsewhere. (As an aside, in computational applied math, many people keep their codes secret to avoid “research parasites”, so it’s not just data gatherers who feel threatened.)

Where, might you ask, is the moral indignation on the part of our experimental colleagues complaining about how computational folks don’t make their codes accessible? First off, I think many of these folks are in fact annoyed (I am, for instance), but are much less likely to be on Twitter and the like. Secondly, I think that many non-computational folks are brow-beaten by p-value toting computational people telling them they don’t even know how to analyze their own data, leading them to feel like they are somehow unable to contribute meaningfully in the first place.

So my point is, sure, data should be available, but let’s not all be so self-righteous about it. Anyway, there, I said it. Peace. :)

PS: Just in case you were wondering, we make all our software and processed data available, and our most recent paper has all the scripts to make all the figures–and we’ll keep doing that moving forward. I think it's good practice, my point is that reasonable people could disagree.

Update: Nice discussion with Casey Bergman in the comments.
Update (4/28/2016): Fixed links to Genome Research papers (thanks to Quaid Morris for pointing this out). Also, Quaid pointed out that I was being unreasonable, and that 2/3 actually did provide code. So I looked at the next 3 papers from that issue (4, 5, 6). Of these, none of them had any code provided. For what it's worth, I agree with Quaid that it is not necessarily reasonable to require code. My point is that we should be reasonable about data as well.

Saturday, January 2, 2016

A proposal for how to label small multiples

I love the concept, invented/defined/popularized/whatever by Tufte, of small multiples. The general procedure is to break apart data into multiple small graphs, each of which contain some subset of the data. Importantly, small multiples often make it easier to compare data and spot trends because the cognitive load is split in a more natural way: understand the graph on a small set of data, then once you get the hang of it, see how that relationship changes across other subsets.

For instance, take this more conventionally over-plotted graph of city vs. highway miles per gallon, with different classes of cars labeled by color:

q2 <- qplot(cty,hwy,data=mpg,color = class) + theme_bw()
ggsave("color.pdf",q2,width = 8, height = 6)



Now there are a number of problems with this graph, but the most pertinent is the fact that there are a lot of colors corresponding to the different categories of car and so it takes a lot of effort to parse. The small multiple solution is to make a bunch of small graphs, one for each category, that allows you to see the differences between each. By the power of ggplot, behold!

q <- qplot(cty,hwy,data=mpg,facets = .~class) + theme_bw()
ggsave("horizontal_multiples.pdf",q,width = 8, height = 2)


Or vertically:

q <- qplot(cty,hwy,data=mpg,facets = class~.) + theme_bw()
ggsave("vertical_multiples.pdf",q,width = 2, height = 8)


Notice how much easier it is to see the differences between categories of car in these small multiples than the more conventional over-plotted version, especially the horizontal one.

Most small multiple plots look like these, and they're typically a huge improvement from heavily over-plotted graphs, but I think there’s room for improvement, especially in the labeling. The biggest problem with small multiple labeling is that most of the axis labels are very far away from the graphs themselves. This is of course a seemingly logical way to set things up because the labels apply to all the multiples, but it leads to a problem because it leads to a lot of mental gymnastics to figure out what the axes are for any one particular multiple.

Thus, my suggestion is actually based on the philosophy of the small multiple itself: explain a graph once, then rely on that knowledge to help the reader parse the rest of the graphs. Check out these before and after comparisons:


The horizontal small multiples also improve, in my opinion:


To me, labeling one the small multiples directly makes it a lot easier to figure out what is in each graph, and thus makes the entire graphic easier to understand quickly. It also adheres to the principle that important information for interpretation should be close to the data. The more people’s eyes wander, the more opportunities they have to get confused. There is of course the issue that by labeling one multiple, you are calling attention to that one in particular, but I think the tradeoff is acceptable. Another issue is a loss of precision in the other multiples. Could include tickmarks as more visible markers, but again, I think the tradeoff is acceptable.

Oh, and how did I perform this magical feat of alternative labeling of small multiples (as well as general cleanup of ggplot's nice-but-not-great output)? Well, I used this amazing software package called “Illustrator” that works with R or basically any software that spits out a PDF ;). I’m of the strong opinion that being able to drag around lines and manipulate graphical elements directly is far more efficient than trying to figure out how to do this stuff programmatically most of the time. But that’s a whole other blog post…