Thursday, July 25, 2013

Less variables makes for happy analysis in R, MATLAB, etc.

- Gautham

The more experience you have in R, MATLAB, in general the less variables you'll have in your workspace. There are good reasons for this that as beginners we try to make up for by sometimes clever means, but they catch up to us eventually.

Suppose you measure temperature and vapor pressure of water one day. And on another day you add some sugar to water and measure temperature and vapor pressure. When I was just getting started with R or MATLAB, my analysis script may have looked like this:

temp_water <- read.table( ... )
vp_water  <-  read.table( ... )
temp_sugar <- read.table( ... )
vp_sugar <- read.table( ... )

As the number of experimental conditions increases, I'll end up making more and more variables (or maybe different variables in different scripts). Worse, if I try to plot something, I'll end up doing a bunch of copy-paste to plot it for every condition, changing the variables each time. By this kind of logic, and by refusing to write common procedures into functions (a topic for later), version 1 of the code I used to plot the figures of our initial worm paper submission turned into such a deep morass that I had to rewrite it nearly from scratch when the revisions came.

The "clever" beginner, myself and members of my old lab included, would try to get around the copy-pasting by exploiting functions in R and MATLAB that let you execute commands in strings. This is where the dreaded eval function and its close relative assign, come in.   Unless you are developing an R package to submit to CRAN and know your way around namespace hierarchies, you are probably headed down the wrong road. There are much less ugly ways to accomplish what you are trying to do.

Instead, you can organize your data the way you would if you were making a database. Principle 3 of Dr. Wickham's paper on 'tidy data' suggests that:
3. Each table (or file) stores information about one observational type.

Go one further, and store *all* information of that one observational type (that one type of experiment) in a single table. In our fake vapor pressure example, we'd just have one variable:

> vp_temp = 
T        vp         sugar_frac
..       ..         0
..       ..         0
..       ..         ..
..       ..         0.1
..       ..         0.1

R is very good at helping you then take out the parts of this data that you are interested in for any particular plot or analysis. Tools like plyr and ggplot2 work like magic with data that looks like this. This form of data is also good for merging with other tables that contain data from other kinds of experiments (maybe heat capacity against temperature?).

 To make this kind of master data table, follow this two step procedure:
1) Maintain a master index of all your vapor pressure - temperature experiments as a .csv file (you could even keep it on google docs). Every row is an experimental run. The table has columns for everything you think is relevant about the experiment, the so-called metadata, like the date, the experiment conditions. Most importantly, there are columns for the location and name of the file that contains that run's raw data. 
2) Have a script read the master index file and use the table to read all your data files and add informative columns (like the sugar fraction). In R you can read each file into a list and then do.call(rbind, the list) or melt the list to get the full data table. Then merge with the master index table to attach the metadata columns.

Since 1 is a good idea no matter how you analyze data, may as well tack on 2 and get rid of that mess of variables in your workspace.

No comments:

Post a Comment