Friday, January 4, 2013

Some R tips


Greetings present or future UseRsLast night I did a typical data processing task similar to the ones I used to have to do when I began working on the Nanog project. But it took a few hours instead of a day and many more lines of bookkeeping code. A good part of this has to do with packages that extend the base R compatibility that seemed irrelevant when I first started learning how to do this stuff. Here they are:

String manipulation


library(stringr)

Allows for easy, sensible and consistent vectorized string operations, including anything you might want to do with regular expressions. Documentation: Look at the help menu for the package in stringr. The function names transparently reveal capability.


Data wrangling

A lot of work goes into massaging data, your own, or someone else's, into different formats. This is especially relevant if you have a bunch of measurements that fall under mutually exclusive categories. For example, gene expression from a time series with and without a drug (two categories - time, drug).  Or ChIP binding affinities for a bunch of transcription factors to every gene. 

library(reshape2)

Fundamentally, these packages allow a sensible way to change information from being encoded as row entries or as column names. I know, why go through the trouble, right? I thought so too. But after I learned it, I found a huge number of applications. And I actually store most of my data as 'melted' data now. The documentation and philosophy is in Chapter 2 of the creator's thesis. http://had.co.nz/thesis/practical-tools-hadley-wickham.pdf

library(plyr)

If you want to split up data by category, do something to it (possibly summarize it), and put it back together to and from a variety of formats, this is your ticket. This package makes obsolete almost every for loop and a whole bunch of now-pointless bookkeeping code I had to write, plus it makes my intent very clear to myself. Documentation: (another one of Mr. Wickham's creations) http://www.jstatsoft.org/v40/i01/paper

library(data.table)

If you have a huge number of categories, like if your categories are gene names, plyr can be slow. So you can sacrifice flexibility for sometimes 1000x speed by using data.table. Good for simple calculations like within-group averaging or finding maxima. Documentation: various vignettes and faqs are available on the web. 


Plotting

library(ggplot2)


Hard to describe this package. It is a totally different way of specifying plots than I am used to in a command line environment. It is almost like a command line version of bringing up the plot wizard in excel, except it has many more options, and the output with default settings is even more stunning and efficient than base R graphics. Another package that makes obsolete a lot of painstaking graphics adjustment, subplot accounting, and complications in changing what kind of plot I want. Documentation is in chapter 3 of the creator's thesis, once again: http://had.co.nz/thesis/practical-tools-hadley-wickham.pdf


No comments:

Post a Comment