Introduction to R, CDC, June 25-28th 2012


Instructor information

Aedin Culhane contact: aedin@jimmy.harvard.edu


Install and Set up script

Install script

Datasets for Exercises

FUN

Manual and R code


Questions from Class

  1. How do I sort a matrix by 2 columns,one in decreasing order, the second ascending?
    There are 2 ways to do this
    x <- matrix(c(2,1,1,3,.5,.3,.5,.2), ncol=2)
    
    ## create an example dataset
         [,1] [,2]
    [1,]    2  0.5
    [2,]    1  0.3
    [3,]    1  0.5
    [4,]    3  0.2
    
    Sort the data in 2 steps:
    # Sort the second column in decreasing order
    x1 <- x[order(x[,2], decreasing=TRUE),]
    # Sort the first column in the partially sorted matrix
    x2 <- x1[order(x1[,1]),] 
    
    Or if both columns are numeric, you negatives sort in the reverse order of positives
    x[order(x[,1], -x[,2]),] 
    
    If the values aren't known to be numeric, convert them to numeric before sorting
    x[order(xtfrm(x[,1]), -xtfrm(x[,2])),]
    
    Note with both of these, NA will be appended to the end of the list
    z.vec<-c(5,NA,8,2,3.2)
    order(z.vec)
    z.vec[order(z.vec)]
    ## Results in  2.0 3.2 5.0 8.0  NA
    
    z.vec[order(z.vec, decreasing=TRUE)]
    ## Results in 8.0 5.0 3.2 2.0  NA
    
  2. Reading compressed data into R
    Files compressed via the algorithm used by gzip can be used as connections created by the function gzfile, whereas files compressed by bzip2 can be used via bzfile. Suppose your data is in a compressed gzip or tar.gz file, you can use the R gzfile function to decompress on the fly. Do this:
    myDataFrame <- read.table(gzfile("myData.gz"), header=T)
    
  3. Functions for analyzing Survey data available in the Survery package
  4. Reading specialized data types SEER data SEER2R
  5. Merging data frame or matrices See attached Reports doc

Making Reports in R


Resources for Spatial Data Analysis


If you have your own laptop, I recommend the following software for this course

  • R Software: Download R from from the R home page and if you wish, the integrated development envirnoment (IDE) R Studio which is available for Windows, Mac or Linux OS
  • Windows software: Download MikTex and an editor such as TexWorks, TeXnicCenter or simply just use an enchanced notepad like Notepad++
  • Windows: There is also a easy-to-install Tex software bundle called proTeXt which includes MikTex, TeXnicCenter and Ghostscript
  • MacOS software: Download MaxTEx and TeXshop for editing or
  • TexMaker as a free cross-platform TeX editor
  • Linux: I tend to use either Kate (within KDE), Emacs or Texworks which is cross platform
  • More on Latex Editors from Wikipedia
  • Convert Tex to a MSword Document using TeX4ht

These should each be pretty straightforward to download and install, but a little more detail instruction is provided in installing R (from May 2011)


Exploratory Data Analysis

  • Importance of EDA
  • Clustering Data using hierarchical cluster analysis
  • Dimensions reduction using principal components analysis
  • Interpreting Results of EDA

Exploratory Data Analysis (slides)

Exercise/tutorial

## Install R packages for this tutorial

install.packages(“RCurl”)
install.packages(“gplots”)
install.packages(“scatterplot3d”)

## Install Bioconductor Packages
source(“http://www.bioconductor.org/biocLite.R”)
#biocLite()
biocLite(“made4″)
biocLite(“hgu95av2.db”)


Reproducible Research

  • Importance of Reproducdible Research
  • Why should we perform reproducible research
  • A survey of reproducibility, cases studies
  • Using Sweave

Reproducible Research (A short course)


Updated.June 2012. Aedin Culhane