Courses in R and a bit more from Aedin Culhane

Intro to Bioconductor

Introduction to R and Bioconductor, May 2012


Course Description

This course is an introduction to R and Bioconductor, a powerful and flexible statistical language for analysis of genetic and genomics data (http://www.bioconductor.org/). The course will introduce attendees to the basics of using R for statistical programming, computation, graphics, and modeling, especially for analyzing high-throughput genomic data. We will start with a basic introduction to the R language, reading and writing data, and plotting data. Case studies and data will all be based on real gene expression and genomics data. We will introduce the main classes and packages in Bioconductor. Our goal is to get attendees up and running with R and Bioconductor such that they can use it in their research and are in a good position to expand their knowledge of R and Bioconductor on their own.

Instructors

  • Aedin Culhane contact: aedin@jimmy.harvard.edu
  • Guest Lecturers:
    • Nicolai Juul Birkbak (SNP data)
    • Svitlana Tyekucheva, Alejandro Quiroz (Gene Set Analysis)
    • J. Fah Sathirapongsasuti (Exome Seq CNV)

Schedule

May 16th- 9:15am – 5:00pm
May 17th – 9:15am – 12:00pm

Required Software

I recommend the following software.
  • R. Download R from from the R home page
  • The integrated development envirnoment (IDE) R Studio available for Windows, Mac or Linux OS
  • Install Bioconductor. Start R and type the following command (it requires an internet connection)
    source("http://www.bioconductor.org/biocLite.R")
    biocLite()
    
  • Latex
    • Windows: MikTex or there is a easy-to-install Tex software bundle called proTeXt which includes MikTex, the latex editor TeXnicCenter and Ghostscript
    • MacOS: MaxTEx
  • Latex Editor ( a comparison of editors )
    • Windows: Either TexWorks, TeXnicCenter or simply an enchanced notepad like Notepad++
    • MacOS: TeXshop
    • Linux: I tend to use either Kate (within KDE), Emacs or Texworks which is cross platform
    • Convert Tex to a MSword Document using TeX4ht
These should be pretty straightforward download and install, but this document provides a little more detail instruction on installing R and Bioconductor (from May 2011) extension packages (not required for course).

Agenda

Please install the following packages by copying and pasting this into R, or sourcing this script.
install.packages(c("R2HTML", "lme4", "manipulate", "RColorBrewer", "googleVis","network","igraph","lattice","gplots", "venneuler")
Day 1
  1. History and Background to R, Installing R website slides
  2. Introduction to R language (classes, subsetting etc)
    • Introduction Notes
    • R code for Introduction Notes
    • Women.xls
    • Women.txt
  3. Plotting in R
    • Notes on Plotting in R
    • R code for Plotting Notes
    • Chart of colors in R
  4. Reading Data into R and Bioconductor
    ## Install Bioconductor Packages
    source("http://www.bioconductor.org/biocLite.R")
    biocLite(c("GEOquery", "ArrayExpress", "frma", "arrayQualityMetrics, "affycoretools", "made4"))
    biocLite("HGU95Av2_Hs_ENSG",respos="http://brainarray.mbni.med.umich.edu/bioc")
    
    • Notes on reading/writing data in R
    • Rcode for reading/writing data in R
    • Women.xls
    • Women.csv
    • Women.txt
  5. Expression Set Class, S4 Class,Example data workflows in R
    • slides
    • Function to create an ExpressionSet given 2 matrices (or data.frames) containing 1) expression data and 2) annotation
      makeEset<-function(eSet, annt){
                #Creating an ExpressionSet from eSet, a normalized gene expression matrix
                # and annt, a data.frame containing annotation
          metadata <- data.frame(labelDescription = colnames(annt), row.names=colnames(annt))
          phenoData<-new("AnnotatedDataFrame", data=annt, varMetadata=metadata)
          if (inherits(eSet, "data.frame")) eSet= as.matrix(eSet)
          if (inherits(eSet, "ExpressionSet")) eSet=exprs(eSet)
          data.eSet<-new("ExpressionSet", exprs=eSet, phenoData=phenoData)
          print(varLabels(data.eSet))
          return(data.eSet)
      }
      
  6. Normalization and custom cdf files Install packages for this tutorial
    • Notes on normalization including using custom cdf file
    • R code for Normalization Notes
    • zip of cel files
    • Results Report from arrayQualityMetrics
  7. Clustering and Exploratory Data Analysis in R slides
      Install packages for this tutorial
      ## Install Bioconductor Packages
      source("http://www.bioconductor.org/biocLite.R")
      biocLite("made4")
      biocLite("hgu95av2.db")
      
    • Notes on EDA in R
    • R code for EDA Notes
    • data.vsn
    • annt.txt
    • Other potentially useful packages EDA for sequence data EDASeq
  8. Feature Selection (Limma etc), GSEA and Annotating data using annotate/BiomaRt
    • Feature Selection and Gene Annotation Notes on Feature Selection using Limma and Annotating Genes in R
    • R code for Feature Selection/Annotation Notes
    • References which compare different feature selection approaches
      • Jeffery IB, Higgins DG, Culhane AC. (2006) Comparison and evaluation of microarray feature selection methods. BMC Bioinformatics 7:359.
      • Murie C, Woody O, Lee AY, Nadon R. (2009) Comparison of small n statistical tests of differential expression applied to microarrays. BMC Bioinformatics. 10:45.
    • Gene Annotation HTML output Results of aafTableAnn()
  9. Gene Set Analysis For this analysis you will need the following packages
    source("http://bioconductor.org/biocLite.R")
    biocLite("RTopper")
    biocLite("limma")
    
    • Slides (Svitlana)
    • Gene Set Data for Exercise (Svitlana)
    • Exercise R Code (Svitlana)
      GeneSet and Pathway Databases
    • GeneSigDB http://compbio.dfci.harvard.edu/genesigdb
    • MSigDB http://www.broadinstitute.org/gsea/msigdb/index.jsp
    • KEGG http://www.genome.jp/kegg/
    • GO http://www.geneontology.org/
  10. Literature mining, tag cloud (will not be presented during class)
    • See the Text Mining Package R library tm and wordcloud . There are also many network and graph packages will be covered later
    • Sample R script to create a word cloud from frequently occuring words in pubmed abstracts
    • Example of script output
    • Also see packages annotate, XML etc
      Day 2

      For today please install the required packages by coping these commands in R

      1. Recap on previous day - Quiz
      2. Overview of CCCB and its resources (Mick Correll)
      3. Exome Data (Fah)
      4. Data Integration and Gene Set Analysis - RTopper, GeneGroupAnalysis (Svitlana, Alejandro)



      Additional manuals

      These will not be covered in the course, but maybe helpful if you are new to R. Lecture notes from Bio503 Programming and Statistical Modeling in R (Jan 2011)

      New to R: Installation and getting help. Basic Introduction to R and Bioconductor


      R Resources

      • An excellent beginners guide to R is from Emmanuel Paradis
      • Introduction to R classes and objects on R site
      • Tom Short’s R reference card and other contributed are useful from the R the R contributed documentation
      • Stephen Eglen’s publications in PLoS Computational Biology on A Quick Guide to Teaching R Programming to Computational Biology Students. It includes links to lecture notes and an overview of useful introductory books in R.
      • Simple Intro to Linear Models view
      • R Web Services
        • Rweb
        • caBIG

      • R Cloud Computing
        • There was an RCloud at EBI but I think this might be down
        • CRdata.org

      • R Web Development Kits
        • RWebServices - exposes R packages as SOAP-based web services
        • Biocep-R
        • Galaxy - Web Integration Framework. Include SamTools and tools for analysis of NGS data
      (Thanks to Tom Guirke for some of these links)

      Bioconductor Resources

      • Bioconductor Courses
      • A starter to Affymetrix data analysis Jean Wu’s old but still useful lab on Affymetrix data analysis
      • Thomas Girke’s (UC Riverside) excellent and extensive intro into R and Bioconductor
      • Thomas Girke’s (UC Riverside) Analysis of RNA-Seq, ChIP-Seq and SNP-Seq Data using R and Bioconductor

      Methods comparing different feature selection approaches

      • Jeffery IB, Higgins DG, Culhane AC. (2006) Comparison and evaluation of microarray feature selection methods. BMC Bioinformatics 7:359.
      • Murie C, Woody O, Lee AY, Nadon R. (2009) Comparison of small n statistical tests of differential expression applied to microarrays. BMC Bioinformatics. 10:45.

      Links to Gene Set Analysis Resources

      • GeneSigDB http://compbio.dfci.harvard.edu/genesigdb
      • MSigDB http://www.broadinstitute.org/gsea/msigdb/index.jsp
      • KEGG http://www.genome.jp/kegg/
      • GO http://www.geneontology.org/
    1. Survival Analysis (Survcomp, Ben)
      • Notes on Survival
      • R code for survival
      • Rnw file for survival
      • Manuscript describing survcomp

      A few Code Tips

      Function to create an ExpressionSet given 2 data matrices (or data.frames) containing 1) expression data and 2) annotation

      makeEset<-function(eSet, annt){
                #Creating an ExpressionSet from eSet, a normalized gene expression matrix
                # and annt, a data.frame containing annotation
          metadata <- data.frame(labelDescription = colnames(annt), row.names=colnames(annt))
          phenoData<-new("AnnotatedDataFrame", data=annt, varMetadata=metadata)
          if (inherits(eSet, "data.frame")) eSet= as.matrix(eSet)
          if (inherits(eSet, "ExpressionSet")) eSet=exprs(eSet)
          data.eSet<-new("ExpressionSet", exprs=eSet, phenoData=phenoData)
          print(varLabels(data.eSet))
          return(data.eSet)
      }

      Starting with Biomart

       
      library(biomaRt)
      mart=useMart("ensembl")
      mart<-useDataset("hsapiens_gene_ensembl",mart)
      geneAnnt<-getBM(attributes=c("affy_hg_u95av2","hgnc_symbol","chromosome_name","band", "entrezgene"),filters="affy_hg_u95av2",values=c("1939_at","1503_at","1454_at"), mart=mart)
      

      Updated.May 2012. Aedin Culhane