Introduction to R and Bioconductor, May 2012

Course Description

This course is an introduction to R and Bioconductor, a powerful and flexible statistical language for analysis of genetic and genomics data ( The course will introduce attendees to the basics of using R for statistical programming, computation, graphics, and modeling, especially for analyzing high-throughput genomic data. We will start with a basic introduction to the R language, reading and writing data, and plotting data. Case studies and data will all be based on real gene expression and genomics data. We will introduce the main classes and packages in Bioconductor. Our goal is to get attendees up and running with R and Bioconductor such that they can use it in their research and are in a good position to expand their knowledge of R and Bioconductor on their own.


  • Aedin Culhane contact:
  • Guest Lecturers:
    • Nicolai Juul Birkbak (SNP data)
    • Svitlana Tyekucheva, Alejandro Quiroz (Gene Set Analysis)
    • J. Fah Sathirapongsasuti (Exome Seq CNV)


May 16th- 9:15am – 5:00pm
May 17th – 9:15am – 12:00pm

Required Software

I recommend the following software.
  • R. Download R from from the R home page
  • The integrated development envirnoment (IDE) R Studio available for Windows, Mac or Linux OS
  • Install Bioconductor. Start R and type the following command (it requires an internet connection)
  • Latex
    • Windows: MikTex or there is a easy-to-install Tex software bundle called proTeXt which includes MikTex, the latex editor TeXnicCenter and Ghostscript
    • MacOS: MaxTEx
  • Latex Editor ( a comparison of editors )
These should be pretty straightforward download and install, but this document provides a little more detail instruction on installing R and Bioconductor (from May 2011) extension packages (not required for course).


Please install the following packages by copying and pasting this into R, or sourcing this script.
install.packages(c("R2HTML", "lme4", "manipulate", "RColorBrewer",
"googleVis","network","igraph","lattice","gplots", "venneuler")
Day 1
  1. History and Background to R, Installing R website slides
  2. Introduction to R language (classes, subsetting etc)
  3. Plotting in R
  4. Reading Data into R and Bioconductor
    ## Install Bioconductor Packages
    biocLite(c("GEOquery", "ArrayExpress", "frma", "arrayQualityMetrics", 
    "affycoretools", "made4"))
  5. Expression Set Class, S4 Class,Example data workflows in R
  6. Reading in SNP data
  7. Normalization and custom cdf files Install packages for this tutorial
  8. Clustering and Exploratory Data Analysis in R slides
  9. Feature Selection (Limma etc), GSEA and Annotating data using annotate/BiomaRt
  10. Literature mining, tag cloud (will not be presented during class)
    • See the Text Mining Package R library tm and wordcloud . There are also many network and graph packages will be covered later
    • Sample R script to create a word cloud from frequently occuring words in pubmed abstracts
    • Example of script output
    • Also see packages annotate, XML etc
      Day 2

      For today please install the required packages by coping these commands in R

      1. Recap on previous day
      2. Overview of CCCB and its resources (Mick Correll)
      3. Exome CNV analysis (Fah Sathirapongsasuti)
        • Slides (powerpoint)
        • Slides (pdf)
        • source("")

        • Userguide to Exome CNV
        • reference Sathirapongsasuti et al., 2011 Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV

        • library(ExomeCNV)
          chr.list = c("chr19","chr20","chr21")
          suffix = ".coverage"
          prefix = ""
          normal = read.all.coverage(prefix, suffix, chr.list, header=T)
          prefix = ""
          tumor  = read.all.coverage(prefix, suffix, chr.list, header=T)
          demo.logR = calculate.logR(normal, tumor)
          demo.eCNV = c()
          for (i in 1:length(chr.list)) {
              idx = (normal$chr == chr.list[i])
              ecnv = classify.eCNV(normal=normal[idx,], tumor=tumor[idx,], logR=demo.logR[idx], min.spec=0.9999, min.sens=0.9999, option="spec", c=0.5, l=70)
              demo.eCNV = rbind(demo.eCNV, ecnv)
          demo.cnv = multi.CNV.analyze(normal, tumor, logR=demo.logR,, coverage.cutoff=5, min.spec=0.99, min.sens=0.99, option="auc", c=0.5)
          do.plot.eCNV(demo.eCNV, lim.quantile=0.99, style="idx", line.plot=F)
          do.plot.eCNV(demo.cnv, lim.quantile=0.99, style="bp", bg.cnv=demo.eCNV, line.plot=T)
          write.output(demo.eCNV, demo.cnv, "demo")

        • Circos Circular image of genes
      4. Data Integration and Gene Set Analysis (Svitlana, Alejandro)
        For this analysis you will need the following packages

      Additional manuals

      These will not be covered in the course, but maybe helpful if you are new to R. Lecture notes from Bio503 Programming and Statistical Modeling in R (Jan 2011)

      New to R: Installation and getting help. Basic Introduction to R and Bioconductor

      R Resources

      R Web Services

      R Cloud Computing

      • There was an RCloud at EBI but I think this might be down

      R Web Development Kits

      (Thanks to Tom Girke for some of these links)

      Bioconductor Resources

      Methods comparing different feature selection approaches

      Links to Gene Set Analysis Resources

      RNA Seq Analysis Resources

      • General review: Oshlack A, Robinson MD, Young MD. (2010) From RNA-seq reads to differential expression results.Genome Biol. 2010;11(12):220. [PMID:21176179]
      • Useful tutorial from SeqAnswers
      • File Formats
        • FASTQ is the sequence of a short read. FASTQ is an extension to a FASTA file and bundles the sequence and its quality data. Increasing NGS data is stored as SAM/BAM files which contain aligned reads.
        • SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. It is rapidly becoming the standard for reporting short read alignment and is supported by a wide range of downstream analysis tools. SAM Tools provide various utilities for manipulating alignments in the SAM format
        • BAM files contain all the same information as SAM files, but is just compressed to be more space efficient and searchable, BAM files use a variant of gzip compression called BGZF. You should always convert your SAM files into BAM to save storage space and BAM files are faster to manipulate.
        • For more information see on SAM/BAM see: Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools . Bioinformatics, 25, 2078-9. [PMID: 19505943]
      • RNA Seq Tutorial (by Marc Carlson, Valerie Obenchain, Herve Pages, Paul Shannon,Daniel Tenenbaum, Martin Morgan∗ from May 2012)
        • Complete RNA seq Tutorial Tutorial (pdf file) .
        • Download the R packages to run this tutorial
          pkgInstall("SeattleIntro2012", type="source")
        • The R code just to run the RNA-seq part of this tutorial (page 49-56 of the pdf) SeattleIntro2012_RNAseq.R and the data counts which are RNA-seq counts from the Pasilla data set.
          Brooks, AN, ̃Yang,L Duff, MO, Hansen, KD, Park,JW, ̃Dudoit, S, Brenner, SE and Graveley.BR (2011) Conservation of an RNA regulatory map between Drosophila and mammals. Genome Research, 193–202,
      • Repositories:
        • Sequence Read Archive ( SRA) at NCBI. Use the GEO browser to find data
        • European Nucleotide Archive (ENA) at EBI, Cambridge
        • The ReCount repository from Bowtie also has several example datasets
      • A few of the many Bioconductor Packages
        • easyRNASeq a wrapper packages around several other packagse to make count summarization, normalization and analysis of RNA-seq data easier.
        • Normalization and exploratory data analysis of RNAseq data using EDAseq
        • Differential Gene Expression analysis (among replicates) edgeR and DESeq based on the negative binomial distribution.
        • Differential exon level expression analysis: DEXSeq
        • The package tweeDEseq makes the data package tweeDEseqCountData contains three data sets of counts from RNA-seq experiments by Cheung et al. (2010), Montgomery et al. (2010) and Pickrell et al. (2010). The latter is RNA seq data on lymphoblastoid cell lines derived from 69 unrelated nigerian individuals.
        • Gene Set Enrichment Analysis goseq
        • To install these
          biocLite(c("easyRNAseq", "EDAseq", "edgeR", "DEseq",
           "DEXseq", "tweeDEseq", "tweeDEseqCountData", goseq))

      Survival Analysis (Survcomp)

      A few Code Tips

      Function to create an ExpressionSet given 2 data matrices (or data.frames) containing 1) expression data and 2) annotation

      makeEset<-function(eSet, annt){
                #Creating an ExpressionSet from eSet, a normalized gene expression matrix
                # and annt, a data.frame containing annotation
          metadata <- data.frame(labelDescription = colnames(annt), row.names=colnames(annt))
          phenoData<-new("AnnotatedDataFrame", data=annt, varMetadata=metadata)
          if (inherits(eSet, "data.frame")) eSet= as.matrix(eSet)
          if (inherits(eSet, "ExpressionSet")) eSet=exprs(eSet)
          data.eSet<-new("ExpressionSet", exprs=eSet, phenoData=phenoData)

      Starting with Biomart

      geneAnnt<-getBM(attributes=c("affy_hg_u95av2","hgnc_symbol","chromosome_name","band", "entrezgene"),filters="affy_hg_u95av2",values=c("1939_at","1503_at","1454_at"), mart=mart)

      Updated.May 2012. Aedin Culhane