Biostatistics 140.688: Statistics for Gene Expression, Spring 2002

BIOSTATISTICS 140.688: STATISTICS FOR GENE EXPRESSION, Spring 2002
Monday and Wednesday from 10:30 to 12
School of Public Health Building, room W3031

COURSE SYLLABUS AND MATERIALS

A Calder's mobile is a good way to imagine how a dendrogram, which is like a photo of the mobile,
may be a deceiving map of the position of objects in higher dimensional spaces.

OVERVIEW

The goal of this class is to introduce statistical concepts and tools necessary to interpret and critically evaluate the literature on gene expression array data. Advanced statistical material will be presented at an intuitive level. However, knowledge of basic hypothesis testing, ANOVA, linear and logistic regression are a prerequisite.

MIDTERM EXAM

The midterm exam is a take-home data analysis project. The experiment to be analyzed consists of 8 two channel arrays, 4 in each of two groups. The green channel is always reference RNA. The red channel is from wild type mice in group 1 and from knockout mice in group two. The reference is unrelated to either type. The goal is to identify genes that may be differentially expressed tn the two groups. A project should contain two parts: a) normalization; and b) identification of differentially expressed genes.
You are free to use any software package you like for the project, but some explanation should be given about the procedure followed by the software and about why the procedure is appropriate for the data at hand. Suggested software include sma, SNOMAD, SAM and Cyber-T. Links for these are in the syllabus . Additional packages are linked below.

Data are available in R format as midterm.data.dget , which you can upload in R directly using the command
midterm.data <- dget("midterm.data.dget")

Data are also available as tab delimited ascii as midterm.data.txt , if you want to import in excel or other programs.

Columns are labeled R1-R8 and G1-G1. R and G indicate red and green channels. Numbers indicate arrays. Arrays 1-4 are the wild type group, arrays 5-8 are the knockout group.

The layout of the array in sma format is in midterm.setup.dget . If you don't use sma, the layout is included in the tab delimited file and is the same as that of the arrays described in the Yang et al paper linked to the Apr 3 lecture.

I tried things out on R 1.4.1 in Linux and on R 1.4.0 on windows 98. The way I got it to work in windows is by first changing the working directory from the file pulldown menu at the top left and then loading the data from the R prompt as described above.

don't forget to assign the dget command to something, as in
midterm.data <- dget("midterm_data.dget")
this will read the file and put it into an R object called midterm.data; if you just say dget, R will yank the whole thing out at the prompt.

also, don't try to read the .txt file using dget. You can read the .txt into R using read.table, but if you dget the file "midterm_data.dget", the results will already be in the format you need for sma, w/o additional work.

The deadline for the midterm is May 8 by class time. Please email your project.

The list of truly differentially expressed genes in the simulated data set is here

FINAL EXAM

The final exam is a take home data analysis project. The experiment to be analyzed consists of 75 samples, each hybridized to one array. Samples are from one of two known classes. The goals are to construct and evaluate a classification algorithm to predict class based on the expression profiles. I have divided the samples into a training set of 50 arrays and a validation set of 25 arrays. Data files are here:

final.expr.train.txt Expression matrix for the 50 training samples
final.expr.valid.txt Expression matrix for the 25 validation samples
final.phen.train.txt Phenotype (binary class indicator) matrix for the 50 training samples
final.phen.valid.txt Phenotype (binary class indicator) matrix for the 25 validation samples

All files are tab delimited text. Expression matrices have one row per gene, one column per sample. Entries are expression measures. They are already normalized and roughly centered by gene. Phenotypes are vectors of length 50 and 25 respectively.

You can use any approach you like and any software you like. Many of the packages under "data mining" in this List have classification tools. Excel users may want to check out the BRB site.

Please develop your classifier in the 50 training samples, and report to me how you developed you classifier, and how the classifier performs on the validation set. You can present more than one classifier (although one is enough). The only rule is this: Please do not use the validation sample to train the classifier, and do not go back and retrain the classifier if it does poorly in the validation sample. I will not grade you on how well you do on the validation sample.

R has software for all the classification algorithms described in class. To load the data in R,
expr <- read.table("final.expr.train.txt"header=F)
will create a data frame named expr with your expression matrix for the 50 training samples, while
expr <- as.matrix(read.table("final.expr.train.txt"header=F))
will create a matrix named expr with your expression matrix for the 50 training samples.
Similar commands will work for the other files.

The deadline for the final is friday May 24th by 11:30pm. Please email your project.

SOFTWARE

Students are free to use any software they like to analyze array data in this class. Some software packages will be illustrated in class, and occasionally live R analysis will be shown step by step. These link include a variety of free state-of-the art tools for array analysis.

Follow this Link to download the free statistical package R
Bioconductor software for bioinformatics, free, based on R
A great crash course about R, by Bates and Lumley
R packages for gene expression analysis
The analysis of gene expression data: Methods and Software. Table of content of a forthcoming book and link to software sites
A comprehensive survey of microarray data analysis software. Link

First time R users on windows: To install R go to CRAN click on "Windows (95 and later)" then click on "base" then click on "SetupR.exe" this will start the download. Once you have the exe file, click on it and it will install. Once installed, click to start.

INSTRUCTOR

The instructor for this class is Giovanni Parmigiani. He is available via email at gp@jhu.edu, after class every Wednesday until 1pm, and by appointment.

RELATED LINKS

The web site of the microarray working group "Hopkins Expressionists", with links to software and papers by Hopkins faculty.

A description of and information for the course "Introduction to Bioinformatics" M.E:440.714