BIOSTATISTICS 140.688: STATISTICS FOR GENE EXPRESSION, Spring 2003
FINAL EXAM

The final exam is a take home data analysis project. The experiment to be analyzed consists of 197 lung cancer-related samples, each hybridized to one array. Data are from  Bhattacharjee et al 2001

Samples are from one of four known classes. The goals are to construct and evaluate a classification algorithm to predict class based on the expression profiles. I have divided the samples into a training set of 131 arrays and a validation set of 66 arrays.  Data files are here:

     Bhattacharjee2001.dput

The file is an R dump and can be uploaded using dget. It contains expression matrices X.train and X.valid and phenotype vectors Y.train and Y.valid Expression matrices have one row per gene, one column per sample. Entries are expression measures. They are already normalized.  Phenotypes are vectors of length 131 and 66.

You can use any approach you like and any software you like. Many of the packages under  "data mining" in this List  have classification tools.

Please develop your classifier in the training samples, and report to me how you developed you classifier, and how the classifier performs on the validation set. You can present more than one classifier (although one is enough).  The only rule is this: Please do not use the validation sample to train the classifier, and do not go back and retrain the classifier if it does poorly in the validation sample. I will not grade you on how well you do on the validation sample.

The deadline for the final is friday May 23th by 11:30pm. Please email your project.