My research focuses on developing statistical methodology and computational tools for the analysis of high-dimensional genomics data. Most of my work has revolved around experiments that use next-generation sequencing technologies to characterize the genomic basis of complex traits. On this page you'll find a descriptive overview of some of the problems I've worked on; see my CV or My Google Scholar Page for a more inclusive list of my publications. You can also check out my GitHub page for open source software I've developed, and keep up with the latest news on twitter.



Differential Distributions in single-cell RNA-seq experiments

With collaborators at UW's Morgridge Institute for Research, we developed a method to detect differences in expression from single-cell RNA-seq experiments that explicitly accounts for the possibility of multiple distinct expression states within and among biological conditions. The ability to detect distinct expression states is a key advantage of single-cell technologies since measurements do not represent averages over populations of cells, as in bulk RNA-seq. In contrast to traditional differential expression analyses, which generally assume differences can be characterized by a mean shift, we use flexible nonparametric Bayesian mixture modeling to detect subgroups of cells expressing a given gene and determine whether they differ by condition. In addition to detecting genes with subtle and complex differences in expression among single-cells, the framework also provides a summary of the key features than can differ between two conditions by classifying them into meaningful patterns. The predominant features that are represented by the patterns are differences in means, differences in the number of subgroups, and differences in the proportion of cells belonging to each subgroup. Compared to methods that do not account for distinct expression states, our method shows increased sensitivity in simulation, and is able to detect and classify changes in key pluripotency genes and cell cycle regulators when comparing differentiated cells to embryonic stem cell lines.

DDdiagram

The scDD R package is available on GitHub. The manuscript is published in Genome Biology. Here are my slides from JSM 2016, where I presented on this work in the ASA Biometrics Section Travel Award Session.



MADGiC: A Model-based Approach for identifying Driver Genes in Cancer

We know that cancer arises from the accumulation of genetic alterations that provide a selective advantage to a cancer cell (drivers), but identifying which changes will provide that advantage is a difficult and open problem. Alterations that are irrelevant to the disease process (passengers) will occur by chance and the key challenge is to be able to separate these two classes of alterations. We have developed a statistical method to address this problem.

MADGiC prior

MADGiC prior

As we detail in the manuscript published in Bioinformatics, existing statistical methods for identifying driver genes in cancer rely primarily on frequency-based criteria (i.e. identifying driver genes as those showing higher mutation rates than expected by chance). However, recent studies have identified many other properties of drivers such as increased functional impact, enrichment for specific mutations, and highly structured spatial patterns that have not yet been utilized in statistical approaches. Our approach incorporates all three of these criteria and in doing so shows substantially increased power (with a well controlled false discovery rate) over competing methods in simulation studies.

The R package MADGiC fits an empirical Bayesian hierarchical model to obtain posterior probabilities that each gene is a driver. The model accounts for (1) frequency of mutation compared to a sophisticated background model that accounts for gene-specific factors in addition to mutation type and nucleotide context, (2) predicted functional impact (in the form of SIFT scores) of each specific change, and (3) positional patterns in mutations that have been deposited into the COSMIC (Catalogue of Somatic Mutations in Cancer) database. Example data from the The Cancer Genome Atlas (TCGA) project ovarian cohort is provided.

The latest version of the MADGiC R package is available on GitHub. Alternatively, the source code can be found here.



Predicting Cancer Subtypes using survival-supervised Latent Dirichlet Allocation Models

Latent Dirichlet allocation models, also referred to as topic models, are commonly applied to text corpora to discover underlying themes in the text. Supervised versions of topic models do so while relating topics to an outcome of interest. Instead of literal documents, we applied supervised topic models to diverse genomic data in order to discover cancer subtypes. Here, a topic represents a collection of co-occurring genomic features.

survLDA
Personalized genomic medicine aims to predict clinical response using genomic predictors, but integration of multiple data types (e.g. gene expression, methylation, SNP genotypes) remains a challenge. Using a novel translation of diverse genomic information to construct patient-specific 'documents', we are able to discover collections of genomic features in cancer patients that are related to survival. As we demonstrate in simulation studies, this type of inference is feasible even with modest sample sizes. However, more exploration is needed to determine the optimal way to translate the genomic features into 'documents'.

Book Chapter: See our book chapter for more information (chapter 18 in Advances in Statistical Bioinformatics, 2013).

Last updated February 2017