Statistical methods in genomic pathway analysis

Date
2014-12
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract

Identifying genomic pathways that are related to various clinical phenotypes is an important concern in today's biological research. I will discuss a few statistical frameworks that are very well suited for microarray gene expression analysis. I will address common issues with this type of analysis by offering new mathematical perspectives and by developing new statistical models that help explain the biological phenomenons that are being studied. For the purpose of this thesis, I define a pathway as a collection or subset of genes with a common biological process, molecular function, or cellular location. Typically, over-representation(ORA) type methods are used to predict gene expression as a function of gene membership and to provide rankings of pathways based on estimated expression levels and/or p-values. I start by showing that traditional hypergeometric ORA methods are fully described by and can be considered a special case of the logistic regression methods. Logistic regression presents the advantage that while it produces simple models, they are more rich and they describe the biological process in a more accurate fashion. While logistic regression has been proposed before as an improvement over ORA, I prove the over-encompassing nature of the method and I also propose flavors of regression that can be aimed at different scenarios. Furthermore, logistic regression has a solid mathematical basis and produces results that have biological justification. I continue by developing a Bayesian hierarchical regression model that solves three important problems of ORA analysis: it reduces type I error rates, it disentangles effects in cases of overlapping pathways and it shrinks probability estimates depending on the length of the pathways, providing sensible estimates and rankings. Our method is able to emphasize pathways that are biologically relevant to the phenotype in a real study. Simulations further show when this method is preferable to over-representation analysis. I present the cases when our method is more accurate in ranking relevant pathways and when it is better at demoting pathways that are irrelevant to the phenotype.

Description
Keywords
Bayes, Pathway analysis, Regression
Citation