Application of advanced machine learning based approaches in cancer precision medicine
Dhruba, Saugato Rahman
MetadataShow full item record
Precision medicine aims to design therapies that are tailored to individual patient characteristics - genetic and other relevant features. A big step in designing such individualized therapies is to develop models that can accurately predict patient response to an anticancer drug or drug combination from their relevant characteristics without actual drug administration. Since actual clinical data have limited availability, the usual practice here is to learn these functional relationships via in vitro (i.e., cell lines) experiments and then attempt to translate these models for the target in vivo scenarios. Researchers use various statistical techniques and diverse Machine Learning models to learn the functional dependence between large-scale genetic characterization ("-omics") data and drug response values. In this dissertation, I have presented a number of approaches based on advanced Machine Learning techniques to design suitable predictive models in various scenarios to estimate drug sensitivity. In short, the approaches presented here can be broadly divided into two categories -- the Transfer Learning based modeling techniques, and the Functional Regression modeling techniques. Transfer learning (TL) uses related but different tasks and aims to transfer the "knowledge" learned from one task to another to improve target task performance. For drug response modeling, this translates to the use of multiple pharmacogenomics studies to build robust predictive models for estimating drug response in a target domain with the model itself being built using data from the other domain. The motivation for TL comes from the fact that often there is a scarcity of suitable biological data to design an appropriate statistical model with reliable predictive capabilities. Using data from multiple studies with significant overlaps such as CCLE and GDSC hold the promise to mitigate this small sample issue but can often be difficult to implement due to the existing distribution shift. I present multiple novel TL approaches for incorporating information from a source (secondary) database for improving the prediction in a target (primary) space. The first TL approach is based on generating mapping functions to transfer data from the target domain to the source domain and build predictive models by using the available source data. I have described two separate techniques to generate such maps - one using polynomial regression mapping to produce one-to-one sample maps where the degree of the polynomial is decided upon the consistency between datasets, and a more generalized distribution matching based maps where we match the target distribution with the source distribution which opens up a lot of possibilities in terms of application. I have demonstrated the performance of these mapping based domain transfer approaches in various scenarios and showed the improvement over existing ML approaches as well as established TL approaches. Furthermore, I present a distribution mapping based transfer learning software framework, DMTL that has been developed as an R-package to provide researchers with easy access to the distribution mapping based domain transfer and corresponding predictive modeling capabilities. I have shown the application of DMTL in various pharmacogenomic modeling scenarios including cell lines, tumor cultures, and patient derived xenograft models. The second TL approach is based on latent variable based cost optimization technique that can incorporate samples from two sources such as CCLE and GDSC together to improve the final GDSC predictive performance task. I have presented three such approaches using varying degrees of transcriptomics and drug sensitivity data and performed a comparative analysis among them which identified the incorporation of latent variables from two approaches as the best performer while showing superior performance for each over regular ML modeling efforts. The last TL approach is based on extracting common representations using the dimensionality reduction approach Principal Component Analysis (PCA) which elucidated the fact that the transcriptomics data from two different studies with significant overlaps such as CCLE and GDSC can possibly be represented by a common underlying basis vector set, which can subsequently be used to design predictive models that borrows information through the PCA coefficients to provide better performance than even multivariate modeling techniques like Multivariate Random Forest. Functional regression is significant in scenarios involving the estimation of complete drug response profile following the application of a drug (or drug combination) over a range of drug concentrations and/or multiple time points. Majority of drug sensitivity modeling efforts involve modeling for a characteristic summary metric of the dose-response curves such as AUC or IC50 which often does not provide the full response behavior. Using suitable functional characterization data from studies such as CCLE, GDSC, and HMS-LINCS, I have demonstrated two approaches for functional response modeling. Functional random forest (FRF) is an extension to the popular Random forest (RF) based modeling techniques that modifies the node cost calculation and prediction techniques from the function perspective and provides dose-response curve predictions at various given dose using either baseline (static) or functional genetic characterizations. I have demonstrated the superior functional modeling capabilities for FRF using various scenarios with varying the number of dose-response points, cost function calculations, and even in function-to-function regression scenarios with HMS-LINCS. Furthermore, I have shown the biological significance of the FRF results via a pathway analysis with STRING. The recursive hybrid model provides a generalized model for capturing the complete dose-time drug response behavior following drug administration. Using functional predictors i.e., the dose-expression proteomics data from HMS-LINCS (which, to our knowledge, is the only publicly available source providing both functional predictors and corresponding functional response data), this model can predict the dose-response curve at the next time point using the dose-response curve from the immediately preceding time point. I have laid out some desirable properties for modeling dose-time drug response and the corresponding implications as well as limitations of this approach along with performance evaluation against the powerful RF methodology. Overall, the hybrid approach can follow the drug response behavior much closely making it valuable in clinical applications where patient responses are tracked over both drug concentrations and elapsed time. Overall, I have attempted to design robust predictive modeling approaches to improve in vitro drug sensitivity estimation, and ultimately, translate these models for in vivo scenarios to facilitate precision therapy design for better health outcomes.Embargo status: Restricted until 06/2022. To request the author grant access, click on the PDF link to the left.