Data representation and machine learning methodologies for drug discovery and precision medicine
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Drug discovery is a long, expensive, and risky process. According to recent survey, the median cost of developing a new drug is estimated to be 1.1 billion dollars, and it takes approximately 12 years from discovery to market. Efforts to speed up the drug discovery process, reduce costs, and increase efficiency are highly desirable. Scientists look for new technologies, including artificial intelligence (AI) and machine learning (ML), which have shown great potential in the field of drug discovery. With the development in computational technology, data availability, and new models, ML in drug discovery is on the brink of unprecedented rapid progresses. Data in the biomedical field have unique types and characteristics that require careful consideration during the application of ML. For instance, the human genome contains more than 20,000 genes, and genomics data is often noisy and highly dimensional, but the numbers of patients or cell lines are usually limited. Proteins and peptides can be understood as both sequences and molecules, while how to model them is still under debate. Chemical structures of small molecules are desirable to model directly; however, a common understanding of how to comprehend the structures is still lacking. In this dissertation, we focus on three interesting topics in drug discovery and precision medicine. (1) precision medicine for oncology drugs. Precision medicine involves selecting personalized treatment plans or drugs for individuals based on their specificity, mostly gene traits. Scientists look for better ways to identify the gene traits, and ML can help by identifying patterns in genomic data and predicting drug responses. (2) Functional prediction of proteins/peptides. Proteins and peptides are of great interest in understanding biological processes, and have recently emerged as therapeutics. Protein sequences are believed to contain all the information about protein properties, and predicting their functionalities using AI is a desirable goal. (3) Modeling of small molecules. Predicting properties based on structures is the long-standing aspiration of drug chemists. Small molecules are chemical structures represented by atoms and bonds. Understanding how to convert these structures into machine-readable data formats and how to model them is a significant research problem. We discuss the effectiveness of data representation methods based on the characteristics of data, and explored model frameworks tailored for different data types. This dissertation starts with a brief introduction on the drug discovery process, and the applications of AI in drug discovery. In Chapter 2, we introduce post-prediction covariance alignment - ranking transformation (PPCA-RT) framework for oncology drug sensitivity prediction. PPCA-RT takes the covariance of targets responses into modeling, and is proved improve the predictive power. In Chapter 3, we introduce the REFINED as a data representation and visualization tool. In Chapter 4, we introduce the similarity-based prediction framework named topological regression (TR), which is designed for data types with only similarities defined. In Chapter 5 and Chapter 6, we discuss the effectiveness of data representation and modeling methods for protein sequences and molecules, and illustrate the advantages of TR in these two problems.
Embargo status: Restricted until 06/2028. To request the author grant access, click on the PDF link to the left.