Analytical mathematics & machine learning - A hybrid approach to deciphering patterns in genomic data
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
An important goal in molecular biology is to quantify both the patterns across a genomic sequence, such as DNA, RNA, or protein, and the relationship between phenotype and underlying sequence structure. With an increasing amount of genomic data being available, there exists an increasing need to identify patterns across sites in a sequence and ultimately connect sequence structure to output phenotype. Current methods that attempt to solve this problem are largely computionatal approaches, such as machine learning (ML) algorithms. Here, we give another example of an analytical method we previously established and apply this to transcription factor binding site data with sequence information and corresponding binding energies. This method is the multivariate tensor-based orthogonal polynomial (OP) approach that characterizes nucleotides or amino acids in a given DNA/RNA or protein sequence as vectors and builds a sequence space onto which phenotypes can be mapped.
This method captures not only higher order interactions between different parts of a sequence, it also quantifies the effect of the phenotype on having particular nucleotides at given sites along the sequence at first and higher order levels. We compare this method to a previously published deep neural network (NN) that aimed to understand the binding affinity landscape of two model yeast transcription factors (TFs). In addition, we applied the OP method to NN-derived binding affinities which yielded more insights into the binding energy landscape of the given TFs. We argue that by combining an analytical approach like the OP method and a computational approach, such as a deep learning method, we can gain more biological insights about the given system than we would by only utilizing one method. Analyses shown here showcase the wide applicability of this hybrid approach and its ability to quantify sequence-phenotype relationships in other biological systems. Furthermore, a command line tool has been developed to compute orthogonal polynomials for sequence data and project corresponding phenotypes onto this space. This software tool can be widely used in conjunction with experimental validation to build sequence-phenotype landscapes for a wide variety of applications in molecular biology.
Embargo status: Restricted until January 2022. To request access, click on the PDF link to the left.