Application of multivariate tensor-based orthogonal polynomials to biological sequences
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
An important goal in molecular biology is to quantify both the patterns across a genomic sequence, such as DNA, RNA, or protein, and the relationship between phenotype and underlying sequence structure. With an increasing amount of genomic data being available, there exists an increasing need to identify patterns across sites in a sequence and ultimately connect sequence structure to output phenotype. We propose a multivariate tensor-based orthogonal polynomial approach to characterize nucleotides or amino acids in a given DNA/RNA or protein sequence.
We have applied this method to a previously published case of small transcription activating RNAs (STARs) and used it to quantify higher order relationships between parts of the terminator RNA sequence. We mapped the regulatory activity of these RNA structures to the higher order relationships in the sequence. This revealed an interplay between intramolecular interactions within the target RNA and intermolecular interactions between the STAR and the target RNA and how these interactions impact the system’s regulatory activity.
This method captures not only higher order interactions between different parts of a sequence, it also quantifies the effect of the phenotype on having particular nucleotides at given sites along the sequence. We show proof of concept of this approach as applied to a case of regulatory RNA and subsequently demonstrate its application to transcription factor binding site data with sequence information and corresponding binding energies. The two distinctive applications of this method shown in this work showcase the wide applicability of the tool and its ability to quantify sequence-phenotype relationships in other biological systems. Furthermore, a command line tool has been developed to compute orthogonal polynomials for sequence data and project corresponding phenotypes onto this space. This tool can be widely used in conjunction with experimental validation to build sequence-phenotype landscapes for a wide variety of applications in molecular biology.