The genetic code not only determines protein amino acid residue sequence but also defines the 'splicing code' of cis- and trans-acting regulatory elements that control pre-mRNA splicing. Single nucleotide variant (SNV) changes at key regions in pre-mRNA may disrupt splicing resulting in disease [1, 2]. Understanding which SNVs cause aberrant splicing and which are benign is important for understanding disease pathogenesis. SNVs at consensus splice sites, at exon-intron junctions, are known to cause aberrant splicing and contribute to at least 10% of inherited diseases [2]. However, SNVs outside consensus splice sites can still disrupt splicing [3]. Current, bioinformatics tools limit analysis to SNVs at or near consensus splice sites and lack the ability to generalize to SNVs beyond the consensus splice site [4-7]. In this application, I propose to substantially improve the ability to interpret the consequences of mutations on pre-mRNA splicing. This goal will be achieved by: 1) developing novel features, useful in predicting the impact of variation on cis- splicing regulation; 2) training a supervised machine learning algorithm that uses the novel features to predict the impact of SNVs; 3) sharing the algorithm in a publically available software package; and 4) comparing algorithm predictions to the relationships between SNVs and splicing patterns derived from matched DNA- and RNA-sequencing studies.
Genetic sequences not only encode the amino acids of proteins but also regulate many critical biological functions, including pre-mRNA splicing. The impact of genetic variation on splicing is not well understood. The goal of this research project i to computationally identify features of variants useful in predicting aberrant splicing, then incorporate the features into a machine learning algorithm and test the utility of the predictions using publically available sequencing studies. 1
Douville, Christopher; Springer, Simeon; Kinde, Isaac et al. (2018) Detection of aneuploidy in patients with cancer through amplification of long interspersed nucleotide elements (LINEs). Proc Natl Acad Sci U S A 115:1871-1876 |