Ebola is an RNA virus characterized by a high mutation rate. The genetic diversity of RNA viruses enables them to adapt to varying conditions over the course of infection and keep proliferating. Estimating viral genetic diversity is essential for the understanding of their origin and mutation patterns, and for the development of effective drug treatments. A viral population is characterized by the sequences and frequencies of the genomes that comprise it. High-throughput DNA sequencing technologies enable fast and affordable analysis of viral genomes. However, errors and limited read lengths of high-throughput sequencing platforms render the problem of estimating viral genetic diversity challenging.
The aim of this research is to develop novel algorithms for determining and analyzing genetic diversity of RNA viruses and applying them to the analysis of the Ebola virus. The investigator specifically aims to: (1) Develop correlation clustering framework and computationally efficient methods for estimating viral genetic diversity from high-throughput sequencing data. In this line of research, reconstruction of viral genomes is cast as the max-k-cut problem and efficiently solved using semi-definite programming. (2) Design graphical models and belief propagation algorithms for inferring viral genomes in a diverse set analyzed with high-throughput sequencing technologies. The focus of this research thrust is on scalable message-passing methods for estimating viral genetic diversity. (3) Relying on the developed methods, analyze the diversity of the Ebola virus using publicly available high-throughput sequencing data. The results of the outlined work are expected to have an immediate impact on the understanding of the Ebola outbreak mechanisms and virus mutation patterns.