Recent advances in high-throughput sequencing (HTS) technologies provide opportunities to study genome structure, function, and evolution at an unprecedented scale, and are profoundly transforming genomic research. However, fully realizing the potential of HTS technologies requires sophisticated data analysis methods. This research project is aimed at developing efficient computational methods for reconstructing the full spectrum of haplotype sequences from HTS data. Working in collaboration with molecular biologists from the University of Connecticut Health Center and the Centers for Disease Control, the investigators will develop methods enabling three novel applications of HTS, namely (a) reconstruction of diploid genome sequences, including complete haplotype sequences of each CNV copy, (b) reconstruction of alternative splicing isoform sequences and their frequencies, and (c) reconstruction of viral quasispecies sequences and their frequencies. Major outcomes of the project will include the development of a comprehensive analytical toolkit for these problems, and high-quality open source software implementations that will be made available free of charge to the research community. The project will provide opportunities for participation of undergraduate and graduate students in bioinformatics research at UCONN and Georgia State University, and will especially encourage participation of women and underrepresented groups.
Recent advances in high-throughput sequencing (HTS) technologies have provided new opportunities to study genome structure, function, and evolution at an unprecedented scale, and are profoundly transforming genomic research. However, fully realizing the potential of HTS technologies requires sophisticated data analysis methods. In this project we aimed to develop efficient combinatorial algorithms for reconstructing the full spectrum of haplotype sequences from high-throughput sequencing data, specifically reconstruction of diploid genome sequences, including complete haplotype sequences of each CNV copy, reconstruction of alternative splicing isoform sequences and their frequencies, and reconstruction of viral quasispecies sequences and their frequencies. The new sequencing technologies have enabled cost-effective shotgun sequencing of individual genomes. Sequencing can be used to discover new single nucleotide polymorphisms (SNPs) and other forms of sequence variation such as small insertions and deletions, copy number variants, genome rearrangements, etc., thus providing a complete picture of individual genome variation. The ideal outcome of such a sequencing project is the diploid genome of the individual, i.e., full haplotype sequences for the individual’s maternal and paternal chromosomes, since haplotype sequences provide the detailed context required for accurate accurate predictions of translation in protein coding regions. We introduced a novel problem formulation for the single individual haplotyping problem, similar to the well known max-cut problem. Our algorithm initially finds the best cut based on a heuristic algorithm for max-cut and then builds haplotypes consistent with that cut. Experiments conducted on both simulated and real sequencing data show that this algorithm performs significantly faster than previous methods without loss of accuracy. Among other applications, we have incorporated the algorithm in a multi-step bioinformatics analysis pipeline for predicting tumor-specific epitopes that can be used in cancer immunotherapy. Massively parallel whole transcriptome sequencing has become the technology of choice for transcriptome analysis since it supports a wider range of problems than the previously popular microarray technology. This project has focused on two of these applications, namely transcriptome reconstruction and quantification. For transcriptome reconstruction, the main outcome is a statistical genome-guided software tool called "Transcriptome Reconstruction using Integer Programming" (TRIP) that incorporates fragment length distribution into novel transcript reconstruction from paired-end RNA-Seq reads. Experimental results on both real and synthetic datasets show that TRIP is more accurate than methods ignoring fragment length distribution information. For transcriptome quantification, the main outcomes are two Expectation-Maximization (EM) algorithms for RNA-Seq and Digital Gene Expression (DGE) sequencing protocols. Both algorithms take into account alternative splicing and mapping ambiguities. Experimental results on real datasets comparing the two protocols as well as methods for each protocol. show that the EM algorithms outperform other available methods for both RNA-Seq and DGE, and that they yield comparable quantification accuracy on real data generated using the RNA-Seq and DGE protocols. Many clinically relevant viruses including hepatitis C virus (HCV) and human immunodeficiency virus (HIV) exhibit high genomic diversity within infected hosts which may explain the failure of vaccines and resistance to existing antiviral therapies. Characterizing the viral population infecting a host requires reconstructing all co-existing (related, but non-identical) viral variants, referred to as quasispecies, and inferring their relative abundances. Next-generation sequencing is a promising approach for characterizing viral diversity due to its ability to generate large numbers of reads at low cost. However, standard assembly software was originally designed for a single genome assembly and cannot be used to assemble multiple closely related quasispecies sequences and estimate their abundances. The project has focused on the problem of reconstructing viral quasispecies populations from next-generation sequencing reads generated using the two most commonly used strategies: shotgun sequencing and sequencing of partially overlapping PCR amplicons. The main outcomes of the project are two software tools: Viral Spectrum Assembler (ViSpA), designed for shotgun reads, and Viral Assembler (VirA), which handles amplicon reads. Both tools have been tested on simulated and real read data from HCV, HIV (ViSpA) and HBV (VirA) quasispecies, and shown to compare favorably with other existing methods. Educational activities supported by this award have included developing several sequencing-related projects for graduate-level courses taught by PI Zelikovsky. Several computational problems related to high-throughput sequencing (including short reads mapping, sequence assembly and genetic variation calling) are covered. 2 PhD students have defended their theses and continued their training as postdocs in other universities. 6 other PhD students including three women (one of them is of African descent) have passed qualification exams, co-authored papers and also had opportunity to present at several international conferences. Their participation in research activities related to this project has given them the chance to develop a deep understanding of the underlying biological processes, to acquire research skills in mathematical modeling, statistical analysis, as well as algorithm design, analysis and implementation, and to enhance their capability of working effectively across disciplinary lines.