In this project, we propose to develop computational methods and tools for whole-genome haplotyping and small variant calling using long-read sequencing technologies such as Pacific Biosciences and Oxford Nanopore and linked-read technologies. Haplotype information is crucial for interpretation of genetic variation in individual genomes, disease mapping, clinical genomics and several other analysis of human genetic variation. The lack of phase or haplotype information in human genomes sequenced using short reads is a major barrier in identifying disease associations with compound heterozygous mutations. More than 600 genes overlap segmental duplications with high sequence identity and variants in more than 100 such genes have been associated with rare Mendelian disorders and complex diseases including cancer. The inability to detect variants with high accuracy in duplicated regions of the genome using short-read sequencing technologies reduces the ability to identify disease causing mutations in medical genetics studies.
In Aim 1, we will develop a general computational method for long-read based diploid genotyping that will enable accurate haplotyping for single nucleotide variants and short indels using long-read and linked-reads as well as accurate small variant calling using SMS technologies.
In Aim 2, we will develop computational methods for sensitive mapping of SMS reads and accurate variant calling in repetitive regions of the human genome that are currently excluded from benchmark small variant call sets for reference human genomes. Finally, in Aim 3, we will leverage the methods from Aims 1 and 2 to perform variant calling on multiple genomes sequenced using SMS technologies to catalog variant PSVs and leverage this catalog to improve read mapping and variant calling accuracy of short-read sequencing in repetitive regions of the genome. We will implement the methods in robust and computationally efficient software tools and benchmark their accuracy using publicly available long- read sequence datasets for multiple human genomes of diverse ancestries.
Massively parallel DNA sequencing technologies have revolutionized the study of human genetic variation and disease. However, whole-genome sequencing using short-read sequencing technologies such as Illumina does not provide long-range haplotype information and has limited ability to detect variants in repetitive regions of the genome. The methods developed in this project will address these two limitations of short-read sequencing and enable the use of long-read sequencing technologies in medical and population studies.