Massive sequencing is revolutionizing biological research and clinical practice. Over the past decades, projects such as the 1,000 Genomes Project, TCGA, GTEx, and GEUVADIS have generated hundreds of trillions of reads. The recent completion of the UK?s 100K WGS project has inspired many other nations to develop their own 100K WGS projects. The improvements in throughput and reduced costs of sequencing have enabled more thorough and deeper studies of cancer, genetic disorders, and other areas of human biology. Advanced sequencing alignment and computational methodologies have played a major role in conducting these analyses. In recent years, our lab has contributed to these global scale and unprecedented endeavors by developing several widely used bioinformatics tools for analyzing NGS sequencing reads: TopHat2 and HISAT for aligning RNA-seq reads, TopHat-Fusion for identifying gene fusions, Centrifuge for classifying metagenomics sequencing reads, HISAT2 for graph alignment at the human genome scale, and HISAT- genotype for HLA gene typing and assembly. This proposal addresses several key challenges in the areas of sequence alignment, genotyping, and diploid genome assembly. First, we plan to research and develop various indexing strategies. Virtually all alignment programs rely on one type of index for aligning reads to a reference. Alignment accuracy and speed will be further enhanced by incorporating additional types of indexes. Second, we will develop genotyping and diploid genome assembly algorithms. As sequencing costs continue to decline, it will become routine for people to have their own genomes sequenced for clinical purposes. We will further develop our initial version of HISAT-genotype into a comprehensive suite of tools that can genotype and assemble a person?s whole diploid genome in one day on a desktop. Third, we will continue to maintain and improve HISAT2, and develop a new more versatile aligner. We propose to unify widely used alignment programs by developing several common functions of alignment programs (input processing, indexing, aligning, and reporting) as modules and provide application programming interfaces (APIs) that expose those modules, enabling bioinformatics engineers to use the APIs for developing their own indexes and alignment algorithms that are customized for best analyzing their own data sets. We plan to demonstrate the usability of the new sequence aligner, SARTOR (Sequence Alignment Repertoire To Optimize Reference-guided analysis), by effectively handling different types of reads (WGS, WES, RNA-seq, ChIP-seq, BS-seq, etc.,) produced by different sequencing technologies (short, long, and linked reads). Upon successful completion, the proposed software systems will promote personalized medicine by drawing upon customized personal genomes, with key functionalities including differential gene expression analysis and somatic mutation identification. The programs will also allow researchers to perform unbiased, accurate, and rapid analyses in large-scale NGS experiments.

Public Health Relevance

High-throughput sequencing combined with the diversity of sample preparation protocols are being used to effectively study human biology and disease. Advanced, versatile software tools are critical for efficiently analyzing large sequencing data sets. This project will develop (1) very rapid and accurate sequence alignment methods applicable to the latest sequencing technology and (2) practical genotyping and assembly methods that can be executed within a day on a conventional desktop, both of which will be invaluable tools for studying human disease.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
1R01GM135341-01
Application #
9861501
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Ravichandran, Veerasamy
Project Start
2019-09-23
Project End
2023-08-31
Budget Start
2019-09-23
Budget End
2020-08-31
Support Year
1
Fiscal Year
2019
Total Cost
Indirect Cost
Name
University of Texas Sw Medical Center Dallas
Department
Miscellaneous
Type
Schools of Medicine
DUNS #
800771545
City
Dallas
State
TX
Country
United States
Zip Code
75390