Sequencing technologies have become an essential tool to the study of human evolution, to the understanding of the genetic bases of diseases and to the clinical detection and treatment of genetic disorders. Computational algorithms are indispensible to the analysis of large-scale sequencing data and have received broad attention. However, developed several years ago, many mainstream software packages for sequence alignment, assembly and variant calling have gradually lagged behind the rapid development of sequencing technologies. They are unable to process the latest long reads or assembled contigs, and will be outpaced by upcoming technologies in terms of throughput. The development of advanced algorithms is critical to the applications of sequencing technologies in the near future. This project will address this pressing need with four proposals: (1) developing a fast and accurate aligner that accelerates short-read alignment and can map megabase-long assemblies against large sequence collections of over 100 gigabases in size; (2) developing an integrated caller for small sequence variations that is faster to run, more sensitive to moderately longer insertions and more accessible to biologists without extended expertise in bioinformatics; (3) developing a generic variant filtering tool that uses a novel deep learning model to achieve human-level accuracy on identifying false positive calls; (4) developing a new de novo assembler that works with the latest nanopore reads of ~100 kilobases in length and may achieve good contiguity at low coverage. Upon completion, the proposed studies will dramatically reduce the computational cost of data processing in most research labs and commercial entities, and will enable the applications of long reads in genome assembly, in the study of structural variations and in cancer researches.

Public Health Relevance

Computational algorithms are essential to the analysis of high-throughput sequencing data produced for the detection, prevention and treatment of cancers and genetic disorders. The proposed studies aim to address new challenges arising from the latest sequencing data and to develop faster and more accurate solutions to existing applications. The success of this proposal is likely to unlock the full power of recent sequencing technologies in disease studies and will dramatically reduce the cost of data analyses.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Sofia, Heidi J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Dana-Farber Cancer Institute
United States
Zip Code
Tan, Longzhi; Xing, Dong; Chang, Chi-Han et al. (2018) Three-dimensional genome structures of single diploid human cells. Science 361:924-928
Li, Heng (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34:3094-3100
Li, Heng; Bloom, Jonathan M; Farjoun, Yossi et al. (2018) A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat Methods 15:595-597