Advanced computational methods in analyzing high-throughput sequencing data

Li, Heng

Abstract

Sequencing technologies have become an essential tool to the study of human evolution, to the understanding of the genetic bases of diseases and to the clinical detection and treatment of genetic disorders. Computational algorithms are indispensible to the analysis of large-scale sequencing data and have received broad attention. However, developed several years ago, many mainstream software packages for sequence alignment, assembly and variant calling have gradually lagged behind the rapid development of sequencing technologies. They are unable to process the latest long reads or assembled contigs, and will be outpaced by upcoming technologies in terms of throughput. The development of advanced algorithms is critical to the applications of sequencing technologies in the near future. This project will address this pressing need with four proposals: (1) developing a fast and accurate aligner that accelerates short-read alignment and can map megabase-long assemblies against large sequence collections of over 100 gigabases in size; (2) developing an integrated caller for small sequence variations that is faster to run, more sensitive to moderately longer insertions and more accessible to biologists without extended expertise in bioinformatics; (3) developing a generic variant filtering tool that uses a novel deep learning model to achieve human-level accuracy on identifying false positive calls; (4) developing a new de novo assembler that works with the latest nanopore reads of ~100 kilobases in length and may achieve good contiguity at low coverage. Upon completion, the proposed studies will dramatically reduce the computational cost of data processing in most research labs and commercial entities, and will enable the applications of long reads in genome assembly, in the study of structural variations and in cancer researches.

Public Health Relevance

Computational algorithms are essential to the analysis of high-throughput sequencing data produced for the detection, prevention and treatment of cancers and genetic disorders. The proposed studies aim to address new challenges arising from the latest sequencing data and to develop faster and more accurate solutions to existing applications. The success of this proposal is likely to unlock the full power of recent sequencing technologies in disease studies and will dramatically reduce the cost of data analyses.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 7R01HG010040-02
Application #: 9824311
Study Section: Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer: Sofia, Heidi J

Project Start: 2018-11-16
Project End: 2022-02-28
Budget Start: 2018-11-16
Budget End: 2019-02-28
Support Year: 2
Fiscal Year: 2018
Total Cost
Indirect Cost

Institution

Name: Dana-Farber Cancer Institute
Department
Type
DUNS #: 076580745

City: Boston
State: MA
Country: United States
Zip Code

Related projects


NIH 2021 R01 HG	Advanced computational methods in analyzing high-throughput sequencing data Li, Heng / Dana-Farber Cancer Institute
NIH 2020 R01 HG	Advanced computational methods in analyzing high-throughput sequencing data Li, Heng / Dana-Farber Cancer Institute
NIH 2019 R01 HG	Advanced computational methods in analyzing high-throughput sequencing data Li, Heng / Dana-Farber Cancer Institute
NIH 2018 R01 HG	Advanced computational methods in analyzing high-throughput sequencing data Li, Heng / Broad Institute, Inc.
NIH 2018 R01 HG	Advanced computational methods in analyzing high-throughput sequencing data Li, Heng / Dana-Farber Cancer Institute

Publications

Tan, Longzhi; Xing, Dong; Chang, Chi-Han et al. (2018) Three-dimensional genome structures of single diploid human cells. Science 361:924-928

Li, Heng (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34:3094-3100

Li, Heng; Bloom, Jonathan M; Farjoun, Yossi et al. (2018) A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat Methods 15:595-597

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: