The lack of complete, high-quality sequencing of human genomes is a major bottleneck for accurate and complete analyses in population and medical genetics. Advances in a variety of sequencing technologies have created enormous opportunities to yield full assemblies of every chromosome and its homologue (called as haplotypes). The reconstruction of haplotype sequences from sequencing data is known as diploid assembly or haplotype-aware de novo assembly. Standard de novo assemblers are limited in their ability to combine mixed data types, and also collapse haplotype sequences, resulting in expensive, discontinuous, and inaccurate assemblies. Our interim goal is a finished human genome that would not only reveal the last remaining regions of the genome, but also benefit downstream analyses by providing an unbiased reference for comparison and mapping, as well as the complete phased sequencing of several human and non-human genomes for specific research projects. This project will develop a novel computational toolkit WHdenovo, that can optimally combine various sequencing data types to generate phased assemblies of single individuals and pedigrees.
In aim 1 (K99 phase), I will provide computationally efficient tools that are easy-to-use, open-source and are production level for generating diploid assemblies of pedigrees at minimal cost.
In aim 2 (R00 phase), I will develop novel computational tools for generating pedigree-independent diploid assemblies of single individuals over whole genomes including centromeres.
In aim 3 (R00 phase), the tools developed during aims 1 and 2 will be applied to generating diploid assemblies of diverse human and non-human genomes, and of clinically relevant regions such as the histocompatibility complex (MHC) and killer cell immunoglobulin-like receptor (KIR) region. My goal is to design tools that will be useful to large consortiums such as Genome in a Bottle, High Quality Human Reference Genomes, and the Personal Genome Project. My extensive background in computational biology puts me in a unique position to accomplish this proposal, which requires a seamless integration between data science and genomics. Career and Training: I received my PhD in Computer Science at Max Planck Institute for Informatics, and started postdoctoral research in the lab of Professor George Church at Harvard Medical School. During the K99 phase, I will continue to be mentored by Professor Church. Under the supervision of co-mentor Heng Li, I will advance my expertise in making computational tools efficient in practice, and how to tune them for upcoming novel high throughput sequencing (HTS) datasets. This proposed plan would prepare me to be an independent bioinformatics research scientist. ? ? ? ?? ?

Public Health Relevance

Humans genomes are diploid, consisting of two strands across each chromosome, one inherited from each parent. Determining the DNA sequences of these two chromosomal copies?called haplotypes?is important for many applications ranging from population history to clinical questions. Advances in sequencing technologies has enabled the reconstruction of underlying diploid genomes. I propose the development of novel computational tools to assemble diploid genomes, that is, to directly solve two jigsaw puzzles of two very similar sequences.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Career Transition Award (K99)
Project #
1K99HG010906-01
Application #
9871496
Study Section
National Human Genome Research Institute Initial Review Group (GNOM)
Program Officer
Sofia, Heidi J
Project Start
2019-09-10
Project End
2021-08-31
Budget Start
2019-09-10
Budget End
2020-08-31
Support Year
1
Fiscal Year
2019
Total Cost
Indirect Cost
Name
Dana-Farber Cancer Institute
Department
Type
DUNS #
076580745
City
Boston
State
MA
Country
United States
Zip Code
02215