The lack of complete, high-quality sequencing of human genomes is a major bottleneck for accurate and complete analyses in population and medical genetics. Advances in a variety of sequencing technologies have created enormous opportunities to yield full assemblies of every chromosome and its homologue (called as haplotypes). The reconstruction of haplotype sequences from sequencing data is known as diploid assembly or haplotype-aware de novo assembly. Standard de novo assemblers are limited in their ability to combine mixed data types, and also collapse haplotype sequences, resulting in expensive, discontinuous, and inaccurate assemblies. Our interim goal is a finished human genome that would not only reveal the last remaining regions of the genome, but also benefit downstream analyses by providing an unbiased reference for comparison and mapping, as well as the complete phased sequencing of several human and non-human genomes for specific research projects. This project will develop a novel computational toolkit WHdenovo, that can optimally combine various sequencing data types to generate phased assemblies of single individuals and pedigrees.
In aim 1 (K99 phase), I will provide computationally efficient tools that are easy-to-use, open-source and are production level for generating diploid assemblies of pedigrees at minimal cost.
In aim 2 (R00 phase), I will develop novel computational tools for generating pedigree-independent diploid assemblies of single individuals over whole genomes including centromeres.
In aim 3 (R00 phase), the tools developed during aims 1 and 2 will be applied to generating diploid assemblies of diverse human and non-human genomes, and of clinically relevant regions such as the histocompatibility complex (MHC) and killer cell immunoglobulin-like receptor (KIR) region. My goal is to design tools that will be useful to large consortiums such as Genome in a Bottle, High Quality Human Reference Genomes, and the Personal Genome Project. My extensive background in computational biology puts me in a unique position to accomplish this proposal, which requires a seamless integration between data science and genomics. Career and Training: I received my PhD in Computer Science at Max Planck Institute for Informatics, and started postdoctoral research in the lab of Professor George Church at Harvard Medical School. During the K99 phase, I will continue to be mentored by Professor Church. Under the supervision of co-mentor Heng Li, I will advance my expertise in making computational tools efficient in practice, and how to tune them for upcoming novel high throughput sequencing (HTS) datasets. This proposed plan would prepare me to be an independent bioinformatics research scientist. ? ? ? ?? ?

Public Health Relevance

Humans genomes are diploid, consisting of two strands across each chromosome, one inherited from each parent. Determining the DNA sequences of these two chromosomal copies?called haplotypes?is important for many applications ranging from population history to clinical questions. Advances in sequencing technologies has enabled the reconstruction of underlying diploid genomes. I propose the development of novel computational tools to assemble diploid genomes, that is, to directly solve two jigsaw puzzles of two very similar sequences.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Career Transition Award (K99)
Project #
Application #
Study Section
National Human Genome Research Institute Initial Review Group (GNOM)
Program Officer
Sofia, Heidi J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Dana-Farber Cancer Institute
United States
Zip Code