Whole-genome shotgun sequencing strategy and assembly

Jaffe, David

Abstract

The ability to read the DNA sequence of an organism's genome has revolutionized biology and biomedical research. Accurate assemblies of sequence data are critical because they provide the foundation for all subsequent work. Using capillary-based sequencing technology, high quality drafts were generated for many genomes. Over the past several years, massively parallel sequencing technologies have lowered sequencing cost by 1000-fold, but the reads from these technologies are shorter and less accurate than the capillary reads, hence harder to assemble, particularly for large genomes. We have recently demonstrated assemblies of massively parallel data that begin to approach the quality of those from capillary data. These assemblies were of genomes for which exceptionally high-quality ('finished') assemblies were already available, and we were thus able not only to rigorously assess the quality of our assemblies, but also to systematically diagnose their defects. Moreover we observe that in almost all cases, defective loci have enough coverage that they could in principle be assembled correctly, provided that the right algorithms were available. On this basis we have proposed a research program to develop computational methods for the creation of assemblies of unprecedented quality: In our first aim we propose to develop methods to achieve high quality draft assemblies of new genomes. Here our objective is to reach and exceed the level of quality that had been achieved using capillary sequencing. In our second aim we will develop methods to achieve ultra high quality assemblies of human genomes. To do this we will leverage the existing human reference sequence and reference sequences of other individuals, including those that we would create. In this way we aim to achieve near-finished quality for regions represented in the reference sequences (essentially via 'resequencing'methods), and at the same time (by de novo methods) capture those regions that are not present in the reference sequences.
Our aim i s thus to produce the best possible representation of each individual's genome. We note that as costs drop, this is likely to become 'standard of care'for patients. In our third aim, we look beyond existing data, to the next generation of sequencing technologies, to assemble very hard regions using very long and 'strobe'reads. These hard regions include segmental duplications, which are evolutionary hotspots, associated with many diseases, and inaccessible to current methods, except those using very expensive clone-by-clone sequencing. Finally our fourth aim is to make assembly methods accessible to the community. Here our goal is to make it as easy as possible for a range of users (including individual investigators) to match the results achievable by genome assembly experts. In short, through our four aims, we will enable the community to achieve the highest possible assembly quality using the lowest cost data. We thus anticipate that our work will advance a broad range of investigations of importance to biology and human disease.

Public Health Relevance

This grant will develop better methods for completely and accurately determining the genome sequence of an organism, in particular producing precise representations of complex and repetitive regions of genomes. This will advance a broad range of investigations in genome evolution, cancer and human genetic disease.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 2R01HG003474-06A1
Application #: 8183944
Study Section: Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer: Felsenfeld, Adam

Project Start: 2004-11-01
Project End: 2014-06-30
Budget Start: 2011-09-20
Budget End: 2012-06-30
Support Year: 6
Fiscal Year: 2011
Total Cost: $859,485
Indirect Cost

Institution

Name: Broad Institute, Inc.
Department
Type
DUNS #: 623544785

City: Cambridge
State: MA
Country: United States
Zip Code: 02142

Related projects


NIH 2013 R01 HG	Whole-genome shotgun sequencing strategy and assembly Jaffe, David B. / Broad Institute, Inc.	$725,799
NIH 2012 R01 HG	Whole-genome shotgun sequencing strategy and assembly Jaffe, David B. / Broad Institute, Inc.	$759,999
NIH 2011 R01 HG	Whole-genome shotgun sequencing strategy and assembly Jaffe, David B. / Broad Institute, Inc.	$859,485
NIH 2009 R01 HG	Whole-genome shotgun sequencing strategy and assembly Jaffe, David B. / Broad Institute, Inc.	$772,252
NIH 2008 R01 HG	Whole-genome shotgun sequencing strategy and assembly Jaffe, David B. / Massachusetts Institute of Technology	$700,327
NIH 2008 R01 HG	Whole-genome shotgun sequencing strategy and assembly Jaffe, David B. / Broad Institute, Inc.	$59,158
NIH 2007 R01 HG	Whole-genome shotgun sequencing strategy and assembly Jaffe, David B. / Massachusetts Institute of Technology	$751,656
NIH 2006 R01 HG	Whole-genome shotgun sequencing strategy and assembly Jaffe, David B. / Massachusetts Institute of Technology	$751,549
NIH 2005 R01 HG	Whole-genome shotgun sequencing strategy and assembly Jaffe, David B. / Massachusetts Institute of Technology	$726,285

Publications

Weisenfeld, Neil I; Yin, Shuangye; Sharpe, Ted et al. (2014) Comprehensive variation discovery in single human genomes. Nat Genet 46:1350-5

Ross, Michael G; Russ, Carsten; Costello, Maura et al. (2013) Characterizing and measuring bias in sequence data. Genome Biol 14:R51

Goldberg, Jonathan M; Griggs, Allison D; Smith, Janet L et al. (2013) Kinannote, a computer program to identify and classify members of the eukaryotic protein kinase superfamily. Bioinformatics 29:2387-94

Amemiya, Chris T; Alföldi, Jessica; Lee, Alison P et al. (2013) The African coelacanth genome provides insights into tetrapod evolution. Nature 496:311-6

Ribeiro, Filipe J; Przybylski, Dariusz; Yin, Shuangye et al. (2012) Finished bacterial genomes from shotgun sequence data. Genome Res 22:2270-7

Williams, Louise J S; Tabbaa, Diana G; Li, Na et al. (2012) Paired-end sequencing of Fosmid libraries by Illumina. Genome Res 22:2241-9

Calvo, Sarah E; Compton, Alison G; Hershman, Steven G et al. (2012) Molecular diagnosis of infantile mitochondrial disease with targeted next-generation sequencing. Sci Transl Med 4:118ra10

Jones, Felicity C; Grabherr, Manfred G; Chan, Yingguang Frank et al. (2012) The genomic basis of adaptive evolution in threespine sticklebacks. Nature 484:55-61

Gnerre, Sante; Maccallum, Iain; Przybylski, Dariusz et al. (2011) High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A 108:1513-8

Earl, Dent; Bradnam, Keith; St John, John et al. (2011) Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 21:2224-41

Showing the most recent 10 out of 14 publications

Comments

Be the first to comment on David Jaffe's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: