The most structurally complex regions in the genome are comprised of ampliconic sequences, which are defined as repeats that display >99% identity and are >10 kb in length. Ampliconic regions are of immensely disproportionate biomedical significance and interest. However, these regions are inaccessible by standard genome sequencing strategies, so are grossly misrepresented in or entirely missing from reference genome assemblies. Biomedical researchers cannot extract insights from parts of the genome to which they have no access, so our understanding of the frequency and mechanism of amplicon-mediated rearrangements and their role in disease is far from complete. Furthermore, ampliconic sequences are systematically excluded from all experiments based on mapping to the reference sequence (e.g. exome re-sequencing, RNA-seq, ChIP-seq), severely limiting the insights to be gained from such studies. The chief obstacle to accessing entire genomes is not a lack of interest on the part of the biomedical research community, but the lack of a practical, affordable, and distributable technology with which to generate reference-quality sequence of ampliconic regions. Single Haplotype Iterative Mapping and Sequencing (SHIMS) is the only proven strategy to assemble such regions. SHIMS relies on the use of mapped large-insert clones (usually BACs) derived from a single haplotype so that polymorphisms do not confound the assembly of ampliconic repeats. The major bottleneck and cost associated with the traditional SHIMS approach - SHIMS 1.0 - is the sequencing of individual BACs. Using standard capillary-based sequencing, this endeavor is expensive in terms of both reagents and highly skilled labor. Here we propose to dramatically restructure the SHIMS operational paradigm, so that ultra-high-quality reference sequence can be generated by a small research team at modest cost. We will achieve this by setting up an efficient SHIMS 2.0 pipeline encompassing all steps in generating finished BAC sequence using the Illumina MiSeq platform. We will sequence pools of 192 indexed BACs, generating deep sequence coverage that will dramatically reduce if not eliminate the need for directed finishing. We will optimize all components of the process, from high-throughput plasmid preparation and DNA fragmentation to de novo sequence assembly and quality assessment, with an eye toward quality of product, cost, efficiency, and reproducibility. We will ensure that this new technology and software is distributable and actively promote and support the application of the SHIMS 2.0 pipeline by other researchers to complex genomic regions. For example, it will be possible to use SHIMS 2.0 to assemble multiple human genomes, providing an invaluable resource for studies in human genetics. The SHIMS 2.0 strategy can be applied in other species, enabling insight into the evolutionary dynamics of ampliconic regions. In addition, applying SHIMS 2.0 to improve the genomes of model organisms will be of tremendous benefit to researchers in multiple biomedical disciplines.

Public Health Relevance

Structurally complex or repetitive regions of the genome are of immensely disproportionate medical significance because of their susceptibility to large-scale rearrangements, which can add or subtract genes and cause disease. Because of the inherent difficulty in sequencing such complex regions, especially using the latest sequencing technologies, they are missing from or poorly represented in genome sequences of humans and important model organisms, impeding the study of the nature and mechanism of disease-causing rearrangements. We will develop a practical, affordable, and distributable technology capable of generating accurate sequences of complex genomic regions and ensure that the technology has a broad impact by actively promoting its use by other researchers.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Smith, Michael
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Whitehead Institute for Biomedical Research
United States
Zip Code
Bellott, Daniel W; Skaletsky, Helen; Cho, Ting-Jan et al. (2017) Avian W and mammalian Y chromosomes convergently retained dosage-sensitive regulators. Nat Genet 49:387-394
Ly, Peter; Teitz, Levi S; Kim, Dong H et al. (2017) Selective Y centromere inactivation triggers chromosome shattering in micronuclei and repair by non-homologous end joining. Nat Cell Biol 19:68-75
Hughes, Jennifer F; Skaletsky, Helen; Koutseva, Natalia et al. (2015) Sex chromosome-to-autosome transposition events counter Y-chromosome gene loss in mammals. Genome Biol 16:104