Two of the major challenges in genome analysis are de novo genome sequence assembly based on """"""""short read"""""""" shotgun sequencing and genome-wide structural variation analysis. At present, most medical sequencing projects and whole genome sequencing projects map the sequencing data onto the reference human genome sequence without performing whole genome assemblies. When whole genome assembly is attempted, it is done by generating paired-end sequencing reads from a number of sequencing libraries with different insert sizes. The paired-end sequences provide the """"""""scaffold"""""""" that helps with sequence assembly. However, it increases the complexity of the sequencing project and provides limited information on the haplotypes of the diploid human genome. Similarly, current structural variation scanning based on array-based comparative genomic hybridization is unable to determine the genomic locations of duplicated regions or identify genomic inversions or balanced translocations. We propose to optimize a new, highly flexible, automated method for optical mapping for general use. Our genome mapping strategy starts with sequence-specific labeling double-stranded genomic DNA fragments with fluorophores. The fluorescently labeled, large (100 kbp to 1 Mbp) DNA fragments are then linearized in nanochannel arrays for high-throughput, automated imaging and analysis on a commercially available instrument. As more and more groups are performing large-scale genomic sequencing and searching for structural variation, a method that average labs can use in-house will facilitate medical genomics studies. By intelligent probe design, one can therefore create genome maps tailored to the questions being asked, be it local structural variation screening, global structural variation detection, or scaffolding for de novo genome sequence assembly. In this proposal, we aim to improve and scale the method to generate, with ease, >300 individuals from the 1000 Genomes Project to provide both genome-wide structural variation data and fully assembled sequencing data on these whole-genome sequenced subjects.
As sequencing platforms are producing short-read sequences at extremely high rates, the main obstacle to whole genome sequencing is the inability to assemble the sequencing data accurately and efficiently. Furthermore, structural variations in the human genome are found to be associated with a number of important diseases but genome-wide scanning for these variations is not yet feasible. In this proposal, we aim to develop and optimize a single molecule mapping approach that will make de novo sequence assembly and structural variation analysis possible.