The advent of next-generation (Next-gen) sequencing technologies has begun a surge in whole genome sequencing and resequencing, exemplified spectacularly by four papers describing five complete human genomes in 2008 alone. One company, Knome, now even offers customers their entire genome sequence using Next-gen sequencing technology. These developments, together with targeted resequencing of genome, presage the day of the $1000 human genome. Broad-scale whole human genome resequencing (WHGR) will have enormous impact on the areas of personalized medicine, human evolution and human diversity. To fully realize that potential, however, software capabilities must be dramatically enhanced to meet the significant challenges posed by the sheer volume of data generated in these projects, the diversity of technology-specific data characteristics and simply analyzing the 6 billion base pair diploid human genome. Moreover, we see the day when technology improvements and cost reductions make WHGR as commonplace as bacterial genome sequencing has become today. For that to occur, assembly and analysis software must be accessible to a far broader and less computer savvy range of researchers than the highly specialized bioinformatics teams that decode the information now. Also, computer resources are far more limited even for a well funded research laboratory than available to a large sequencing center. Therefore, the overall goal of this proposal is to develop a Next-gen sequence assembly and analysis pipeline, DESKAPP, that will run on an affordable ($5000) high- end desktop computer and produce a human genome sequence in a reasonable timeframe (days, not weeks). WHGR by DESKAPP will involve a reference-guided main assembly as well as a de novo assembly branch to characterize unique regions of the new genome relative to the reference. Merging of the assemblies produces a complete sequence that can be evaluated for gene content, single nucleotide polymorphisms (SNPs) and structural variation (SV;indels, inversion, translocations) both by web-based searches of external databases to identify known allelic variation and by direct examination of the sequence to identify new polymorphisms. A Disk Sort Alignment algorithm allows the data sets which are far too large for in-memory processing to be evaluated and clustered for assembly by SeqMan N-Gen (SM N-Gen), our desktop assembly engine. Using a prototype DSA-SM N-Gen pipeline, we have processed the entire 7.4x 454 data set from the James Watson genome to a layout file in 31 hours using DSA and have assembled three chromosomes: 8;21;and X;using SM N-Gen. Assembly times varied from 1 hour for Chromosome 21 to 10.6 hours for an average- sized chromosome, such as Chromosome 8. Together, these results demonstrate the feasibility of constructing a DESKAPP pipeline for WHGR. The Phase II Aims are designed to build upon this foundation and produce a seamless pipeline for the desktop assembly and analysis of a human genome in a matter of days.
Next-gen sequencing technologies have started a new revolution throughout biology by providing DNA sequence data in unprecedented quantities at continually decreasing costs. This data will be invaluable in the emerging era of personalized medicine and in exploring the immense diversity of life. The goal of this project is to develop desktop computer software that will enable research laboratories and clinics of any size to realize the promise of these new technologies.