Our goal is to develop a set of pre- and post-processing tools that are independent of the assembly software used and thus could be immediately implemented at all major sequencing/assembly centers. We will make our software and methods freely available, open source. Pre-processing:
Specific Aim 1. Improving draft genome assemblies through better use of read data, (a) Vector Trimming: Improperly trimmed vector sequences often cause genome assemblies to break unnecessarily. We propose an improved vector trimming method that determines vector sequence automatically, (b) We propose to preprocess the read data to increase the amount of useable sequence on the 3'ends of the reads using overlap-based error correction followed by overlap-based trimming. We propose to use low quality bases on the 3'ends of the reads to help resolve repetitive regions. This procedure would be used to further improve UMD Overlapper. (c) We propose to use our read extension and error correction routines to create better assemblies of genomes sequenced at low. Post-processing:We propose to develop a set of assembler-independent techniques that can be used at any sequencing/assembly center or in the framework of any sequencing consortium.
Specific Aim 2. Assembly evaluation software. We have developed a Compression/Expansion (CE) statistic that allows us to detect misassembled regions in the draft assemblies. We also developed software that uses shooting method to determine which inserts lie in easy to assemble (not necessarily unique) regions of the genome and exactly measure their size. We propose to develop integrated assembly evaluation/misassembly detection software that uses statistics based on read and mate pair placements and data obtained from the shooting methods to detect problems in draft assemblies.
Specific Aim 3. Assembly reconciliation. We propose to create software that enhances a given draft assembly using alternate draft assemblies of the same genome created from the same read data with different assembly programs, or with the same assembly program using different parameters. The U.S. government spends hundreds of millions of dollars on whole genome shotgun sequencing. We believe that if the goals of this project are achieved, significantly better and cheaper genomes will be produced. The cost of using our techniques will be negligible compared to the cost of generating reads. Our approach may find more genes and regulatory regions and lead to a better understanding of the genetic structure of the various genomes. The ultimate goal of this project is to improve public health by better understanding the human genome and the genomes of other species.
Showing the most recent 10 out of 16 publications