The goal of the informatics core is to increase the rate at which finished sequence data are generated. Tools will be created to process over 1.5 Mb of raw data and 100 kb of sequence per day during year 5 of the proposal. This will be achieved by steady improvement in the performance of software automation, hardware, and computational support. Base calling algorithms will be developed and tested to replace the current base calling software. Base calling routines will also provide base reliability estimates for use in sequence assembly and automated contig editing. The combination of improved base calling, automated proofreading, and improved sequence assembly will reduce initial proofreading by Year 2 and eliminate it by Year 3, as well as reduce the final contig editing time per megabase by one half in each of Years 3, 4, and 5. The multiplex technology sequences both ends of plasmid inserts. Sequence assembly algorithms will use the fact that plasmid ends must be separated by the length of the insert to automatically clones with repetitive regions. In addition, new techniques will improve the recognition of fragment overlaps and the merging of fragments into contigs which will contribute to the goals of proofreading and final contig editing. Contig editing will be automated. The automated contig editor will use multiple coverage in the assembled contigs to reinterpret the original image data. The likelihood of conflicting bases or indels (insertions/deletions) will be calculated for the entire set of assembled sequence reads and the sequence variant with the highest likelihood will be selected. General computational support will be provided for the success of Projects 1 and 2. Services include 1) installing and maintaining adequate network connectivity, both internally and externally, 2) programming support for the development and use of robotics, 3) routine acquisition, installation, and maintenance of required hardware and software, and 4) development of project management and database tools. Sophisticated sequence analysis tools will use database searching and gene identification tools to find genes in cosmids. The emphasis will be on developing visual interfaces to facilitate efficient use of existing gene identification algorithms. Daily interaction with the biologists working on the Projects 1 and 2 creates an excellent environment for developing practical and elegant algorithmic and software solutions to production problems.
Engelstein, M; Aldredge, T J; Madan, D et al. (1998) An efficient, automatable template preparation for high throughput sequencing. Microb Comp Genomics 3:237-41 |
Smith, D R; Richterich, P; Rubenfield, M et al. (1997) Multiplex sequencing of 1.5 Mb of the Mycobacterium leprae genome. Genome Res 7:802-19 |
Smith, D R (1996) Microbial pathogen genomes--new strategies for identifying therapeutics and vaccine targets. Trends Biotechnol 14:290-3 |