Data Driven Sequence Assemby

Baldwin, Schuyler

Abstract

This proposal aims at improving the software for DNA sequencing by integrating the components of basecalling, sequence assembly and postassembly analysis into an integrated software system. To test the performance of the software configurations, data for known regions of E. coli will be resequenced from the original clones using a LI-Cor sequencing instrument. The data will be basecalled by neural net based pattern recognition and assembled with a variety of multiple alignment methods. Algorithmic solutions will be compared and evaluated relative to the goal of achieving an accurate final sequence with the minimum of editing by a human expert. Alleviating the need for sequence editing will present the opportunity for significant cost savings in genome projects and other research involving DNA sequencing.