The growing use of DNA sequence data in research, databases, diagnostic and therapeutic biotechnology, and even litigation dramatically increases the need to improve the quality of data being used. This proposal addresses the problem of assembling a large set of sequenced DNA fragments into a finished consensus. In order for a sequencing project to produce high quality finished sequence data, the assembly of sequence fragments must be correct and accurate both in its large scale structure and in the fine scale detail of the alignment of individual base calls. We propose to investigate new algorithms for consensus estimation and assembly of DNA sequence fragments. Recent novel word- based approaches to consensus estimation offer promise as a method for de novo assembly and for exploring alternative assemblies on the large scale. This will be especially important when sequences contain large exact or approximate repeats. We propose to develop several main enhancements to these algorithms. In particular, we will develop a global optimization algorithm for determining consensus sequences, replacing current locally optimizing methods. Also, we propose to develop algorithms allowing alternative alignments in regions of ambiguity. This approach will allow us to assess alignment accuracy at both the large and fine scale level.
Accurate assemblies are at the heart of many sequencing projects central to biopharmaceutical, agricultural, and basic research as well as to the Human Genome Project. The proposed advances will provide the potential for simultaneously increasing reliability and automation in a bioinformatics software market totaling about 100 million dollars per year.