Among the stated goals of the Human Genome Project are dramatic improvements in DNA sequencing technologies and corresponding reductions in the cost per finished base. As these goals are realized, genome sequencing will likely become a more automated, high-volume activity, and similar economies will be demanded in the process of assuring and documenting the quality of the data produced. In an environment where a thorough, expert manual validation of new sequence data may often be prohibitive, it would be a great benefit to consumers of sequence data if the quality of base calls were provided in databases with the calls themselves. For this to be practical, uncertainty information must be generated in an automatic and unobstructive manner. In the proposed research, (a) algorithms for the estimation of base probability distributions from sequencing gel lane traces will be implemented and evaluated, (b) alternative schemes for the compact storage of this information in databases will be explored, and (c) contig assembly software will be prototyped that utilizes such information for the input fragments and estimates a statistically consistent representation of the finished contig. Its success will promote improvements in the robustness and reliability of sequence data while reducing its cost through longer fragment reads and greater validation efficiency.
The results of the research will be used to extend the X/Gene(TM) sequence analysis software, a comprehensive package supporting distributed processing on Unix networks that has been under development for three years and is currently in pre-release testing. Thus enhanced, it will include facilities for automatically estimating, storing, disseminating, and robustly utilizing uncertainty information in a broad range of sequence analysis applications.