The human genome project is moving into its final phase, in which the genome sequence will be determined in large-scale efforts in a number of laboratories. Current technology appears largely adequate to the task but it will be essential to reduce as much as possible the need of skilled human labor, which remains a bottleneck to increased throughput, a potential source of uneven sequence quality, and an obstacle to more widespread participation by the community. A major aspect of sequencing that currently requires skilled labor is human review and manipulation of data, particularly editing (revision of errors in assembly and base calls), assessment of data quality, and decisions regarding data collection. The investigators' goal is to reduce, and eventually completely eliminate, all such human involvement in data processing while maintaining a high level of accuracy of the final sequence. The investigators will do this by improving the accuracy of assembly and base-calling, and by developing objective criteria to estimate this accuracy so as to more precisely delineate those regions of the sequence that may still require human review. In particular they will develop base-specific error probabilities as a criterion to guide data collection and to measure the quality of the final sequence. These advances will be implemented in the basecalling and assembly programs phred and phrap, which are freely distributed to academic researchers and are already in use in a number of sequencing laboratories.
Gordon, D; Desmarais, C; Green, P (2001) Automated finishing with autofinish. Genome Res 11:614-25 |
Garg, K; Green, P; Nickerson, D A (1999) Identification of candidate coding region single nucleotide polymorphisms in 165 human genes using assembled expressed sequence tags. Genome Res 9:1087-92 |
Green, E D; Idol, J R; Mohr-Tidwell, R M et al. (1994) Integration of physical, genetic and cytogenetic maps of human chromosome 7: isolation and analysis of yeast artificial chromosome clones for 117 mapped genetic markers. Hum Mol Genet 3:489-501 |