The human genome project spurred the development of high throughput technologies, especially in the area of DNA sequencing. Not only has this effort produced a draft of the human genome, it's catalyzed development of an entire industry based on DNA sequencing and genomics. Since these technologies produce enormous amounts of data they depend on bioinformatics programs for data management. Phrap, Cross_Match, RepeatMasker and Consed are four programs that played an integral role in the human genome project and became accepted as standard. However, as the technology for sequencing has evolved, so too, have the applications. These new applications include sequencing additional genomes, EST cluster analysis, and genotyping and they have highlighted the need to update standard bioinformatics programs to meet the current needs of a broader community. In this project we will re-engineer Phrap, Cross_Match and Repeat Masker to improve performance by optimizing these algorithms and developing a hierarchical data file to store and manipulate assembled sequence data. Phrap and Cross_Match will also be modified to use XML-formatted data allowing users to apply constraints to sequence assembly. Lastly, we will develop a new program to review, edit, and manipulate sequences, thus giving users unprecedented control over their data.
Phrap is widely used in industry and academia for applications involving DNA sequences. There are over 100 commercial sites that would benefit from new versions of Phrap that support incremental assemblies and utilize computer resources better. An API for Phrap will encourage application development creating additional commercialization possibilities for algorithm and application developers.