The tremendous amounts of sequence data made available in recent time have increased the need to re-engineer existing bioinformatics algorithms for better performance. Our ability to organize human and mouse genomic, cDNA, and EST (expressed sequence tag) data, rapidly assemble microbial genomes, and compare sequences within and between organisms depends on programs that can operate on large amounts of data and be easily incorporated into scientific applications. In the case of the popular assembly program Phrap (P. Green, unpublished), performance improvements include the ability to perform incremental assemblies, where new sequence data are added to already assembled sequences, better memory management to accommodate larger data sets, and running the algorithm as a parallel process to reduce assembly times. Further Improvements include developing an API (Application Programming Interface) so that Phrap can be better incorporated into bioinformatics applications. In this project a prototype of Phrap will be developed that performs incremental assemblies and has improved memory management. New versions of Phrap will be structured to run as parallel processes. Finally, we will develop specifications for an API and an XML-DTD (eXtensible Markup Language - Data Type Definition) that will allow Phrap to be more efficiently incorporated into bioinformatics applications.
Phrap is widely used in industry and academia for applications involving DNA sequences. There are over 100 commercial sites that would benefit from new versions of Phrap that support incremental assemblies and utilize computer resources better. An API for Phrap will encourage application development creating additional commercialization possibilities for algorithm and application developers.