This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. The subproject and investigator (PI) may have received primary funding from another NIH source, and thus could be represented in other CRISP entries. The institution listed is for the Center, which is not necessarily the institution for the investigator. A basic operation on biological sequence databases is to locate homologous regions for a given query sequence using pair-wise alignments. Unfortunately. the dynamic programming algorithm used for sequence alignments is computationally expensive, making it prohibitive for today's rapidly-growing sequence databases. Existing alignment tools, such as FAST A and BLAST. though fast in locating candidate homologous regions, sacrifice sensitivity for efficiency -they may miss some true homologous regions in database sequences. In this project, we will develop novel indexing algorithms for large biological databases that support efficient pair-wise sequence alignments with high sensitivity. Specifically, we will incorporate widely-used substitution matrices, such as PAM and BLOSUM, into the construction algorithms of the NSP-tree (an index structure designed for sequence data) so that sequences with evolutionarily-related letters are grouped together in the structure of the NSP-tree. As a result, indexed sequence groups with unrelated letters will obtain a low score when aligned to a given query sequence, and be promptly pruned. By enhancing the pruning power of the NSP-tree, we expect that the new index-based approach will provide high sensitivity while maintaining a comparable or even higher level of efficiency than that of existing pair-wise alignment tools. The project will be conducted in four steps: 1) Developing a new dynamic programming query algorithm to handle the alignments between a query sequence and sequence groups indexed in the tree;2) Based on the substitution matrices, analyzing functionally conservative leiters in biological sequences, and creating a clustering tree that hierarchically organizes the proximity of the letters based on their evolutionary closeness;3) Designing new heuristics that incorporate the clustering tree of letters into the construction algorithms of the NSP-tree;and 4) Conducting experimental studies on the performance of the new heuristics and comparing the performance of the NSP-tree with that of the existing tools.
Showing the most recent 10 out of 165 publications