This subproject is one of many research subprojects utilizing theresources provided by a Center grant funded by NIH/NCRR. The subproject andinvestigator (PI) may have received primary funding from another NIH source,and thus could be represented in other CRISP entries. The institution listed isfor the Center, which is not necessarily the institution for the investigator.As the volume of biological data increases rapidly, index-based approaches to searching the data becomes more favorable than sequential-scan-based approaches. This subproject investigates the application of the ND-tree, a multidimensional structure specifically designed to index substrings/q-grams with discrete and non-ordered components typical of bioinformatics data, to bioinformatics queries.
The aim of this subproject is to extend the ND-tree to support the edit distance, a widely-used similarity measure for homologous region queries. The goal of the extension is to enhance the sensitivity in the filtering stage of a bioinformatics database query. To incorporate the edit distance which employs the extra insertion and deletion operations than the Hamming distance, the ND-tree must support efficient similarity queries with a relatively large search range. In the first phase of this project, we will design and evaluate novel algorithms that efficiently process queries with relatively large search ranges in the ND-tree. We plan to investigate approximation-based techniques that can improve query performance by pruning a large amount of less-promising index branches. In the second phase, a query algorithm based on the edit distance will be developed. To further enhance the performance, the construction and bulk-loading algorithms of the ND-tree will also be examined and adapted so that the data organization within the index becomes more suitable for edit distance queries. To evaluate the effectiveness of the new algorithms, we will experimentally compare them with existing algorithms. The project will lead to the design of a novel bioinformatics search engine based on the ND-tree.
Showing the most recent 10 out of 165 publications