Searching for a string in a digital library is a fundamental operation. One may seek a phrase in a file that is being manipulated by a text editor, a sentence on the web via a search engine, or a genome in the human DNA sequence. This basic tool has been well researched and, indeed, is ubiquitous in data processing.

However, string matching and motif discovery algorithms ("Stringology") that have dealt with searches in general databases, have always assumed that the sought string is ordered. The research thrusts were in directions such as exact matching, approximate matching, or grappling with the challenge of coping with errors in the data. But always the order of the data was assumed to be iron-clad.

Nevertheless, some non-conforming problems have been gnawing at the walls of this assumption. Some of the areas where applications assume erroneous order of the data are: Text Editing, where errors such as swaps or transpositions, assume that the data has not been changed -- it has rather been rearranged; Computational Biology, where reversals or transpositions of genome subsequences are part of the evolutionary process; and Linguistics, where the task of lexical categorization is aided by considering sets of parts-of-speech in various orders.

The above application areas suggest that the wonderfully ordered world of pattern matching may be too rigid to handle new challenges. This project introduces a fundamentally new model of string matching, where the order of symbols in the input pattern may be perturbed, but the content remains unchanged. This model studies unordered pattern matching universes in order to supply the above mentioned application areas with appropriate computational tools. A theory of unordered matching will give a general framework for all the above problems.

This project considers permuted Stringology as a line leading from fully ordered data all the way to data with no order at all. The ordered side has been historically amply researched. The specific technical goals of the project are:

1. To develop a theory and framework for permuted Stringology. 2. To identify the types of unordered problems that are more difficult than ordered Stringology, the conditions that make them harder, and the reason why. 3. To define the term ``Similarity'' in unordered clusters. 4. To define the world of ``almost ordered sequences'', and to compare it with unordered and ordered sequences. 5. To design tools that will be key elements in the solution for the proposed problems, and will be shared by more than one algorithm. The list of problems includes Indexing, Dictionary and Approximations.

This project's aim is to study the fundamental problems arising from a model of pattern matching with rearrangement. The immediate benefit to the field is a totally new, yet very basic, research direction that seems to defy the state-of-the-art toolkit. The historical tools of pattern matching, such as dynamic programming, FFT, sub-word trees, renaming, encodings, and embeddings do not seem suitable to handle these problems. New algorithmic tools and data structures are required. In fact, this direction already bore unexpected interesting fruits -- the solution of an open problem in graph theory posed by the mathematician Cayley in 1849! Techniques such as non-standard convolutions, group testing, and graph theoretic algorithms, have been necessary to solve some of the problems thus far.

Since this project defines a new model, there are many directions to explore. It is expected that this project will mark the beginning of intensive research in a new paradigm.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
0904246
Program Officer
Balasubramanian Kalyanasundaram
Project Start
Project End
Budget Start
2009-10-01
Budget End
2014-09-30
Support Year
Fiscal Year
2009
Total Cost
$295,912
Indirect Cost
Name
Polytechnic University of New York
Department
Type
DUNS #
City
Brooklyn
State
NY
Country
United States
Zip Code
11201