The need for more powerful and efficient query languages to find complex patterns in stored sequences and data streams is shared by a wide spectrum of applications, including software analysis, complex event processing, identification of RNA structures, temporal databases and XML queries. The goal of this project is to develop a unified framework to support very powerful pattern languages and their query optimization techniques for different application domains. To achieve this goal, the project follows the approach of using (i) nested Kleene-closure (K*) constructs to achieve greater levels of expressive power for the query languages, and (ii) Nested Words and Visibly Pushdown Automata as the basis for their unified implementation and query optimization over different application domains. This K*-based approach was previously applied successfully to relational sequences and are now generalized to different computing environments and application domains. Through the unified framework, the project designs and demonstrates XML and temporal query languages that compare favorably in terms of expressive power and performance with existing ones. It then demonstrates the use of the unified framework in new application areas. In particular, it develops efficient query languages for RNA structures and software analysis. These research results will have great impacts on many applications, such as software analysis, genomic databases, complex event processing, digital government and scientific studies. This project supports Ph.D. students to pursue research in the areas of advanced query languages and data stream management systems. A new graduate-level course covering these areas and integrating the research results from the project are introduced into the curriculum. Publications, technical reports, software and experimental data from this research are available via the project web site at: http://yellowstone.cs.ucla.edu/nsf-projects/RegExpr2NestedWords.html.
In the original relational data model, the database is viewed as an unordered collection of records, whereby no ordering of input data could be specified when expressing search and query conditions. This simple model ensures better physical data independence and has served well both database users and database management for a very many years, since it entails (i) the expression of powerful queries in high-level set-oriented query languages and (ii) the optimization of such queries into efficient execution plans using the rich repertoire of sophisticated query optimization techniques that is now incorporated in all commercial Data Base Management Systems (DBMS). However, as the use of big databasesbecame more widespread and diversified, it also became clear that more powerful and sophisticated queries are needed, and the original framework must be generalized in ways that maximize benefits while minimizing the complications brought in by such extensions. Now, the ability of treating the input data as ordered sequences of records, rather than as unordered sets, represents a simple but fundamental extension that is critical for many new applications. For instance, recent SQL standards use this generalization to support OLAP analytical functions, which proved essential in decision support. Another important extension was introduced more than a decade ago by the PI and his students who showed that (i) powerful queries on ordered sequences can be expressed in SQL enriched with simple Kleene-closure (K*) constructs, and (ii) powerful query optimization techniques based on extension of the Knuth-Morris-Pratt sequence search algorithm can be used to optimize the execution of K* queries. These ideas are having a significant impact on DBMS and vendors are adding K* constructs into their systems, and into the SQL standards. In the meantime, this PI and his student have been working on important generalizations and enrichment of K* constructs, that are needed to support (a) nested K* expression in queries, and (b) their use to find patterns in semi-structured data sequences of nested structure, including XML documents, software code, and RNA sequences. This has led to (a) the design of powerful nested K* constructs, (b) their efficient support via novel query optimization technique, and (c) their use in advanced applications for (i) searching XML documents, (ii) software analysis and (iii) identification of complex RNA structures. The optimization techniques devised for nested K* queries are of particular technical significance since they are based on the nested world model of finite automata and the use of visibly pushdown automata, which extend the expressive power of finite automata while preserving their desirable computational properties. The advances, achieved in the course of this NSF project, have led to several publications and the development of our Xseq system, which enabled us to demonstrate the high level of performance achieved by the new technology and its effectiveness over a wide range of applications that include software analysis and the search for RNA sequences among many others.