Computerized DNA sequence analysis is currently accomplished through the use of a large set of tools, ranging from generic regular expression search algorithms for pattern-matching on large sequence databases, to specialized similarity algorithms for discovering longer sets of sequences with potential evolutionary relatedness, to sophisticated ad hoc programs for search and analysis based on higher-order properties of DNA sequences. The proposed work would attempt to consolidate the wide range of approaches to such activities, by undertaking to treat the genome as language, bringing to bear the tools of computational linguistics to established a formal basis for describing genetic information. This will be done using the formalism of logic grammars (or Definite Clause Grammars), and extensible, Prolog- based system for specifying languages of greater than context- free power. This will extend DNA search capabilities well beyond the known limitations of current regular expression search programs, and should in addition subsume many specialized programs, because of the increased linguistic power available. The unified conceptual framework provided by such a system would provide a clear, hierarchical presentation of varying levels of abstraction on the genome, presenting the opportunity for (1) specifying searches for more sophisticated genetic elements over large sequence databases (such as those likely to be produced by the Human Genome Sequencing Project); (2) an interactive system for adjusting definitions of such elements to account for data; and (3) the foundation for an experiment-planning system based on a procedural interpretation of the declarative grammar. To achieve these goals, it will be necessary to address issues of computational efficiency and systematic extensions in linguistic power, making use of current approaches to parsing and natural language processing.

Agency
National Institute of Health (NIH)
Institute
National Center for Research Resources (NCRR)
Type
Small Research Grants (R03)
Project #
1R03RR004522-01
Application #
3431546
Study Section
Biotechnology Resources Review Committee (BRC)
Project Start
1988-07-15
Project End
1989-07-14
Budget Start
1988-07-15
Budget End
1989-07-14
Support Year
1
Fiscal Year
1988
Total Cost
Indirect Cost
Name
Unisys
Department
Type
DUNS #
City
Paoli
State
PA
Country
United States
Zip Code
19301