Computational Linguistic Analysis of Genetic Information

Searls, David

Abstract

Computerized DNA sequence analysis is currently accomplished through the use of a large set of tools, ranging from generic regular expression search algorithms for pattern-matching on large sequence databases, to specialized similarity algorithms for discovering longer sets of sequences with potential evolutionary relatedness, to sophisticated ad hoc programs for search and analysis based on higher-order properties of DNA sequences. The proposed work would attempt to consolidate the wide range of approaches to such activities, by undertaking to treat the genome as language, bringing to bear the tools of computational linguistics to established a formal basis for describing genetic information. This will be done using the formalism of logic grammars (or Definite Clause Grammars), and extensible, Prolog- based system for specifying languages of greater than context- free power. This will extend DNA search capabilities well beyond the known limitations of current regular expression search programs, and should in addition subsume many specialized programs, because of the increased linguistic power available. The unified conceptual framework provided by such a system would provide a clear, hierarchical presentation of varying levels of abstraction on the genome, presenting the opportunity for (1) specifying searches for more sophisticated genetic elements over large sequence databases (such as those likely to be produced by the Human Genome Sequencing Project); (2) an interactive system for adjusting definitions of such elements to account for data; and (3) the foundation for an experiment-planning system based on a procedural interpretation of the declarative grammar. To achieve these goals, it will be necessary to address issues of computational efficiency and systematic extensions in linguistic power, making use of current approaches to parsing and natural language processing.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Center for Research Resources (NCRR)
Type: Small Research Grants (R03)
Project #: 1R03RR004522-01
Application #: 3431546
Study Section: Biotechnology Resources Review Committee (BRC)

Project Start: 1988-07-15
Project End: 1989-07-14
Budget Start: 1988-07-15
Budget End: 1989-07-14
Support Year: 1
Fiscal Year: 1988
Total Cost
Indirect Cost

Computational Linguistic Analysis of Genetic Information
Searls, David B.
Unisys, Paoli, PA, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments