New Analysis Tools for Gene Function Identification

Smith, Randall

Abstract

New algorithms and computer software tools will be developed to aid in identifying the function of newly-generated sequences. This work will have important practical applications for human and model organism genome sequencing projects. Significant insights into the potential function of newly-generated sequences of unknown biological function (e.g., anonymous cDNAs), can be obtained if similarity to sequences of known function can be detected. Current sequence database search programs can fail to detect similarity between distantly related sequences incases where functional domains contain a few key residues that are dispersed along the primary sequence (e.g., """"""""zinc-finger"""""""" DNA binding domains). This is because, in the generation of alignment scores, mismatches at non-conserved residues can easily outweigh matches at the few key sites. To overcome this problem, we propose to develop new pattern construction and search methodologies that identify and utilize only conserved residues and domains in sequence similarity searches. First, techniques to identify conserved regions within protein sequences will be used to construct a new type of sequence database in which only the conserved regions are represented in each sequence. This database should significantly improve the ability to detect distantly related sequences by reducing the number spurious, but statistically significant, matches to unrelated sequences during a database search. Second, methods will be developed to exploit information on 1)sequence family relationship and 2) the positions of conserved domains within related sequences in sequence database searches. These new tools will aid in distinguishing weak matches for distantly related sequences from the alignments of unrelated but statistically significant matches in database searches. Third, new pattern libraries will be constructed from sequence and sequence similarity data available in the Entrez: Sequences database, produced by the National Center for Biotechnology Information (NCBI). This will allow functional information in the covering pattern databases to be directly cross-referenced to sequence and sequence annotation information in Entrez database, providing value-added benefits for both databases. Fourth, the high-speed database search tool BLAST will be adapted for pattern database searches. This will provide a fast and sensitive search tool for identifying the function of newly-generated sequences. Fifth, the use of concave gap penalties and suboptimal alignments will be incorporated into our Pattern-Induced Multi-sequence Alignment (PIMA) algorithm. These new extensions will significantly enhance the quality of the patterns and multiple sequence alignments generated by PIMA. These new analysis tools should prove invaluable to genome scientists and molecular biologists as they isolate genes and proteins of unknown biological function.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 5R01HG000973-02
Application #: 2209206
Study Section: Genome Study Section (GNM)

Project Start: 1993-12-10
Project End: 1996-11-30
Budget Start: 1994-12-01
Budget End: 1995-11-30
Support Year: 2
Fiscal Year: 1995
Total Cost
Indirect Cost

Institution

Name: Baylor College of Medicine
Department: Genetics
Type: Schools of Medicine
DUNS #: 074615394

City: Houston
State: TX
Country: United States
Zip Code: 77030

Related projects


NIH 1996 R01 HG	New Analysis Tools for Gene Function Identification Nelson, David Loren / Baylor College of Medicine
NIH 1995 R01 HG	New Analysis Tools for Gene Function Identification Smith, Randall F. / Baylor College of Medicine
NIH 1994 R01 HG	New Analytic Tools for Identification of Gene Function Smith, Randall F. / Baylor College of Medicine

Publications

Worley, K C; Culpepper, P; Wiese, B A et al. (1998) BEAUTY-X: enhanced BLAST searches for DNA queries. Bioinformatics 14:890-1

Ladunga, I; Smith, R F (1997) Amino acid substitutions preserve protein folding by conserving steric and hydrophobicity properties. Protein Eng 10:187-96

Ladunga, I; Wiese, B A; Smith, R F (1996) FASTA-SWAP and FASTA-PAT: pattern database searches using combinations of aligned amino acids, and a novel scoring theory. J Mol Biol 259:840-54

Smith, R F (1996) Perspectives: sequence data base searching in the era of large-scale genomic sequencing. Genome Res 6:653-60

Smith, R F; Wiese, B A; Wojzynski, M K et al. (1996) BCM Search Launcher--an integrated interface to molecular biology data base search and analysis services available on the World Wide Web. Genome Res 6:454-62

Worley, K C; King, K Y; Chua, S et al. (1995) Identification of new members of a carbohydrate kinase-encoding gene family. J Comput Biol 2:451-8

Worley, K C; Wiese, B A; Smith, R F (1995) BEAUTY: an enhanced BLAST-based search tool that integrates multiple biological information resources into sequence similarity search results. Genome Res 5:173-84

Smith, R F; King, K Y (1995) Identification of a eukaryotic-like protein kinase gene in Archaebacteria. Protein Sci 4:126-9

Korber, B T; MacInnes, K; Smith, R F et al. (1994) Mutational trends in V3 loop protein sequences observed in different genetic lineages of human immunodeficiency virus type 1. J Virol 68:6730-44

Comments

Be the first to comment on Randall Smith's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: