The long-term objectives of the application are to improve our ability to analyze DNA sequences, with the eventual goal of applying this ability to analysis of the human genome. The human genome consists essentially of a sequence of four variables, the nucleotides A, C, G and T. The information that eventually will emerge from the human genome project will be sequences of about 3 billion of these variables. The task of deciphering such a mass of information seems impossible, but this same information can produce a human being, and the sequences are the product of between 3 and 4 billion years of evolution. It therefore seems that studies of molecular evolution must be included in analyzing the human genome. This should involve comparisons of DNA sequences of different organisms; and studies of their evolutionary relationships, which we propose to carry out. The health-relatedness of the human genome project has been described repeatedly, especially with reference to hereditary diseases caused by mutations, often recessive in their expression, also to mutations in oncogenes. These mutations occur in DNA and are translated by the genetic code. The application is relevant on a general basis to these effects. Studies of human genes are almost inherently related to health. Computer methods will be used to examine the DNA sequences of genes, especially human genes. The human genome contains regions of high and low GC content. We are studying the relation of the GC content of such regions to the composition of the genes in the regions. The human alpha hemoglobin gene, on chromosome 17, has 90% GC in silent nucleotide positions and 51% GC in replacement nucleotide positions, while the values for the beta-hemoglobin gene on chromosome 11 are respectively 68% and 50%. We plan to study the relation of GC content of DNA to GC content of exons, introns and intergenic regions. We plan to study the amino acid composition of proteins to obtain further information on protein function. We shall obtain nucleic acid sequences from GenBank through the BIONET resource, and from any other sources that become available. As a start, we shall use codon tables in our possession for 1,737 protein genes from eukaryotes and bacteria.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
8R01HG000312-02
Application #
3333386
Study Section
Special Emphasis Panel (SSS (A))
Project Start
1989-07-01
Project End
1992-06-30
Budget Start
1990-07-01
Budget End
1991-06-30
Support Year
2
Fiscal Year
1990
Total Cost
Indirect Cost
Name
Lawrence Berkeley National Laboratory
Department
Type
Organized Research Units
DUNS #
078576738
City
Berkeley
State
CA
Country
United States
Zip Code
94720
Collins, D W; Jukes, T H (1994) Rates of transition and transversion in coding sequences since the human-rodent divergence. Genomics 20:386-96
Collins, D W; Jukes, T H (1993) Relationship between G + C in silent sites of codons and amino acid composition of human proteins. J Mol Evol 36:201-13
Jukes, T H; Osawa, S (1993) Evolutionary changes in the genetic code. Comp Biochem Physiol B 106:489-94
Collins, D W (1993) FISH: a guide to protein-coding DNA sequences in the GenBank database. Comput Appl Biosci 9:337-42
Osawa, S; Jukes, T H; Watanabe, K et al. (1992) Recent evidence for evolution of the genetic code. Microbiol Rev 56:229-64
Collins, D W; Liu, C C; Jukes, T H (1992) Numerical classification of coding sequences. Nucleic Acids Res 20:1405-10