Coding & Noncoding Regions in DNA Sequences

Jukes, Thomas

Abstract

The long-term objectives of the application are to improve our ability to analyze DNA sequences, with the eventual goal of applying this ability to analysis of the human genome. The human genome consists essentially of a sequence of four variables, the nucleotides A, C, G and T. The information that eventually will emerge from the human genome project will be sequences of about 3 billion of these variables. The task of deciphering such a mass of information seems impossible, but this same information can produce a human being, and the sequences are the product of between 3 and 4 billion years of evolution. It therefore seems that studies of molecular evolution must be included in analyzing the human genome. This should involve comparisons of DNA sequences of different organisms; and studies of their evolutionary relationships, which we propose to carry out. The health-relatedness of the human genome project has been described repeatedly, especially with reference to hereditary diseases caused by mutations, often recessive in their expression, also to mutations in oncogenes. These mutations occur in DNA and are translated by the genetic code. The application is relevant on a general basis to these effects. Studies of human genes are almost inherently related to health. Computer methods will be used to examine the DNA sequences of genes, especially human genes. The human genome contains regions of high and low GC content. We are studying the relation of the GC content of such regions to the composition of the genes in the regions. The human alpha hemoglobin gene, on chromosome 17, has 90% GC in silent nucleotide positions and 51% GC in replacement nucleotide positions, while the values for the beta-hemoglobin gene on chromosome 11 are respectively 68% and 50%. We plan to study the relation of GC content of DNA to GC content of exons, introns and intergenic regions. We plan to study the amino acid composition of proteins to obtain further information on protein function. We shall obtain nucleic acid sequences from GenBank through the BIONET resource, and from any other sources that become available. As a start, we shall use codon tables in our possession for 1,737 protein genes from eukaryotes and bacteria.