Methods are described to clone promoter regions and 5' exon sequences from the majority of expressed cellular genes. The strategy uses a retrovirus gene trap shuttle vector to selectively disrupt genes expressed in cultured cells. Regions adjacent to the integrated proviruses are cloned by plasmid rescue and sequenced. Because the proviruses are positioned in or near transcribed exons, it is only necessary to sequence 300 nucleotides upstream of the virus to identify genes disrupted by the virus. Presently, 7% of random PTSs match known genes in the nucleic acid data bases, and the proportion of matching genes has doubled every 18-24 months for the past 8 years. The process will generate a database of expressed sequence tags, designated """"""""promoter tagged sites"""""""" (PTSs). PTS libraries will assist efforts to identify candidate genes responsible for disease phenotypes in both mice and men. PTSs several advantages over cDNA sequence tags (ESTs) for use in genome studies. (1) Gene representation among PTS is more uniform. Thus, the vast majority of PTSs comprise less than 0.01% of random cDNA clones. (2) cDNA probes are unable to distinguish between pseudogenes and functional genes, whereas, PTSs are linked to expressed genes. (3) PTSs frequently include 5' exon sequences that may be missing from cDNAs. (4) PTSs associated with different exons of the same gene will be valuable for characterizing large genes with many small exons and for verifying the integrity of cDNAs and ESTs. (5) The size of genomic DNA fragments cloned by plasmid rescue will facilitate mapping of candidate genes by fluorescence in situ hybridization (FISH) and identification of simple sequence repeats (SSRs) and single copy regions for use by other mapping strategies. PTS libraries will be isolated from human cortical neuron and HepG2 hepatocellular carcinoma cell lines and analyzed to determine (i) the minimum number of cellular genes that can be disrupted by gene trap mutagenesis, (ii) the extent to which segments of expressed cellular genes can be recovered by plasmid rescue, and (iii) the randomness of retrovirus integration throughout the genome. Gene representation of PTSs will be compared to EST databases from similar sources. PTSs derived from genes expressed on human chromosome 19 will be isolated from hybrid lines containing chromosome 19 as its only human chromosome. The chromosome-specific PTSs will be mapped within cosmid contigs, greatly expanding the number of candidate genes on these chromosomes.