The last few years have seen a dramatic increase in the number of publicly available complete genome sequences and annotations. At the same time, researchers have been taking advantage of technology developments that allow individual labs to efficiently perform experiments that generate tens of thousands of data points. This massive increase in data means that some lab projects are no longer tractable by individual biologists but, rather, require large-scale data analysis capabilities best handled by a computer programmer. The projects described here focuses on developing methodologies to integrate sequence, annotation, and experimentally generated data so that bench biologists can quickly and easily obtain results for their large-scale experiments.? ? The goal of one such project is to take advantage of the publicly available set of sequence and annotations to develop automated tools for the computational characterization of experimentally identified genomic sequences. The first step in the process is to align each sequence to the reference genome assembly to determine its genomic location. Existing programs suffice for most sequences, but we have developed a novel set of algorithms to map short sequences of less than 25 nucleotides. These programs can map tens of thousands of sequences in only a few minutes, even allowing for mismatches. The second step of the process is to compare the coordinates of the sequences to the coordinates of a variety of genome annotations. Using this approach, we can assign putative functions to the experimentally-identified sequences based on their proximity to known sequence features. In order to provide statistical rigor for the analysis, we have developed a pipeline to characterize sequences picked at random from the genome. ? ? We are applying the above methods to a number of research projects. One example is to determine if retroviruses and retroviral vectors integrate randomly into the host genome during the process of retroviral gene therapy. With Dr. Fabio Candottis lab at NHGRI, we have determined the integration sites in a patient treated in a retroviral gene therapy trial. We are in the process of determining whether any of these integrations could disrupt gene function and thereby affect the patients health, as well as whether the pattern of integration sites changes in the years post gene therapy. We are also collaborating on a similar project with Dr. Cynthia Dunbar of NHLBI. Her lab is pursuing retroviral gene therapy in rhesus macaques (Macaca mulatta), with the eventual goal of improving techniques for retroviral gene therapy in humans.? ? In collaboration with Dr. Julie Segres lab in the Genetics and Molecular Biology Branch (GMBB) to characterize skin microbes using genomic methods. This project involves sequencing the gene for the 16s ribosomal RNA (rRNA) subunit from resident microbes of the skin. Our objectives are to (1) characterize the baseline microbial diversity of the skin; (2) analyze the changes in microbial diversity of an animal model and human patients with atopic dermatitis; and (3) characterize the skin microflora by 16S rRNA sampling to pick appropriate representative species for whole-genome shotgun sequencing and, ultimately, metagenomic studies. My group has written programs to prepare the 16s rRNA sequence for analysis, and is collaborating in the computational analysis using existing public domain software.? ? The completion of the human and other genome sequencing projects also makes it possible to perform comprehensive analyses on gene structure. With Dr. Lawrence Brody of NHGRI, we are exploring the role of exon size in protein evolution. We are expanding our initial analysis to computationally characterize the lengths and other properties of all protein-coding exons from a representative set of fully-sequenced genomes, including vertebrates, D. melanogaster, C. elegans, and plants. Our goal is to gain a greater understanding of how large exons came to exist in our genomes, how they evolve, and what role, if any, that they play.