The main goal of comparative genomics is to identify functional elements by comparing sequences between organisms. Current rapid development of next generation technologies made it possible to compare entire genomes, identify differences, and interpret them in the context of specific adaptations and functions. Among vertebrates, the primates provide excellent models for comparative genomics studies because the quality of genome sequence, coverage, and annotation of genes is currently unmatched in other mammalian groups. This project takes advantage of this resource and aims to identify, verify, and assess the impact and history of large insertions and deletions (indels), especially those located within known functional elements, such as regulatory elements and genes. Specifically, the main objective is to use four pairs of comparisons: human to chimpanzee (HuCh), human to gorilla (HuGo), human to macaque (HuMa), and human to orangutan (HuOr), to identify computationally and validate in the laboratory those functional indels with signatures of natural selection. To accomplish the main objective of the project, the experiments are designed to reach four specific aims: (1) identification, characterization and analysis of indels in pair-wise comparison between the five species, (2) identification of indels in functional elements, (3) estimation of indel impact on gene structure and protein products, and (4) determination of ancestral states and interrogating flanking sequences of indels for the signatures of natural selection. The study will concentrate on large indels (10-1,000 bp) since they are more likely to have large impacts, and also because they are easily identified with only basic molecular techniques such as PCR and electrophoresis. In addition, the project can serve as a model for further investigations in other mammalian and non-mammalian species, as well as for population genetics studies in any vertebrate species for which the genome sequence is available.

This study will broaden participation in science by training individuals from the underrepresented Hispanic minority who will conduct basic science research. Following their computational discovery by comparison of published genome sequences, the indel-containing fragments will be amplified by a basic PCR technique and validated by gel electrophoresis, where different alleles will be visible due to large differences in size, a strategy that will allow the project to be divided into parts, and generate development of several independent comparative genomics studies by undergraduate students. Thus, many participating minority students will be exposed to basic and advanced techniques in molecular biology, as well as to the most current topics in genome analysis.

Project Report

Comparative genomics is a dynamic field, focusing on finding, and describing functional elements in the rapidly growing data from sequenced genomes of different species. Given the large amount of rapidly growing sequencing data, we are npw presented with a unique opportunity to examine origins and function of every genetic element and explain their evolutionary history. While many structural variations, such as insertions/deletions (indels), copy number polymorphisms and retrotransposition events have been well characterized in direct comparisons between reference genomes, some variants need to be validated by laboratory methods. This study concentrated on a subset of large indels (>10 bp) discovered by comparison of primate species in the evolutionary branch leading to our own species. Those structural changes arising during the differentiation between humans and their closest extant relatives are especially interesting, because they provide excellent candidates for discovering sequence variants that determine our own human species. We identified, classified and analyzed 36,422 indels discovered in comparison between the reference genomes humans and five primate species (24,229 came from human-chimpanzee, 245 from human-gorilla, 8,895 from human-macaque, and 3,053 came from the human-orangutan comparison). We studied properties of these indels as well as their distribution in the five genomes based on size, numbers of copies, and locations. We confirmed our earlier findings that most indels are found in the intergenic sequences and introns, but they are underrepresented in coding sequences of exons (Figure 1), considered to be locations with the highest impact on protein structure and function. A larger than expected proportion of indels found in the Homo/Pan comparison are associated with nervous and reproductive system. However, it was unclear whether the remaining variants confer any functional impact and if they are still carry adaptive value for the species. To evaluate possible impact, we proceeded to (1) validate the 152 indels in the coding regions and splice sites, and (2) test for the signatures of positive selection in the flanking regions of these genomic fragments. In the validation process, we designed degenerate primers for each site, amplified them by PCR and used electrophoresis to visualize variation of the product length. This simple approach was possible since we restricted our dataset to indels >10 bp in length, and allowed us to involve many undergraduate students in the process. Some of the fragments reported in the reference comparisons were not found when their corresponding regions were amplified and reported as not validated. This information underlines the importance of completing primate genomes for comparative analysis, and shows the necessity of the experimental review. The validated indels were retained and interrogated for the existence of polymorphism in 96 individuals each from population of Kenya and Yoruba, and a Local Genome Diversity panel developed under a different NSF grant. At least one of the candidate indels (USP7) was polymorphic in the two African populations. We looked at flanking regions of each indel to look for signatures of natural selection by accounting for synonymous and non-synonymous differences (Ka/Ks) between the five species. Several potential regions have identified: 31 indels contained signatures of positive selection in the pairwise comparisons, more than we expected from a randomly distributed dataset, indicating that these fragments were generally retained by positive selection. To look for more recent selection within the human lineage, we developed a PYTHON script evaluating multi-locus heterozygosity and FST in pairwise genome-wide comparisons between human populations worldwide (HGDP). Finally, the most recent signatures of selection were identified using extended haplotype homozygosity approach. The only gene containing a validated polymorphism in Yoruba population is not currently selected in that population. This project initiated a comparative genomics laboratory at the Biology Department, and provided training for many minority students in basic molecular biology techniques. Three graduate and three undergraduate students have been directly supported by the current grant, while several other participants received support from other minority programs or volunteered (two graduates and six undergraduates). Students volunteered in the outreach programs to the local schools and gained experience presenting their results: the project participants won awards for outstanding presentation on Evolution and Genetics at two consecutive Undergraduate Research Symposiums at UPR-M (2011 and 2012), and five different poster presentations were well received at several national and international meetings (ASHG, ISHG, CSHL Biology of Genomes). Many of the students who were involved in this project have now been invited or accepted for a higher-level graduate degree program (MS or PhD). In addition, the physical infrastructure of the laboratory has been improved with the purchase of thermocylces, -80C freezers, a gel image station, a LINUX workstation, and a web server. A major research paper is now in preparation based on this work to be submitted early in 2013, and, a computer program note describing the script employed for the search of selection in on the way.

Project Start
Project End
Budget Start
2010-08-15
Budget End
2012-07-31
Support Year
Fiscal Year
2010
Total Cost
$199,974
Indirect Cost
Name
University of Puerto Rico Mayaguez
Department
Type
DUNS #
City
Mayaguez
State
PR
Country
United States
Zip Code
00680