The study of repetitive DNA, microsatellites, a class of genomic variation which exhibits a 10,000 fold higher mutability than single nucleotide polymorphisms has been hampered by the lack of data at microsatellite- containing loci. That is, until now, with the emergence of data from the 1000 Genomes Project. We hypothesize that these hypervariable loci, once analyzed in depth will yield a new appreciation for their value and role in the genome as new biomarkers and functional elements. Baseline measurements of the variability at these loci in the substantial 1000 Genomes Project cohort will provide important information required to exploit these loci, both computationally and in the laboratory. The primary goal of the proposed research is to complete an exhaustive analysis and interpretation of the ~700,000 microsatellite loci using the -2,500 sets of genome sequence becoming available from the 1000 Genomes Project to measure their size, purity and motif dependent distributions and then overlay those data with metadata (gene ontologies, conservation and more) to create a resource where we an others can explore the significant, yet underappreciated role of microsatellite polymorphism in human variation and disease. We have demonstrated the techniques required and impactful preliminary results confirm feasibility and value and potential.
Specific aims 1) align all 1000 Genomes Project sequence data to the microsatellite containing loci to measure the allelic distribution, polymorphism rate, characteristics, quality of the sequence in these repetitive regions;inspect and characterize groups of motif lengths and families (AAT,AAAT,AATT, etc.) to look for evidence for selection pressure, bias and genome wide trends;2) compare the distributions with models for estimating polymorphism propensity as a function of specific sequence motifs, motif size, copies and purity (are there any SNPs), thus identifying any general replication or error correction mechanism bias, which we suspect;3) annotate each locus with ontology, conservation and other positional data to identify any process, functional or disease propensity correlations;and 4) create a web resource to distribute our findings and other reagents derived from this study so others can investigate microsatellite sequence variability at individual loci or across the genome.
The human genome contains over 500,000 areas with repeated DNA sequence (e.g. CACACACACA) called microsatellites. They are extremely variable, cause numerous diseases, are used in forensics/ paternity testing and may alter many of our characteristics, but they are understudied and under- appreciated. The 1000 Genome Project data enables their thorough analysis en masse by our methods.