The study of repetitive DNA, microsatellites, a class of genomic variation which exhibits a 10,000 fold higher mutability than single nucleotide polymorphisms has been hampered by the lack of data at microsatellite- containing loci. That is, until now, with the emergence of data from the 1000 Genomes Project. We hypothesize that these hypervariable loci, once analyzed in depth will yield a new appreciation for their value and role in the genome as new biomarkers and functional elements. Baseline measurements of the variability at these loci in the substantial 1000 Genomes Project cohort will provide important information required to exploit these loci, both computationally and in the laboratory. The primary goal of the proposed research is to complete an exhaustive analysis and interpretation of the ~700,000 microsatellite loci using the -2,500 sets of genome sequence becoming available from the 1000 Genomes Project to measure their size, purity and motif dependent distributions and then overlay those data with metadata (gene ontologies, conservation and more) to create a resource where we an others can explore the significant, yet underappreciated role of microsatellite polymorphism in human variation and disease. We have demonstrated the techniques required and impactful preliminary results confirm feasibility and value and potential.
Specific aims 1) align all 1000 Genomes Project sequence data to the microsatellite containing loci to measure the allelic distribution, polymorphism rate, characteristics, quality of the sequence in these repetitive regions;inspect and characterize groups of motif lengths and families (AAT,AAAT,AATT, etc.) to look for evidence for selection pressure, bias and genome wide trends;2) compare the distributions with models for estimating polymorphism propensity as a function of specific sequence motifs, motif size, copies and purity (are there any SNPs), thus identifying any general replication or error correction mechanism bias, which we suspect;3) annotate each locus with ontology, conservation and other positional data to identify any process, functional or disease propensity correlations;and 4) create a web resource to distribute our findings and other reagents derived from this study so others can investigate microsatellite sequence variability at individual loci or across the genome.

Public Health Relevance

The human genome contains over 500,000 areas with repeated DNA sequence (e.g. CACACACACA) called microsatellites. They are extremely variable, cause numerous diseases, are used in forensics/ paternity testing and may alter many of our characteristics, but they are understudied and under- appreciated. The 1000 Genome Project data enables their thorough analysis en masse by our methods.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project--Cooperative Agreements (U01)
Project #
Application #
Study Section
Special Emphasis Panel (ZHG1-HGR-M (J1))
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Virginia Polytechnic Institute and State University
Organized Research Units
United States
Zip Code
Tae, Hongseok; Karunasena, Enusha; Bavarva, Jasmin H et al. (2014) Large scale comparison of non-human sequences in human sequencing data. Genomics 104:453-8
Tae, Hongseok; Kim, Dong-Yun; McCormick, John et al. (2014) Discretized Gaussian mixture for genotyping of microsatellite loci containing homopolymer runs. Bioinformatics 30:652-9
McIver, L J; McCormick, J F; Martin, A et al. (2013) Population-scale analysis of human microsatellites reveals novel sources of exonic variation. Gene 516:328-34
Tae, Hongseok; McMahon, Kevin W; Settlage, Robert E et al. (2013) ReviSTER: an automated pipeline to revise misaligned reads to simple tandem repeats. Bioinformatics 29:1734-41
Tae, Hongseok; Settlage, Robert E; Shallom, Shamira et al. (2012) Improved variation calling via an iterative backbone remapping and local assembly method for bacterial genomes. Genomics 100:271-6
McIver, L J; Fondon 3rd, J W; Skinner, M A et al. (2011) Evaluation of microsatellite variation in the 1000 Genomes Project pilot studies is indicative of the quality and utility of the raw data and alignments. Genomics 97:193-9
Galindo, Cristi L; McIver, Lauren J; Tae, Hongseok et al. (2011) Sporadic breast cancer patients' germline DNA exhibit an AT-rich microsatellite signature. Genes Chromosomes Cancer 50:275-83
Garner, H R (2011) Combating unethical publications with plagiarism detection services. Urol Oncol 29:95-9