Most eukaryotic genomes include vast numbers of interspersed repeats (IRs), which are the remnants of mostly selfishly amplified transposable elements. Transposable elements have an exceptionally wide-ranging mutagenic effect on genomes, while recognition of IRs provide unparalleled information on genome evolution and is crucial in many aspects of bioinformatics. This grant would continue support for the maintenance and further development of RepeatMasker, a computational tool that has become the de facto standard for identification and characterization of IRs, RepeatModeler, a program designed to derive RepeatMasker-grade databases of IR consensus sequences, and related software. The source code for these tools is freely available to the public. We have recently co-created a database of profile Hidden Markov Models, called Dfam, for repeat families found in the human genome. RepeatMasker can use this database and we have found dramatically increased sensitivity over previous results. Our research and development plans include the following: a) We propose several ways in which sensitivity of detection can still be improved, including the creation of better profiles and exploiting our false positive and false negative benchmarks. b) To prepare for the onslaught of the 10,000 vertebrate genome project, we propose significant speed up strategies for both library creation and repeat analysis, and plan to improve repeat analysis for NextGen generated genomes. c) Dfam is meant to eventually comprise repeats for all genomes In collaboration with our colleagues at the Howard Hughes Medical Institute who house Dfam, we aim to develop tools to simplify and enhance submission and curation of entries. d) We plan to optimize our method of superimposing the RepeatMasker annotation of reconstructed ancestral genomes on extant genomes. We also propose several strategies involving IRs that could improve the construction of ancestral genomes.

Public Health Relevance

RepeatMasker is the standard tool to annotate the repetitive (i.e. transposable element derived) portion of the human and other complex genomes, which is the first step in most genome sequence analysis. In recent years it has become clear that interspersed repeats are responsible for a large fraction of major structural allelic variations, which are far more common than previously assumed and are much more likely than small mutations to be associated with phenotypic differences and genetic diseases. Furthermore, reactivation of transposable elements has been implicated in tumorigenesis, and has been observed in neural development in the brain and during creation of pluripotent stem cells.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Wellington, Christopher
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Institute for Systems Biology
United States
Zip Code
Agarwal, Prasoon; Enroth, Stefan; Teichmann, Martin et al. (2016) Growth signals employ CGGBP1 to suppress transcription of Alu-SINEs. Cell Cycle 15:1558-71
Hubley, Robert; Finn, Robert D; Clements, Jody et al. (2016) The Dfam database of repetitive DNA families. Nucleic Acids Res 44:D81-9
Hoen, Douglas R; Hickey, Glenn; Bourque, Guillaume et al. (2015) A call for benchmarking transposable element annotation methods. Mob DNA 6:13
Suh, Alexander; Churakov, Gennady; Ramakodi, Meganathan P et al. (2015) Multiple lineages of ancient CR1 retroposons shaped the early genome evolution of amniotes. Genome Biol Evol 7:205-17
Rosenbloom, Kate R; Armstrong, Joel; Barber, Galt P et al. (2015) The UCSC Genome Browser database: 2015 update. Nucleic Acids Res 43:D670-81
Carbone, Lucia; Harris, R Alan; Gnerre, Sante et al. (2014) Gibbon genome and the fast karyotype evolution of small apes. Nature 513:195-201
Caballero, Juan; Smit, Arian F A; Hood, Leroy et al. (2014) Realistic artificial DNA sequences as negative controls for computational genomics. Nucleic Acids Res 42:e99
Knijnenburg, Theo A; Ramsey, Stephen A; Berman, Benjamin P et al. (2014) Multiscale representation of genomic signals. Nat Methods 11:689-94
Green, Richard E; Braun, Edward L; Armstrong, Joel et al. (2014) Three crocodilian genomes reveal ancestral patterns of evolution among archosaurs. Science 346:1254449
Chong, Amanda Y; Kojima, Kenji K; Jurka, Jerzy et al. (2014) Evolution and gene capture in ancient endogenous retroviruses - insights from the crocodilian genomes. Retrovirology 11:71

Showing the most recent 10 out of 16 publications