Most eukaryotic genomes include vast numbers of interspersed repeats (IRs), which are the remnants of mostly selfishly amplified transposable elements. Transposable elements have an exceptionally wide-ranging mutagenic effect on genomes, while recognition of IRs provide unparalleled information on genome evolution and is crucial in many aspects of bioinformatics. This grant would continue support for the maintenance and further development of RepeatMasker, a computational tool that has become the de facto standard for identification and characterization of IRs, RepeatModeler, a program designed to derive RepeatMasker-grade databases of IR consensus sequences, and related software. The source code for these tools is freely available to the public. We have recently co-created a database of profile Hidden Markov Models, called Dfam, for repeat families found in the human genome. RepeatMasker can use this database and we have found dramatically increased sensitivity over previous results. Our research and development plans include the following: a) We propose several ways in which sensitivity of detection can still be improved, including the creation of better profiles and exploiting our false positive and false negative benchmarks. b) To prepare for the onslaught of the 10,000 vertebrate genome project, we propose significant speed up strategies for both library creation and repeat analysis, and plan to improve repeat analysis for NextGen generated genomes. c) Dfam is meant to eventually comprise repeats for "all" genomes In collaboration with our colleagues at the Howard Hughes Medical Institute who house Dfam, we aim to develop tools to simplify and enhance submission and curation of entries. d) We plan to optimize our method of superimposing the RepeatMasker annotation of reconstructed ancestral genomes on extant genomes. We also propose several strategies involving IRs that could improve the construction of ancestral genomes.

Public Health Relevance

RepeatMasker is the standard tool to annotate the repetitive (i.e. transposable element derived) portion of the human and other complex genomes, which is the first step in most genome sequence analysis. In recent years it has become clear that interspersed repeats are responsible for a large fraction of major structural allelic variations, which are far more common than previously assumed and are much more likely than small mutations to be associated with phenotypic differences and genetic diseases. Furthermore, reactivation of transposable elements has been implicated in tumorigenesis, and has been observed in neural development in the brain and during creation of pluripotent stem cells.

National Institute of Health (NIH)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Wellington, Christopher
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Institute for Systems Biology
United States
Zip Code
Caballero, Juan; Smit, Arian F A; Hood, Leroy et al. (2014) Realistic artificial DNA sequences as negative controls for computational genomics. Nucleic Acids Res 42:e99
Knijnenburg, Theo A; Ramsey, Stephen A; Berman, Benjamin P et al. (2014) Multiscale representation of genomic signals. Nat Methods 11:689-94
Wheeler, Travis J; Clements, Jody; Eddy, Sean R et al. (2013) Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Res 41:D70-82
Glusman, Gustavo; Qin, Shizhen; El-Gewely, M Raafat et al. (2006) A third approach to gene prediction suggests thousands of additional human transcribed regions. PLoS Comput Biol 2:e18