Most eukaryotic genomes include vast numbers of interspersed repeats (IRs), which are the remnants of mostly selfishly amplified transposable elements. Transposable elements have an exceptionally wide-ranging mutagenic effect on genomes, while recognition of IRs provide unparalleled information on genome evolution and is crucial in many aspects of bioinformatics. This grant would continue support for the maintenance and further development of RepeatMasker, a computational tool that has become the de facto standard for identification and characterization of IRs, and support the development of RepeatModeler, a program designed to derive RepeatMasker-grade databases of IR consensus sequences. The source codes for these tools are freely available to the public. Development will emphasize the following: a) As sequencing of new vertebrate species continues to accelerate, further development of the de novo repeat identifying program RepeatModeler is a priority. Unlike other such programs in development, RepeatModeler is specifically geared for the analysis of mammalian and bird genomes. b) Now the RepeatMasker code has been completely refactored, the emphasis of its development shifts towards increasing its sensitivity and accuracy, and addition of options like a mode for analyzing low coverage assemblies and recognition of chimaeric elements created by homologous recombinations. c) The maintenance of the DNA consensus sequence database with many RepeatMasker-specific metadata, the Transposable Element protein database, and the website with, among others, a growing number of pre-annotated genomes, will take an effort that is more likely to grow than to shrink in size. d) We aim to further automate and refine the process of "phylogenetic labeling" of consensus sequences in the library, and to expand the databases with refined sets of subfamily sequences, which will make the prediction of potentially polymorphic elements and the precise time of older insertions possible. As part of our efforts to increase the sensitivity and speed of the RepeatMasker program, we propose to develop an improved search engine, starting from the open source BLASTZ code.

Public Health Relevance

RepeatMasker is the default tool to annotate the repetitive portion of complex genomes like the human genome, which is an essential and standard process in any genomic sequence analysis. In recent years it has become clear that interspersed repeats are responsible for a large fraction of "copy number" or structural allelic variations like large deletions, duplications and insertion of foreign DNA, which are far more common than previously assumed and are much more likely than small mutations to be associated with phenotypic differences and genetic diseases.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Bonazzi, Vivien
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Institute for Systems Biology
United States
Zip Code
Caballero, Juan; Smit, Arian F A; Hood, Leroy et al. (2014) Realistic artificial DNA sequences as negative controls for computational genomics. Nucleic Acids Res 42:e99
Knijnenburg, Theo A; Ramsey, Stephen A; Berman, Benjamin P et al. (2014) Multiscale representation of genomic signals. Nat Methods 11:689-94
Wheeler, Travis J; Clements, Jody; Eddy, Sean R et al. (2013) Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Res 41:D70-82
Glusman, Gustavo; Qin, Shizhen; El-Gewely, M Raafat et al. (2006) A third approach to gene prediction suggests thousands of additional human transcribed regions. PLoS Comput Biol 2:e18