Most eukaryotic genomes include vast numbers of interspersed repeats (IRs), which are the remnants of mostly selfishly amplified transposable elements. Transposable elements have an exceptionally wide-ranging mutagenic effect on genomes, while recognition of IRs provide unparalleled information on genome evolution and is crucial in many aspects of bioinformatics. This grant would continue support for the maintenance and further development of RepeatMasker, a computational tool that has become the de facto standard for identification and characterization of IRs, and support the development of RepeatModeler, a program designed to derive RepeatMasker-grade databases of IR consensus sequences. The source codes for these tools are freely available to the public. Development will emphasize the following: a) As sequencing of new vertebrate species continues to accelerate, further development of the de novo repeat identifying program RepeatModeler is a priority. Unlike other such programs in development, RepeatModeler is specifically geared for the analysis of mammalian and bird genomes. b) Now the RepeatMasker code has been completely refactored, the emphasis of its development shifts towards increasing its sensitivity and accuracy, and addition of options like a mode for analyzing low coverage assemblies and recognition of chimaeric elements created by homologous recombinations. c) The maintenance of the DNA consensus sequence database with many RepeatMasker-specific metadata, the Transposable Element protein database, and the website with, among others, a growing number of pre-annotated genomes, will take an effort that is more likely to grow than to shrink in size. d) We aim to further automate and refine the process of """"""""phylogenetic labeling"""""""" of consensus sequences in the library, and to expand the databases with refined sets of subfamily sequences, which will make the prediction of potentially polymorphic elements and the precise time of older insertions possible. As part of our efforts to increase the sensitivity and speed of the RepeatMasker program, we propose to develop an improved search engine, starting from the open source BLASTZ code.

Public Health Relevance

RepeatMasker is the default tool to annotate the repetitive portion of complex genomes like the human genome, which is an essential and standard process in any genomic sequence analysis. In recent years it has become clear that interspersed repeats are responsible for a large fraction of """"""""copy number"""""""" or structural allelic variations like large deletions, duplications and insertion of foreign DNA, which are far more common than previously assumed and are much more likely than small mutations to be associated with phenotypic differences and genetic diseases.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Good, Peter J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Institute for Systems Biology
United States
Zip Code
Agarwal, Prasoon; Enroth, Stefan; Teichmann, Martin et al. (2016) Growth signals employ CGGBP1 to suppress transcription of Alu-SINEs. Cell Cycle 15:1558-71
Hubley, Robert; Finn, Robert D; Clements, Jody et al. (2016) The Dfam database of repetitive DNA families. Nucleic Acids Res 44:D81-9
Hoen, Douglas R; Hickey, Glenn; Bourque, Guillaume et al. (2015) A call for benchmarking transposable element annotation methods. Mob DNA 6:13
Suh, Alexander; Churakov, Gennady; Ramakodi, Meganathan P et al. (2015) Multiple lineages of ancient CR1 retroposons shaped the early genome evolution of amniotes. Genome Biol Evol 7:205-17
Rosenbloom, Kate R; Armstrong, Joel; Barber, Galt P et al. (2015) The UCSC Genome Browser database: 2015 update. Nucleic Acids Res 43:D670-81
Caballero, Juan; Smit, Arian F A; Hood, Leroy et al. (2014) Realistic artificial DNA sequences as negative controls for computational genomics. Nucleic Acids Res 42:e99
Knijnenburg, Theo A; Ramsey, Stephen A; Berman, Benjamin P et al. (2014) Multiscale representation of genomic signals. Nat Methods 11:689-94
Green, Richard E; Braun, Edward L; Armstrong, Joel et al. (2014) Three crocodilian genomes reveal ancestral patterns of evolution among archosaurs. Science 346:1254449
Chong, Amanda Y; Kojima, Kenji K; Jurka, Jerzy et al. (2014) Evolution and gene capture in ancient endogenous retroviruses - insights from the crocodilian genomes. Retrovirology 11:71
Carbone, Lucia; Harris, R Alan; Gnerre, Sante et al. (2014) Gibbon genome and the fast karyotype evolution of small apes. Nature 513:195-201

Showing the most recent 10 out of 16 publications