Mammalian and most other eukaryotic genomes contain a large number of interspersed repeats (IRs), most of which are copies of transposable elements (TEs) at varying levels of decay. Their presence complicates many genome sequence analyses, but their accurate identification in an early analysis stage can reduce these complications. In addition to their pervasiveness, over the last decades the research community has become widely familiar with their enormous impact on genome activity and evolution. Every species has been exposed to a unique, complex set of TEs leaving recognizable copies from as long ago as 300 million years to as recent as present day. These TEs are uncovered and reconstructed by de novo discovery methods, often by our RepeatModeler tool, while their copies are then annotated by our RepeatMasker software. De novo methods can create TE libraries at a reasonable pace, but the product is far from the desired quality that can be reached by hand curation. With the recent explosive growth in sequenced species, these finishing steps, perhaps never fully automatable, now form a severe bottleneck in genome analyses due to a lack of manpower and expertise, while the results, especially when coming from different research groups, lack consistency and suffer from redundancy. Furthermore, the annotation of genomes for which high-quality libraries have been created is not keeping up with library improvements due to the computational burden of re-analysis. In this proposal, we describe a plan to alleviate the problems of finishing new repeat libraries:
we aim to exploit the power of multi-species genome alignments, especially in revealing lineage-specific TEs, develop a web-based workbench based on our TE library finishing tools and strategies, and crowdsource the most laborious step through the use of gamification. In addition, we propose a new family-centric search strategy and an incremental annotation approach to provide a tractable solution to the re-analysis problem while also providing opportunities to improve the annotation quality.

Public Health Relevance

Most of the vertebrate genome finds its ultimate origin in transposable elements, once best known as prototypical selfish DNA, and their annotation is crucial for genome sequence analysis and our understanding of their unrivaled impact on genome biology and evolution. Their de novo discovery and description has become a bottleneck in the genome analysis of the thousands of new species sequenced every year. In this proposal we describe three novel approaches to alleviate this problem and dramatically improve on genome annotation.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG002939-15
Application #
9905539
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Wellington, Christopher
Project Start
2003-08-15
Project End
2021-03-31
Budget Start
2020-04-01
Budget End
2021-03-31
Support Year
15
Fiscal Year
2020
Total Cost
Indirect Cost
Name
Institute for Systems Biology
Department
Type
DUNS #
135646524
City
Seattle
State
WA
Country
United States
Zip Code
98109
Agarwal, Prasoon; Enroth, Stefan; Teichmann, Martin et al. (2016) Growth signals employ CGGBP1 to suppress transcription of Alu-SINEs. Cell Cycle 15:1558-71
Hubley, Robert; Finn, Robert D; Clements, Jody et al. (2016) The Dfam database of repetitive DNA families. Nucleic Acids Res 44:D81-9
Hoen, Douglas R; Hickey, Glenn; Bourque, Guillaume et al. (2015) A call for benchmarking transposable element annotation methods. Mob DNA 6:13
Suh, Alexander; Churakov, Gennady; Ramakodi, Meganathan P et al. (2015) Multiple lineages of ancient CR1 retroposons shaped the early genome evolution of amniotes. Genome Biol Evol 7:205-17
Rosenbloom, Kate R; Armstrong, Joel; Barber, Galt P et al. (2015) The UCSC Genome Browser database: 2015 update. Nucleic Acids Res 43:D670-81
Carbone, Lucia; Harris, R Alan; Gnerre, Sante et al. (2014) Gibbon genome and the fast karyotype evolution of small apes. Nature 513:195-201
Caballero, Juan; Smit, Arian F A; Hood, Leroy et al. (2014) Realistic artificial DNA sequences as negative controls for computational genomics. Nucleic Acids Res 42:e99
Knijnenburg, Theo A; Ramsey, Stephen A; Berman, Benjamin P et al. (2014) Multiscale representation of genomic signals. Nat Methods 11:689-94
Green, Richard E; Braun, Edward L; Armstrong, Joel et al. (2014) Three crocodilian genomes reveal ancestral patterns of evolution among archosaurs. Science 346:1254449
Chong, Amanda Y; Kojima, Kenji K; Jurka, Jerzy et al. (2014) Evolution and gene capture in ancient endogenous retroviruses - insights from the crocodilian genomes. Retrovirology 11:71

Showing the most recent 10 out of 16 publications