Repetitive DNA, especially that due to transposable elements (TEs), makes up a large fraction of many genomes. Thorough and accurate annotation of repetitive content in genomes depends on a comprehensive database of known TEs, along with robust statistical and procedural methods for recognizing decayed instances of elements and disentangling their complex relationships. Annotation of TE instances is usually performed using our RepeatMasker software, which compares a genome to a database containing representations of known repeat families. These have historically been consensus sequences, which generally approximate the sequences of the original TEs. The largest repository of such consensus sequences is Repbase, whose restrictive license and limited interface for curators has led to a lack of input from third parties and the creation of many unaffiliated, often organism-specific open databases. The parallel existence of these many databases has led to a divergence in nomenclature and repeat definition. Our Dfam database is an open access collection of repetitive DNA families, in which each family is represented by a multiple sequence alignment and a profile hidden Markov model (HMM). We have demonstrated that profile HMMs support improved annotation sensitivity, and Dfam provides numerous aids to both curators of TE families and those who make use of the resulting annotations. In this proposal, we describe a plan to develop the infrastructure of Dfam to expand to 1000s of genomes, and to establish a self-sustaining TE Data Commons dependent on limited centralized curation. We further describe plans to improve the quality of repeat annotation through development of methods for more reliable alignment adjudication, to expand approaches to visualization of this complex data type, and to improve the modeling of TE subfamilies. By further developing this open access database, we will provide a strong disincentive for the proliferation of unaffiliated non-standard repeat datasets and ease the burden of data management for those developing TE libraries.

Public Health Relevance

Most of the vertebrate genome finds its ultimate origin in transposable elements (TEs), and the thorough annotation of TEs is a critical aspect of genome annotation pipelines. Currently, the data used for this annotation is dominated by a single database, Repbase, whose restrictive license impedes coalescence on a central standardized resource. We have developed an innovative open access database (Dfam) that yields substantial gains in TE annotation quality and provides numerous novel features in support of community TE curation. We will grow and improve Dfam to create a sustainable, standardized, and open system for TE family data. To achieve this, we will develop Dfam's infrastructure to scale to 1000s of genomes, develop improved methods for computing and representing the complex relationships between TE instances, and develop a framework for aiding curators in developing and sharing their TE libraries.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Resource-Related Research Projects--Cooperative Agreements (U24)
Project #
5U24HG010136-02
Application #
9764454
Study Section
Special Emphasis Panel (ZHG1)
Program Officer
Wellington, Christopher
Project Start
2018-08-15
Project End
2023-05-31
Budget Start
2019-06-01
Budget End
2020-05-31
Support Year
2
Fiscal Year
2019
Total Cost
Indirect Cost
Name
Institute for Systems Biology
Department
Type
DUNS #
135646524
City
Seattle
State
WA
Country
United States
Zip Code
98109