Transposable elements (TEs, also referred to as jumping genes or mobile elements) are extraordinary contributors to eukaryotic genome diversity, including in humans. TEs make up more than 50% of the human genome and are far more common than protein coding genes, which comprise about 1% of the human genome. Despite their abundance, TEs are understudied and major aspects of their mobile element biology remain elusive. Due to their random insertion within the genome, insertions occur both in intergenic and genic regions (including in exons). As retrotransposition is ongoing, with ~1 new insertion per 20 live births, there are millions of polymorphic TEs within the human population, including some associated with disease. Highly repe- titive regions are notoriously difficult to assemble, overrepresented at contig ends, and under-annotated from short-read sequencing reads (presently prevalent in biomedical settings).
In Aim I, we will improve the annotation of the human mobilome (the genome?s entire mobile element content) by building upon the human reference genome and the Human Genome Structural Variation consortium (providing access to Illumina short- read and PacBio HiFi sequencing data). Part of our focus will be on improved calling of TEs from short-read sequencing data. We will (a) implement chimpanzee as an outgroup in order to distinguish between TE insertions and deletions containing TE sequence; and (b) develop a targeted-sequencing approach for trans- posable elements. The latter will be combined with whole genome sequencing. Our targeted sequencing approach will provide deeper coverage of breakpoints, improving identification of mobile elements. We will also generate a high-resolution subfamily annotation with well-resolved end-branches. The youngest subfamilies are commonly collapsed within older subfamilies because of size and few shared diagnostic mutations. Underidentifying the youngest subfamilies leads to an apparent relative quiescence of TEs in recent history. Building upon the TE annotation improvement in Aim I, we will investigate TEs to identify and characterize pu- tative source elements (i.e. TEs capable of generating offspring insertions). Most TE insertions are dead upon arrival and not able to create offspring TEs. While the identification of active L1s is relatively easy, the identification of the drivers of Alu and SVA expansion has been far more elusive. A fine-scale TE subfamily resolution that includes the youngest subfamilies will both shed light on the most recent TE evolution, and allow investigation of source elements (which tend to be deleterious to their host) within the youngest subfamilies. This makes the youngest subfamilies a prime target for an integrative source element identification comparative approach.