Viral hepatitis from hepatitis B (HBV) establishes chronic infections in >250M people worldwide; chronicity is on the rise, and approximately one-third of the world?s population (2 billion) has serologic evidence of exposure. HBV coinfection with HCV and HIV is a hidden consequence of the substance use disorder epidemic. Viral populations have extremely high sequence diversity and rapidly evolve, which explains the vaccine failure rates and viral resistance to existing therapies and makes discovering lasting therapies extremely challenging. Next Generation Sequencing (NGS) is the method of choice to assess the intra-host virus population, termed a ?quasispecies?. While a large set of short DNA sequencing reads are acquired that represent the virions in the quasispecies, computational technologies are limited in their analysis capabilities, resulting in particularly low resolution of complex HBV genomic structures. Another challenge is assembling NGS reads representing short fragment of the host genome into full strains (haplotypes) without knowledge of their true occurrence in the samples. To meet these challenges, GATACA is developing pathogen-specific bioinformatics software, GAT-ML (GATACA Assembly Tool ? machine learning [ML]) to support treatment discovery and improve infection control. Its specifically designed algorithm utilizes novel ML methodologies adapted and modified for assisting genome assembly that will allow GAT-ML to reconstruct complete viral haplotypes and populations by learning the ?language? of the sequences. Tailored initially for HBV samples, GAT and its new ML system will be integrated for feasibility testing in this Phase I with the following Specific Aims: 1.
Specific Aim 1. Build a joint learning system. Train and test natural language processing (NLP) methods on HBV genetic variation. 2.
Specific Aim 2. Implement and test the machine learning methods in GAT (GAT-ML). We anticipate a working tool for characterizing HBV haplotypes, validated with multi-sourced datasets, and extensive testing and benchmarking of offline and integrated methods.
The proposed project will develop and increase the capabilities of our novel computational tool, GAT, to help researchers identify the full spectrum of genetic features of a viral population?such as emergence and persistence of resistance or baseline polymorphisms regardless of their frequencies?and translate these findings to the development of new or improved antiviral drugs and other applications requiring high analytic sensitivity. GAT will particularly benefit researchers working in preclinical stages of drug development who require rapid, sensitive, and reliable results to inform decisions about which targets to advance to clinical trial testing.