Tuberculosis (TB) is the leading cause of infectious disease mortality worldwide. Nearly one-third of the world's population is infected with Mycobacterium tuberculosis (MTB). More than 10.4 million new cases of active TB disease develop annually, leading to 1.4 million deaths due to the disease each year. Despite widespread efforts to study of the etiology of disease, the development and global introduction of an effective treatment regimen, and sensitive diagnostics for identifying pulmonary TB disease, efforts to control this pandemic are falling short, largely due to a lack of a clear understanding of the pathogenic progression from MTB infection to active clinical disease. In addition, Existing gene expression studies have presented more than three dozen biomarkers to predict TB related outcomes such as identifying active TB disease, predicting risk of treatment failure, or predicting which patients will progress to active TB disease. These have been developed and refined using multiple technologies and using a diverse set of computational and machine learning prediction algorithms, but most are focused on two-class comparison (e.g. TB vs. LTBI). In this proposal, we propose to compile and harmonize dozens of existing RNA-sequencing datasets for TB outcomes. We will use these compiled data to develop a computational platform and interactive visualization tools for profiling TB signatures across all existing datasets. We plan to use this curated data and software platform to develop a more refined molecular map of progression from TB infection to active disease. Consistent with a recently presented models for TB disease development, we hypothesize that we will be able to identify gene expression patterns associated with stages on the TB disease spectrum, including: uninfected or eliminated infection, controlled or truly latent infection, future progressors or incipient disease, subclinical TB disease, and active clinical TB disease. We believe that existing gene expression data and signatures will allow us to identify distinct transcriptional profiles for each stage, and hence develop a multi-class machine learning approach for classifying patients into their corresponding stage. Overall, this proposal contributes to the field by compiling existing gene expression data and developing a wholistic map of TB progression from infection to active disease. In addition, we will provide a curated dataset and metadata in an accessible format for more than three dozen existing TB studies, and allow others to access and explore these data through a user-friendly profiling platform.
We will compile existing TB gene expression data and develop a wholistic map of TB progression from infection to active disease. We will provide curated data for dozens of existing TB RNA-sequencing datasets, and allow others to access and explore these data through a user-friendly software toolkit and platform.