Immune-repertoire sequence, which consists of an individual's millions of unique antibody and T-cell receptor (TCR) genes, encodes a dynamic and highly personalized record of an individual's state of health. Our long- term goal is to develop the computational models and tools necessary to read this record, to one day be able diagnose diverse infections, autoimmune diseases, cancers, and other conditions directly from repertoire se- quence. The key problem is how to find patterns of specific diseases in repertoire sequence, when repertoires are so complex. Our hypothesis is that a combination of bottom-up (sequence-level) and top-down (systems- level) modeling can reveal these patterns, by encoding repertoires as simple but highly informative models that can be used to build highly sensitive and specific disease classifiers. In preliminary studies, we introduced two new modeling approaches for this purpose: (i) statistical biophysics (bottom-up) and (ii) functional diversity (top-down), and showed their ability to elucidate patterns related to vaccination status (97% accuracy), viral infection, and aging. Building on these studies, we will test our hypothesis through two specific aims: (1) We will develop models and classifiers based on the bottom-up approach, statistical biophysics; and (2) we will de- velop the top-down approach, functional diversity, to improve these classifiers. To achieve these aims, we will use our extensive collection of public immune-repertoire datasets, beginning with 391 antibody and TCR da- tasets we have characterized previously. Our team has deep and complementary expertise in developing computational tools for finding patterns in immune repertoires (Dr. Arnaout) and in the mathematics that under- lie these tools (Dr. Altschul), with additional advice available as needed regarding machine learning (Dr. AlQuraishi). This proposal is highly innovative for how our two new approaches address previous issues in the field. (i) Statistical biophysics uses a powerful machine-learning method called maximum-entropy modeling (MaxEnt), improving on past work by tailoring MaxEnt to learn patterns encoded in the biophysical properties (e.g. size and charge) of the amino acids that make up antibodies/TCRs; these properties ultimately determine what targets antibodies/TCRs can bind, and therefore which sequences are present in different diseases. (ii) Functional diversity fills a key gap in how immunological diversity has been measured thus far, by factoring in whether different antibodies/TCRs are likely to bind the same target. This proposal is highly significant for (i) developing an efficient, accurate, generative, and interpretable machine-learning method for finding diagnostic patterns in repertoire sequence; (ii) applying a robust mathematical framework to the measurement of immuno- logical diversity; (iii) impacting clinical diagnostics; and (iv) adding a valuable new tool for integrative/big-data medicine. The expected outcome of this proposal is an integrated pair of robust and well validated new tools/models for classifying specific disease exposures directly from repertoire sequence. This proposal in- cludes plans to make these tools widely available, to maximize their positive impact across medicine.
The proposed research is relevant to public health because B cells/antibodies and T cells play vital roles across such a vast range of health conditions, from infection, to autoimmunity, to cancer, that the ability to de- code what they are doing would be an important step forward for diagnosing these conditions. The proposed research is relevant to the NIH's mission of fostering fundamental creative discoveries, innovative research strategies, and their applications as a basis for ultimately protecting and improving health, specifically relating to the diagnosis of human diseases.