Somatic hypermutation (SHM) of the Immunoglobulin (Ig) loci is a fundamental process in generating antibody diversity. High-throughput methods for profiling mutation spectra of Ig genes such as Roche 454 deep sequencing remain inaccurate and are expensive in part due to specialized bioinformatics required for post-processing. Recent improvements to the Illumina MiSeq platform allowing paired-end 2x300 nt reads will enable more accurate IgV sequencing at a >60 times lower cost than the 454 platform. As costs fall further, platforms such as MiSeq will become common lab equipment, but will only be useful if the appropriate bioinformatics tools are available. Accurate deep sequencing of the Ig loci is rapidly becoming the standard in a broad range of clinical applications including determining prognosis and detection of Minimal Residual Disease in B cell malignancies, characterization of autoimmune diseases and evaluating vaccine responses.
In Aim 1 we will develop a user-friendly bioinformatics pipeline (SHMPrep) to improve mutation calls for IgV sequences from the Illumina MiSeq platform. We hypothesize that statistical modeling of independent PCR vs sequencing error effects can improve the quality of MiSeq IgV sequences to levels comparable to Sanger sequencing. The pipeline will be integrated with a previously developed analysis tool (SHMTool) to allow non computer experts, such as most clinicians, to process MiSeq IgV datasets on an ordinary desktop computer. As high-throughput data accumulates it becomes more important to have analysis methods for the data we already have rather than producing yet more data. IgV mutation spectra depend on many factors including base composition, abundance and location of activation induced deaminase (AID) hot and cold spots, Pol-? hot spot composition and overall mutation frequency. This complexity makes it difficult to compare mutation spectra from different IgV regions.
In Aim 2 we will develop statistical methods for comparing different IgV regions taking into account sequence composition as well as mutation saturation and strand bias, which is important in identifying repair defects in immunodeficiencies such as AIDS and in B-cell malignancies and other cancers. We still understand little about the differences between the IGHV genes. Why are there so many V regions and such strong associations between particular Ig genes and immune responses? In Aim 3 we will develop a statistical model for predicting mutation frequencies that will allow known molecular interactions to be represented, for example, the interaction between AID targeting and error-prone mismatch repair. Predicted mutation frequencies from the model will be used by SHMTool to provide a comparative benchmark in situations where no control dataset is available. The model will be used to characterize each IGHV gene at a deeper level than was previously possible, allowing cross-species comparisons. In the longer term such a model will facilitate a better understanding of evolutionary changes in the IGHV genes and repertoire.

Public Health Relevance

Somatic hypermutation (SHM) of Immunoglobulin variable (V) regions is fundamental to the generation of antibody diversity. High-throughput V-region sequencing has a broad range of clinical applications including determining prognosis and detection of Minimal Residual Disease in B cell malignancies, characterization of autoimmune diseases and evaluating responses to infections and vaccines. We will develop the bioinformatics tools necessary for processing of SHM data obtained from high-throughput platforms as well as developing methods for comparison of mutations from different V-regions and even different organisms.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Cellular and Molecular Immunology - B Study Section (CMIB)
Program Officer
Marino, Pamela
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
State University New York Stony Brook
Biostatistics & Other Math Sci
Biomed Engr/Col Engr/Engr Sta
Stony Brook
United States
Zip Code
Yuan, Chaohui; Chu, Charles C; Yan, Xiao-Jie et al. (2017) The Number of Overlapping AID Hotspots in Germline IGHV Genes Is Inversely Correlated with Mutation Frequency in Chronic Lymphocytic Leukemia. PLoS One 12:e0167602
Chen, Jeffrey; MacCarthy, Thomas (2017) The preferred nucleotide contexts of the AID/APOBEC cytidine deaminases have differential effects when mutating retrotransposon and virus sequences compared to host genes. PLoS Comput Biol 13:e1005471
Patten, Piers E M; Ferrer, Gerardo; Chen, Shih-Shih et al. (2016) Chronic lymphocytic leukemia cells diversify and differentiate in vivo via a nonclassical Th1-dependent, Bcl-6-deficient process. JCI Insight 1:
Shin, Jeewoen; MacCarthy, Thomas (2016) Potential for evolution of complex defense strategies in a multi-scale model of virus-host coevolution. BMC Evol Biol 16:233
Maul, Robert W; MacCarthy, Thomas; Frank, Ekaterina G et al. (2016) DNA polymerase ? functions in the generation of tandem mutations during somatic hypermutation of antibody genes. J Exp Med 213:1675-83
Shin, Jeewoen; MacCarthy, Thomas (2015) Antagonistic Coevolution Drives Whack-a-Mole Sensitivity in Gene Regulatory Networks. PLoS Comput Biol 11:e1004432
Wei, Lirong; Chahwan, Richard; Wang, Shanzhi et al. (2015) Overlapping hotspots in CDRs are critical sites for V region diversification. Proc Natl Acad Sci U S A 112:E728-37