Genomic technologies have made it possible to reconstruct, with ever-growing scope and detail, the landscape of genetic variations across individuals, tissues, and time. One consequence of this is a growing appreciation for the importance of somatic genetic variation. Somatic variations, and the hypermutability defects that can produce them, have been most extensively studied in the context of cancers, but they have relevance to a variety of other disorders, both germline and sporadic, as well as value in basic research into development and regeneration. Yet we are only beginning to understand the processes by which genetic variations accumulate in cancers and a few well-defined precancerous conditions. We so far know little about how somatic variation processes act in potentially cancerous but asymptomatic or in purely healthy tissues. Understanding somatic variation more broadly is a crucial question for developing informed models of cancer risk, pursuing options to identify and potentially treat cancers before they start, as well as for diagnosing and treating other illnesses that arise from non-cancerous somatic hypermutability or spontaneous genetic mosaicism. Experience reconstructing cell lineages in cancers provides the foundation for building comparable methods for the broader landscape of somatic genetic variability across cancerous, pre-cancerous, and non- cancerous tissues. One crucial lesson of cell lineage reconstruction in cancers is that doing reproducible science, and planning the experimental data-generation studies needed to do so, requires first understanding the data to be generated and the mathematical/statistical models and algorithms by which it will be analyzed. The field of tumor phylogenetics has provided a groundwork of theory and tools for reconstructing cell lineages in the presence of diverse somatic mutation processes that can be adapted to address the problem of reconstructing somatic variation processes more broadly. It has also provided crucial experience into best practices and pitfalls in designing variation studies, which will be needed to guide new large-scale experimental studies into non-cancerous somatic variation. Mapping the full landscape of human somatic variation and the mechanisms that produce it will be a large effort of numerous experimental and computational groups that will not succeed without a clear grasp on the data science problems it entails. The proposed work will help to build the informatics infrastructure needed for somatic variation study through four specific aims intended to leverage decades of experience with reconstructing somatic evolutionary histories in cancers. It will advance the state-of-the-art of tools for variant calling and cell lineage tracing to handle diverse mechanisms of somatic structural variations (SVs), copy number aberrations (CNAs), and single nucleotide variations (SNVs). It will use these methods via simulation study to assess data needs and study designs for identifying diverse somatic variation mechanisms. Finally, it will apply them in a pilot study of existing variation resources, both to validate their use in and beyond the cancer context and to perform novel discovery of cancerous, non-cancerous, and pre-cancerous somatic variation mechanisms.

Public Health Relevance

Although we canonically think of each person as having a unique genome, that genome actually varies even in a single person due to mutations, known as somatic variations, that accumulate over a lifetime. How somatic variations arise is important to understanding various disease processes and for reconstructing developmental mechanisms of tissues and organisms. The proposed work will develop computational methods needed to characterize this landscape of somatic variation processes across healthy, pre-disease, and disease tissues.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Ramos, Erin
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Carnegie-Mellon University
Biostatistics & Other Math Sci
Schools of Arts and Sciences
United States
Zip Code