De novo DNA sequence mutations are mutations not inherited from a parent and play an important role in many human disorders, including cancer, autism, schizophrenia, and heart conditions. However, de novo mutations can be difficult to identify because sequencing errors are more common than mutations. Current approaches used to analyze DNA sequence data are inadequate to identify de novo mutations successfully at a genome scale because each potential de novo mutation must be validated by a costly and time-consuming validation process. Our goal is to improve the identification of de novo mutations in order to understand their role in genetic disorders. We will develop a novel statistical approach to identify de novo mutations, and we will implement it in software to make our method readily available to other researchers. Our first objective is to determine the probability that an apparent DNA sequence change is due to a de novo mutation, when analyzing short-read sequencing data from families. To determine this probability, we will integrate over other possible sources of error/noise including sequencing error, population diversity, and chromosome segregation. Secondly, we will expand on this model to detect somatic de novo mutations between multiple tissues from the same individual (e.g. matched tumor-normal datasets). Thirdly, we will develop new models to handle sequencing data from single-cell sequencing, which generates different probabilities of error compared to those discussed previously. These three aims are unified by a common approach of using genealogical information relating people, tissues, or individual cells, to improve the accuracy of de novo mutation discovery. Finally, we will implement these methods in an easy-to-use software package that will make the identification of de novo mutations possible for scientists working on subjects ranging from variation in mutation rates to the effects of aging. The methods developed here can benefit the hundreds, if not thousands, of studies that will search for and characterize de novo mutations in the coming decade. This package will be open source and free for the community to use.
Each new child is born with 60-100 new mutations not present in their parents. New mutations within genes have been shown to be a risk factor for common pediatric diseases, such as autism and developmental delay. Despite the importance of mutation rates in clinical practice, our understanding of how these rates vary between individuals on the basis of sex, age, genetic background, and environmental exposure is rudimentary at best. The research proposed here will create software that improves the accuracy of new mutation detection, and thus enable the integration of this important class of variation into medical genetics research.