This project will investigate several issues arising from the statistical and computational analysis of whole genome sequencing (WGS) based genomics studies. In the area of data management in WGS studies, we address the rapidly increasing cost associated with the transfer and storage of the massive files for the sequence reads and their associated quality scores. We will develop data compression methods to achieve a further compression of several folds beyond current standards, with minimal incurred errors. In the area of secondary analysis, we will develop new statistical learning methods to improve variant quality score recalibration and to filter out unreliable calls. This will improve te reliability of the key information provided by the WGS data, which are the variants calls indicating the locations where the genome differs from the reference and the nature of the differences. We will study methods for case-control studies based on WGS. In particular, we will develop statistical models to enable the integrating of information from multiple types of variants to obtain more powerful tests of association. We will apply the methods developed in this aim to the analysis of WGS data from a study on abdominal aortic aneurysm. Finally, we will address selected new questions associated with population scale WGS projects. Several national programs have recently been initiated to generate WGS data for hundreds of thousands of individuals with longitudinal medical records. The availability of this comprehensive data on a population scale will open up a rich frontier for genome medicine and will pose many new challenges for statistical analysis. We will formulate some of these new challenges and develop the statistical methods needed to meet these challenges.

Public Health Relevance

The research in this project concerns the design and implementation of statistical and computational methods for the analysis of data from whole genome sequencing studies. Methods will be developed for sequence quality score compression, variant call filtering, and methods for case-control association analysis and mega-cohort analysis based on whole genome sequencing.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Stanford University
Biostatistics & Other Math Sci
Schools of Arts and Sciences
United States
Zip Code
Weirather, Jason L; Afshar, Pegah Tootoonchi; Clark, Tyson A et al. (2015) Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing. Nucleic Acids Res 43:e116
Taylor, Sarah E B; Li, Ye Henry; Wong, Wing H et al. (2015) Genome-wide mapping of DNA hydroxymethylation in osteoarthritic chondrocytes. Arthritis Rheumatol 67:2129-40
Mu, John C; Tootoonchi Afshar, Pegah; Mohiyuddin, Marghoob et al. (2015) Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods. Sci Rep 5:14493
Mohiyuddin, Marghoob; Mu, John C; Li, Jian et al. (2015) MetaSV: an accurate and integrative structural-variant caller for next generation sequencing. Bioinformatics 31:2741-4
Fang, Li Tai; Afshar, Pegah Tootoonchi; Chhibber, Aparna et al. (2015) An ensemble approach to accurately detect somatic mutations using SomaticSeq. Genome Biol 16:197