Predicting the molecular complexity of a genomic sequencing library has emerged as a critical but difficult problem in modern applications of DNA sequencing. In applications like RNA-seq and single-cell sequencing, the molecular complexity of the underlying biological sample is also of central interest. This project will produce computational methods for predicting the number of distinct molecules that will be sequenced from deeper sequencing of an existing sequencing library. We will adapt these methods to also predict saturation in RNA-seq and the fraction of the genome covered above some fold in genome resequencing as a function of sequencing depth. We will also develop methods for estimating heterogeneity of phenotypes in a tissue based on single-cell RNA-seq experiments. These methods will allow investigators to optimize their use of DNA sequencing resources, minimizing waste and improving throughput.

Public Health Relevance

DNA sequencing technology will inevitably revolutionize the practice of medicine. Clinical DNA sequencing, for example in diagnosis or guiding treatment, requires robust statistical methods to evaluate the information content of DNA samples and detect the presence of technical artifacts in sequencing data. This project develops statistical methods to evaluate the quality DNA sequencing libraries based on very small amounts of sequencing, which assist in developing more reliable and cost effective clinical sequencing protocols.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
1R01HG007650-01A1
Application #
8819058
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Sofia, Heidi J
Project Start
2014-12-16
Project End
2017-11-30
Budget Start
2014-12-16
Budget End
2015-11-30
Support Year
1
Fiscal Year
2015
Total Cost
$381,975
Indirect Cost
$116,475
Name
University of Southern California
Department
Biology
Type
Schools of Arts and Sciences
DUNS #
072933393
City
Los Angeles
State
CA
Country
United States
Zip Code
90089
Mangul, Serghei; Yang, Harry Taegyun; Strauli, Nicolas et al. (2018) ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues. Genome Biol 19:36
Delás, M Joaquina; Sabin, Leah R; Dolzhenko, Egor et al. (2017) lncRNA requirements for mouse acute myeloid leukemia and normal differentiation. Elife 6:
Deng, Chao; Daley, Timothy; Smith, Andrew D (2015) Applications of species accumulation curves in large-scale biological data analysis. Quant Biol 3:135-144