Predicting the molecular complexity of a genomic sequencing library has emerged as a critical but difficult problem in modern applications of DNA sequencing. In applications like RNA-seq and single-cell sequencing, the molecular complexity of the underlying biological sample is also of central interest. This project will produce computational methods for predicting the number of distinct molecules that will be sequenced from deeper sequencing of an existing sequencing library. We will adapt these methods to also predict saturation in RNA-seq and the fraction of the genome covered above some fold in genome resequencing as a function of sequencing depth. We will also develop methods for estimating heterogeneity of phenotypes in a tissue based on single-cell RNA-seq experiments. These methods will allow investigators to optimize their use of DNA sequencing resources, minimizing waste and improving throughput.
DNA sequencing technology will inevitably revolutionize the practice of medicine. Clinical DNA sequencing, for example in diagnosis or guiding treatment, requires robust statistical methods to evaluate the information content of DNA samples and detect the presence of technical artifacts in sequencing data. This project develops statistical methods to evaluate the quality DNA sequencing libraries based on very small amounts of sequencing, which assist in developing more reliable and cost effective clinical sequencing protocols.
Mangul, Serghei; Yang, Harry Taegyun; Strauli, Nicolas et al. (2018) ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues. Genome Biol 19:36 |
Delás, M Joaquina; Sabin, Leah R; Dolzhenko, Egor et al. (2017) lncRNA requirements for mouse acute myeloid leukemia and normal differentiation. Elife 6: |
Deng, Chao; Daley, Timothy; Smith, Andrew D (2015) Applications of species accumulation curves in large-scale biological data analysis. Quant Biol 3:135-144 |