Defining the features of cellular mixtures, where diverse cell types with distinct genomic characteristics are physically intermingled together, is a central problem in biology. For example, diseases such as cancer are characterized by cellular masses comprised of subpopulations, each with its own set of genetic variants and transcriptional signatures, where inter-population DNA variation is compounded with cell-to-cell RNA expression stochasticity. Characterizing genomic diversity in cellular mixtures and assessing its impact on cell-to-cell gene expression variation require analyses at the resolution of individual cells and contiguous genome molecules. This level of analytical resolution is now feasible with next generation sequencing (NGS) assays that integrate molecular barcoding with single-cell RNA sequencing and single molecule DNA sequencing. These technological advances surmount key challenges and herald new opportunities for the study of disease, but require new analysis methods: (1) Current NGS methods are not optimal for detecting and phasing genomic variants from cellular mixtures. For example, it is difficult to detect complex structural variants (SVs) that are carried by only a fraction of the genomes present within a mixture. Methods based on short read data is hindered by the loss of long range contiguity in heavily fragmented DNA as well as the low mappability of many SV junctions. Single-molecule linked-read DNA sequencing overcomes these drawbacks, but is in need of reliable analysis methods. (2) Single-cell RNA sequencing allows the detection of distinct cellular subpopulations with unique transcriptional signatures, however, data from individual cell transcriptomes have high levels of error and bias. New analysis procedures are needed to make statistically sound inferences. (3) The existing methods for single-cell expression analysis typically ignore DNA heterogeneity, which can be crucial for some studies, especially for cancer. It is yet unclear how to simultaneously characterize variation at both the DNA and RNA levels in a cellular mixture. This proposal addresses these issues by developing new statistical methods and experimental designs that enable accurate characterization of cellular mixtures exhibiting both DNA and RNA variations. We propose to develop methods to (1) detect, characterize, and phase complex variants using new single-molecule sequencing technology, (2) improve expression estimates obtained from single-cell RNA sequencing data, and (3) combine bulk single-molecule DNA sequencing and single-cell RNA sequencing to quantify the relationship between DNA variation and transcriptomic variation in genetically heterogeneous samples such as cancer.

Public Health Relevance

This application provides statistical and computational tools for analysis of single-cell and single molecule sequencing data, which allows the more accurate profiling of genomic and cellular heterogeneity within disease tissues such as tumors. More accurate tissue profiling forms the foundation for more accurate disease prognosis and more effective treatment.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
2R01HG006137-07
Application #
9382546
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Brooks, Lisa
Project Start
2011-07-06
Project End
2020-06-30
Budget Start
2017-09-14
Budget End
2018-06-30
Support Year
7
Fiscal Year
2017
Total Cost
Indirect Cost
Name
University of Pennsylvania
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
042250712
City
Philadelphia
State
PA
Country
United States
Zip Code
19104
Zhang, Hanrui; Zhang, Nancy R; Li, Mingyao et al. (2018) First Giant Steps Toward a Cell Atlas of Atherosclerosis. Circ Res 122:1632-1634
Huang, Mo; Wang, Jingshu; Torre, Eduardo et al. (2018) SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods 15:539-542
Zhou, Zilu; Wang, Weixin; Wang, Li-San et al. (2018) Integrative DNA copy number detection and genotyping from sequencing and array-based platforms. Bioinformatics 34:2349-2355
Xia, Li Charlie; Ai, Dongmei; Lee, Hojoon et al. (2018) SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution. Gigascience 7:
Urrutia, Eugene; Chen, Hao; Zhou, Zilu et al. (2018) Integrative pipeline for profiling DNA copy number and inferring tumor phylogeny. Bioinformatics 34:2126-2128
Wang, Jingshu; Huang, Mo; Torre, Eduardo et al. (2018) Gene expression distribution deconvolution in single-cell RNA sequencing. Proc Natl Acad Sci U S A 115:E6437-E6446
Ai, Dongmei; Huang, Ruocheng; Wen, Jin et al. (2017) Integrated metagenomic data analysis demonstrates that a loss of diversity in oral microbiota is associated with periodontitis. BMC Genomics 18:1041
Chen, Hao; Jiang, Yuchao; Maxwell, Kara N et al. (2017) ALLELE-SPECIFIC COPY NUMBER ESTIMATION BY WHOLE EXOME SEQUENCING. Ann Appl Stat 11:1169-1192
Jiang, Yuchao; Zhang, Nancy R; Li, Mingyao (2017) SCALE: modeling allele-specific gene expression by single-cell RNA sequencing. Genome Biol 18:74
Lau, Billy T; Ji, Hanlee P (2017) Single molecule counting and assessment of random molecular tagging errors with transposable giga-scale error-correcting barcodes. BMC Genomics 18:745

Showing the most recent 10 out of 38 publications