Due to the ease of handling and inexpensive storage, Formalin-Fixed Paraf?n-Embedded (FFPE) tissues are the most widely available source of tissue material for which long-term clinical follow-up data are recorded. The ubiquity of FFPE tissue specimens has made them an invaluable resource in biomedical research, with great potential for predictive and prognostic biomarker discovery. However, the quality of RNA extracted from FFPE tissues is generally poor due to chemical modi?cations and continued degradation over time. Consequently, assays using microarray or quantitative polymerase chain reaction (qPCR) often have limited reproducibility and sensitivity when measuring gene expression from such samples. In order to exploit the vast collection of FFPE samples, substantial effort has been devoted to development and/or validation of advanced technologies that can reliably probe their gene expression levels. For medium- throughput pro?ling, NanoString nCounter is frequently used with FFPE samples, as the nCounter system can accurately measure gene expression even when the target RNA is degraded. For high-throughput pro?ling, RNA sequencing is in common use. Recent studies have shown that for a wide variety of human tumor tissues (e.g., bladder, colon, prostate and renal carcinoma), RNA-seq can be used to measure mRNA of suf?cient quality extracted from FFPE tissues to provide biologically relevant transcriptome analysis. With the above advances, the use of FFPE specimens in cancer research has been growing fast, and analysis of FFPE gene expression data has become increasingly important. A crucial step when analyzing this type of data is normalization. Existing methods were all designed and validated using fresh-frozen (FF) or similar- type samples because using such samples has been a standard in most molecular biological analysis. FFPE expression data have very distinct technology-speci?c characteristics which present many statistical challenges. All these give rise to a pressing need for novel and rigorous statistical approaches to normalization that allow for modeling key characteristics of FFPE expression data to remove all estimable biases, in order to enhance the power and reproducibility of transcriptome analysis, and ultimately to promote utilization of largely existing FFPE specimens in biomedical research. To meet the need, we propose to accomplish the following speci?c aims.
In Aim 1, we will develop rigorous yet ?exible methods to normalize FFPE expression data from experiments using NanoString nCounter, the most important medium-throughput technology compatible with FFPE samples.
In Aim 2, we will develop robust and ef?cient methods to normalize FFPE data for high-throughput gene expres- sion analysis using RNA-seq.
In Aim 3, we will collaborate with leading cancer researchers to apply and re?ne the statistical methods, to facilitate translation from biomarker discovery to clinical practice.
In Aim 4, we will test the proposed methods using extensive simulation and multiple benchmark data sets, and develop free and open-source software for dissemination to the scienti?c community.
Formalin-Fixed Paraf?n-Embedded (FFPE) tissues are routinely archived and extensively stored in biorepos- itories worldwide with highly-annotated demographic, clinicopathologic and long-term follow-up information, pro- viding an invaluable resource for translational cancer research. Advanced technologies have made it increasingly feasible to extract mRNA material and quantify gene expression levels from FFPE specimens, although the pro- cess is noise prone and measured gene expression levels are biased due to distinct biochemical characteristics of such samples. With the aid of the proposed statistical methods and computational tools, biases will be reduced and signals will be recovered, facilitating utilization of FFPE specimens for discovery of molecular biomarkers for cancer risk prediction, early diagnosis, therapeutic target ?ndings and prediction of response to treatment.