Various normalization methods, including both simple re-scaling-based methods and regression-based methods, have been developed for RNA sequencing data to remove unwanted variations in sequencing depth due to experimental handling. Most of these methods presuppose that variations in the assumed scaling factor or in the projection of the assumed regression function are solely due to data artifacts and should be removed. This assumption, however, may not hold for studies of low-complexity RNA molecules such as microRNAs (a prevalent class of small RNAs that are closely related to carcinogenesis) that tend to be expressed in a tissue- specific manner with only a small number of molecules expressed dominantly. The properties of depth normalization methods have not been assessed for tumor microRNA data. In this proposal, we will conduct such an assessment using a unique pair of microRNA datasets for the same set of tumor samples, where one dataset was collected using uniform handling and balanced sequencing library assignment while the second dataset was collected using neither. The former dataset can be assessed for disease-relevant microRNAs, serving as a benchmark; the latter can be used to test normalization methods in comparison with the benchmark. An R package will be built for the proposed assessment and disseminated to the research community to reproduce our study and conduct objective evaluation of new methods. In addition to method assessment, we will develop a novel statistical approach for guiding the choice of a normalization method for the data under study using the paired datasets and test this approach in microRNA data from the Cancer Genome Atlas (TCGA). An R package will be developed to implement the proposed approach and disseminated to the research community to use.
RNA sequencing is widely used in cancer research to decipher the causes of cancer and find improved treatment options. A critical step for analyzing RNA sequencing data is to normalize the sequencing depth so that measurements from different samples are comparable. There is an urgent need to evaluate the properties of statistical methods for depth normalization when they are applied to tumor microRNA data, and to develop a statistical approach for guiding the choice of a depth normalization method that is suited to the data under s tudy.