Rapid and accurate detection of biothreat is important not only for containing its potential damages, but also for determining potential medical remedies. Extensive researches show that certain genes in infected cells have different mRNA expression levels for different pathogens. Thus, an accurate identification of the genes that react to pathogens and an accurate quantification of their expression variations are key steps in early biothreat detections. The emerging RNA-Seq technologies provide tens of millions of short sequence reads of the expressed genes, which, after mapping to the genome, can be converted to accurately represent gene expression levels. However, the conversion from sequence reads to gene expression levels is still problematic. In this project, The investigator and her colleagues will tackle this problem by modeling RNA-Seq data through a broad class of flexible nonlinear models, called sufficient dimension reduction (SDR) models; propose novel variable selection methods for SDR models; and develop theoretical underpinning of the effectiveness of the proposed methods. As a consequence, this effort will result in a powerful software suite for estimating gene expression levels from RNA-seq data and identifying marker genes reacting to specific pathogens in a unified framework.
This project not only addresses some emerging issues in biothreat detections using high-throughput sequencing technologies, but also results in novel statistical methods and theory broadly applicable to general statistical learning and prediction problems. More specifically, the proposed methods (i) produce innovative new methodologies for analyzing ultra-high dimensional data, (ii) inspire new lines of quantitative investigations in genomics, and (iii) offer a unique educational experience for both undergraduate and graduate students to participate in cutting-edge statistical and interdisciplinary research.