Rapid and accurate detection of biothreat is important not only for containing its potential damages, but also for determining potential medical remedies. Extensive researches show that certain genes in infected cells have different mRNA expression levels for different pathogens. Thus, an accurate identification of the genes that react to pathogens and an accurate quantification of their expression variations are key steps in early biothreat detections. The emerging RNA-Seq technologies provide tens of millions of short sequence reads of the expressed genes, which, after mapping to the genome, can be converted to accurately represent gene expression levels. However, the conversion from sequence reads to gene expression levels is still problematic. In this project, The investigator and her colleagues will tackle this problem by modeling RNA-Seq data through a broad class of flexible nonlinear models, called sufficient dimension reduction (SDR) models; propose novel variable selection methods for SDR models; and develop theoretical underpinning of the effectiveness of the proposed methods. As a consequence, this effort will result in a powerful software suite for estimating gene expression levels from RNA-seq data and identifying marker genes reacting to specific pathogens in a unified framework.
This project not only addresses some emerging issues in biothreat detections using high-throughput sequencing technologies, but also results in novel statistical methods and theory broadly applicable to general statistical learning and prediction problems. More specifically, the proposed methods (i) produce innovative new methodologies for analyzing ultra-high dimensional data, (ii) inspire new lines of quantitative investigations in genomics, and (iii) offer a unique educational experience for both undergraduate and graduate students to participate in cutting-edge statistical and interdisciplinary research.
The research proposed in this proposal focuses on developing novel statistical theory and methods to probe some of the striking new phenomena which emerge from the next generation sequencing technology. Over the past three years, we have gradually developed techniques to overcome computational and theoretical challenges that arise from high dimensional data analyses that are related to the high-throughput sequence analysis. A series of high dimensional regression methods has been developed. The methods are successfully applied in modern genomic, epigenetic, neuro imaging and chemical sensing. Under the support of this program, there are 14 peer-reviewed articles have been published in the top statistics and bioinformatics journals. Among them, 7 statistical methodology papers are published in the top of statistics journal including Journal of the American Statistical Association, Annul of Statistics, Journal of the Royal Statistical Society: Series B, and Technometrics, and 7 bioinformatics papers are published in the top bioinformatics journals, such as Integrative Biology, PLoS ONE and Molecular Vision, etc. As a side work, our research effort in the statistical methodology development also results in one publication and two manuscripts in neuro image, two published work in analytical chemistry. We have also made more than 30 invited talks and conference presentations and we even delivered our research results to high school students through guest lectures, instructional material developments, and science project mentoring. In 2012, the research supported by this program was featured in the College of Liberal Arts & Science Newsletter at UIUC. More specifically, we developed stepwise variable selection methods and theory under the sufficient dimension reduction (SDR) framework; developed trace based stepwise variable selection methods and establish asymptotic consistency theories of the methods in the SDR framework; developed variable screening methods and theory in ultra-high dimensional SDR framework; developed publicly available data analysis software for modeling RNA-Seq short-read counts. Our major achievement include the development of a suite of variable selection procedure under the sufficient dimension reduction framework and developed their theoretical underpinning including the first consistency result for a stepwise type of variable selection procedure. Similar theoretical results are still lacking even for a much simpler linear regression model. A publicly available computational tool, COP, is deposited in CRAN for the general public usage. A side product of the grant is some applications in Neroimaging include: 1) construct a high-dimensional scalar-on-image regression in which human subjects’ mental disease status or mental states are the response, and their brain activity measured by the fMRI at many small brain locations are predictors, and identify brain regions that are predictive of subjects’ brain regions predictive of their mental states. 2) develop fast-to-compute methods that can be applied to high-dimensional fMRI to identify brain regions responsive to designed stimulus sequence. 3) develop variable selection methods for constructing high-dimensional brain networks. The project provided a rich set of research problems to involved graduate students. On UIUC and UGA campus, one minority graduate student has been supported for two years and obtained his doctoral degree this summer. He is currently a visiting assistant professor at UIUC. Three first-year grauduate students have been supported in one year including a female student. There is a postdoc work on this project for half an year as part of his postdoc training. On Harvard Campus, during the past year, one minority graduate student was supported, who is finishing his thesis this year. One postdoctoral fellow was supported for 5 months, who has applied the developed methods for genetics problems. On virginia campus, one female Ph.D. candidate has been supported for two months in 2014 summer by the award. The PI Zhang has supervised her to conduct the proposed research and finish her Ph.D. dissertation.