Rapid and accurate detection of biothreat is important not only for containing its potential damages, but also for determining potential medical remedies. Extensive researches show that certain genes in infected cells have different mRNA expression levels for different pathogens. Thus, an accurate identification of the genes that react to pathogens and an accurate quantification of their expression variations are key steps in early biothreat detections. The emerging RNA-Seq technologies provide tens of millions of short sequence reads of the expressed genes, which, after mapping to the genome, can be converted to accurately represent gene expression levels. However, the conversion from sequence reads to gene expression levels is still problematic. In this project, The investigator and her colleagues will tackle this problem by modeling RNA-Seq data through a broad class of flexible nonlinear models, called sufficient dimension reduction (SDR) models; propose novel variable selection methods for SDR models; and develop theoretical underpinning of the effectiveness of the proposed methods. As a consequence, this effort will result in a powerful software suite for estimating gene expression levels from RNA-seq data and identifying marker genes reacting to specific pathogens in a unified framework.

This project not only addresses some emerging issues in biothreat detections using high-throughput sequencing technologies, but also results in novel statistical methods and theory broadly applicable to general statistical learning and prediction problems. More specifically, the proposed methods (i) produce innovative new methodologies for analyzing ultra-high dimensional data, (ii) inspire new lines of quantitative investigations in genomics, and (iii) offer a unique educational experience for both undergraduate and graduate students to participate in cutting-edge statistical and interdisciplinary research.

Project Report

Biothreat often uses natural or engineered pathogens (viruses or bacteria) that would eventually cause human’s inability or lethality in terrorism attacks. Rapid and accurate detection of biothreat is important not only for containing its potential damages, but also for determining potential medical remedies. Recent researches show that certain genes can be used to distinguish the threat related pathogens. Thus, an accurate identification of the genes that reacts to pathogens and an accurate quantification of their expression levels ("host gene expression levels") at a genome wide scale are more than often the key steps in the early biothreat detection. The goal of this project is to identify the pathogen by accurately quantifying the host gene expression levels and consistently selecting the marker genes that can be used as biomarkers in threat detection. The emerging RNA-Seq technologies provide tens of millions of short sequence reads of the expressed genes, which, after mapping to the genome, can be converted to accurately represent gene expression levels. In this project, we focused on predicting gene expression from sequences by using next generation sequencing data, and formulated the quantification of gene expression in RNA-Seq data as a statistical variable selection problem. The PI (together with the other two PIs) proposed a stepwise variable selection method, called correlation pursuit (COP), under a nonlinear index model for quantifying the relationship between gene expression levels and sequence features and selecting features predictive of gene expressions. A crucial advantage of the index model is that it does not rely on any stringent model assumption and very flexible to characterize the variation in the RNA-seq data. The PI has carefully evaluated the proposed method and developed theoretical underpinning of the effectiveness of the proposed methods. We applied the proposed COP algorithm to real RNA-seq and ChIP-Seq data, and showed that the new method is able to identify significant sequence features that are predictive of gene expression levels. Since variable selection problems occur not only in biothreat detections, but also widely in many other scientific fields, for example, neuroscience, environmental and medical sciences. This project not only addresses some emerging issues in biothreat detections using high-throughput sequencing technologies, but also results in novel statistical methods and theory broadly applicable to general statistical learning and prediction problems. The PI also developed new variable selection methods applicable to imaging data or data with spatial information and data arising in network studies. Specifically, the PI has applied the developed methods to human brain data, one of the most cited forms of big data, and confirmed existing psychology theories regarding human’s brain activities under different social scenarios. Under the support of the grant, the PI is also participating in the preparation of the book "Handbook of Modern Statistical Method: Neuroimaging Data Analysis". The book aims for providing educational guide for advanced masters and Ph.D. students learning statistical methods for neuroimaging data. In this book chapter Linear and Non-linear Models for FMRI Time Series Analysis, the PI has elaborated her statistical methods, which were developed for analyzing multi-subject fMRI data, and published in Journal NeuroImage.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1120756
Program Officer
Leland M. Jameson
Project Start
Project End
Budget Start
2011-08-15
Budget End
2014-07-31
Support Year
Fiscal Year
2011
Total Cost
$53,039
Indirect Cost
Name
University of Virginia
Department
Type
DUNS #
City
Charlottesville
State
VA
Country
United States
Zip Code
22904