Successful statistical analysis of the massive amounts of data available today can lead to successful, early threat detection. This proposal consists of two parts. The first part focuses on the detection of mutations in pathogen samples. Many emerging health threats are due to new mutations in evolving pathogen populations, which can now be profiled using massively parallel sequencing experiments. The investigators work with Dr. Hanlee Ji's laboratory in the Stanford Genome Technology Center, whose deep sequencing platform allows the detection of low prevalence mutations in pathogen samples. This problem was previously treated mainly from an algorithmic perspective, lacking statistical models for error estimates. The investigators propose methods for analysis of single nucleotide changes and general structural variants, and consider the analysis of single samples, the simultaneous analysis of multiple samples, and the comparison of matched samples. The second part of the proposal considers threat detection in a more general framework: detection of changes from background condition in one or more parallel streams of data. Examples are cyber-attacks on computer networks, introduction of belligerent agents (e.g. landmines, aircraft) into previously quiescent environments, appearance of noxious chemicals, genetic modifications of viruses or bacteria, etc. The main contribution is a general conceptual framework for integrating data from a large number of distributed sources, when the signal of interest may be present in only a small fraction of the sources. This proposal motivates theoretical developments in the areas of change-point detection, mixture estimation, empirical Bayes estimation, and false discovery rate control.

Successful statistical analysis of the massive amounts of data collected in modern scientific and technological activities can lead to successful, early threat detection. This proposal consists of two parts. The first part focuses on the detection of mutations in pathogen samples. Many emerging health threats are due to new mutations in evolving pathogen populations, which can now be profiled using next generation sequencing experiments. The accurate detection of new mutations is important, because they may confer survival advantage to the virus that carries it. Currently, this problem has been treated mainly from an algorithmic perspective, lacking statistical models for error estimates. The methods developed in this proposal will bridge this gap. The second part of the proposal considers threat detection in a more general framework: detection of changes from background condition in one or more parallel streams of data. Examples include cyber-attacks on computer networks, introduction of belligerent agents (e.g. landmines, aircraft) into previously quiescent environments, appearance of noxious chemicals, genetic modifications of viruses or bacteria, etc. The main contribution is a general conceptual framework for integrating data from a potentially large number of distributed sources, when the signal of interest may be present in only a small fraction of the sources.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1043204
Program Officer
Leland Jameson
Project Start
Project End
Budget Start
2010-10-01
Budget End
2015-09-30
Support Year
Fiscal Year
2010
Total Cost
$710,987
Indirect Cost
Name
Stanford University
Department
Type
DUNS #
City
Stanford
State
CA
Country
United States
Zip Code
94305