The accurate and reliable recovery of sparse signals in massive and complex data has been a fundamental question in many scientific fields. The discovery process usually involves an extensive screening through a large number of hypotheses to separate signals of interest and also recognize their patterns. The situation can be described as finding needles of various shapes in a haystack. Despite the enormous progress on methodological work in data screening, pattern recognition and related fields, there have been little theoretical studies on the issues of optimality and error control in situations where a large number of decisions are made sequentially and simultaneously. These issues are among the central topics in modern Statistics; hence it is imperative to develop solid theory and powerful data-driven methods to help understand, regulate and optimize the dynamic decision process of sparse signal and pattern recovery. The specific research goals in this proposal are: to study the optimality theory and develop data-driven methods for a broad class of interrelated problems in signal detection, multiple testing and pattern classification; to develop a dynamic scheme for data acquisition, resource allocation and decision making for effective and accurate signal recovery; and to develop a compound decision theoretic framework for large-scale simultaneous and sequential inference.
The data screening and pattern recognition problems may arise from a wide range of scientific applications such as bioinformatics, finance, signal and language processing, image analysis, and geographical and astronomical surveys. These problems have significantly contributed to the rapid growth of a new and active interdisciplinary research area in data mining that has attracted substantial interests from applied mathematicians, statisticians and computer scientists. The proposed research provides important insights on some fundamental issues in these problems such as how the size of large data sets can be reduced significantly without losing many signals, how the signals can be separated from noise optimally, how the shapes and patterns of different objects can be recognized accurately, and how the inflation of errors in a large number of decisions can be controlled effectively. User-friendly software will be developed and made freely available for public use. The investigator will integrate the proposed research into educational activities through developing new courses for the young USC Statistics program, and through mentoring and training both undergraduate and graduate students to help them participate effectively in an information era overwhelmed by massive data.