This award is made as part of the FY 2012 Mathematical Sciences Postdoctoral Research Fellowships Program. These fellowships support a research and training plan at a host institution in the mathematical sciences, including applications to other disciplines. The title of the research and training plan for this fellowship to Rina Foygel is "Low-rank matrix reconstruction in the presence of non-uniform sampling, logistic response, or sparsely-low-rank structure." The host institution for the fellowship is Stanford University, and the sponsoring scientist is Dr. Emmanuel Candes.
This project is centered on finding signals in high-dimensional data. Many modern applications can be characterized in this way, from fields such as genetics, image processing and compression, and many others. The main outcomes of the project are in two areas: first, the problem of corrupted sensing, where multiple structured signals are present in the data and may obscure each other, and second, the problem of false discovery control in high-dimensional regression, where the large number of potential explanatory variables makes it difficult to distinguish real effects from false positives. In the corrupted sensing problem, limited measurements of a structured signal have been corrupted with some outliers - that is, some of the measurements are unreliable, but it is not known which measurements are reliable and which are corrupted. More generally, both the signal and the corruption can reflect different types of structure, depending on the setting. The goal for this problem is to disentangle the signal and the corruption based on the observed measurements, and recover the signal as accurately as possible. The main finding for this component of the project is that this is possible even without side information about the magnitude of the signal or the level of corruption present. The problem can be approached with a convex optimization problem that balances a penalty function on the estimated signal and a penalty function on the estimated corruption which reflect the type of structure known to be present in each component. Surprisingly, the tradeoff parameter that controls the balance between these two penalties can be chosen based on some geometric properties of the signal and the corruption structures. More broadly, this result may have implications for other penalized convex optimization problems that require choosing penalty or tradeoff parameters. False discovery control in high-dimensional regression is a second area of focus for this project. For example, consider applications such as predicting disease susceptibility from patients' genetic information, or predicting drug resistance from the mutations present in a virus's proteins. In a linear regression with many potential explanatory variables, scientists are often interested in performing hypothesis tests to test for the presence of an effect from each of the variables. However, the large number of potentially correlated explanatory variables means that these hypothesis tests are not independent, and suffer from the problem of multiple comparisons. A new method created in the course of this project, the knockoff filter, allows us to select explanatory variables while maintaining precise control of the false discovery rate (FDR), which is the expected proportion of discoveries that are false positives. This method creates additional "knockoff" explanatory variables whose presence serves as a test of model selection methods; if too many of these knockoffs are selected by a particular model selection method, then presumably the selected model contains many false positives as well. The specific construction of the knockoffs allows for strong empirical and theoretical control of the FDR, beyond what is possible with permutation-based methods and other existing approaches. Preliminary results on publicly-available data for HIV-1 drug resistance demonstrate that the method has power to select replicable effects from among many potential false positives. As a result, the broader impacts of this finding lie in its usefulness as a tool for selecting reliably replicable effects in a range of applications where there are many potential explanatory variables.