The investigators in this project develop methods that control the selection of predictive features from multiple sources when building statistical models. A martingale representation of the number of spurious variables provides the underlying theoretic support. This martingale defines a framework for testing a possibly infinite sequence of hypotheses. This representation leads to methods for streaming feature selection that control the expected number of false discoveries (mFDR). Extensions to be developed in this project generalize prior work of the investigators, extending their results to multiple streams of potential features while maintaining the martingale representation. Whereas the previous work of the authors was in the high-noise, low-signal setting in which few features are predictive (the nearly black setting), advances in this proposal push their methods into problems characterized by many predictive features with higher signal-to-noise ratios. This proposal envisions replacing the original martingale by one directly related to the goodness of fit of the model. The investigators plan to use this revised martingale to show that an auction-based system that combines several sources of features satisfies the mFDR condition.

The investigators develop novel methods for building predictive statistical models that combine and learn from multiple sources of information. A predictive statistical model is an empirical rule constructed from data that predicts a specific characteristic of observations, the response, based on the values of other characteristics. The challenge of building these models is to identify characteristics that yield predictive insights. While ever larger amounts of data are an essential input to a statistical model, the presence of vast numbers of characteristics lead to the problem of over-fitting. Over-fitting occurs when one confuses a random coincidence among characteristics with a reproducible pattern. Modern data mining produces such a plethora of characteristics that it becomes difficult to distinguish real from imaginary associations. The investigators propose a system that makes these distinctions in the context of a common modeling paradigm. As a practical testbed, the investigators will analyze classic computational linguistic problems using regression analysis, the workhorse method of applied statistics. Given the extent of experience in linguistics, any deficiencies of a regression model will stand out. This will encourage innovations in regression that maintain their simplicity while competing with handcrafted methods in linguistics. These innovations should extend to other applications including fMRI, genetics, and more general data mining.

Project Report

The research funded by this proposal has produced novel developments in the selection of features for predictive modeling. The emphasis is upon models used in the analysis of text. Features in this context are used in predictive models to anticipate future outcomes, such as resolving questions of authorship or identifying structural characteristics of text. Such features must be identified during model development, often by searching large collections of related documents. Streaming methods preform this search sequentially, allowing for searches of exceptionally large amounts of data such as that encountered in building models from text. Alternative methods also considered include methods of random projection that avoid costly matrix operations and storage. The contributions made in this work concern both the rapid generation of features as well as insight into how to choose among these features. Text mining provides the setting for many of these applications. For example, one might try to predict a medical outcome using only the text of medical case reports, temporarily eschewing further quantitative sources. Our applications show that this approach is feasible and competitive with classical numerically driven models. The features in these applications are numerical vectors from eigen-analysis of text. These features might be generated by classical linguistic analysis or by innovations developed in our research. In one such application of text mining, the features can be ordered a priori, allowing a powerful search technique. Improvements in power are obtained by alpha-investing as well as by employing a Bayesian model that adaptively pools estimates in accordance with a monotone prior for the parameters. This contribution shows that the risk of the resulting estimates is competitive with optimal methods derived from oracle procedures. Oracle procedures provide a benchmark for comparison in this context; oracle procedures benefit from extra information that is not available in practical applications. Realizable methods that approach the levels of oracle performance are thus highly desired. Our results also concern the estimation risk produced by streaming feature selection. Typical approaches in this domain control the false discovery rate, the proportion of false positives among selected features (ie, the proportion of unnecessary features). In keeping with typical practice in the testing of statistical hypotheses, error rates are held low, on the order of 5%. Exact calculation of the risk of such estimators shows, however, that controlling the false discovery rate at such low rates exposes the modeling to exceptionally large estimator risk. Higher false discovery rates allow more over fitting, but provide more powerful estimates that capture otherwise missed structure. An additional set of results concern the speed of the computations of these estimates. Our work has developed and studied the properties of several estimation techniques, with particular emphasis on regression methods that are widely applicable.

National Science Foundation (NSF)
Division of Mathematical Sciences (DMS)
Standard Grant (Standard)
Application #
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pennsylvania
United States
Zip Code