This proposal is focused on sufficient dimension reduction (SDR), which comprises methods for reducing the dimension of the predictor vector X in reference to the response Y in regression or classification problems. In the last 10 to15 years a variety of SDR methods have been developed that do not require a regression model and that exploit the conditional moments of X given Y. These methods have accrued a striking record of successful applications and have led to a variety of techniques. The investigators propose to introduce inverse reductive models that describe the stochastic structure of X given Y, and not Y given X as in traditional regression. Preliminary results indicate that this will lead to significant advances in theory, methods and applications. Reductive models provides a unified perspective linking traditional methods such as principal components and various recent model-free inverse methods. In addition, reductive models can provide information bounds, which make it possible to evaluate and improve upon the performance of existing model-free methods in recognizable contexts.
High-throughput technologies produce massive amounts of complex and interconnected data. More than ever before, understanding experimental evidence and exploring scientific hypotheses require methods to meaningfully reduce high-dimensional data. This is particularly the case for contemporary genomic sciences. Sequencing techniques, alignment algorithms, microarrays and other emerging experimental technologies generate information on genomes, myriads of novel functional elements within them, patterns of simultaneous expression for the thousands of genes they contain, and patterns of evolution across related species. The need to handle this growing body of information has spun a whole new discipline, Bioinformatics, at the very heart of which are indeed data reduction methods. In this proposal the investigators plan to study a class of inverse reductive models that unify and improve on existing dimension reduction methods, and that are capable of handling situations where the number of variables far exceed the number of subjects. Such situations are typical for genomic applications, and are difficult or impossible to study using existing methods.