Massive and high-dimensional data arise frequently in biology and genomics. Regularization and sparsity are critical components for modeling such data and extracting information. Broad success of sparse modeling methods, such as the Lasso, has encouraged fast development in this area. However, most existing methods were developed under the frameworks of linear models and generalized linear models. The complex structures in genomic data require further development beyond existing methods. To this end, proposed are three novel sparse modeling methods with sophisticated model structures driven by large-scale gene expression data, protein binding data and DNA sequence data. The first method, motivated by modeling the relationship between protein binding and gene expression, constructs linear regression models on the terminal nodes of a decision tree. The decision tree partitions the population into subgroups according to the predictors. Each subgroup has its own sparse linear regression model between the response and the predictors. Two types of regularization, one on the regression coefficients and the other on the size of the tree, are used to encourage sparsity. The second method concerns the construction of tight clusters for gene expression data by penalizing the difference in grouped parameters between two tight clusters and between a tight cluster and the null cluster. Block-wise coordinate descent in conjunction with majorization is developed to maximize the regularized likelihood function. The third method, motivated by the motif finding problem, aims at sequence pattern discovery. A dictionary model is used to partition a sentence into words, which represent sequence patterns, and single letters. A novel regularization through the Kullback-Leibler divergence is developed for the product-multinomial model for words, which can achieve sparsity in estimating the cell probabilities. This regularization is used to construct a sparse dictionary that contains only a small number of words. A generalized EM algorithm is proposed for parameter estimation and solution path construction.

As efficient analysis of large-scale high-dimensional data is critical in many fields of science and engineering, the proposed research is of great current interest. Particularly, the proposed methods are ready for applications to front-edge research areas in genomics and molecular biology, where massive data sets have been continuously generated. To accelerate such applications, free computer packages and self-contained software are being developed for users to analyze their own data. On the other hand, this proposal contains many innovative statistical methodologies that may contribute significantly to statistics and computational sciences. Finally, the proposed research is integrated with educational activities by developing new and improving existing courses at both undergraduate and graduate levels.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1055286
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2011-06-01
Budget End
2017-05-31
Support Year
Fiscal Year
2010
Total Cost
$400,000
Indirect Cost
Name
University of California Los Angeles
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90095