Researchers throughout academia, industry, and government are generating data at scales and levels of complexity far beyond what could previously have been imagined. Complex data demand statistical models that are sufficiently flexible to adapt to meaningful, underlying signals, allowing scientists to discover unexpected patterns. Yet as society relies more heavily on statistical algorithms to make decisions impacting everyday life, it becomes increasingly important for a method's output to be interpretable by non-experts. This demands parsimony: that simpler explanations be favored over more complicated ones. For example, the Internet has led to unprecedented quantities of data in the form of text (such as articles, blogs, webpages, consumer reviews, and many other social media products). Such text data represent a potential treasure trove of insights into the world -- what people are thinking, how this is changing over time, how this varies by location, etc. The investigator develops new statistical methods for overcoming major technical challenges to gleaning useful information from this data. This same methodology can be applied to the study of the microbiome, the vast community of microbes living in an environment such as the human gut. Better statistical methods are needed to identify types of microbes in the gut that play a crucial role in human health and disease. Another problem that is tackled in this project involves modeling data collected over time (such as wind-speed data and wildlife monitoring). The methods that are developed allow for more accurate forecasting, which is crucial in many areas including health and medicine and the development of lower cost energy systems. The last major area in this project is devoted to making the process of statistical research more efficient and its software of higher quality and easier to share across the community of statistical researchers. Finally, all three research objectives are closely integrated with educational outcomes, including the supervision and teaching of graduate students, outreach to non-statisticians and non-scientists, and the release of undergraduate-accessible mini-papers describing the investigator's new research findings.

This project focuses on the design of new statistical methods that balance two important and often opposing needs: flexibility and parsimony. (1) Building predictive regression and classification models is difficult when the features are highly sparse. While many methods focus on the challenge of high dimensionality, relatively few have considered the obstacle posed by features that are rarely nonzero. The investigator develops a new framework for feature selection when the features are highly sparse that succeeds where preexisting methods fail. This is studied both from theoretical and computational standpoints. (2) High-dimensional covariance estimation and time series modeling are two rich, but largely distinct, areas in statistics, which the investigator combines to develop new methods for modeling locally stationary time series. The added flexibility in going from stationarity to local stationarity must be carefully balanced with parsimony. (3) A series of area-specific software modules will be distributed freely online building on the investigator's new platform for streamlining the process of performing simulation studies. Each module will implement some of the most common models, methods, and metrics used in a given area of statistics research. The goal is to facilitate the sharing of high-quality, reproducible simulation code in the statistics research community by creating an easily-adaptable standardized format.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1748166
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2017-07-01
Budget End
2022-06-30
Support Year
Fiscal Year
2017
Total Cost
$361,149
Indirect Cost
Name
University of Southern California
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90089