In the digital age, the advancement of technology has enabled data collection at an unprecedented pace including the collection of a variety of dynamic data over time. Such dynamic data potentially holds the key to unlock many mysteries in science, such as how genes interact with each other in the developments of Drosophila and Human alike. However, dynamic data is notoriously challenging to analyze due to its changing nature as well as its massive data size. The PI plans to enhance the modeling toolbox for dynamic data by designing scalable parallel algorithms that aim at both high prediction accuracy and high interpretability through decision-tree based methods. Their applications range across many fields including computational biology and precision medicine. During the course of the proposed research, graduate students will receive training in domain-driven data science and open-source software development. Further dissemination of the proposed research will be through an upcoming book, undergraduate- and graduate-level courses, and presentations at workshops and conferences.

The high-volume dynamic data poses challenges to the model training process because the underlying data distribution is varying with time. Algorithms or models have to adapt to the changing dynamic as well as their interpretations. Among statistics and machine learning methods, decision-tree based ensembles are especially favorable for dealing with a large volume of dynamic data because tree ensembles can capture flexible non-linear relationships in the data and are easily interpretable for people to extract useful narratives and information. PI?s prior work, such as iterative Random Forests (iRF) and signed iterative Random Forests (siRF), identifies stable and high-order biomolecule interactions that explain its high predictive accuracy but it only focuses on cross-sectional data at a fixed time point. The proposed research will build on the iRF and siRF algorithms to develop enhanced Random Forest and iRF algorithms for modeling high-volume and dynamic data with interpretable high-order feature interactions. The PI will 1) develop a communication-efficient parallel RF training algorithm (pRF) that can efficiently take advantage of a large number of machines. 2) propose a novel method that discovers feature interactions in the dynamic data with the presence of concept drift: dynamic iterative Random Forests (diRF). 3) carry out a theoretical analysis of pRF and diRF algorithm under time-varying change-detection models where local stationarity conditions are satisfied.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1953191
Program Officer
Yong Zeng
Project Start
Project End
Budget Start
2020-08-15
Budget End
2023-07-31
Support Year
Fiscal Year
2019
Total Cost
$452,016
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94710