Data science has emerged as an important field for making decisions based on data collected from sectors as varied as healthcare and housing. Though data are plentiful, thanks to phone apps, merchant loyalty cards, and social media accounts, there is still a question of whether more data translates to more knowledge. Furthermore collection and storage can be problematic especially when data are sensitive, as it is often the case with clinical trials and genetic experiments. The problem of selecting information-rich data becomes crucial for creating models that can reliably predict the outcome of future experiments. Few results have been published on the amount of necessary data, and currently there are no guidelines for generating specific data sets which would unambiguously identify a predictive model. As a first step towards developing a complete theory, the PIs will focus on models described by finite-valued nonlinear polynomial functions. (For example, the internal "function" in WedMD's Symptom Checker returns medical conditions according to symptoms input by the user.) They will construct the smallest data sets that have a single associated polynomial model and study properties of such data sets. From these computational experiments, they will build the appropriate theory, design algorithms, and generate code that can be later developed into software complete with a graphical user interface. Graduate students will participate at the appropriate level of each component of the project. Such an experience will provide them possible topics for an MS or PhD dissertation and will very likely inspire a career-long involvement in the STEM disciplines. The theoretical results will advance the fields of design of experiments, network inference, and finite dynamical systems through the determination of criteria for selecting data sets to uniquely identify models. The algorithms will serve as a guide for experimentalists in determining the data that are needed to identify the structure of a network of interest. Such knowledge has the potential to drastically reduce wasted resources that arise from too much data with too little information.

While this is the age of big data, there is still a question of whether more data translates to more knowledge. Particularly when collecting data is expensive or time consuming, as it is often the case with clinical trials and biomolecular experiments, the problem of selecting information-rich data becomes crucial for creating relevant models. Finite-state multivariate polynomial functions have successfully been used to model complex networks from discretized data; however, few results have been published on the amount of data necessary for such models, with the majority applying to Boolean models only. It is still unknown which data points explicitly identify such discrete models, and as a consequence, there are no methods for generating the specific data sets which would unambiguously identify the model. The PIs will address the issue of the minimality and specificity of data to uniquely identify discrete polynomial models by developing the appropriate theory, designing algorithms, and generating code that can be later built into software. Graduate students will participate at the appropriate level of each component of the project. This project will resolve some important computational issues in network inference and will improve experimental design and model selection by eliminating the effect of computational artifacts that arise when working with nonlinear multivariate polynomials. The theoretical results will advance the fields of design of experiments and network inference through the establishment of criteria to select data sets to uniquely identify models. The proposed work will also increase the utility of polynomial dynamical systems as models of complex networks by establishing the minimal amount of the data for unique model identification. The algorithms will serve as a guide for experimentalists in determining the data that are needed to identify the structure of a network of interest. Such knowledge has the potential to drastically reduce the number of experiments performed and to eliminate the generation of data with little intrinsic value.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1720341
Program Officer
Leland Jameson
Project Start
Project End
Budget Start
2017-08-15
Budget End
2019-07-31
Support Year
Fiscal Year
2017
Total Cost
$100,000
Indirect Cost
Name
Clemson University
Department
Type
DUNS #
City
Clemson
State
SC
Country
United States
Zip Code
29634