Modern sequencing platforms can sequence tens of billions of bases per run and generate peta-bytes of data, but individual study sizes may be small. Similarly, a wide variety of health data are now publicly available to inform health policy decisions, and it may be advantageous to use data from several different surveys. The ability to aggregate and compare heterogeneous data across different datasets would be critical to expanding the usable data available for any individual study. We propose systematically studying two major barriers to this effort: 1) Aggregating different medical and biological datasets; 2) Dealing with batch effects and structured heterogeneous data.
Aim 1 allows us to fully utilize information on related topics from diverse datasets, as information across different experiments needs to be combined in a statistically rigorous, reliable way - the process needs to fully exploit the available information, not introduce biases, and still be systematic and reproducible. Not all experiments study the same set of variables/features, and combining this information is a non-trivial task.
The second aim allows researchers to handle heterogeneity between individuals or samples, which happens with ubiquity in biological and health data. For instance, sequencing machines are evolving over time and samples obtained wlth new technologies cannot be directly compared to samples taken on older systems, even if data was collected in the same lab. This also applies to samples obtained under different environmental conditions. Currently, researchers are forced to either ignore such biases, potentially leading to violations of statistical validity, or limit their analysis to data generated in one batch of samples. This work will extend the set of useful data available to researchers in a wide variety of domains and provide methods to compare and synthesize disparate datasets. The proposed work will result in: (1) Development of algorithms with theoretical performance guarantees for combining information from datasets with small number of overlapping features; (2) Development of rigorous statistical procedures for hypothesis testing in the presence of within-. group heterogeneity. These methods are particularly helpful for pre-/post- treatment studies, studies containing batch effects, or studies where samples are collected over long time periods using different technologies; (3) Implementation of these methods in case studies to domains in molecular biology (genetic pathway hypothesis generation) and population survey data for health policy modeling.

Public Health Relevance

This project develops statistical methods that allows researchers to aggregate small datasets to extend the set of useful data available and to compare and synthesize disparate datasets. These tools would be useful across a wide variety of fields, and we will demonstrate relevance by performing two case studies on systems/molecular biology and health policy applications.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
1R01LM013315-01
Application #
9916886
Study Section
Special Emphasis Panel (ZLM1)
Program Officer
Ye, Jane
Project Start
2019-09-01
Project End
2022-08-31
Budget Start
2019-09-01
Budget End
2020-08-31
Support Year
1
Fiscal Year
2019
Total Cost
Indirect Cost
Name
University of Southern California
Department
Engineering (All Types)
Type
Biomed Engr/Col Engr/Engr Sta
DUNS #
072933393
City
Los Angeles
State
CA
Country
United States
Zip Code
90089