With the rapid growth of modern technology, many large-scale biomedical studies generate massive datasets with multi-modality imaging, genetic, neurocognitive, and clinical information from increasingly large cohorts. We consider 6 publicly available datasets: the Human Connectome project (HCP) study, the UK biobank study, the Pediatric Imaging, Neurocognition, and Genetics study, the Philadelphia Neurodevelopmental Cohort, the Alzheimer's Disease Neuroimaging Initiative study, and the UNC early brain development study. Simultaneously extracting and integrating rich and diverse heterogeneous information in neuroimaging and/or genomics from these big datasets may transform our understanding of how genetic variants impact brain structure and function, cognitive function, and brain-related disease risk across the lifespan. This is critical for diagnosis, prevention, and treatment of brain-related disorders (e.g., schizophrenia and Alzheimer's). However, the development of methods for the joint analysis of high-dimensional imaging-genetic data, called big data squared, presents major theoretical and computational challenges due to complexities of imaging phenotypes such as regional volumetric measurements, cortical thickness maps, subcortical structures, structural and functional connectivity matrices, white matter tracts, and activation images. We will address three imminent challenges in the analysis of big data squared: (CH1) carrying out genome-wide association analysis for functional imaging phenotypes (e.g., white matter tracts, cortical thickness, and subcortical structures); (CH2) carrying out genome-wide association anal- ysis for high-dimensional imaging phenotypes with strong spatial structure (e.g., regional volumetric measure- ments, and structural and functional connectivity matrices); and (CH3) integrating multi-modality imaging, ge- netic, and clinical data to predict clinical outcomes (e.g., disease status or time-to-disease onset). To this end, we will develop (Aim 1) a functional genome-wide association analysis (FGWAS) framework for (CH1);
(Aim 2) a net- work genome-wide association analysis (NGWAS) framework for (CH2);
(Aim 3) a multi-scale prediction modeling (MSPM) framework for (CH3);
and (Aim 4) verify the ef?cacy of the newly developed analytical tools using simula- tions and the 6 extremely valuable imaging genetic datasets. Finally, we will develop companion software for the methods to be developed in this project. The software, which will provide much needed analytic tools for the big data squared, will be disseminated to the public through http://c2s2.yale.edu/software/, https://github.com/BIG- S2, http://odin.mdacc.tmc.edu/bigs2/software.html, and www.nitrc.org/. Our novel methods are applicable to a variety of imaging genetic studies for neuropsychiatric disorders, major neurodegenerative diseases, sub- stance use disorders, and normal brain development. A deeper understanding of genetic mechanism, brain development, and neurocognitive maturation has the potential to inspire new and urgently needed approaches to prevention, diagnosis, and treatment of many illnesses (e.g., schizophrenia and Alzheimer's).

Public Health Relevance

This project aims at developing statistical methods for integrating big data squared with applications in connectome genetics and genomics. Advanced methods will be verified by using real data sets in order to better address important public health problems. We expect the accomplishments from this project to have significant impact in the understanding of major neuropsychiatric disorders and major neurodegenerative diseases.

National Institute of Health (NIH)
National Institute of Mental Health (NIMH)
Research Project (R01)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Bennett, Yvonne
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Yale University
Public Health & Prev Medicine
Schools of Medicine
New Haven
United States
Zip Code
Kang, Kai; Song, Xinyuan; Hu, X Joan et al. (2018) Bayesian adaptive group lasso with semiparametric hidden Markov models. Stat Med :
Li, Tengfei; Zhou, Fan; Zhu, Ziliang et al. (2018) A Label-fusion-aided Convolutional Neural Network for Isointense Infant Brain Tissue Segmentation. Proc IEEE Int Symp Biomed Imaging 2018:692-695
Zhao, Bingxin; Ibrahim, Joseph G; Li, Yun et al. (2018) Heritability of Regional Brain Volumes in Large-Scale Neuroimaging and Genetic Studies. Cereb Cortex :
Li, Jialiang; Huang, Chao; Zhu, Hongtu (2017) A Functional Varying-Coefficient Single-Index Model for Functional Response Data. J Am Stat Assoc 112:1169-1181