The University of Rochester and Cornell University jointly establish the Greater Data Science Cooperative Institute (GDSC). The GDSC is based on two founding tenets. The first is that enduring advances in data science require combining techniques and viewpoints across electrical engineering, mathematics, statistics, and theoretical computer science. The investigators' goal is to forge a consensus perspective on data science that transcends any individual field. The second is that data-science research must be grounded in an application domain. This helps to ensure that assumptions about the availability and quality of data are realistic, and it allows methodological results to be tested experimentally as well as theoretically. As such, the GDSC aims to consider applications in medicine and healthcare, an important application domain and one for which advances in data science can have a direct, positive impact on society. The GDSC aims to tackle foundational questions that are motivated by problems in healthcare, obtain solutions that fuse domain expertise with application-agnostic methodologies, and ultimately yield scientific advances that impact the way healthcare is provided. The GDSC aims to leverage the physical proximity of the two institutions, and the unique strengths in each of the core disciplines above and in medicine.

The GDSC's cross-disciplinary research directions include: (i) Topological Data Analysis. The challenges that high-dimensional, incomplete, and noisy data present are great, but in many applications, exploiting the topological nature of the problem is possible. GDSC aims to develop new fundamental methods and theory to rigorously explore the promise of this unique approach. (ii) Data Representation. Data compression, embeddings, and dimension reduction play a fundamental role in data science. Inspired by new core challenges in biomedical imaging, genomics, and neural-spike training data, GDSC aims to develop novel source models and distortion measures, and ultimately seek a unifying theoretical framework across domains and disciplines. (iii) Network & Graph Learning. Many of the fundamental challenges in applying data science to non-homogeneous populations are best explored through a network or graph structure. GDSC aims to develop new techniques for parameter-dependent eigenvalue problems in spectral community detection, density-estimation methods on networks, and a theoretical framework for time-varying graphical models to study dynamic variable relations in time-evolving networks. (iv) Decisions, Control & Dynamic Learning. Sequential decisions are high-stakes in medicine. GDSC aims to utilize systems and control-engineering methods to improve health and disease management and develop new foundational theories and methods for label-efficient active learning and dynamic treatment regimes. (v) Diverse & Complex Modalities. Big data is complex data, and major new innovations are needed. GDSC aims to develop theoretical frameworks for inference under computational and privacy constraints and for high-dimensional data without parametric model assumptions. Text, image, and audio data present further challenges. To address such challenges, GDSC aims to explore transition systems for graph parsing of natural language and new fusion approaches for fully multimodal analysis.

This project is part of the National Science Foundation's Harnessing the Data Revolution (HDR) Big Idea activity.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

National Science Foundation (NSF)
Division of Computer and Communication Foundations (CCF)
Application #
Program Officer
Huixia Wang
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Cornell University
United States
Zip Code