In response to NSF's TRIPODS Phase I initiative, the PIs, with expertise in theoretical and applied statistics, computer science, and mathematics at the University of California, Berkeley, will create a Foundations of Data Analysis (FODA) Institute to address cutting-edge foundational issues in interdisciplinary data science. The Institute will advance foundational research and the application of foundational methods through an intensive program of cross-disciplinary outreach to application domains in and beyond the campus research community. In parallel with the massive technological and methodological advances in the underlying disciplines over the past decade, a thriving array of data-related research and training programs has emerged across campus. Yet none of these programs within the campus data science ecosystem are devoted to addressing the interdisciplinary foundations of data analysis in a focused, mission-driven manner. The FODA Institute will address this crucial unmet need. This interdisciplinary project will lay the groundwork for more productive and fruitful interactions between theoretically-inclined data science researchers and researchers in diverse domains that rely upon, but do not always explicitly appreciate, foundational concepts. Advances in this area will lead to more principled extraction of insights from data across a wide range of domains. The three-year Phase I pilot will pave the way for institutionalization of the project as a larger center that will be the subject of a potential Phase II application.
The technical research component of the project addresses four fundamental challenges in data science: the characterization of what is, and what is not, possible in terms of upper and lower bounds for inferential optimization problems; probing more deeply the notion of stability as a computational-inferential principle; exploring the complementary role of randomness as a statistical resource, as an algorithmic resource, and as a tool for data-driven computational mathematics; and developing methods to combine science-based with data-driven models in a principled manner. Each of these challenges addresses old questions in light of new needs, each has important synergies with the other challenges, and each is situated squarely at the interface of theoretical computer science, theoretical statistics, and applied mathematics. The project will bridge the underlying interdisciplinary gaps to address some of the most important questions at the heart of data science today. Funds for the project come from CISE Computing and Communications Foundations and MPS Division of Mathematical Sciences.