Statistical tools for the analysis of heterogeneous data have become increasingly important, leading both to the prominence of the new field of data science and to a need for more researchers at all levels. This project will train the next generation of leaders in the field. The Statistics and Geometry Research Training Group (RTG) will provide gateways for students at the various academic levels: The program will fund undergraduate research programs to get students excited by research so that they see themselves pursuing a graduate career in statistics. Graduate students will have the opportunity to develop their own classes in the summer, providing them with a gateway to an academic career as they see themselves becoming teachers and mentors. Three newly created postdoctoral positions will train doctoral recipients with a strong quantitative background in the emerging field of data science. The common theme linking all the participants will be the use of geometric methods in statistics and their applications to high dimensional complex data in modern biology. The program is structured to maximize mentoring and interaction opportunities, providing each participant with a broad perspective of the state of the art in statistics through highly relevant courses and working groups in statistics and geometry and their applications to biology. A salient feature of the program is that it will develop both the intellectual breadth as well as the mentoring and communication skills of the participants. The program will allow more U.S. students to supplement their interests in mathematics, applied mathematics, or computer science with involvement in data science, which is a crucial current need in many areas of information technology. The progress made in the analysis of heterogeneous data will enhance medical discovery through improved visualization and geometrical representations as well as the possibility to assess levels of uncertainty in making decisions based on images or network data. Domains of application will include computational anatomy, neuroscience, immunology, and genomics.
The methods developed will extend multivariate statistics where specific metrics (such as Fisher's Information metric, Mahalanobis distance, or L1) have already provided successful projections and geometric representations. Differential geometry will provide a rigorous framework that enables the incorporation of local information. The questions raised by using statistics on non-Euclidean varieties will be addressed in collaboration with computational geometers, statisticians, and mathematicians. Creating formal initiatives in geometry, statistics, and information science will bring new energy and focus into the curriculum, with mathematically grounded students encouraged to attack applied data analytic challenges. A key component of this initiative is outreach to women and minority students. Material and tools developed by the program will include open source visualization and approximations packages written in the R programming language.