Geoffrey Hutchison and David Koes of the University of Pittsburgh are supported by the Chemical Theory, Models and Computational Methods Program in the Division of Chemistry, to develop and apply novel statistical machine learning or "artificial intelligence" techniques to analyze the structure and flexibility of molecules. Most molecules are flexible, and at room temperature can exist in multiple interconverting geometries called conformers. As molecules increase in size, complexity, and flexibility, the number of possible stable conformers increases exponentially. To accurately predict molecular properties, it is important to take into account all geometries---even those lower-probability conformers---if they have some probability of spontaneously forming at a given temperature. Thus, a challenge is to efficiently identify these conformers despite an extremely complex, multi-dimensional geometric search spaces. Professors Hutchison and Koes implement a unique grid-based neural network approach to predict conformers, trained on databases of high-quality calculations generated using NSF XSEDE supercomputing resources. The machine learning techniques are validated against experimental and computational benchmarks. They are also applied to developing improved methods for identifying drug targets, and for optimizing the design of plastic electronic materials. The databases and software developed for this project are publicly disseminated. New tutorial resources are developed for Avogadro and 3DMol.js software tools, and new educational components on programming, visualization, and statistical machine learning areincorporated into the "Mathematics for Chemists" course taught by Professor Hutchison. Both investigators give open lectures on data science and chemistry to the public and to local high school students. They are also actively engaged in outreach to underrepresented groups through the American Chemical Society Project Seed, Pittsburgh Public Schools Science and Technology Academy, and the Drug Discovery, Systems, and Computational Biology Summer Academy for high school students.

The first part of this project draws upon a connection between statistical thermodynamics and Bayesian statistics. Using experimental and computational data, one can estimate distributions of dihedral angles for most molecules. From such probabilities, Bayesian optimization can be used to accurately explore and sample Boltzmann-weighted ensembles of the potential energy surface. The second aim takes advantage of recent improvements in quantum chemical methods for predicting thermochemistry for organic molecules. Using these accurate energies in tandem with experimental and other computational databases, recurrent neural-network thermochemical models are produced for molecules of different sizes and containing a wide range of elements. The resulting large data repositories and open source software tools are disseminated to the community, and used as the foundation for educational materials in a new curriculum for chemistry. They are providing the basis for ongoing outreach and broadening participation activities to high school and undergraduate students, involving a state-of-the-art, interdisciplinary mix of data science, machine learning, and computational chemistry.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Chemistry (CHE)
Type
Standard Grant (Standard)
Application #
1800435
Program Officer
Richard Dawes
Project Start
Project End
Budget Start
2018-08-15
Budget End
2021-07-31
Support Year
Fiscal Year
2018
Total Cost
$411,333
Indirect Cost
Name
University of Pittsburgh
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15260