This project, NSF Convergence Accelerator Track D: A Community Resource for Innovation in Polymer Materials, will develop a data-centric infrastructure for findable, accessible, interoperable, and reusable (FAIR) data that will be usable by a wide variety of stakeholders to accelerate the pace of materials innovation. The field is currently hampered by small, disparate data sets and there is clear need for a community-wide database effort. This project brings together a team with members from industry (Dow, Citrine), academia (MIT), and government laboratories (NIST) to directly address the issue of building a data and modeling infrastructure to serve as a community resource for polymeric material design. The project focuses on a sharing infrastructure for polymers and other soft materials addressing the current inability to deal with molecular distributions and stochastic reaction networks; characterization/data generation challenges for stochastic chemistry; challenges with nomenclature and molecular representation, and polymer properties determined on multiple scales from the chemical bond to the molecule to collective molecular interactions. The approach will utilize novel graph-based representations that can be widely adopted for the storage and exchange of data by all stakeholders in the polymer field. It will also explore how widely such data could be shared by different stakeholders, including paradigms that mix embargoed and open data as well as exploring models for ownership and credit that enable wider sharing of data across the community.
The approach will employ natural language processing (NLP) and computer vision techniques for automated information extraction from the polymer literature to generate a large, public structure- property database. The text-based extraction schemes will exploit chemical rationales and specific shared synthesis techniques for more efficient data extraction; extraction of polymer chemical structures from images using an optical chemical structure recognition system; and development of new machine learning methods of data curation in order to integrate a wider range of data and overcome data sparsity and diversity. Together, these elements will yield a populated data structure by the end of Phase I that will form a foundation for further efforts by the community. Although the techniques employed will generalize to all synthetic polymers, the initial testbed for these developments will be polyurethanes—a large polymer market with diverse chemistry, substantial data availability in the patent and journal literature, and structure-processing-property relationships that are a playground for continued material innovation. Community-wide engagement will be sought via a digital symposium to assist in identifying additional interests and partners that need to be represented and included; disseminating information about this effort; and receiving inputs on how deliverables should be planned and designed during Phase II execution. An educational planning exercise in Phase I will identify educational partners and educational needs in this space and develop both pedagogical and assessment plans that can be acted upon in Phase II such that the tools that are built are available freely to the community via wide dissemination of knowledge and training. All of the tools and standards will be made publicly available using open-source development projects.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.