Cells are basic structural and functional units of all known living organisms. Understanding the structures of large individual macromolecules and their complexes inside cells is fundamental to biological research community. However, such structural information has been extremely difficult to obtain due to the lack of data acquisition techniques. Recent advances in cellular electron cryo-tomography (CECT) have enabled 3D imaging of cellular structure and organization at sub-molecular resolution and in near-native state. CECT images contain tens of millions of structurally highly diverse macromolecules, which introduces a major challenge in the throughput of subsequent computational analysis. How to efficiently and accurately process the images to identify each distinct type of macromolecular complex has been challenging. Conventional methods that try to align and classify each type of complex are too slow to process the large volumes of data being acquired. This project will focus on reducing data annotation costs using supervised deep learning, which is a major bottleneck for macromolecule structural recognition. This project will train graduate and undergraduate students in computational biology, bioinformatics, and bioimage analysis, as well integrate research results into university curricula.

The project will develop novel approaches to significantly reduce the training data annotation cost. This will be achieved by (1) improving the generalization ability of classification model trained by randomizing the simulated imaging conditions; (2) enhancing training of classification model using structural representation learned from unlabeled subtomograms via a semi-supervised autoencoder; and (3) reducing annotation by selecting the minimal set of most important subtomograms for improving the classification using active learning. By integrating the three objectives into a unified system, this project will provide users learning-based subtomogram classification models with improved accuracy and a minimal amount of annotated training data. The success of the project is a critical step towards fully automating the systematic in situ macromolecule structural classification in single cells captured by heterogeneous CECT datasets. The algorithms and software developed in this project will have direct benefit to the structural biology community. To facilitate broad use of the methods developed from this project, an open-source CECT analysis software, AITom (https://github.com/xulabs/aitom), will be developed. Results of this research will be provided on the Xu Lab website (http://xulabs.org) and GitHub site (https://github.com/xulabs)

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Type
Standard Grant (Standard)
Application #
1949629
Program Officer
Jean Gao
Project Start
Project End
Budget Start
2020-06-01
Budget End
2023-05-31
Support Year
Fiscal Year
2019
Total Cost
$427,114
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213