Roughly 40% of the US population will be diagnosed with some form of cancer in their lifetime. In a majority of these cases, a definitive cancer diagnosis is only possible via histopathologic confirmation using a tissue slide. Increasingly, these slides are being digitally scanned as high-resolution images for usage in both clinical and research digital pathology (DP) workflows. Our group has been pioneering the use of deep learning (DL), a form of machine learning, for segmentation, detection, and classification of various cancers using digital pathology images. DL learns features and their associated weighting from large datasets to maximally discriminate between user labeled data (e.g., cancer vs non-cancer, nuclei vs non-nuclei); a paradigm known as ?learn from data?. Unfortunately, this paradigm makes DL especially sensitive to low quality slides, noise induced by small errors in the manual user labeling process, and general dataset heterogeneity. As many groups do not intentionally account for these problems, they learn that successful employment of DL technologies relies heavily on explicitly addressing challenges associated with (a) carefully curating high quality slides without preparation or scanning artifacts, (b) obtaining a large precise collection of annotations delineating objects of interest, and (c) selecting diverse datasets to ensure robust classifier performance when clinically deploying the model. To address these challenges we propose HistoTools, a suite of three modules or ?Apps?: (1) HistoQC examines slides for artifacts and computes metrics associated with slide presentation characteristics (e.g., stain intensity, compression levels) helping to quantify ranges of acceptable characteristics for downstream algorithmic evaluation. (2) HistoAnno drastically improves the efficiency of annotation efforts using a combined active learning and deep learning approach to ensure experts focus only on regions which are important for classifier improvement. (3) HistoFinder aids in selecting suitable training and test cohorts to guarantee that various tissue level characteristics are well balanced, leading to increased reproducibility. Our team already has working prototypes of HistoQC (100% concordance with a pathologist, evaluated on n>1200 slides) and HistoAnno (30% efficiency improvement during annotation tasks). In this U01, we seek to further develop and evaluate HistoTools in the context of enhancing two companion diagnostic (CDx) assays being developed in our group. First, we will use HistoTools to quality control and annotate nuclei, tubules, and mitosis for improving our CDx classifier for predicting recurrence in breast cancers using a cohort of n>900 patients from completed trial ECOG 2197. Secondly, HistoTools will be employed for quality control and identification of tumor infiltrating lymphocytes and cancer nuclei towards improving our CDx classifier for predicting response to immunotherapy in lung cancer using the n>700 patients from completed clinical trials Checkmate 017 and 057. These tools will build on our existing open source tool repository to aid in real-time feedback and dissemination throughout the ITCR and cancer research community.
This project will result in development of HistoTools, a new digital pathology toolkit for common pre-experiment machine learning tasks in the oncology domain such as (a) timely identification of poor quality slides and slide regions, (b) quantitative metrics driving optimized cohort selection, and (c) generation of highly precise and relevant annotations. Each component is designed to directly combat an existing bottleneck in the evolving usage of digital pathology. HistoTools will significantly enhance the functionality of existing toolboxes and pipelines, facilitating increasingly sophisticated machine learning applications in oncology.