Modern digital pathology departments produce a tremendous amount of whole slide image data, which is quickly growing to petabyte scale. This plethora of data presents an unprecedented treasure chest for all kinds of medical machine learning tasks, including improvements in precision medicine. Unfortunately, the vast majority of digital slides are not annotated on the image level, so histological points of interest are not integrated with the clinical notes or genetics associated with a patient. In contrast to other disciplines, manual labeling is not only cumbersome and time-consuming, but given the decades-long training of a pathologist, it is exorbitantly expensive and, due to clinical time constraints, impractical. Moreover, the time- dependent relationship between a patient's histology and genotype is not quantitatively leveraged to recommend combination therapies. Genetics informs us of important driver mutations, but how multiple cell types interact with these mutants over time in the tumor microenvironment to become histologically evident is less clear. Deep learning synthesizes generations of pathologist knowledge as accurate quantitative models. Given a picture of a patient's morphology, I provide a tool that in four seconds finds the top ten most similar patients with their diagnoses, to support a pathologist's diagnosis decisions under the time pressures of active surgery. Recording the pathologist inspecting a slide at the microscope automatically annotates observed slide regions with time. Not only amenable for learning models that predict whether or not a region is salient to a pathologist making a diagnosis, this also allows all slides in a hospital to be annotated to identical criteria with only a representative sample of slides. I have submitted a manuscript reporting 85.15% accuracy in this saliency pre- diction task. This annotation greatly simplifies machine learning tasks, which can now focus on a non-redundant set of diagnostic regions in the slide, whether the application is to (a) find similar patients by morphology for diagnosis or (b) relate diagnostic morphology to the genetics. Statistically modeling the relationship of the genotype to the histological phenotype in cancer opens promising new avenues in precision medicine. Taking a Big Data approach, I will leverage over 18,244 paired genome-histology samples to learn this model, using transfer learning techniques to maximize the value of all 18,244 samples for each tissue type. Genotype-phenotype model in hand, I will simulate the molecular clock in cancer, incrementally mutating the genome and predicting corresponding histology at each molecular time step. Through similar Q-learning that powers Google's champion artificial intelligence ``AlphaGo'', I will learn an agent that inhibits expression of a small set of mutant genes to maximize cancer progression-free survival time, by molecular time in the simulator. This not only measures the therapy's evolutionary durability, but also leads directly to experimentally testable hypotheses in 3D cell culture.

Public Health Relevance

/ PUBLIC HEALTH RELEVANCE STATEMENT Digitized whole slide images of cancer tissue are a ``dark matter'' in the clinic, data that are often collected but rarely used quantitatively to model cancer or personalize medicine. I will, through close collaborations with practicing pathologists at globally leading cancer research hospitals, form statistical models of the histology evident in whole slides to (i) create clinical decision support tools that find similar patients by morphology alone to disambiguate cases that are difficult for a pathologist to diagnose under the time pressures of ongoing surgery, (ii) create a tool that learns which parts of a whole slide image are salient for a pathologist at the microscope to make a diagnosis, which greatly facilitates the deep learning techniques underlying [i] and [iii] by discarding redundant histological image data, and (iii) create a system that relates genotype to histological phenotype in cancer and learns an agent which leverages this system to determine the optimal multistage combination therapy to drive the histology towards health such that cancer progression-free survival time is maximized in the patient of interest. In this way, my statistical models will transform the vast dark matter of diagnosed histology data, representing generations of pathologist knowledge at these leading hospitals, into an indispensable resource for surgery, pathology, genetics, and precision medicine.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Predoctoral Individual National Research Service Award (F31)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Perkins, Susan N
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Weill Medical College of Cornell University
Schools of Medicine
New York
United States
Zip Code