My long term career goal is to accelerate the use of data to improve population health. As an Assistant Professor of Epidemiology, I have focused my previous work on advancing access to biomedical data in public health. From working in several countries around the world, I have become acutely aware of the great potential for new discoveries offered by the vast amount of data on population health that is currently collected by researchers and health agencies. Most of these data are stored in different formats across thousands of data systems and may never be used for new research to better understand health and disease because they cannot be easily integrated (to make the data work together).
I aim to redirect my career track from working on one dataset at a time, to improving the availability and use of thousands or millions of datasets at a time by researchers and practitioners around the world. I plan to become an independent investigator and data scientist and to establish my own research group at the interface between Public Health and Big Data. Candidate: This K01 project will help me to achieve my long-term career goal through training in new knowledge and skills. My background in medicine and epidemiology has enabled me to improve access to datasets for epidemiological analysis, but I lack essential technical skills and knowledge to create new technology to improve the integration of population scale data in general. My mentors and I have developed this K01 training and research plan so that I can acquire these skills and knowledge. Training plan: This plan includes formal coursework, seminars, personal mentoring, and an immersive research experience across world-class institutes in Pittsburgh. Throughout this project, I will dedicate 75% effort to K01 training and research in the Department of Biomedical Informatics with my primary mentor Dr. Mike Wagner. Dr. Wagner is a leading expert in the application of intelligent systems and data systems to problems in public health. Dr. Greg Cooper will be my co-mentor and has an established track record in computer and information science and is now the director of the newly created Center for Causal Discovery, funded by the NIH Big Data to Knowledge (BD2K) mechanism. My third mentor, Dr. Mark Roberts, is a practicing clinician and a leader in computer modeling of diseases. He is also the new director of the University of Pittsburgh Public Health Dynamics Laboratory (PHDL) at the Graduate School of Public Health, where I will continue my epidemiological research as co-PI on the NIH Models of Infectious Disease Agent Study (MIDAS) Center of Excellence. My specific training goals during this K01 program are to master: 1) Data standards and ontology development; 2) Logic and logic programming; 3) Computer programming for disease simulation; and 4) Publication and grant writing skills in biomedical informatics and Big Data. I will develop this mastery in the context of the KO1 research project. Research plan: The goal of my K01 research is to improve the integration of population scale data required by epidemic simulators. An epidemic simulator is a software system that can represent epidemics; it typically requires a large diversity of datasets to represent the many interacting processes that result in a particular epidemic. Currently, the use of epidemic simulators is data limited, partly, due to the effort required to integrate datasets. M specific research aims are to: 1) Standardize a wide range of datasets for the mosquito-borne diseases dengue and Chickungunya from a variety of countries; 2) Develop computer algorithms that will search across all available datasets and all available epidemic simulators to identify those epidemics that can be studied by simulation. These algorithms will also identify data gaps; that is, epidemics that could be studied by simulation if a particular datum or dataset were to become available; and 3) Quantify the importance of different datasets for simulation of specific epidemics. This new technology will replace laborious manual processes with fast computer algorithms that can be scaled up to search across millions of datasets and simulators. Impact: Easier and faster discovery of appropriate datasets or data gaps for simulation will expand the use of epidemic simulation for public health research and practice leading to more efficient integration of available data. Using data more efficiently for innovative analyses will lead to new knowledge and discoveries that can improve global population health. Efficient use of data will also lead to cost savings by avoiding redundant data investments. Finally, wider use of epidemic simulators will improve preparedness against new epidemic threats. Outcomes of this project can be used across the biomedical sciences and will prepare me to become an independent investigator at the interface between public health and Big Data. .
Many valuable datasets that could be used to improve population health are not used due to challenges in accessing and standardizing datasets, and in integrating data (making the data work together) into novel analyses. This project will develop new technology to use available datasets more efficiently for computer simulation of epidemics and to better integrate data for population health. Using data more efficiently for epidemic simulation will lead to: 1) New knowledge and discoveries that can improve global population health; 2) Cost savings by avoiding redundant data investments; and 3) Better protection of the health of millions against new epidemic threats.