SARS-CoV-2 is now a global pandemic with 4.2M cases and 290K deaths worldwide (as of May 12, 2020). In the United States, there are over 1.3M cases and 81K deaths. Locally, Arizona has over 11K cases and 562 deaths. In response to this public health emergency, several studies have been published that describe patient characteristics in terms of signs, symptoms, and clinical endpoints. In addition, epidemiologists and infectious disease researchers have utilized next-generation sequencing technology to produce complete genomes of the virus for clinical and epidemiologic investigation. Genomic epidemiology has enabled scientists to understanding localized transmission while determining geographic sources of introductions from different states and countries. However, most of the sequencing for SARS-CoV-2 (as well as for other viruses) is performed outside of state or local health departments such as the Centers for Disease Control and Prevention (CDC), universities, or private labs. It can then be difficult to link the pathogen, once sequenced, back to the data collected by the health department for case investigation. This can inhibit genomic epidemiology when there is no link between sequences of viral isolates and epidemiologic case data. There is limited research in how to link pathogen sequences to epidemiologic case data; especially for COVID-19. Thus, despite the abundance of clinical and epidemiologic data collected during this pandemic, more informatics research is needed to understand how to link viral genetic and epidemiological data and demonstrate the value of this for disease surveillance. The goal of this supplement is to link epidemiologic data from COVID-19 positive patients in Arizona with viral genetics from sequenced isolates to better understand the relationship between viral genetics and epidemiologic and clinical phenotypes. We will accomplish this by utilizing Arizona?s disease surveillance system and available sequences and metadata that are published in online nucleic acid databases. We will use different probabilistic matching strategies to link the two different sources (Aim 1) and then use Bayesian phylogenetics and phylogeography to study clustering of epidemiologic cases (Aim 2). Epidemiologists can use these findings to gain an understanding of how local viruses genetically cluster in relation to specific epidemiologic and clinical cases. While disease severity is dependent on individual immune response and environmental factors, linking viral genetics to its proper epidemiologic case could also support hypothesis generation for future reverse genetics and immunological studies in animal models.
This biomedical informatics project will leverage probabilistic matching to link reportable disease data and viral sequence data of SARS-CoV-2. This will support the analysis of local SARS-CoV-2 cases by linking them with the genetics of the virus for on-going public health surveillance.