Phylogeography of zoonotic viruses studies the geographical spread and genetic lineages of viruses that are transmittable between animals and humans such as avian influenza and rabies. This science can help state public health and agriculture agencies identify the animal hosts that most impact virus propagation in a particular geographic region, the migration path of the virus including its origin, and the patterns of infection in various host populations, including humans, over time. The National Center for Biotechnology Information (NCBI), specifically GenBank, provides an abundance of available viral sequence data for phylogeography. Sequences and their metadata can be downloaded and imported into software applications that generate phylogeographic trees and models for surveillance. However, geospatial metadata such as host location is inconsistently represented and sparse across GenBank entries, with our preliminary studies showing only about 20% of the GenBank records contain specific information such as a county, town, or region within a state. While this detailed geospatial information might be included in the corresponding journal article, it is not available for immediate use in a bioinformatics or GIS application unless it is manually extracted and linked back to the appropriate sequence. Absence of precise sampling locations from easily-computable secondary data sources such as GenBank increases the difficulty of achieving accurate phylogeographic models of virus migration. We propose an infrastructure to improve phylogeographic models of virus migration by linking relevant geospatial data from the literature. This work represents the first effort to use automatically extracted geospatial data present in journal articles corresponding to GenBank records in order to enhance modeling of virus migration. Our research will extend phylogeography and zoonotic surveillance by: creating a Natural Language Processing (NLP) infrastructure that will improve the level of detail of geospatial data for phylogeography of zoonotic viruses (Aim 1), develop phylogeographic models using the data extracted in Aim 1 with adequate biostatistical models (Aim 2), and evaluating the impact of our approach for phylogeography and surveillance of zoonotic viruses (Aim 3). Thus, this work will provide researchers with a framework for population surveillance using an integrated biomedical informatics approach including NLP, biostatistics, bioinformatics, and database design.
We will create Natural Language Processing Infrastructure and novel phylogeographic models of zoonotic viruses that will allow state public health and agriculture agencies and other researchers to study virus migration. This will enhance population health surveillance including identification of the animal hosts that most impact virus propagation in a particular geographic region, the migration path of zoonotic pathogens, and the patterns of infection in various host populations over time, including humans. This resource will enable state agencies to implement improved public health control measures that will reduce morbidity and mortality of animals and humans from zoonotic diseases.