Tracking evolutionary changes in viral genomes and their spread often requires the use of data deposited in public databases such as GenBank, the Influenza Research Database (IRD), or the Virus Pathogen Resource (ViPR). GenBank provides an abundance of available viral sequence data for phylogeography. Sequences and their metadata can be downloaded and imported into software applications that generate phylogeographic trees and models for surveillance. IRD and ViPR are NIH/NIAID funded programs that import data from GenBank but contain additional data sources, visualization, and search tools for their users. Tracking evolutionary changes and spread also requires the geospatial assignment of taxa, which is often obtained from GenBank metadata. Unfortunately, geospatial metadata such as host location is often uncertain in GenBank entries, with only 36% containing a precise location such as a county, town, or region within a state. For example, information such as China or USA was indicated instead of Beijing or Bedford, NH. While town or county might be included in the corresponding journal article, this valuable information is not available for immediate use unless it is extracted and then linked back to the appropriate sequence. The goal of our work is to enable health agencies and other researchers to automatically generate phylogeographic models that incorporate enhanced geospatial data for better estimates of virus spread. This proposal focuses on developing and applying information extraction and statistical phylogeography approaches to enhance models that track evolutionary changes in viral genomes and their spread. We propose a framework that uses natural language processing (NLP) for the automatic extraction of relevant geospatial data from the literature, and assigns a confidence between such geospatial mentions and the GenBank record. We will then use these locations and the estimates as observation error in the creation of phylogeographic models of zoonotic virus spread. We hypothesize that a combined NLP-phylogeography infrastructure that produces models that include observation error in the geospatial assignment of taxa will be closer to a gold standard than phylogeographic models that do not include them. Our research will extend phylogeography and zoonotic surveillance by: creating a NLP infrastructure that will improve the level of detail of geospatial data for phylogeography of zoonotic viruses (Aim 1), develop phylogeographic models using the estimates from Aim 1 as observation error (Aim 2), and evaluating our approach by comparing the models it produces to models that do not account for observation error in the geospatial assignment of taxa (Aim 3). We will allow users to generate enhanced models and view results on a web portal accessible via a LinkOut feature from GenBank, IRD, and ViPR. The addition of more precise geospatial information in building such models could enable health agencies to better target areas that represent the greatest public health risk.

Public Health Relevance

We will develop and evaluate an infrastructure that uses Natural Language Processing (NLP) to identify more precise geographic information for modeling spread of zoonotic viruses. These new models could enable public health agencies to identify the most at-risk areas. In addition, by improving geospatial information in popular sequence databases such as GenBank, we will enrich other sciences that utilize this information such as molecular epidemiology, population genetics, and environmental health.

Agency
National Institute of Health (NIH)
Institute
National Institute of Allergy and Infectious Diseases (NIAID)
Type
Research Project (R01)
Project #
1R01AI117011-01A1
Application #
9065021
Study Section
Biomedical Computing and Health Informatics Study Section (BCHI)
Program Officer
Brown, Liliana L
Project Start
2016-04-01
Project End
2020-03-31
Budget Start
2016-04-01
Budget End
2017-03-31
Support Year
1
Fiscal Year
2016
Total Cost
Indirect Cost
Name
Arizona State University-Tempe Campus
Department
Biomedical Engineering
Type
Sch Allied Health Professions
DUNS #
943360412
City
Tempe
State
AZ
Country
United States
Zip Code
85287
Magge, Arjun; Weissenbacher, Davy; Sarker, Abeed et al. (2018) Deep neural networks and distant supervision for geographic location mention extraction. Bioinformatics 34:i565-i573
Tolkoff, Max R; Alfaro, Michael E; Baele, Guy et al. (2018) Phylogenetic Factor Analysis. Syst Biol 67:384-399
Tahsin, Tasnia; Weissenbacher, Davy; O'Connor, Karen et al. (2018) GeoBoost: accelerating research involving the geospatial metadata of virus GenBank records. Bioinformatics 34:1606-1608
Al-Qahtani, Ahmed A; Baele, Guy; Khalaf, Nisreen et al. (2017) The epidemic dynamics of hepatitis C virus subtypes 4a and 4d in Saudi Arabia. Sci Rep 7:44947
Tahsin, Tasnia; Weissenbacher, Davy; Jones-Shargani, Demetrius et al. (2017) Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research. Database (Oxford) 2017:
Dudas, Gytis; Carvalho, Luiz Max; Bedford, Trevor et al. (2017) Virus genomes reveal factors that spread and sustained the Ebola epidemic. Nature 544:309-315
Weissenbacher, Davy; Sarker, Abeed; Tahsin, Tasnia et al. (2017) Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods. AMIA Jt Summits Transl Sci Proc 2017:114-122
Vrancken, Bram; Suchard, Marc A; Lemey, Philippe (2017) Accurate quantification of within- and between-host HBV evolutionary rates requires explicit transmission chain modelling. Virus Evol 3:vex028