Accelerating Curation of GWAS Catalog by Automatic Text Mining

Hsu, Chun-Nan

Abstract

A genome-wide association study (GWAS) is an approach to detecting genetic variations associated with particular diseases or traits by scanning markers across the genomes of a large-scale sample of subjects in a high-throughput manner. In less than a decade, GWAS studies have been successfully producing discovery and replication of many new disease loci. Discovered genetic associations have led to development of better strategies to diagnose, treat and prevent diseases. The number of GWAS is growing rapidly. There is a need for a database that allows researchers to easily query and search for previous results. A well-curated database also provides a resource for overview and summarization investigations of associated genetic sites and may help suggest pleiotropic genes. Such a database has been created and maintained by the National Human Genome Research Institute (NHGRI), called """"""""A Catalog of Published Genome-Wide Association Studies"""""""" (Catalog of GWAS). The catalog has led to interesting characterization of previous results in GWAS and NHGRI has continued to update and curate the catalog regularly. However, this is performed by manually extracting information from published GWAS articles. As a result, the coverage is low compared to the volume of all GWAS publications and would be impossible to catch up the pace of new publications. The goal of this project is to develop a new tool to automatically extract the information from research articles for the curation of the catalog of GWAS. Our proposal is to use the curated data currently available from NHGRI as the training examples and apply novel machine-learning algorithms to train an information extractor to allow accurate automatic extraction. Given our recent success in applying machine learning to biological text mining, we are confident that this will lead to a useful tool to improve the productivity of curators and solve the coverage problem. Our first specific aim is to develop an accurate information extractor. Our second specific aim is to develop an easy-to-use curation tool for curators to efficiently check and correct errors from automatic information extraction so that their curation productivity can be improved by 18 folds. Then we will adapt the tool to extraction and curation of research papers reporting association studies using data from next generation sequencing. Currently, study design and the reporting of GWAS results using NGS data are not standardized. These results have not been considered to be included in the catalog yet. However, we expect that the limitations will be overcome and the methodology will converge soon. We will closely monitor the progress and adapt the tool to allow for inclusion of the NGS data. Finally, we will distribute the software to the public domain so that volunteers or interested parties can create their own catalog locally. It is our goal to share the developed software with the research community to advance the field. The new algorithms developed in this project and the entire development cycle, from design to deployment, will also contribute to the state-of-the- arts of biological text mining.

Public Health Relevance

The goal of this project is to develop a new tool to automatically extract the information from research articles for the curation of the catalog of GWAS. The catalog contains information about SNP-trait/disease associations that are extracted from published genome-wide association studies. A well-curated catalog of GWAS allows researchers to easily query and search for previous results and provides a useful resource for overview and summarization investigations of associated genetic sites and may help suggest pleiotropic genes.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project--Cooperative Agreements (U01)
Project #: 5U01HG006894-02
Application #: 8549925
Study Section: Biodata Management and Analysis Study Section (BDMA)
Program Officer: Hindorff, Lucia

Project Start: 2012-09-24
Project End: 2015-06-30
Budget Start: 2013-07-01
Budget End: 2014-06-30
Support Year: 2
Fiscal Year: 2013
Total Cost: $247,917
Indirect Cost: $47,917

Institution

Name: University of Southern California
Department: Biostatistics & Other Math Sci
Type: Schools of Engineering
DUNS #: 072933393

City: Los Angeles
State: CA
Country: United States
Zip Code: 90089

Related projects


NIH 2015 U01 HG	Accelerating Curation of GWAS Catalog by Automatic Text Mining Hsu, Chun-Nan / University of California San Diego	$130,994
NIH 2014 U01 HG	Accelerating Curation of GWAS Catalog by Automatic Text Mining Hsu, Chun-Nan / University of California San Diego
NIH 2013 U01 HG	Accelerating Curation of GWAS Catalog by Automatic Text Mining Hsu, Chun-Nan / University of Southern California	$247,917
NIH 2013 U01 HG	Accelerating Curation of Gwas Catalog by Automatic Text Mining Hsu, Chun-Nan / University of California San Diego	$411,067
NIH 2012 U01 HG	Accelerating Curation of GWAS Catalog by Automatic Text Mining Hsu, Chun-Nan / University of Southern California	$247,879

Publications

Jain, Suvir; Tumkur, Kashyap R; Kuo, Tsung-Ting et al. (2016) Weakly supervised learning of biomedical information extraction from curated data. BMC Bioinformatics 17 Suppl 1:1

Comments

Be the first to comment on Chun-Nan Hsu's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: