A genome-wide association study (GWAS) is an approach to detecting genetic variations associated with particular diseases or traits by scanning markers across the genomes of a large-scale sample of subjects in a high-throughput manner. In less than a decade, GWAS studies have been successfully producing discovery and replication of many new disease loci. Discovered genetic associations have led to development of better strategies to diagnose, treat and prevent diseases. The number of GWAS is growing rapidly. There is a need for a database that allows researchers to easily query and search for previous results. A well-curated database also provides a resource for overview and summarization investigations of associated genetic sites and may help suggest pleiotropic genes. Such a database has been created and maintained by the National Human Genome Research Institute (NHGRI), called 'A Catalog of Published Genome-Wide Association Studies' (Catalog of GWAS). The catalog has led to interesting characterization of previous results in GWAS and NHGRI has continued to update and curate the catalog regularly. However, this is performed by manually extracting information from published GWAS articles. As a result, the coverage is low compared to the volume of all GWAS publications and would be impossible to catch up the pace of new publications. The goal of this project is to develop a new tool to automatically extract the information from research articles for the curation of the catalog of GWAS. Our proposal is to use the curated data currently available from NHGRI as the training examples and apply novel machine-learning algorithms to train an information extractor to allow accurate automatic extraction. Given our recent success in applying machine learning to biological text mining, we are confident that this will lead to a useful tool to improve the productivity of curators and solve the coverage problem. Our first specific aim is to develop an accurate information extractor. Our second specific aim is to develop an easy-to-use curation tool for curators to efficiently check and correct errors from automatic information extraction so that their curation productivity can be improved by 18 folds. Then we will adapt the tool to extraction and curation of research papers reporting association studies using data from next generation sequencing. Currently, study design and the reporting of GWAS results using NGS data are not standardized. These results have not been considered to be included in the catalog yet. However, we expect that the limitations will be overcome and the methodology will converge soon. We will closely monitor the progress and adapt the tool to allow for inclusion of the NGS data. Finally, we will distribute the software to the public domain so that volunteers or interested parties can create their own catalog locally. It is our goal to share the developed software with the research community to advance the field. The new algorithms developed in this project and the entire development cycle, from design to deployment, will also contribute to the state-of-the- arts of biological text mining.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project--Cooperative Agreements (U01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Hindorff, Lucia
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California San Diego
Internal Medicine/Medicine
Schools of Medicine
La Jolla
United States
Zip Code
Jain, Suvir; Tumkur, Kashyap R; Kuo, Tsung-Ting et al. (2016) Weakly supervised learning of biomedical information extraction from curated data. BMC Bioinformatics 17 Suppl 1:1