The velocity, variety, volume and veracity of data from relevant information sources make it extremely challenging for oncologists to collect and review pertinent data that can support routine personalized treatment for their patients. There is an urgent need to develop data wrangling approaches including Natural Language Processing and information retrieval methods to extract and curate personalized-therapy related publications and clinical trials. Once curated, the structured data can be used by biomedical researchers to generate novel scientific hypotheses, design new studies, obtain a better understanding of biological mechanisms of disease, perform meta-analyses, and create clinical decision support systems. There is an urgent need to develop improved search interfaces specific to the field of personalized therapy, including ways to display, rank, and save results by end users. While several database and web-based keyword search engine algorithms exist, there is a lack of tools that meet the unique challenges of personalized medicine. There is also an urgent need to develop software that allows for verification and validation of information extracted and ranked through computational methods using subject matter expertise to improve the gold standard corpus that can be used for biomedical research into personalized therapies. To address these issues, we will build an innovative software stack (MACE2K) to adapt and extend widely tested Biocreative natural language processing (NLP) tools to automatically retrieve and pre-process targeted therapy information from, PubMed abstracts as well as open access articles, and conference proceedings. We will build an entity extraction cartridge to accurately parse gene mutations, translocations, gene expression, protein expression, and protein phosphorylation. A marker disambiguation cartridge will be built to assess for trial inclusion or exclusion criteria and to determine marker-related primary endpoints. We will include a ranking cartridge that uses the disambiguated information on markers, drugs and trials to provide a rigorous scoring of trials and studies according to their relevance for personalized medicine. A novel gamification cartridge will be built to allow subject matter experts to verify and validate the information corpus. Our research leverages National Cancer Institute's investments in several programs (many of which we are involved in) including the NCI drug dictionary, National Cancer Informatics Program (NCIP), I-SPY trials, and Center for cancer systems biology (CCSB) to efficiently accomplish our aims.

Public Health Relevance

This project will develop new computational methods and software to retrieve targeted molecular and drug therapy information from multiple sources of big data including:, PubMed abstracts, open access articles, and conference proceedings. The software can be used by biomedical researchers to generate new hypotheses for research on personalized cancer treatment decisions based on enormous volumes of public data already in existence. A novel gamification component will be built to allow subject matter experts to verify and validate the information corpus to enhance accuracy of the software.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project--Cooperative Agreements (U01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Sofia, Heidi J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Georgetown University
Internal Medicine/Medicine
Schools of Medicine
United States
Zip Code
Madhavan, Subha; Ritter, Deborah; Micheel, Christine et al. (2018) ClinGen Cancer Somatic Working Group - standardizing and democratizing access to cancer molecular diagnostic data to drive translational research. Pac Symp Biocomput 23:247-258
Mahmood, A S M Ashique; Rao, Shruti; McGarvey, Peter et al. (2017) eGARD: Extracting associations between genomic anomalies and drug responses from text. PLoS One 12:e0189663
Rao, Shruti; Beckman, Robert A; Riazi, Shahla et al. (2017) Quantification and expert evaluation of evidence for chemopredictive biomarkers to personalize cancer treatment. Oncotarget 8:37923-37934
Wang, Qinghua; Ross, Karen E; Huang, Hongzhan et al. (2017) Analysis of Protein Phosphorylation and Its Functional Impact on Protein-Protein Interactions via Text Mining of the Scientific Literature. Methods Mol Biol 1558:213-232
Ritter, Deborah I; Roychowdhury, Sameek; Roy, Angshumoy et al. (2016) Somatic cancer variant curation and harmonization through consensus minimum variant level data. Genome Med 8:117
Bhuvaneshwar, Krithika; Sulakhe, Dinanath; Gauba, Robinder et al. (2015) A case study for cloud based high throughput analysis of NGS data using the globus genomics system. Comput Struct Biotechnol J 13:64-74
Madhavan, Subha; Gauba, Robinder; Song, Lei et al. (2013) Platform for Personalized Oncology: Integrative analyses reveal novel molecular signatures associated with colorectal cancer relapse. AMIA Jt Summits Transl Sci Proc 2013:118
Madhavan, Subha; Gusev, Yuriy; Natarajan, Thanemozhi G et al. (2013) Genome-wide multi-omics profiling of colorectal cancer identifies immune determinants strongly associated with relapse. Front Genet 4:236
Gusev, Yuriy; Riggins, Rebecca B; Bhuvaneshwar, Krithika et al. (2013) In silico discovery of mitosis regulation networks associated with early distant metastases in estrogen receptor positive breast cancers. Cancer Inform 12:31-51
Madhavan, Subha; Gusev, Yuriy; Harris, Michael et al. (2011) G-DOC: a systems medicine platform for personalized oncology. Neoplasia 13:771-83