Text Analytics, Machine Learning &Biomedical Data Science

Johnson, Calvin

Abstract

The Text Analytics, Machine Learning, and Biomedical Data Science, which operates within the Collaborative Research Office in Computer and Information Science (CROCIS), Division of Computational Bioscience of CIT, is collaborating with NIH investigators to build a critical mass in text and numerical analytics that is envisioned to encompass a number of pertinent and related disciplines in biomedical research including semantic interoperability, knowledge engineering, computational linguistics, text and data mining, natural language processing, machine learning, and visualization. The program is intended to foster advances in critical domains at NIH including biomedical and clinical informatics, translational research, genomics, proteomics, systems biology, """"""""big data"""""""" analysis, and portfolio analysis. In 2013, collaborative efforts in support of these goals included the following. - In collaboration with NIAID, CROCIS is developing a new algorithm capable of analyzing V(D)J recombination in thousands of immunoglobulin gene sequences produced by high throughput sequencing. - CROCIS is working with Melissa Friesen of NCI to develop methodologies to improve exposure classification in occupational epidemiologic studies. Initial effort of this collaboration involves a tool that helps experts to classify free-text job descriptions into standard occupational codes. Machine-learning based classification methods will also be utilized to help with evaluating exposure-disease associations. - In collaboration with NINDS, CROCIS has implemented and compared several methods to locate and characterize lysosomes in 3-D fluorescence images. The goal is to be able to calculate the pH of each lysosome in the image, for which the ability to resolve their locations is an important step. - In collaboration with NIA, we are applying machine learning and visualization techniques on large biological datasets to discover novel patterns of functional gene or protein interactions as related to aging. Omnimorph, a graphic data analysis tool, is being developed for multidimensional data visualization. In this collaboration, we are also developing a model to predict the progression of Alzheimer's disease using plasma proteomic biomarker data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). - Machine-learning methods have been devised and implemented to identify and refine transcription start sites in the fruit fly genome found using cap analysis gene expression (CAGE). This effort is in collaboration with Brian Oliver of NIDDK. - CROCIS is collaborating with NIAID in developing an image analysis pipeline to quantify individual transcript molecules in macrophage cells to help understand the molecular mechanism of macrophage adaptation to various stimuli at the single-cell level. - A freely available plasmid database that is interoperable with popular freeware is currently being developed for the NIDA Optogenetics and Transgenic Technology Core. The plasmid database offers a versatile yet simple platform for scientists to store and analyze their plasmid data. Motivated by the need for a more comprehensive approach to archiving plasmid data, the database platform is enriched with numerous components beyond the repository, serving as an informatics platform designed to enhance the efficiency and analytic capabilities of scientists. - In collaboration with CSR, CROCIS is applying text analytics to provide CSR leadership with evidence-based decision support in evaluation of the grant review process. The effort so far has concentrated on exploratory analysis against the NIH portfolio to evaluate clustering methods and assess intrinsic measures of cluster quality. Content-based application referral tools are being developed to help evaluate the merit of PIs study section requests, and to recommend the most suitable study section for an application if no requests are made. In addition, CROCIS is analyzing text from quick feedback surveys on peer review. This effort includes evaluating a pilot study to evaluate the feasibility of analyzing free text from peer reviewers on their perception of the study section quality. If successful, the pilot results will be used to as initial input for a full-scale implementation. - CROCIS has been collaborating with the Molecular Libraries Program (MLP), part of the NIH Common Fund, to develop the Common Assay Reporting System (CARS). CARS is an integrated system for managing bioassay information and facilitating communication between all the high-throughput screening centers within the Molecular Libraries Probe Production Centers Network (MLPCN). Goals for this collaboration include: 1) Track project status and related issues at each of the screening centers within the MLPCN, and provide the means for information collection, sharing and retrieval among the centers and the program office at NIH. 2) Establish a standardized protocol to describe raw data from the experiments and report screening data to the scientific community. - The human salivary protein catalog has been made available online on a community-based Web portal developed by CROCIS, in collaboration with NIDCR, to enable scientists to add their own research data, share results, and discover new knowledge. This is a major step towards the discovery and use of saliva biomarkers to diagnose oral and systemic diseases. - CROCIS investigators worked with the Office of Extramural Research (OER) on applying machine-learning methods to identify important terms that peer reviewers use to describe innovative applications. The goal of the effort was to develop a lexicon of terms that can help estimate the innovation level of a grant application based on peer review critiques from the applications NIH Summary Statement. - Although the scientific impact of NCI consortia on the advancement of cancer epidemiology research is understood to be significant, accurate quantitative metrics of this impact are needed by program leadership. We are developing methods to track citations to clinical guidelines in the context of evidence-based medicine that could provide funding agencies and program directors insight into individual consortia's contributions in advancing medical knowledge. This work is being conducted in collaboration with Epidemiology and Genomics Research Program (EGRP), NCI. - Based on its experience in building novel models for classifying research grants and projects, CROCIS is collaborating with DPCPSI/OD and other ICs to develop the Portfolio Learning Tool, a comprehensive classification workflow system that will allow users to select from multiple classification algorithms, feature spaces, and training regimes, to build and run their own classifiers. A particular prototype of this system is being tailored to assist NCI Intramural investigators in reporting their research to the Annual Report system. CROCIS has been developing an augmented support vector machine (SVM) that augments a training set by sampling from a corpus of unknowns and runs a large ensemble on various samples of this augmented space. The results obtained from this classifier suggest that, when coupled with an effective annotation strategy, such a classifier can be quite effective at categorizing a research portfolio. - The Office of Behavioral and Social Sciences Research (OBSSR) is conducting a pilot investigation in collaboration with CROCIS to evaluate the efficacy of machine learning models for the classification of five BSSR-relevant research categories.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: Center for Information Technology (CIT)
Type: Scientific Computing Intramural Research (ZIH)
Project #: 1ZIHCT000200-25
Application #: 8941588
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 25
Fiscal Year: 2014
Total Cost
Indirect Cost

Institution

Name: Computer Research and Technology
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects


NIH 2019 ZIH CT	Informatics, Machine Learning & Biomedical Data Science Johnson, Calvin A. / Center for Information Technology
NIH 2018 ZIH CT	Informatics, Machine Learning & Biomedical Data Science Johnson, Calvin A. / Computer Research and Technology
NIH 2017 ZIH CT	Informatics, Machine Learning & Biomedical Data Science Johnson, Calvin A. / Computer Research and Technology
NIH 2016 ZIH CT	Informatics, Machine Learning & Biomedical Data Science Johnson, Calvin A. / Computer Research and Technology
NIH 2015 ZIH CT	Informatics, Machine Learning & Biomedical Data Science Johnson, Calvin A. / Computer Research and Technology
NIH 2014 ZIH CT	Text Analytics, Machine Learning &Biomedical Data Science Johnson, Calvin A. / Computer Research and Technology
NIH 2013 ZIH CT	Text Analytics, Machine Learning &High Performance Computing Johnson, Calvin A. / Center for Information Technology	$2,419,860
NIH 2012 ZIH CT	Text Analytics, Knowledge Engineering, &High Performance Computing Johnson, Calvin A. / Center for Information Technology	$2,726,852
NIH 2010 ZIH CT	Collective Intelligence, Knowledge Infrastructure, &High Performance Computing Johnson, Calvin A. / Center for Information Technology	$2,823,000
NIH 2009 ZIH CT	Collective Intelligence, Knowledge Infrastructure, &High Performance Computing Johnson, Calvin A. / Center for Information Technology	$2,941,656

Publications

Schmitz, Roland; Wright, George W; Huang, Da Wei et al. (2018) Genetics and Pathogenesis of Diffuse Large B-Cell Lymphoma. N Engl J Med 378:1396-1407

Martins, Andrew J; Narayanan, Manikandan; Prüstel, Thorsten et al. (2017) Environment Tunes Propagation of Cell-to-Cell Variation in the Human Macrophage Gene Network. Cell Syst 4:379-392.e12

Wilcox, Amber N; Silverman, Debra T; Friesen, Melissa C et al. (2016) Smoking status, usual adult occupation, and risk of recurrent urothelial bladder carcinoma: data from The Cancer Genome Atlas (TCGA) Project. Cancer Causes Control 27:1429-1435

Liang, Ma; Raley, Castle; Zheng, Xin et al. (2016) Distinguishing highly similar gene isoforms with a clustering-based bioinformatics analysis of PacBio single-molecule long reads. BioData Min 9:13

Lau, William W; Tsang, John S (2016) Humoral Fingerprinting of Immune Responses: 'Super-Resolution', High-Dimensional Serology. Trends Immunol 37:167-169

Lau, William W; Sparks, Rachel; OMiCC Jamboree Working Group et al. (2016) Meta-analysis of crowdsourced data compendia suggests pan-disease transcriptional signatures of autoimmunity. F1000Res 5:2884

Sparks, Rachel; Lau, William W; Tsang, John S (2016) Expanding the Immunology Toolbox: Embracing Public-Data Reuse and Crowdsourcing. Immunity 45:1191-1204

Russ, Daniel E; Ho, Kwan-Yuet; Colt, Joanne S et al. (2016) Computer-based coding of free-text job descriptions to efficiently identify occupations in epidemiological studies. Occup Environ Med 73:417-24

Maudsley, Stuart; Martin, Bronwen; Gesty-Palmer, Diane et al. (2015) Delineation of a conserved arrestin-biased signaling repertoire in vivo. Mol Pharmacol 87:706-17

Russ, Daniel E; Ho, Kwan-Yuet; Longo, Nancy S (2015) HTJoinSolver: Human immunoglobulin VDJ partitioning using approximate dynamic programming constrained by conserved motifs. BMC Bioinformatics 16:170

Showing the most recent 10 out of 14 publications

Comments

Be the first to comment on Calvin Johnson's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: