The Text Analytics, Knowledge Engineering, and High Performance Computing Program, which operates within the High Performance Computing and Informatics Office (HPCIO), Division of Computational Bioscience of CIT, is collaborating with NIH investigators to build a critical mass in text and numerical analytics that is envisioned to encompass a number of pertinent and related disciplines in biomedical research including semantic interoperability, knowledge engineering, computational linguistics, text and data mining, natural language processing, machine learning, and visualization. The program is intended to foster advances in critical domains at NIH including biomedical and clinical informatics, translational research, genomics, proteomics, systems biology, """"""""big data"""""""" analysis, and portfolio analysis. In 2012, collaborative efforts in support of these goals included the following. - The human salivary protein catalog has been made available online on a community-based Web portal developed by HPCIO, in collaboration with NIDCR, to enable scientists to add their own research data, share results, and discover new knowledge. This is a major step towards the discovery and use of saliva biomarkers to diagnose oral and systemic diseases. - In collaboration with NCI, HPCIO is investigating document classifiers trained using machine-learning methods. One aspect of this collaboration involves the development of a system to match protocols with their funding source in IMPAC II. The need for such a system is motivated by the fact that the NIH project number is specified in only 20% to 25% of all NIH-sponsored protocols. An intended outcome of this matching system is improved classifier performance by augmenting grant document text with matching protocol text. - In response to input from various collaborative groups, HPCIO is developing a portfolio visualization resource, dubbed PViz, that integrates visualization of categorical data with results of clustering algorithms, to allow analysts to gain new insight into their data. Users may either construct a portfolio from IMPAC II data or import their own custom portfolio of categorical data. - In collaboration with the Division of Planning, Coordination, and Strategic Initiatives (DPCPSI/OD), we have trained a """"""""one-sided"""""""" classifier on a set of Comparative Effectiveness Research (CER) exemplars. The results of this investigation suggest that, when coupled with an effective annotation strategy, such a classifier can be quite effective at retrospectively identifying CER grants. - HPCIO has demonstrated the utility of its integrated portfolio clustering and visualization resource on NIAID's Anti-Microbial Resistance portfolio. The current focus of the collaboration with NIAID is to investigate various machine-learning methods (including unsupervised, semi-supervised, and fully supervised algorithms) to map projects to NIAID HIV/AIDS priorities, objectives, and initiatives. - HPCIO has been collaborating with the Molecular Libraries Program (MLP), part of the NIH Common Fund, to develop the Common Assay Reporting System (CARS). CARS is an integrated system for managing bioassay information and facilitating communication bettween all the high-throughput screening centers within the Molecular Libraries Probe Production Centers Network (MLPCN). Goals for this collaboration include: 1) Track project status and related issues at eaach of the screening centers within the MLPCN, and provide the means for information collection, sharing and retrieval among the centers and the program office at NIH. 2) Establish a standardized protocol to describe raw data from the experiments and report screening data to the scientific community. - A novel statistical test has been developed to identify differential expressed RNA from RNAseq count data. This work will provide a better idea of the biological differences between cell types. - HPCIO is working with Melissa Friesen of NCI to develop methodologies to improve exposure classification in occupational epidemiologic studies. Initial effort of this collaboration involves a tool that helps experts to classify free-text job descriptions into standard occupational codes. Machine-learning based classification methods will also be utilized to help with evaluating exposure-disease associations. - In collaboration with NINDS, HPCIO has implemented and compared several methods to locate and characterize lysosomes in 3-D fluorescence images. The goal is to be able to calculate the pH of each lysosome in the image, for which the ability to resolve their locations is an important step. - Machine-learning methods have been devised and implemented to identify and refine transcription start sites in the fruit fly genome found using cap analysis gene expression (CAGE). This effort is in collaboration with Brian Oliver of NIDDK. - We are applying machine-learning methods to identify important terms that peer reviewers use to describe innovative applications. The goal of the effort is to develop a lexicon of terms that can help estimate the innovation level of a grant application based on peer review critiques from the applications NIH Summary Statement. - HPCIO is working with NINDS and the Office of Extramural Research (OER) to determine peer-review sentiment of grant applications based on the NIH Summary Statement. The sentiment analysis results can provide decision support information to NIH program directors considering applications for selective pay. - In collaboration with NIA, we are applying machine learning and visualization techniques on mass biological datasets to discover novel patterns of functional gene or protein interactions as related to aging. Omnimorph, a graphic data analysis tool, is being developed for multidimensional data visualization. - Although the scientific impact of NCI consortia on the advancement of cancer epidemiology research is understood to be significant, accurate quantitative metrics of this impact are needed by program leadership. We are developing methods to track citations to clinical guidelines in the context of evidence-based medicine that could provide funding agencies and program directors insight into individual consortias contributions in advancing medical knowledge. This work is being conducted in collaboration with Epidemiology and Genomics Research Program (EGRP), NCI. - In collaboration with George Chacko of CSR, HPCIO is applying text analytics to provide CSR leadership with evidence-based decision support in evaluation of the grant review process. The effort so far has concentrated on exploratory analysis against the NIH portfolio to evaluate clustering methods and assess intrinsic measures of cluster quality. - Based on its experience in building novel models for classifying research grants and projects, HPCIO is collaborating with DPCPSI/OD and NCI to develop a comprehensive classification workflow system that will allow users to select from multiple classification algorithms, feature spaces, and training regimes, to build and run their own classifiers. - The Office of Behavioral and Social Sciences Research (OBSSR) is conducting a pilot investigation in collaboration with HPCIO to evaluate the efficacy of machine learning models for the classification of five BSSR-relevant research categories. - NIA and the Alzheimer's Association have developed a Common Alzheimer's Disease Research Ontology (CADRO) to categorize Alaheimer's Disease Research. HPCIO is in collaboration with NIA to develop classifiers for the six categories, 45 topics, and 145 themes.

National Institute of Health (NIH)
Center for Information Technology (CIT)
Scientific Computing Intramural Research (ZIH)
Project #
Application #
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Center for Information Technology
Zip Code
Martins, Andrew J; Narayanan, Manikandan; PrĂ¼stel, Thorsten et al. (2017) Environment Tunes Propagation of Cell-to-Cell Variation in the Human Macrophage Gene Network. Cell Syst 4:379-392.e12
Lau, William W; Sparks, Rachel; OMiCC Jamboree Working Group et al. (2016) Meta-analysis of crowdsourced data compendia suggests pan-disease transcriptional signatures of autoimmunity. F1000Res 5:2884
Wilcox, Amber N; Silverman, Debra T; Friesen, Melissa C et al. (2016) Smoking status, usual adult occupation, and risk of recurrent urothelial bladder carcinoma: data from The Cancer Genome Atlas (TCGA) Project. Cancer Causes Control 27:1429-1435
Liang, Ma; Raley, Castle; Zheng, Xin et al. (2016) Distinguishing highly similar gene isoforms with a clustering-based bioinformatics analysis of PacBio single-molecule long reads. BioData Min 9:13
Lau, William W; Tsang, John S (2016) Humoral Fingerprinting of Immune Responses: 'Super-Resolution', High-Dimensional Serology. Trends Immunol 37:167-169
Sparks, Rachel; Lau, William W; Tsang, John S (2016) Expanding the Immunology Toolbox: Embracing Public-Data Reuse and Crowdsourcing. Immunity 45:1191-1204
Russ, Daniel E; Ho, Kwan-Yuet; Colt, Joanne S et al. (2016) Computer-based coding of free-text job descriptions to efficiently identify occupations in epidemiological studies. Occup Environ Med 73:417-24
Maudsley, Stuart; Martin, Bronwen; Gesty-Palmer, Diane et al. (2015) Delineation of a conserved arrestin-biased signaling repertoire in vivo. Mol Pharmacol 87:706-17
Russ, Daniel E; Ho, Kwan-Yuet; Longo, Nancy S (2015) HTJoinSolver: Human immunoglobulin VDJ partitioning using approximate dynamic programming constrained by conserved motifs. BMC Bioinformatics 16:170
De, Supriyo; Zhang, Yongqing; Garner, John R et al. (2010) Disease and phenotype gene set analysis of disease based gene expression in mouse and human. Physiol Genomics :

Showing the most recent 10 out of 13 publications