The Text Analytics, Machine Learning, and High Performance Computing Program, which operates within the High Performance Computing and Informatics Office (HPCIO), Division of Computational Bioscience of CIT, is collaborating with NIH investigators to build a critical mass in text and numerical analytics that is envisioned to encompass a number of pertinent and related disciplines in biomedical research including semantic interoperability, knowledge engineering, computational linguistics, text and data mining, natural language processing, machine learning, and visualization. The program is intended to foster advances in critical domains at NIH including biomedical and clinical informatics, translational research, genomics, proteomics, systems biology, """"""""big data"""""""" analysis, and portfolio analysis. In 2013, collaborative efforts in support of these goals included the following. - In collaboration with NIAID, HPCIO developing a new algorithm capable of analyzing V(D)J recombination in thousands of immunoglobulin gene sequences produced by high throughput sequencing. - HPCIO is working with Melissa Friesen of NCI to develop methodologies to improve exposure classification in occupational epidemiologic studies. Initial effort of this collaboration involves a tool that helps experts to classify free-text job descriptions into standard occupational codes. Machine-learning based classification methods will also be utilized to help with evaluating exposure-disease associations. - In collaboration with NINDS, HPCIO has implemented and compared several methods to locate and characterize lysosomes in 3-D fluorescence images. The goal is to be able to calculate the pH of each lysosome in the image, for which the ability to resolve their locations is an important step. - In collaboration with NIA, we are applying machine learning and visualization techniques on large biological datasets to discover novel patterns of functional gene or protein interactions as related to aging. Omnimorph, a graphic data analysis tool, is being developed for multidimensional data visualization. In this collaboration, we are also developing a model to predict the progression of Alzheimer's disease using plasma proteomic biomarker data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). - Machine-learning methods have been devised and implemented to identify and refine transcription start sites in the fruit fly genome found using cap analysis gene expression (CAGE). This effort is in collaboration with Brian Oliver of NIDDK. - HPCIO has developed a standardized installation of various distributions of Linux, with customized configurations, that complies with many of the NIH computer configuration policies. This """"""""Lab Linux"""""""" configuration leverages the existing IT infrastructure at NIH to comply with the user account life-cycle policy, user account password policy, and Incident Response Team (IRT) scanning policy, thereby making it easy for laboratory staff to use and maintain the system. We are currently seeking partners in the Institutes to test this deployment. - In collaboration with CSR, HPCIO is applying text analytics to provide CSR leadership with evidence-based decision support in evaluation of the grant review process. The effort so far has concentrated on exploratory analysis against the NIH portfolio to evaluate clustering methods and assess intrinsic measures of cluster quality. - HPCIO has been collaborating with the Molecular Libraries Program (MLP), part of the NIH Common Fund, to develop the Common Assay Reporting System (CARS). CARS is an integrated system for managing bioassay information and facilitating communication between all the high-throughput screening centers within the Molecular Libraries Probe Production Centers Network (MLPCN). Goals for this collaboration include: 1) Track project status and related issues at each of the screening centers within the MLPCN, and provide the means for information collection, sharing and retrieval among the centers and the program office at NIH. 2) Establish a standardized protocol to describe raw data from the experiments and report screening data to the scientific community. - The human salivary protein catalog has been made available online on a community-based Web portal developed by HPCIO, in collaboration with NIDCR, to enable scientists to add their own research data, share results, and discover new knowledge. This is a major step towards the discovery and use of saliva biomarkers to diagnose oral and systemic diseases. - We are working with the Office of Extramural Research (OER) on applying machine-learning methods to identify important terms that peer reviewers use to describe innovative applications. The goal of the effort is to develop a lexicon of terms that can help estimate the innovation level of a grant application based on peer review critiques from the applications NIH Summary Statement. - Although the scientific impact of NCI consortia on the advancement of cancer epidemiology research is understood to be significant, accurate quantitative metrics of this impact are needed by program leadership. We are developing methods to track citations to clinical guidelines in the context of evidence-based medicine that could provide funding agencies and program directors insight into individual consortia's contributions in advancing medical knowledge. This work is being conducted in collaboration with Epidemiology and Genomics Research Program (EGRP), NCI. - HPCIO is working with NINDS and the Office of Extramural Research (OER) to determine peer-review sentiment of grant applications based on the NIH Summary Statement. The sentiment analysis results can provide decision support information to NIH program directors considering applications for selective pay. - Based on its experience in building novel models for classifying research grants and projects, HPCIO is collaborating with DPCPSI/OD and NCI to develop a comprehensive classification workflow system that will allow users to select from multiple classification algorithms, feature spaces, and training regimes, to build and run their own classifiers. A particular prototype of this system is being tailored to assist NCI Intramural investigators in reporting their research to the Annual Report system. - The Office of Behavioral and Social Sciences Research (OBSSR) is conducting a pilot investigation in collaboration with HPCIO to evaluate the efficacy of machine learning models for the classification of five BSSR-relevant research categories. - In response to input from various collaborative groups, HPCIO is developing a portfolio visualization resource, known as PViz, that integrates visualization of categorical data with results of clustering algorithms, to allow analysts to gain new insight into their data. Users may either construct a portfolio from IMPAC II data or import their own custom portfolio of categorical data. - In collaboration with various groups including the Division of Planning, Coordination, and Strategic Initiatives (DPCPSI/OD), HPCIO has been developing an augmented support vector machine (SVM) that augments a training set by sampling from a corpus of unknowns and runs a large ensemble on various samples of this augmented space. The results obtained from this classifier suggest that, when coupled with an effective annotation strategy, such a classifier can be quite effective at categorizing a research portfolio.

National Institute of Health (NIH)
Center for Information Technology (CIT)
Scientific Computing Intramural Research (ZIH)
Project #
Application #
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Center for Information Technology
Zip Code
Schmitz, Roland; Wright, George W; Huang, Da Wei et al. (2018) Genetics and Pathogenesis of Diffuse Large B-Cell Lymphoma. N Engl J Med 378:1396-1407
Martins, Andrew J; Narayanan, Manikandan; PrĂ¼stel, Thorsten et al. (2017) Environment Tunes Propagation of Cell-to-Cell Variation in the Human Macrophage Gene Network. Cell Syst 4:379-392.e12
Wilcox, Amber N; Silverman, Debra T; Friesen, Melissa C et al. (2016) Smoking status, usual adult occupation, and risk of recurrent urothelial bladder carcinoma: data from The Cancer Genome Atlas (TCGA) Project. Cancer Causes Control 27:1429-1435
Liang, Ma; Raley, Castle; Zheng, Xin et al. (2016) Distinguishing highly similar gene isoforms with a clustering-based bioinformatics analysis of PacBio single-molecule long reads. BioData Min 9:13
Lau, William W; Tsang, John S (2016) Humoral Fingerprinting of Immune Responses: 'Super-Resolution', High-Dimensional Serology. Trends Immunol 37:167-169
Lau, William W; Sparks, Rachel; OMiCC Jamboree Working Group et al. (2016) Meta-analysis of crowdsourced data compendia suggests pan-disease transcriptional signatures of autoimmunity. F1000Res 5:2884
Sparks, Rachel; Lau, William W; Tsang, John S (2016) Expanding the Immunology Toolbox: Embracing Public-Data Reuse and Crowdsourcing. Immunity 45:1191-1204
Russ, Daniel E; Ho, Kwan-Yuet; Colt, Joanne S et al. (2016) Computer-based coding of free-text job descriptions to efficiently identify occupations in epidemiological studies. Occup Environ Med 73:417-24
Maudsley, Stuart; Martin, Bronwen; Gesty-Palmer, Diane et al. (2015) Delineation of a conserved arrestin-biased signaling repertoire in vivo. Mol Pharmacol 87:706-17
Russ, Daniel E; Ho, Kwan-Yuet; Longo, Nancy S (2015) HTJoinSolver: Human immunoglobulin VDJ partitioning using approximate dynamic programming constrained by conserved motifs. BMC Bioinformatics 16:170

Showing the most recent 10 out of 14 publications