The Informatics, Machine Learning, and Biomedical Data Science, which operates within the High Performance Computing and Informatics Office (HPCIO), Division of Computational Bioscience of CIT, is collaborating with NIH investigators to build a critical mass in text and numerical analytics that is envisioned to encompass a number of pertinent and related disciplines in biomedical research including semantic interoperability, computational linguistics, text and data mining, natural language processing, machine learning, longitudinal analysis, and visualization. The program is intended to foster advances in critical domains at NIH including biomedical and clinical informatics, translational research, genomics, proteomics, systems biology, big data analysis, and portfolio analysis. In 2015, collaborative efforts in support of these goals included the following: -In collaboration with NIA, we are applying machine learning and visualization techniques on large biological datasets to discover novel patterns of functional gene or protein interactions as related to aging. In this collaboration, we are developing a machine learning method that models the temporal nature of the longitudinal clinical data to predict the progression of Amyotrophic lateral sclerosis. Such machine learning method may also work well in prediction of high-dimensional time-series genomic data. - In collaboration with NIAID, HPCIO has released HT JoinSolver(R), a new application capable of analyzing V(D)J recombination in thousands of immunoglobulin gene sequences produced by high throughput sequencing. - HPCIO is working with NCI to develop methodologies to incorporate occupational risk factors into epidemiological models. Novel classifiers are being developed to classify free text job descriptions into the 840 codes of the 2010 U.S. Standard Occupational Classification System. Agreement between our classification system and expert coders is measured using SOC code agreement and exposure agreement after applying CANJEM, a job-exposure matrix of over 250 exposure agents developed by Jrme Lavou at the University of Montreal. - In collaboration with the Membrane Transport Biophysics Section NINDS, HPCIO is 1) developing a tool to accurately identify the boundaries of the lysosomes in fluorescence microscopy and 2) using the fluorescence ration to measure lysosomal pH within each organelle for better understanding of the lysosomal pH regulation. - HPCIO is collaborating with NIAID to study immune cell infiltration in various tissue samples from patients with metabolic diseases. Using systems-based approaches, we examine gene expression and genotyping data to understand the roles and interactions of different immune cells in response to metabolic disease signals and their associations to intervention outcomes and other phenotypes. - A freely available plasmid database that is interoperable with popular freeware is currently being developed for the NIDA Optogenetics and Transgenic Technology Core. The Plasmid Manager offers a versatile yet simple platform for scientists to store and analyze their plasmid data. Motivated by the need for a more comprehensive approach to archiving plasmid data, the database platform is enriched with numerous components beyond the repository, serving as an informatics platform designed to enhance the efficiency and analytic capabilities of scientists. - In collaboration with CSR, HPCIO is applying text analytics to provide CSR leadership with evidence-based decision support in evaluation of the grant review process. A Web-based automated referral tool, called ART, is being developed to help PIs and SROs to identify the most relevant study section(s) or special emphasis panel(s) based on the scientific content of an application. In addition, HPCIO is analyzing text from quick feedback surveys on peer review. This effort includes evaluatinng a pilot study to evaluate the feasibility of analyzing free text from peer reviewers on their perception of the study section quality. If successful, the pilot results will be used to as initial input for a full-scale implementation. - The Human Salivary Protein wiki has been made available online on a community-based Web portal developed by HPCIO, in collaboration with NIDCR, to enable scientists to add their own research data, share results, and discover new knowledge. This is a major step towards the discovery and use of saliva biomarkers to diagnose oral and systemic diseases. - In collaboration with the Office of Data Analysis Tools and Systems, NIH Office of the Director, HPCIO has been developing a standard database update pipeline for NIH Topic Maps, originally developed by Dr. Ned Talley of NINDS. We are evaluating whether this pipeline can be incorporated into a stable hosted instance. - As high-throughput next-generation sequencing (NGS) technology plays an important role in systematically identifying novel cancer driver mutations in genome-wide surveys, NGS data generation is rapidly increasing, currently accumulating at a rate of several terabytes of data every month at the Lymphoid Malignancies Section of NCI. We need to enhance database platforms in anticipation of even more growth in the near future. The recent emergence of Hadoop/NoSQL systems (e.g., Hbase) has provided an alternative platform for querying large-scale genomic data. In addition, relational database providers have been enhancing their offerings to include products for explicitly distributing data across multiple nodes (e.g., Postgres XL). We have sought to integrate these technologies with current relational database systems (e.g., Postgres) to improve performance in a parallel or distributed manner. The goal of our effort has been to investigate the potential of these distributed platforms in storing and querying the large volumes of data that NCI accumulates, thereby augmenting their current analytical capabilities. - Based on its experience in building novel models for classifying research grants and projects, HPCIO has collaborated with DPCPSI/OD and other ICs to develop the Portfolio Learning Tool, a comprehensive classification workflow system that will allow users to select from multiple classification algorithms, feature spaces, and training regimes, to build and run their own classifiers. HPCIO has developed an augmented support vector machine (SVM) that augments a training set by sampling from a corpus of unknowns and runs a large ensemble on various samples of this augmented space. The results obtained from this classifier suggest that, when coupled with an effective annotation strategy, such a classifier can be quite effective at categorizing a research portfolio.

National Institute of Health (NIH)
Center for Information Technology (CIT)
Scientific Computing Intramural Research (ZIH)
Project #
Application #
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Computer Research and Technology
Zip Code
Schmitz, Roland; Wright, George W; Huang, Da Wei et al. (2018) Genetics and Pathogenesis of Diffuse Large B-Cell Lymphoma. N Engl J Med 378:1396-1407
Martins, Andrew J; Narayanan, Manikandan; PrĂ¼stel, Thorsten et al. (2017) Environment Tunes Propagation of Cell-to-Cell Variation in the Human Macrophage Gene Network. Cell Syst 4:379-392.e12
Wilcox, Amber N; Silverman, Debra T; Friesen, Melissa C et al. (2016) Smoking status, usual adult occupation, and risk of recurrent urothelial bladder carcinoma: data from The Cancer Genome Atlas (TCGA) Project. Cancer Causes Control 27:1429-1435
Liang, Ma; Raley, Castle; Zheng, Xin et al. (2016) Distinguishing highly similar gene isoforms with a clustering-based bioinformatics analysis of PacBio single-molecule long reads. BioData Min 9:13
Lau, William W; Tsang, John S (2016) Humoral Fingerprinting of Immune Responses: 'Super-Resolution', High-Dimensional Serology. Trends Immunol 37:167-169
Lau, William W; Sparks, Rachel; OMiCC Jamboree Working Group et al. (2016) Meta-analysis of crowdsourced data compendia suggests pan-disease transcriptional signatures of autoimmunity. F1000Res 5:2884
Sparks, Rachel; Lau, William W; Tsang, John S (2016) Expanding the Immunology Toolbox: Embracing Public-Data Reuse and Crowdsourcing. Immunity 45:1191-1204
Russ, Daniel E; Ho, Kwan-Yuet; Colt, Joanne S et al. (2016) Computer-based coding of free-text job descriptions to efficiently identify occupations in epidemiological studies. Occup Environ Med 73:417-24
Maudsley, Stuart; Martin, Bronwen; Gesty-Palmer, Diane et al. (2015) Delineation of a conserved arrestin-biased signaling repertoire in vivo. Mol Pharmacol 87:706-17
Russ, Daniel E; Ho, Kwan-Yuet; Longo, Nancy S (2015) HTJoinSolver: Human immunoglobulin VDJ partitioning using approximate dynamic programming constrained by conserved motifs. BMC Bioinformatics 16:170

Showing the most recent 10 out of 14 publications