The NHGRI Bioinformatics and Scientific Programming Core actively supports the research being performed by NHGRI/DIR investigators by providing expertise and assistance in bioinformatics and computational analysis. The Core facilitates access to specialized software and hardware, develops generalized software solutions that can address a variety of questions in genomic research, develops database solutions for the efficient archiving and retrieval of experimental and clinical data, disseminates new software and database solutions to the genome community at-large, collaborates with NHGRI researchers on computationally-intensive projects, and provides educational opportunities in bioinformatics to NHGRI Investigators and trainees. The majority of engagements between the Bioinformatics and Scientific Programming Core and DIR investigators are focused on collaborative interactions intended to advance specific research projects. The support provided for these projects includes not only data analysis, but related efforts focused on data collection and dissemination through the public NHGRI/DIR Web site ( Scientific projects undertaken during the reporting period include the development of a new variant discovery and phenotyping pipeline to address the increased demand for variant calling on human genome and exome data. The Core has maintained and updated a GATK-based pipeline that builds upon best practices published by the Broad Institute. This standardized and validated pipeline is currently being used in the context of The Genome Ascertainment Consortium (TGAC) effort being led by Dr. Biesecker; the goals of this effort are to improve our overall understanding of the phenotypic consequences of genetic variation and to predict phenotypes from genotypes. To that end, this pipeline has facilitated the creation of a uniformly processed and formatted genotype callset across multiple cohorts, based on data from multiple sources. To date, we have processed over 1,500 exome samples from the ClinSeq cohort as well as a larger dataset of 4,600 genomes from the INOVA Translational Medicine Institute. The pipeline continues to be updated with new software versions and more recent releases of human genome sequences. The pipeline has also been optimized to take advantage of the Biowulf HPC environment, parallelizing the per-sample processing steps and making use of local SSD storage on nodes (as available) to increase speed and reduce network overhead. Going forward, this efficient pipeline will allow for the re-calling of data from this growing cohort of individuals who have agreed to be re-contacted for secondary phenotyping studies, with the increased sample sizes affording greater power to discover important phenotype/genotype associations. Alongside this effort, the Core has developed an interactive browser for visualizing aggregate exome and genome data from the aforementioned TGAC cohorts, using the gnomAD codebase as its foundation. This same pipeline is currently being used by several additional NHGRI/DIR investigators in the course of their studies and the codebase has been shared with the NIH Intramural Sequencing Center (NISC) for their own implementation. The Core has also focused on generating an in-house somatic variant calling pipeline for use in mosaic and cancer somatic variant calling. This pipeline leverages the initial alignment stages of the existing germline pipeline, but in the variant calling step uses Samtools mpileup and Varscan to produce a highly sensitive variant caller capable of detecting alleles present in only 5% of reads in a sample. Key to the utility of this pipeline are detailed quality and read count statistics broken down by strand direction, allowing the scientific end-user to create a custom filtering strategy. This pipeline has been applied to two projects to date: the detection of mosaicism (Biesecker lab) and the detection of mitochondrial genome heteroplasmy in cell free DNA (McGuire lab). Additional projects include: the development of computational methods to analyze RNA-seq data obtained from the zebrafish translatome the implementation of new gene prediction pipelines for annotation of whole-genome sequencing data comparisons between the translatomes of wild type and Vegf-overexpressing zebrafish to identify genes under Vegf regulation annotation of samples from the TGAC cohorts with HLA genotypes and integration of these data into the gnomAD browser implementation of the GEMINI database to allow viewing of full sample-level genotypes from the TGAC cohort development of a website to return negative secondary findings to participants from the A2 ClinSeq cohort analysis of ClinSeq exams for somatic variants in genes implicated in clonal hematopoiesis of indeterminate potential (CHIP) RNAseq analysis of whole blood from sickle cell patients to identify differentially expressed genes in individuals with and without leg ulcers development of a public web browser and BLAT interface for the goldfish genome assembly an assessment of the feasibility of using single-cell RNAseq to interrogate the transcriptomes of pancreatic islet cells obtained from post-autopsy tissue investigation of the functional bases of diabetes disease risk through the use of scRNA sequencing technology in order to interrogate the transcriptome at the single-cell level ChIP-seq analyses to determine how HIST1H1A dysregulation affects transcription factor and chromatin-associated protein binding ATAC-seq analyses to determine how HIST1H1A dysregulation impacts prostate cancer-specific chromatin structure RNAseq analyses comparing differential expression of genes in wild type vs. HIST1H1A prostate tissue samples from knock-out mice, to determine how HIST1H1A affects metastasis susceptibility in prostate cancer RNA-seq analyses and eQTL mapping to identify modifier genes responsible for aggressive forms of prostate cancer in (TRAMP x WSB) F2 mice and (HiMyc x DO) F1 mice updating the Skippy web server to include additional complementary tools for splicing prediction identification of mitochondrial DNA variants in six mouse strains, with the eventual goal of determining if metabolic phenotypes are associated with different mitochondrial haplotypes analysis of cell-free DNA (cfDNA) in patients with mitochondrial disease and in healthy controls in order to interrogate the serum virome implementation of a GEMINI database with data from 750 dogs to allow complex searches of dog genotypes design and implementation of surveys that assess the health of dogs whose DNA samples have been submitted to scientific studies RNAseq analysis of fibroblasts with different cyclodextrin treatments and LysoTracker staining profiles to discover genetic modifiers of Niemann-Pick disease type C (NPC) RNAseq analyses of post-mortem brain tissue to characterize neuronal gene expression in youth with a history of ADHD and matched non-psychiatric controls, with the goal establishing a neuronal transcriptome and determining which genes and neural gene networks influence the development of ADHD RNAseq analysis of the developing fetal brain to understand the effect of GM2 gangliosidosis on gene expression and to determine the regions of the brain most significantly dysregulated by storage of gangliosides, and identification of integration sites of AAV in mouse and human genomes and developing methods to characterize the clustering and locations of the integration sites

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Scientific Cores Intramural Research (ZIC)
Project #
Application #
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Human Genome Research Institute
Zip Code