Bioinformatics developments A major development within my group has been the creation of a suite of tools for variation analysis and annotation using NextGen sequence data, in particular data generated by Illuminas massively parallel sequencers, i.e. GAiiX and HiSeq2000. The Illumina data processing pipeline aligns sequence reads with ELAND, a hash-based alignment algorithm, but even in its most current version, ELAND is not an accurate enough aligner to allow accurate variation detection. In 2009, my group developed diagCM, which realigns reads and their unaligned read pairs to 100kb genomic windows determined by ELAND. This more refined alignment is performed with the program cross_match, a banded Smith-Waterman aligner written by Phil Green. In this way, sequence reads are aligned to a reference genomic sequence for the particular species being sequenced. Typically this is for human samples, but these methods can, and have been, applied to sequence from other species, e.g., mouse, fly, etc. We convert diagCM alignments into BAM format, the binary alignment format developed for the 1000 genomes project which in turn is the input format for our program bam2mpg (Teer et al., 2010). Bam2mpg is freely available at http://research.nhgri.nih.gov/software/bam2mpg. The algorithm used in bam2mpg, called MPG (Most Probable Genotype) is based on a Bayesian model of sampling from one or two chromosomes with sequencing error, and calculates the posterior probability of each possible genotype given the observed sequence data. The most probable genotype at each position is reported, along with its """"""""MPG score"""""""", which is the value of ln(P(GiA)/P(GjA)) when Gi is the most probable genotype and Gj is the second most probable genotype. We have found empirically, that when calculated from well-aligned Illumina reads, genotypes with MPG scores of 10 or greater agree with Infinium genotypes about 99.8% of the time. Protein Integrated ANNOtation (PIANNO) and (Conserved Domain-based Prediction) CDPred are two key software suites that quickly and efficiently annotate variants de novo, and prioritize them for further review. PIANNO efficiently annotates variants based on UCSC known gene annotations, and is designed to be versatile and adaptable to changes and upgrades to gene annotations in public databases. CDPred is a novel algorithm we developed to score and prioritize missense variants based on their evolutionary conservation. CDPred assigns scores to reflect the severity of substitutions residing in conserved domains by taking advantage of mutliple sequence alignments in Conserved Domain Database (CDD). We have compared CDPred with current popular methods in the field (namely PolyPhen2 and SIFT) and found CDPred to perform better in classifying disease-causing variants. CDPred, in concert with PIANNO annotations (missense, nonsense, and splice-site), has proven to be extremely powerful in quickly discovering and pinpointing disease-causing variants within human genes. The CDPred software is available at http://research.nhgri.nih.gov/software/CDPred/guide.shtml. The Comparative Genomics Unit has used this analysis suite to analyze sequence data from more than 1000 samples captured with Agilents whole exome and custom capture kits and sequenced at NISC. All discovered variants, genotypes, and annotations are stored in a custom Oracle database, where they are available for export and delivery to investigators. Within a project, all variants are genotyped across all samples (a process we call back genotyping) to allow us to determine if samples have been completely interrogated at all variant positions, and to provide accurate allele frequency estimates for each variant. The interpretation of the results generated by the methods described above presents a challenge to the investigator who wishes to find the causal variant(s) in his or her study. In order to allow easier analysis of whole exome results by investigators with limited bioinformatics experience and resources, we developed a graphical tool, VarSifter, which reads sequence variation data from several formats (including the emerging standard Variant Call Format, VCF). Variants and annotation information are presented in tabular format, which itself links to the genotypes for each sample at a given variant position. VarSifter allows sorting and filtering of columns, and includes a framework to allow for generation of custom queries. This tool has been highly regarded by users as it allows analysis of complex next-gen sequence data with little previous bioinformatic or computer programming background. To date, variants discovered using our tools have been the basis for a number of published manuscripts (Johnston et al., 2010, Pineda-Alvarez, 2011, Teer et al., 2010, Teer and Mullikin, 2010, Wei et al., 2011). In addition to genomic sequencing, we have also been exploring transcriptome profiling using NextGen sequencing, which is generally called RNAseq, as well as using gene-expression microarrays. For the ClinSeq study, participants have been divided into experimental and control groups from the two extremes with respect to coronary artery calcification assessed by computed tomography scanning. Using two sources of RNA for each subject (lymphoblastoid cell lines and whole blood), we generated sequence for 16 transcriptomes (8 case, 8 control;matched for age and gender) and concurrently analyzed the samples using Affymetrix Human Exon 1.0 ST microarrays. Sequence data were processed through our custom bioinformatics/statistics pipeline, which interrogates multiple aspects of RNA-Seq whole-transcriptome data, including differential gene-expression levels, alternative splice-site usage patterns, SNP discovery, potential differences in allelic-expression at known heterozygous sites, and annotation of newly detected transcribed regions. After initial data processing, we applied a set of novel statistical methods to identify genes with consistent differences in expression levels and alternative-splicing patterns between the high-calcification and low-calcification groups. Using these methods, we have identified a set of 100 genes that clinically correlate with the atherosclerosis phenotype. This comprises both genes for which previous studies have shown association with atherosclerosis as well as new genes that represent new candidates of interest. Sanger-based Medical Sequencing Collaborations We participated in several cancer related studies that used sequencing to identify novel variants that play important roles in cancer biology. In two cancer related collaborations, we helped to reveal the role of ADAMTS18 as a novel oncogene in melanoma (Wei et al., 2010) and novel somatic mutations in heterotrimeric guanine nucleotide-binding proteins (G-proteins)(Cardenas-Navia et al., 2010). The systematic targeting of all genes known to be necessary for ciliary biogenesis and function lead to the discovery of the key role of mutations in TTC12B, both causal and modifying, in the spectrum of ciliopathies (Davis et al., 2011). Another project targeted 37 human ARS genes in 355 patients with Charcot-Marie-Tooth (CMT) disease and identified KARS as the fourth gene associated with this disease (McLaughlin et al., 2010). A third used linkage to identify a set of genes to sequence identifying mutations in the lysosomal enzyme-targeting pathway can cause persistent stuttering (Kang et al., 2010). In a multi-tiered sequencing approach we successfully identified NBEAL2 as the gene responsible for the gray-platelet syndrome (Gunay-Aygun et al., 2010, Gunay-Aygun et al., 2011).

Project Start
Project End
Budget Start
Budget End
Support Year
7
Fiscal Year
2011
Total Cost
$1,346,722
Indirect Cost
Name
National Human Genome Research Institute
Department
Type
DUNS #
City
State
Country
Zip Code
Le Gallo, Matthieu; Rudd, Meghan L; Urick, Mary Ellen et al. (2018) The FOXA2 transcription factor is frequently somatically mutated in uterine carcinosarcomas and carcinomas. Cancer 124:65-73
Chen, Y-C; Sudre, G; Sharp, W et al. (2018) Neuroanatomic, epigenetic and genetic differences in monozygotic twins discordant for attention deficit hyperactivity disorder. Mol Psychiatry 23:683-690
Randall, Thomas A; Mullikin, James C; Mueller, Geoffrey A (2018) The Draft Genome Assembly of Dermatophagoides pteronyssinus Supports Identification of Novel Allergen Isoforms in Dermatophagoides Species. Int Arch Allergy Immunol 175:136-146
Gandolfi, Barbara; Alhaddad, Hasan; Abdi, Mona et al. (2018) Applications and efficiencies of the first cat 63K DNA array. Sci Rep 8:7024
Serrano Negron, Yazmin L; Hansen, Nancy F; Harbison, Susan T (2018) The Sleep Inbred Panel, a Collection of Inbred Drosophila melanogaster with Extreme Long and Short Sleep Duration. G3 (Bethesda) 8:2865-2873
Dewan, Ramita; Pemov, Alexander; Dutra, Amalia S et al. (2017) First insight into the somatic mutation burden of neurofibromatosis type 2-associated grade I and grade II meningiomas: a case report comprehensive genomic study of two cranial meningiomas with vastly different clinical presentation. BMC Cancer 17:127
Ng, David; Hong, Celine S; Singh, Larry N et al. (2017) Assessing the capability of massively parallel sequencing for opportunistic pharmacogenetic screening. Genet Med 19:357-361
Pemov, A; Li, H; Patidar, R et al. (2017) The primacy of NF1 loss as the driver of tumorigenesis in neurofibromatosis type 1-associated plexiform neurofibromas. Oncogene 36:3168-3177
Falik Zaccai, Tzipora C; Savitzki, David; Zivony-Elboum, Yifat et al. (2017) Phospholipase A2-activating protein is associated with a novel form of leukoencephalopathy. Brain 140:370-386
Harbison, Susan T; Serrano Negron, Yazmin L; Hansen, Nancy F et al. (2017) Selection for long and short sleep duration in Drosophila melanogaster reveals the complex genetic network underlying natural variation in sleep. PLoS Genet 13:e1007098

Showing the most recent 10 out of 141 publications