Methods Development Because non-independence of marker data is particularly relevant in next generation sequencing data, most of the theoretical work during the past year has focused on the development, testing and implementation of Tiled regression, a linear regression based method for intra-familial tests of association that address non-independence both at the marker and observational level. Tiled regression uses multiple and stepwise regression methods in predefined segments of the genome, defined by hotspot blocks, to identify independent sequence variants responsible for the variation or susceptibility in quantitative and qualitative traits, respectively. Multiple, stepwise (and other) regression methods are used to test for associations on the sequence variants in each tile to select the independent markers within each tile. Higher order regressions are then used to identify significant variant across tiles, chromosomes and the entire genome. Quantitative and qualitative traits can be analyzed. With this approach, it becomes practical to analyze hundreds of thousands or millions of markers and their significant gene x gene interaction terms. This approach can substantially reduce the total number of tests to a number closer to the number of tiles rather than the number of markers. Furthermore, the tiled approach can be incorporated into a linear regression framework that allows for non-independence between observations incorporating features from the Regression of Offspring on Mid-Parant (ROMP) and Generalized Estimating Equations approaches. The tiled regression method was tested with simulated mini-exome sequence data as part of the Genetic Analysis Workshop 17 and results are presented in detail in Sung et al. BMC Proc, 2011. The most striking finding from this analysis was that methods that use simple linear regression without considering correlations between markers have estimated type I error rates (false positive rates) that are inflated by as much as three orders of magnitude (up to 1000 times) higher than their expected type I error rates depending on the underlying genetic model. The magnitude of the increase appears to be related to the correlations due to unknown causal variants that contribute to a quantitative trait. This suggests that even with permutation tests, the type I error rate for the analysis of sequence data with GWAS methods may be substantially inflated if the marker-marker correlations are ignored, generating thousands of false positive results. Because the tiled regression method identifies only independent sequence variants, the type I error rate is stable regardless of the underlying genetic model. Permutation tests using the tiled regression method should yield appropriate type I error rates. This approach has been applied to both SNP data from fine mapping SNP studies with the scoliosis data in collaboration with Dr. Nancy Miller (U of Colorado) two manuscripts submitted,2011, and two targeted candidate gene sequencing projects, an NF1 project in collaboration with Dr. Douglas Stewart and the ClinSeq project, in collaboration with Dr. Les Biesecker . In 2011 the tiled regression methodology was implemented in TRAP, a software package written in the freely available R language. The package is freely available on the NHGRI website: http://research.nhgri.nih.gov/software/TRAP. Two other projects involved the simulated mini-exome sequence data from Genetic Analysis Workshop 17 and the findings are now in press. As part of the first project Simpson et al, BMC Proc 2011 we evaluated intrafamilial tests of associations in order to compare the statistical properties of likelihood based and regression of offspring based (ROMP) methods. In the samples considered, both methods were able to detect causal sequence variants with locus specific heritabilities greater than about 0.1, but neither method was able to detect causal variants with locus specific heritabilities near 0. There was some inflation of the type I error rates for both methods. In the second project Kim et al. BMC Proc 2011, we evaluated machine learning methods to detect associations in the GAW 17 simulated data. These methods did not provide any substantial advantage over more traditional methods, although interaction effects, the strength of the learning machine methods, were not included in the underlying simulation model. Collaborations Familial Idiopathic Scoliosis Several analyses focusing on candidate regions and phenotypic subsets have been completed and manuscripts have either been submitted or are in preparation. These included: 1) Statistical genetic analysis of two sets of families with familial idiopathic scoliosis with characteristics nearly identical to those of the sample analyzed in Miller et al. 2005. Linkage analysis and tests of association were performed in two regions on chromosome 1, previously identified as primary candidate regions. We have identified several regions of interest for subsequent nextgen sequencing Behnemann, doctoral thesis, anticipated 2011. 2) Targeted sequencing of the IRX gene family in families with kyphoscoliosis. We have identified an association between kyphoscoliosis and a sequence variant in an upstream conserved region of one of the IRX genes. Association analysis resulted in 12 SNPs with p-values <0.025, of which 11 are 500 kb from IRX1, including the most significant SNP (p = 0.000382). One of these SNPs is in a HCNR sharing 87% sequence identity with a HCNR upstream from IRX3 on 16q12 Justice et al. submitted. 3) Statistical genetic analysis of STRPs and SNPs on chromosomes 9 and 16. Fine mapping on chromosomes 9 and 16 was performed to narrow previously identified candidate regions. Linkage and association studies identified several highly significant regions that are candidates for next generation sequencing Miller et al., submitted. 4) A study based on the presence of males with severe scoliosis Miller et al., submitted. The males with severe curve subset was comprised of 25 families (207 individuals) in which at least one male was diagnosed in adolescence with a ≥30 lateral curvature. The genome-wide linkage analysis for the qualitative and quantitative traits resulted in significant p-values (2 adjacent markers with p-values <0.01) on chromosomes 2, 16 and 22. Significant SNPs lie primarily in the introns of the LARGE gene, integral to the development and maintenance of skeletal muscle, and SFI1, responsible for the integrity of the chromosomal centromere complex. Other large ongoing collaborations include: 1) Clinical characterization of NF1 (Dr. Douglas Stewart, NIH/NCI) 2) the ClinSeq project (Les Biesecker, NIH/NHGRI) 3) the GeneSTAR project (Drs. Diane and Lewis Becker, Johns Hopkins University School of Medicine) Mathias et al., 2010 4) Variation in metabolites in the Irish (Dr. Larry Brody, NIH/NHGRI)
Showing the most recent 10 out of 35 publications