Genome-wide association studies (GWAS) have been moderately successful in identifying common variants that are associated with phenotypic differences. However, the greater part of the heritable component in any given complex trait has yet to be explained. New technologies which allow for the characterization of rare variants, structural variants, and expression data are providing new insights into trait association. Unfortunately, the data is being created faster than the field is able to analyze it. Current common-variant analytical methods are not powered to manage sequence data;therefore, new methods designed to manage high-throughput data are necessary. These new methods should also be capable of analyzing interactions (epistasis and gene-environment) and prepared to incorporate other """"""""-omic"""""""" data as it increasingly becomes available. A single method's ability to perform these complex tasks will enable the researcher to paint a complete picture of a trait that incorporates many forms of genetic and environmental information. Identifying gene-environment interactions are of particular importance since environment is one of few modifiable variables. One approach for developing this analytical tool is to use known biological information in a two step-analysis. The first step uses knowledge-based biology and predicted function as guides to collapse rare variants into weighted bins. This is necessary to decrease the computational load of sequence data, as well as increase the power of detecting an association among rare variants. The binned variants along with common variants can then be tested for association immediately (exit this pipeline) or be packaged for Biofilter. In the second step, Biofilter creates and assesses potential interactions (gene-gene or gene-environment). These interaction models are then tested in genome-wide data for statistically significant association. In both steps, the biological information is derived from a systematic integration multiple public databases of gene groupings and sets of disease-related genes to produce multi-SNP models that have an established biological foundation. The advantages of incorporating prior knowledge are: reduced search space, increased power to identify associations, and inference of relevant biology for any statistically significant result. The first goal of this project is to develop BioBn, an algorithm that will use domain-knowledge to guide the collapsing and binning of rare variants. The second goal is to compare this method to other published collapsing methods using simulated data. The third goal is to create a pipeline for data to be collapsed and evaluated by Biofilter, specifically to test for gene-environment interactions using individuals in an Age-relatd Macular Degeneration study.
This project was developed to meet the demands of increasing sequence data and address the issues of missing heritability. BioBin is a novel method to collapse and bin rare variants based on biological information which can then be analyzed using Biofilter to determine associations between gene-environment interactions.